overview

Advanced

Open Source Ascendant

Posted by archive 
The community is irrelevant to him, because the software can run his infrastructure, and he can buy enough support for it from vendors. "When I look at the constant reengineering we have to do within the travel agency business [to become more efficient], to me, there's no other solution besides open source, given our volumes, our transaction rates and the problems we have to solve."


Open Source Ascendant


How Cendant Travel Distribution Services replaced a $100 million mainframe with 144 Linux servers and lived to tell about it.

July 1, 2005
By Christopher Koch
Source

In the summer of 2003, Mickey Lutz did something that most CIOs, even today, would consider unthinkable: He moved a critical part of his IT infrastructure from the mainframe and Unix to Linux. For Lutz, the objections to Linux, regarding its technical robustness and lack of vendor support, had melted enough to justify the gamble. "The issues raised around open source, around its viability, were in the past," recalls Lutz, CIO for Global Agency Solutions with Cendant Travel Distribution Services, the parent company of online travel brands Orbitz and CheapTickets.com.

Few CIOs agreed with Lutz then or now. Many CIOs are experimenting with Linux these days, but less than 10 percent of the Fortune 1000, according to research company Meta Group, have been willing to bet their core infrastructures on it—to transform the Linux penguin mascot from cute to brute.

They've had some good reasons for their fear of flying. For starters, the technical challenge is significant. You need many carefully formed flocks of Linux-based Intel servers to equal the might of a single mainframe. In addition, the slow uptake of Linux in high-transaction applications has kept support for big, complex Linux environments more scarce and slightly more expensive than traditional heavy-duty platforms such as Unix and mainframes. And the savings from Linux and Intel matter less in a complex environment where applications, databases and their related support and maintenance can account for as much as 80 percent of the overall cost of running a system, adds Jerald Murphy, a Meta Group analyst.

And it's true that Cendant has needed every bit of support that it could get for Linux so far. Lutz's IT group rewrote a complex, real-time airline pricing application that serves hundreds of thousands of travel agents around the world and that also acts as the system of record for all of United Airlines' ticket reservations. When this application came up on Linux, it proved to be so demanding—it handles up to 700 pricing requests per second—that it completely redefined Cendant's expectations about what it would take to get Linux to work. "We have broken every piece of software we've ever thrown at this platform, including Linux itself," says Lutz.

That has resulted in some scary moments, including an initial slowdown in the system that left United Airlines agents intermittently unable to access the reservation application (one outage lasted about 45 minutes) over the course of four days in July 2003. If you are United Airlines and move roughly 8,000 passengers per hour, you need the computers to work all the time. "Even a little downtime is a big deal," admits Lutz.

But he maintains that the gamble on Linux has been worth it. "Our business strategy is to be as efficient as possible [while] processing transactions," he says. "To do that, we have to bring down the cost of our technology." Lutz claims he has done that. A platform on the mainframe that was projected to cost $100 million now costs about $2.5 million on Linux and Intel servers.

The final hurdle for the adoption of Linux at the highest level of the corporate infrastructure is the comfort level of CIOs. Just as few CIOs are interested in first versions of software, few are ready to risk their most important applications on a technical infrastructure that most of their peers haven't embraced. Furthermore, although Linux is closely related to Unix, for a staff trained on the mainframe, the change to the Intel environment will be complete and dramatic. The morass of litigation threats and fears about the open-source model of development and support haven't helped, even though a number of high-profile vendors—such as IBM, Hewlett-Packard and Oracle—have loudly pledged support for Linux.

That means adopting Linux is still very much a personal decision and a personal risk for CIOs. It is a chicken-and-egg game. Which comes first, adoption or vendor support? Reduction of risk or cost savings? Solid vendor support is critical, as is an internal staff capable of handling technical issues and finding answers that vendors—who don't control the development of Linux any more than CIOs do—cannot provide. Proper testing is also crucial, because Linux runs on an architecture—namely, Intel chips—that has not yet been widely used for mission-critical, transaction-intensive workloads.

In other words, Linux is free, but not risk-free.

From Bleeding Edge to Leading Edge

Linux moved from bleeding edge to leading edge in Lutz's mind as Cendant looked for ways to bring down the high cost of maintaining an ancient transaction infrastructure. The pressure to save money became intolerable after the the dust from the Internet bust cleared. Travel—led by brand names such as Expedia, Travelocity and Orbitz—emerged as one of the most powerful online channels left standing.

Lutz was in command of the alternative to those bright, shiny websites: an expensive, aging global distribution system (GDS) called Galileo. It is one of the original four mainframe-based travel reservation systems developed in the 1970s (the others are Amadeus, Worldspan and Sabre) that travel agents access through their desktops. A perennial also-ran to Sabre in the travel agency market, Galileo, like the other GDS relics, has lost more than 40 percent of its market share in the past decade to Internet rivals—including the airlines themselves—that have lower infrastructure costs and can afford to charge smaller fees to agents and travelers, according to Morgan Stanley analyst Christopher Gutek. "The GDSs aren't growing; they're fighting to keep from shrinking," says another analyst, James Wilson, managing director at JMP Securities. "What [Galileo] has to do is keep driving its processing efficiency."

In 2001, to cut costs and to try to differentiate Galileo from its GDS rivals, the business brass authorized an update of the centerpiece of the aging Galileo infrastructure, an airfare pricing application called Galileo 360° Fares. While it was hot stuff in the '70s, Fares had fallen behind the times. For example, it was very fast at reaching into the mainframe and retrieving flights, but it could not automatically administer any of the rules that applied to pricing the flights—such as requiring a Saturday night stay-over to qualify for a discount. Galileo IT employees had to match the rules to the flights and manually input them—thousands per day—into the system. The update would eliminate all the manual work and the errors it created and push new fares to travel agents in a fraction of the time. It would also give Galileo a leg up (temporarily, anyway) on its GDS competitors, some of whom were rushing to update their pricing software too.

Lutz also saw an opportunity to reduce the cost of the infrastructure behind Fares by moving it from the mainframe to Unix, which by then had matured enough to run the volumes and speeds necessary for Fares. At the time, Lutz looked into Linux and rejected it. "The performance of the hardware and the software just wasn't there," he recalls. Questions about finding real enterprise support and the long-term viability of the open-source model also rang in his ears.

But the Fares rewrite took time. By 2003, the outlook for Linux had changed dramatically. Linux could operate on larger systems, Intel servers were much faster and Lutz's data center provider, IBM, had emerged as the leading champion of the platform.

The technical robustness of the hardware and software and support availability all crossed an invisible baseline that Lutz (and every IT leader) has in his mind for new technologies: Lutz felt personally comfortable with it. He decided that the benefits finally outweighed the risks. "I saw many companies adopting it, and the vendor support was there," he recalls. "There are significant cost savings possible with open source, and they became far too compelling for us to ignore."

Will This Penguin Fly?

The transition of Fares to Unix was already 25 percent complete, but Lutz halted it, ordering a five-person internal team to put the application through its paces on Linux servers. They would check to make sure that data flowed properly and that the servers could handle the expected speed and volume of the transactions. If Linux held up, the potential cost savings would be enormous—up to 90 percent over Unix, according to Robert Wiseman, CTO for Global Agency Solutions with Cendant Travel Distribution Services.

The testing was risky, however. The Linux architecture called for the application and the data to be distributed over more than 100 servers. This model meant that the team could not build a subset of the production environment to accurately predict how the penguins would fly. That was deemed too costly and time-consuming.

The decision not to focus more on testing came back to haunt them. In June 2003, after three months of testing, Cendant moved the Fares production system to Linux. Lutz and Wiseman were at a conference in Portugal when calls started coming in, saying that the system was experiencing mysterious slowdowns.

The team had not envisioned the intensity with which Fares would crunch the data being held on multiple storage servers. For example, when travel agents asked the Fares system for a price for a ticket from Boston to Denver, they unleashed a torrent of calculations. According to Lutz, the number of possible combinations of flights and prices for all the airline carriers between two major cities has been estimated by researchers at MIT to be 10 to the 30th power. The Fares software pulls millions of different combinations out of Galileo's storage complex and calculates prices within a second.

According to Wiseman, Fares' vast appetite for data being held on the storage servers quickly created hot spots in which the demand for certain data types began to overwhelm some of the storage servers. Wiseman says that the volume and data distribution requirements of the Fares application (which he declined to identify) forced him to find a different replication solution that the original environment could not satisfy. Meanwhile, the application servers were literally pecking them to death with requests for data. Some slowed down to a crawl. The application slowed down with them.

Hard Lessons Learned

Frantic calls began coming in from some of the 44,000 travel agency locations in 116 countries that were unable to access Fares. Worse, because of significant outages, United Airlines' employees could not access core flight information—including schedules and connections—for as long as 45 minutes. The problems were intermittent over the course of four days. Lutz would not comment on the financial losses incurred by United or Galileo during the downtimes. Once the problem servers were pinpointed, a 40- to 50-person cutover team of IBM, Red Hat and Cendant engineers brought the problems under control by throwing more servers into the mix.

"In hindsight," says Lutz, "we shouldn't have tried to cut over to a new infrastructure at the same time we were deploying a new software application. It was too much at once."

Wiseman faults the limited testing of the new system—especially the storage servers—for the failure. "We were focused on testing [the performance of Linux on] individual servers, and we didn't have a full ratio of servers in the testing environment to predict the load on the storage servers," he says. Rather than falling back to the old platform at the first signs of trouble and reworking the new one, the engineers always thought the answer was around the corner. "We always believed that the next fix would solve our problems," recalls Wiseman. "Eventually it did, of course, but not without system slowdowns and occasional time-outs during high-peak periods for the next few days." To make sure the new system would remain stable over the long haul, the team decided to re-architect it after the failures in 2003, creating about a dozen redundant clusters of 12 servers apiece, each using a new network-attached-storage architecture that Wiseman says was not utilized the first time. Each cluster is designed to handle the full transaction load of Fares, but if demand for a particular function starts to peak, a single server no longer faces down thousands of impatient travel agents on its own. Together, the clusters are designed to handle the largest experienced Fares peak, with 25 percent headroom for situations such as outages and fare wars. "The things that are most important for an environment like this are stability and availability," says Wiseman. "We've designed it so that the possibility of all those clusters failing at once is so small as to be almost incomprehensible."

The new architecture also makes testing more predictable and accurate. "We build a single complete cluster, and we can scale the results linearly," says Wiseman. "As long as our testing on one cluster is accurate, we can predict how it will scale over the rest because they are all the same."

Despite having to re-architect the Linux platform, Wiseman says the combination of Linux on Intel servers still saves more than 90 percent over Unix. All told, the platform cost for Fares for the three years beginning in 2001 went from a projected $100 million for the mainframe to an estimated $25 million for Unix to $2.5 million for Linux, according to Lutz.

Culture Shift

Yet hardware and software don't account for the entire picture in such an infrastructure change. "When anyone in my position makes a commitment to a new technology, it's not simply the cost of the project, it's the cost of everything moving forward," says Lutz. "You're retraining people. And so if you have a $2 million dollar project to implement a Linux system, you're maybe making a $10 million to $15 million decision, because you're changing the whole course of IT development—training, support [and] application development."

The change to Linux and subsequent projects that use open source, such as Web services, has affected probably 50 percent of his 380-person staff, says Lutz. "Open source is propelling us to adopt Java and a new way of programming," he says. For some of his staff, those changes haven't been for the better, he says. "We had to reassign those who could not—or would not—move forward."

The staff (both applications developers and systems administrators) who did make the change had to become more aggressive and intuitive in finding solutions to problems on their own. "We have to have a higher degree of technical support internally now," says Lutz. "When you're working with [commercial software], there are pretty standard diagnostic methods to use when things don't work. [But] Red Hat isn't going to give us the solution to every problem," he adds, because it doesn't control the core development of Linux. "My teams have to be far better technically and in their problem-solving skills than before."

This frontier approach to problem solving has made architecture a more critical component of project planning and development, adds Lutz. "Before open source, our architects were much more involved at the beginning of the project and less at the end. Today, our architects are living with the architecture and living with the project teams, because the technology is more difficult to figure out, and the cause of problems are more difficult to diagnose." That has driven total costs up 5 percent for application development and support, as Lutz has brought in more architects and more skilled support people to manage the new infrastructure. "That is an easy price to pay for free software," he adds.

Linux Without Fear

The savings from the new architecture have Cendant looking at an even more ambitious migration to Linux. The Fares application and infrastructure represent just 10 percent of the Galileo computing platform. The rest houses the massive collection of flight information for every airline, every route in the world, written in a 1970s-era mainframe language called the Transaction Processing Facility (TPF). "Unlike today's operating systems, TPF was designed almost exclusively for speed," says Wiseman.

Wiseman has no idea how many flocks of penguins he would need to displace the polar bear mainframe, but he is looking into it. Such a move would put Galileo on the same infrastructure footing as the other pieces of Cendant Travel Distribution Services, most of which have a dotcom heritage. For example, Orbitz's infrastructure was built from scratch on Linux.

To Lutz, Linux has achieved its goal: to become a viable alternative to proprietary operating systems. He professes no interest in, nor understanding of, the mechanics of the open-source movement. "The Linux community is still a black box to me," he says.

The community is irrelevant to him, because the software can run his infrastructure, and he can buy enough support for it from vendors. "When I look at the constant reengineering we have to do within the travel agency business [to become more efficient], to me, there's no other solution besides open source, given our volumes, our transaction rates and the problems we have to solve."

The "black box" of open source has transformed into something any CIO can appreciate: reliable performance and consistent uptime. The penguin can fly now.



Executive Editor Christopher Koch can be reached at ckoch@cio.com.