Wednesday, January 09, 2013

Velcro: Young Google's sticky little secret

The biggest stories in recent application development history -- Amazon.com and Google -- are so big that they are pretty much hidden. Both applications required a big helping of chutzpah to happen at all. And both disrupted existing industries, creating whole new ones. Amazon.com's and Google's development managers stuck their necks out and trusted clusters of cheap computers to deliver the goods. In Amazon's case, it was personalized shopping. For Google, it was lightening fast Internet searches that fed-up usually useful results along with targeted advertising. Both applications, under the hood, employed a fair helping of warmed-over AI technology of yore. It was the cheap clusters, though, that made things possible. Amazon.com doesn't mind putting developer effort into making custom logic servers, or tweaking OSes to provide the type of fault tolerance this mega-site needs. Even in 2004 the Amazon crew thought of their end product as an application, not just a Web site. *** This reporter got another take on "fast, reliable and cheap" deployments at Usenix in Boston in June 04. At a keynote there, Rob Pike described a Google application development mentality that led the company to take on the responsibility for developing its own fault tolerance. Google's approach to constantly indexing 4 billion Web documents was based on the carefully honed notion that failures are always there. If you can admit that failure is always lurking, said Pike, you might as well not spend a fortune on fault tolerance as companies have done to a fare-thee-well on more than a few occasions. Pike is a member of the Systems Lab at Google Inc., as well as a principal designer and implementer of the Plan 9 and Inferno operating systems. "Failures happen no matter what you do," he said. "That means the software you use has to cope. That means replicate everything." Placing it all into perspective, he said: "Two pieces of crap are better than one." If your software can cope, then you can buy "really cheap, nasty" hardware. If your application is like Google's, and you have to write fault-tolerant software anyway, indicated Pike, you might as well buy cheap stuff. Cheap or expensive, they will fail. He gleefully described the cheapness of the hardware Google used, at least to get going, complete with photos of loose, stacked commodity disk drives held to racks with good old Velcro. Google, the killer app, uses cheap disks that are expected to fail. The company has been able to fashion Linux to make up the difference, creating a self-healing system, although day by day, individual humans - you might call them "Healers" - must go down the racks swapping-in good disk drives for bad. Fault tolerance is not a job, it is a mindset, Pike said, and to succeed "you better understand failure."
***
This is roughly how failover works as found in this reporter's notes after Pike spoke. The Google search-indexing problem is too large for one machine, so multiple machines are used. The search system uses Google's page rank system to establish the total order of things. The Index Server version of all the Web's pages is split into pieces called "shards." The shards are small enough that you can put multiple shards on one machine, but the shards are replicated on different machines so that you can failover the shards. The page rank tells you how much to replicate: A high page rank shard is copied many, many times. The same thing is done on the Google Document Server side. The software is aware of the structure of the app, and it spreads things around to avoid single points of failure. *** Pike is kidding when he says the commodity disk drives are held to racks "with Velcro." He says this to drive home the lesson that commodity hardware is the way to do fault tolerance. However, he does admit that the cheap hardware approach may not be the best choice in every instance. "We treat commodity disks like server disks and pay the price sometimes," Pike said. As always, a look at the notes on risk in an IPO prospectus provide caution to the optimist and tonic for the jaundiced. Google warns: "We may have difficulty scaling and adapting our existing architecture to accommodate increased traffic and technology advances or changing business requirements." So the future, as usual, lies ahead. In the past, of course, there have been snarls. As stated in its IPO in November 2003, Google failed to provide Web search results for about 20% of its traffic for a period of 30 minutes. But the usual result is that failures happen, but the end user doesn't notice. Not bad for Velcro.
***
Much of this sounds familiar. At least it should. The ‘fast and cheap’ theory is behind some of the work of Rodney Brooks, MIT Computing Science and AI Lab director, who favored the benefits of many cheap somewhat intelligent robots as opposed to fewer but more intelligent ones in his 1989 paper “Fast, cheap and out of control: A robot invasion of the Solar system.” I think Amazon’s and Google’s teams arrived at a similar conception through examination of primary principles, but Rodney Brooks had provided the rhetorical underpinnings in his work – which reached a wider audience as a result of the movie “Fast, Cheap, and Totally Out of Control.” [I mention this movie only because it gives me the opportunity to mention “The Batmen of Africa;” I couldn’t make sense of it, but it did include vignettes on Rodney, moles, an apiarist, and a lion tamer – one who studied under Clyde Beatty, who, you see was the star of “The Batmen of Africa,” portions of which are included in “Fast, Cheap and Totally Out of Control,” and which is in turn one of my favorite movies of all time.] Most are aware as well that NASA also got hold of the Fast and Cheap notion, which it backed away from after some interplanetary expedition failures that did not strike too many people as adequately “cheap.”

No comments: