Sunday, June 15, 2008

James Hamilton's Take on Cluster Computing

James Hamilton of Windows Live Services wrote a compelling paper On Designing and Deploying Internet-Scale Services (PDF).

Here's what I think are the most intriguing bits:

  • - Never shut down your services normally. (p. 2)
    - One pod should not affect another pod [great for testing new versions of a cluster] (p.3)
    - Correlated failures are common [a switch failure may affect lots of computers] (p. 5)
    - It must be easy to host the entire service on a single system (p. 7)
    - Soft delete only [Great for debugging] (p. 9)
    - Avoiding latencies is the thoughest problem (p. 10)
  • Google's Lessons of Real Hardware

    Interesting slide:


    Taken from Jeff Dean's slides on Google - A Behind-the-Scenes Tour. [via Geeking with Greg]