Easen.co.uk The home of Marc Easen

28Apr/110

Gearman – Distributed processing

How do you scale a web site/application? Or any application for that matter? Well you can bury you head in the sand and throw more hardware resources to the problem (faster hardware, more memory, etc.) ... or make it horizontally scalable of course!

I'm really hate using this phrase but it is technically correct ... "to the cloud" (sorry Chris). What I hate about it the most is the fact that people in marketing (Microsoft's latest advert campaign) tend to use it without actually knowing what the cloud is. Cloud computing is just a fancy word someone coined for the Internet - the Internet being a vast collection of servers. That's what I mean when I say horizontally scalable, use a collection of servers to do your work for you and not one really powerful machine.

Now let's get a bit technical. The popular way of distributing your work I know is to use a Message-Oriented Middleware, where messages are passed between two or more processors via a message broker. The brokers job is to accept incoming messages, save them to a queue and then dispatch them to an available process. This all well and good, but in my experience the broker can become the single point of failure, unless you use a cluster of them. MOM's tend to be quite CPU and memory intensive, meaning it has to live on it's own hardware. This is where Gearman comes in...

Gearman...

"...allows you to do work in parallel, to load balance processing, and to call functions between languages. It can be used in a variety of applications, from high-availability web sites to the transport of database replication events. In other words, it is the nervous system for how distributed processing communicates." (http://gearman.org/).

The main difference for me is the fact you can have multiple Gearman job servers running on your platform, compared a MOM platform where you have one message broker being your have a single point of failure - meaning less headaches for everyone.

One way I see it adding benefit a scalable project if is to configure your boxes which are going to processing work a single Gearman job server and have all the workers on that box pointing to that server. Next load balance these boxes via virtual IP (VIP). Now all you do is send your jobs to this VIP which will be sent a job server and then be picked up by one of the workers. To scale your application add more boxes to your little Gearman cluster. This solution not only scales your throughput, it also scales the application which is handling your jobs. Another difference which is quite interesting, you can run processes in parallel and wait for them to completed before continuing, e.g. computing basic statistics for group of users and then combining the results (see MapReduce).

Gearman is still quite young and still in beta so expect the somethings to change as it develops and matures. Saying that it's currently being used by some really large websites - Grooveshark, Yahoo and Digg. As Gearman is written by the same people who write and maintain Memcache, I'm expecting it to stay around for quite a while.