Why would someone build a cluster?

I’d like to answer that question while introducing myself. I was born in 1984 and I started my PC career at the age of 10 with an Intel 80486DX2. Before that, we had an 80286 which I mostly used for playing Golden Axe ^^. Since we got the 80486, my interest in computers rose steadily and after connecting to the internet using my first 14.4K dial-up modem I can’t remember when I was offline the last time. Back in these days, the internet was slow, full of static HTML pages (GeoCities, I miss you) and it even sounded different.

My fascination for computers in general, computer networks and programming led me to the idea of studying computer science. After a short “detour” I got my Bachelor’s degree in 2011 and started a Master’s program which I finished in the beginning of 2014. And now I’m working on my PhD at the Database Group at the University of Leipzig while also being responsible for excercises and practical courses. During my studies I developed a strong interest in distributed systems, data management and graph theory. My research focuses are graph based software systems, i.e. graph databases (e.g. Neo4j) and graph processing systems (e.g. Apache Giraph). I could also gain some practical experience while working as a Java/PHP/Python/C++ developer for an e-commerce company, a graph database vendor and SAP. During my Master’s program, I worked as a research assistant doing mostly Java and Python programming.

Since my first time dialing-up into the internet, I was fascinated by the idea of connected machines communicating with each other, doing stuff for the human and making new kinds of services possible. Especially using many machines organized in computer clusters for solving some heavy algorithmic problems is a very interesting and exciting use case for me. So it just made sense to combine my personal interest with my research focus. And the first step was to find out how I can get my hands on a cluster.

Google Datacenter Images

In the above image you see a picture taken in one of Googles datacenters which gives you a good impression of how a cluster can look like. Since I do not work at Google, getting exclusive access to such a cluster is nowhere close to possible .. and also not necessary. I came up with the following options:

1. Use a multi-cpu server at the the University and install virtual machines to simulate a distributed environment

At our department we have access to some servers with 8 to 32 cores and 16 to 32 GB of main memory. The main advantage of using one of the servers is, that we have exclusive access to these machines which means that the ressources are not shared with other departments and experiments can run without any influence of other processes. The operating system is either Windows Server or Suse Enterprise Linux. So one option would be installing some virtual machines, let them share the given hardware ressources and start experimenting. I found three main disadvantages of that setup:

  • Scalability is limited which is due to the fact that the hardware ressources of a single server are limited and so is the number of virtual machines.
  • Using virtual machines is a virtual shared nothing setup on a single server which is in fact a shared everything setup. This could lead to effects (due to hardware allocation via the hypervisor) which cannot occur in a real shared nothing setup where every machine has its own physical ressources.
  • My colleagues at the department wouldn’t be so happy about me blocking our servers, which is because of social responsibility the most serious disadvantage of that option :)

2. Use the computer cluster at the University datacenter

The second option is using the computer cluster at the University datacenter. This would be my preferred solution since the datacenter has the most ressources and the highest computing power per single computing node. In fact, I want to use their cluster in a later state of my research but at the beginning it’s not suitable because of the following disadvantages.

  • Non-exclusive access is by far the most critical disadvantage: Ressources are shared with other departments and even faculties which leads to time windows where you can run your experiments.
  • The datacenter is also using virtual machines which leads to the same disadvantages as stated in the first option.
  • If they give you access to a real machine, your user rights are very limited so just doing a sudo apt-get install foo starts with writing an email to the admin who is a busy man and gets a lot of emails. So as you can probably see, this adds up a lot of communication overhead for just getting the software stack or config you want.
  • I don’t know why that is, but one has to pay money to use the University datacenter. I don’t know how much it is exactly but imho that’s a clear disadvantage especially in the beginning where you want to play around a lot.

3. Use a bunch of scattered desktop machines and connect them as a cluster

A third option which I think is often used for building a small playground in distributed computing is using some desktop machines: plug them together and play. The main advantage of that setup is the exclusive access and the full admin rights to install whatever you want on the machines. We also have some spare desktops at work and at the moment I’m using some of them to play around and evaluate different graph processing frameworks. But also that option has some important disadvantages:

  • The number of free desktop machines is limited. Colleagues and students use them for their regular work and at a university you don’t get new hardware every year.
  • Non-used machines are often abused as a source of spare parts like “Oh, I could need some more main memory, let’s take it from that old machine.”. This leads to a very heterogeneous hardware setup which is not very good if you want to make experiments on scalability of algorithms where it is necessary that doubling the amount of machines results in a duplication of available hardware ressources.
  • They need space, are loud, waste a lot of energy and get warm so you don’t want them standing next to your desk or even in your office.

4. Use Amazon Web Services EC2 machines

A very popular approach in research and in my opinion one of the best ideas a company came up with in the last years are the Amazon Web Services (AWS). The idea of renting out spare hardware ressources which would usually idle, waste energy and cost a lot of money is as brilliant as it is simple (not in realization of course). Amazon even offers an educational program where one can apply for a research grant and get a fixed amount of credits which can be reinvested into virtual hardware ressources. You just register, leave your credit card information, spin up some virtual machines with predefined performance specifications and start working with them. Payment is based on ressource consumption which offers a lot of flexibility. The number of virtual machines to use is not limited, so this seems to be the perfect approach for testing scalability of distributed algorithms and software in general. Providing some extra money, Amazon even offers mechanisms where you can use the underlying hardware ressources exclusively. Keeping in mind the previously stated disadvantages of working with a virtual shared nothing system, there are two more disadvantages I can think of at the moment:

  • Experimental overhead costs money: setting up and configuring the system, deploying and testing software or just learning while playing with the cluster. I guess a nice approach would be using a local cluster of some desktop machines like described before, setup the experiment, play around until you’re satisfied with the configuration and then deploy it to AWS and perform the actual computation / experiment.
  • Many people, especially companies are concerned about data security and system reliability. Uploading confidential data into a 3rd party datacenter is of course a matter of trust. Nevertheless, since I’m using mostly generated, artificial data to compute on, these concerns are not applicable for me.

My short conclusion at that point is: if money is not a matter, I would stick to AWS. Flexible scalability, consumption-based payment, no big servers standing around, exclusive access, use my own images with admin rights.

But research is always about trying new things out (which is the nicer form of “money IS a matter”). So I continued my search for other options …