Hello list,<br>Here is my thesis describing my work on Elastic Load Balancing in StarCluster. Many thanks to Justin Riley for his help in getting this done.<br><br>The entire PDF is located at:<br><a href="http://www.hindoogle.com/thesis/BanerjeeR_Thesis0316.pdf">http://www.hindoogle.com/thesis/BanerjeeR_Thesis0316.pdf</a><br>
It is 71 pages long.<br><br>Here is the abstract:<br>Abstract<br>Computing in the cloud provides companies and colleges a new way to perform sophisticated computational tasks. Amazon.com, Inc. (Amazon) is the leading provider of cloud infrastructure, and their solutions are used by thousands of companies, universities and individuals. Amazon’s service, dubbed Elastic Computing Cloud (EC2) allows users to rent servers by the hour, so that computing power can be increased and decreased as needed. It eliminates the need for companies to build and maintain expensive data centers. Instead customers can rent servers to perform tasks as needed, and turn them off when the tasks are completed.<br>
The ability to quickly add and remove computing capacity enables users to scale computing capacity in business and academic settings alike. When one needs to perform sophisticated calculations, process large data sets, or serve many concurrent clients, having more computing power improves throughput and responsiveness of the system. Tasks can be completed in less time and client requests can be served faster. In a traditional environment where a company or university builds and maintains every server in its data center, it takes days or even weeks to add new computing capacity, and costs a significant amount of money. Amazon EC2 allows for instant addition and removal of capacity, and their services are reasonably priced. A new server can be available in as little as five minutes and can then be terminated at any time. Server usage is billed by the hour, so users pay only for the hours they use. This flexibility, coupled with Amazon’s low prices, is a boon to anyone who needs to perform complex computational tasks for short or unpredictable time periods.<br>
The need for enormous amounts of computing power for short periods of time is a common characteristic of scientists performing High Performance Computing (HPC). HPC tasks are crucially important to modern science and can range from the modeling of microscopic molecular interactions in a protein to a nuclear weapon simulation. Before the availability of cloud computing resources, HPC users ran their computational tasks almost exclusively on very expensive supercomputers, which can cost in excess of $500 per hour and must be reserved ahead of time. These supercomputers are installed at many major universities, corporations, and research laboratories, but are not easily accessible because of their high cost. The recent installation of IBM’s Roadrunner supercomputer at Los Alamos National Laboratories in New Mexico cost over $133 million.<br>
With program decomposition techniques, scientists can break up seemingly intractable problems into smaller, more manageable subtasks that run independently. The problem can be solved by these extremely powerful supercomputers, which distribute the subtasks among the many discrete processors within the supercomputer. The processors have speedy communication channels between them that offer plenty of bandwidth. When discrete subtasks within the larger problem need to share information, such as the attractive charges emitted by a molecule in a protein folding simulation, that information is sent fast and frequently over the inter-processor communication links. Protein Folding simulations are particularly well suited toward parallelization because small parts of the molecule can be simulated independently, and then the individual results can be used to find the ideal structure of the complete protein. Parallelized problems like this can be solved by powerful, expensive supercomputers, or can be solved in a cluster of computers that are cheaper and more readily available. Some problems<br>
have unique requirements, like continuous single-threaded access to a high-powered processor, and those problems are out of the scope of this project.<br>A project called StarCluster brings the flexibility and low cost of clustered, cloud computing to scientists and other users of High Performance Computers. Users can launch a cluster of Amazon EC2 servers, also called instances, through StarCluster and have a fully configured, ready to use computational cluster online in less than ten minutes, for as little as $0.08 per instance per hour. No reservations are required and a cluster of up to 20 machines can be launched at any time the user desires.<br>
StarCluster has made high performance computing in the cloud an affordable reality to many scientists who do not have access to expensive supercomputers. StarCluster, which is free, has approximately 500 users worldwide, most of whom are in academia. Using StarCluster incurs no additional fees beyond the nominal cost of per hour usage of EC2. StarCluster is a superb product for scientists who need supercomputing power, and who know how much time and computational resources they need to complete the tasks.<br>
Despite its many strengths, StarCluster does not easily adapt to changing workloads. This type of adaptability in the cloud is called elasticity. In StarCluster, when a cluster of instances is launched, the scientist must specify how many instances he or she wants. Those instances are launched together, and can only be terminated together. Instances cannot be terminated individually, even if one instance is idle. In some situations it is impossible to predict the workload of a cluster, such as when a scientist overestimates the duration of a task, or data processing runs faster than expected because an unexpected network upgrade transfers files faster. There are many reasons that a task could complete faster or slower than expected. It is a waste of money, in fees paid to Amazon, and a waste of energy, to keep many idle instances running indefinitely.<br>
This project, Elastic Load Balancing in EC2, aims to address this weakness in StarCluster by adding an Elastic Load Balancer to the project. The Elastic Load Balancer (ELB) will add instances to the cluster to improve job throughput when the cluster is heavily loaded, and terminate instances when they are idle to save money and energy. The ELB will periodically poll the cluster, analyze its workload, decide if the cluster needs to be modified, and add or remove instances. Through this process, StarCluster will maximize job throughput at busy times and save money at idle times.<br>
Several powerful Elastic Load Balancers are commercially available for Cloud and EC2 software setups, but StarCluster’s ELB is the only one specifically targeted toward the High Performance Computing domain. Existing ELB implementations are geared toward web server and application server environments and will be discussed in the Prior Work section. HPC jobs have a unique computing profile, have long running jobs and seldom serve external clients. This HPC computing profile mandates a new Elastic Load Balancing strategy.<br>
<br>Any comments or questions are welcome. Best,<br>Rajat<br>