A supercomputer is essentially a high-performance computer designed to handle complex tasks far beyond the capabilities of a typical personal computer. These machines are usually measured in floating-point operations per second (FLOPS) rather than the traditional million instructions per second (MIPS). Since 2017, some supercomputers have been capable of performing over a hundred quadrillion FLOPS.
As of November 2017, all of the world’s top 500 supercomputers operate on Linux-based systems. Research continues in countries like the United States, the European Union, Taiwan, Japan, and China to create even faster and more powerful supercomputers.
Supercomputers are crucial in fields like computational science and are used for various tasks, such as quantum mechanics, weather forecasting, climate studies, oil and gas exploration, molecular modeling, and simulating physical processes like the early universe or nuclear fusion. They also play a significant role in cryptanalysis.
The development of supercomputers began in the 1960s, with early machines built by Seymour Cray at Control Data Corporation and later Cray Research. These initial machines were faster versions of conventional designs, with the first real leap in processing power coming in the 1970s with the introduction of vector processors. Cray-1, released in 1976, was one of the most notable early models. Over the years, supercomputers evolved with increasing parallelism and the introduction of massively parallel systems with tens of thousands of processors, which are now standard.
Energy Usage and Heat Management
One major challenge in supercomputing is managing the heat generated by these powerful systems. As supercomputers consume a significant amount of electrical power, almost all of this energy is converted into heat, requiring extensive cooling. Managing this heat is essential, as excessive heat can damage components and reduce the system’s lifespan. Supercomputers use various methods for cooling, from circulating Fluorinert to hybrid liquid-air cooling systems. The cost to power and cool these machines can be quite high—around $400 per hour or about $3.5 million annually.
Heat density remains a critical issue in supercomputing, especially when thousands of processors are packed together. Systems like the Cray-2 used liquid cooling to manage heat, utilizing a “cooling waterfall” forced through the system under pressure.
Operating Systems
Supercomputer operating systems have evolved as the architecture of the machines has changed. Early supercomputers used custom-built operating systems designed for speed, but today, many supercomputers use more generic operating systems like Linux. Most modern supercomputers separate computations and other services across multiple nodes. They typically run lightweight kernels, like CNK or CNL, on compute nodes and a larger Linux-based system on server and I/O nodes.
Job scheduling in supercomputing has also become more complex. In traditional systems, job scheduling mainly deals with processing and peripheral resources. In massively parallel systems, it must also manage computational resources, communication resources, and handle hardware failures across tens of thousands of processors.
Although most supercomputers use Linux-based operating systems, manufacturers often create their own specific Linux versions to optimize the system for their hardware. This lack of a universal industry standard stems from the need to adjust the operating system for each unique hardware design.
Cloud Computing in HPC
Cloud computing has gained significant attention from high-performance computing (HPC) users and developers. Cloud computing aims to offer HPC-as-a-service, providing users with scalable, on-demand resources that are fast and cost-effective. However, moving HPC applications to the cloud comes with its challenges, including virtualization overhead, resource multi-tenancy, and network latency issues. Research is ongoing to address these challenges and make HPC in the cloud a more viable option.
In 2016, companies like Penguin Computing, Amazon Web Services, and others began offering HPC cloud computing solutions. For example, Penguin Computing’s POD (Penguin On Demand) cloud model offers bare-metal compute nodes for executing code, connected via high-speed, non-virtualized networks.
Amazon’s Elastic Compute Cloud and Penguin Computing argue that virtualization of computing nodes isn’t ideal for HPC, as it can cause latency and impair performance. Some HPC applications may suffer if computing nodes are spread out over long distances, leading to increased latency.
In conclusion, the demand for supercomputers is growing rapidly as technology continues to advance. These powerful machines are essential for tackling some of the world’s most complex problems, and it’s important to ensure they are used effectively.