Design Big compute architecture style

ALIF Consulting
Nov 9, 2022
3 min read

Updated: Nov 29, 2023

In cloud computing, the term “compute” describes concepts and objects related to software computation. It is a generic term used to reference processing power, memory, networking, storage, and other resources required for the computational success of any program.

For example, applications that run machine learning algorithms or 3D graphics rendering functions require many gigs of RAM and multiple CPUs to run successfully. In this case, the CPUs, RAM, and Graphic Processing Units required will be called compute resources, and the applications would be compute-intensive applications.

Compute resources are measurable quantities of computing power that can be requested, allocated, and consumed for computing activities. Some examples of computing resources include –

CPU

The central processing unit (CPU) is the brain of any computer. CPU is measured in units called millicores. Application developers can specify how many allocated CPUs are required to run their application and to process data.

Memory

Memory is measured in bytes. Applications can make memory requests that are needed to run efficiently.

The term big compute describes large-scale workloads that require many cores, often numbering in the hundreds or thousands. Scenarios include image rendering, fluid dynamics, financial risk modelling, oil exploration, drug design, and engineering stress analysis, among others.

Here are some typical characteristics of big compute applications:

The work can be split into discrete tasks, which can be run across many cores simultaneously.
Each task is finite. It takes some input, does some processing, and produces output. The entire application runs for a finite amount of time (minutes to days). A common pattern is to provision a large number of cores in a burst, and then spin down to zero once the application completes.
The application does not need to stay up 24/7. However, the system must handle node failures or application crashes.
For some applications, tasks are independent and can run in parallel. In other cases, tasks are tightly coupled, meaning they must interact or exchange intermediate results. In that case, consider using high-speed networking technologies such as InfiniBand and remote direct memory access (RDMA).
Depending on your workload, you might use compute-intensive VM sizes (H16r, H16mr, and A9).

When to use this architecture

Computationally intensive operations such as simulation and number crunching.
Simulations that are computationally intensive must be split across CPUs in multiple computers (10-1000s).
Simulations that require too much memory for one computer and must be split across multiple computers.
Long-running computations that would take too long to complete on a single computer.
Smaller computations must be run 100s or 1000s of times, such as Monte Carlo simulations.

Benefits

High performance with "embarrassingly parallel" processing.

Can harness hundreds or thousands of computer cores to solve large problems faster.
Access to specialized high-performance hardware with dedicated high-speed InfiniBand networks.
You can provision VMs as needed to do work and then tear them down.

Challenges

Managing the VM infrastructure.
Managing the volume of number-crunching
Provisioning thousands of cores in a timely manner.
For tightly coupled tasks, adding more cores can have diminishing returns. You may need to experiment to find the optimum number of cores.

Big compute using Azure Batch

Azure Batch is a managed service for running large-scale high-performance computing (HPC) applications.

Using Azure Batch, you configure a VM pool and upload the applications and data files. Then, the Batch service provisions the VMs, assigns tasks to them, runs the tasks, and monitors their progress. Batch can automatically scale out the VMs in response to the workload and also provides job scheduling.

Big compute running on Virtual Machines

You can use Microsoft HPC Pack to administer a cluster of Virtual Machines and schedule and monitor HPC jobs. With this approach, you must provision and manage the VMs and network infrastructure. Consider this approach if you have existing HPC workloads and want to move some or all it to Azure. You can move the entire HPC cluster to Azure, or you can keep your HPC cluster on-premises but use Azure for burst capacity.

HPC Pack deployed to Azure

In this scenario, the HPC cluster is created entirely within Azure.

The head node provides management and job scheduling services to the cluster. For tightly coupled tasks, use an RDMA network that provides very high bandwidth and low latency communication between VMs.

Burst an HPC cluster to Azure

In this scenario, an organization is running HPC Pack on-premises, and uses Azure VMs for burst capacity. The cluster head node is on-premises. ExpressRoute or VPN Gateway connects the on-premises network to the Azure VNet.