Lecture 10: High performance computing – Lecture

Jan Benáček

Institute for Physics and Astronomy, University of Potsdam

January 16, 2025

1 Lecture overview

Table of Contents 1/2

  • 17.10. Introduction + numerical methods summary
  • 24.10. Numerical methods of differential equations — lecture + hands-on
  • 31.10. State holiday
  • 07.11. Test particle approach — lecture
  • 14.11. Test particle approach — lecture + hands-on
  • 21.11. PIC method — lecture
  • 28.11. PIC method — hands-on

Table of Contents 2/2

  • 05.12. Fluid and MHD — lecture (online)
  • 12.12. Fluid and MHD — hands-on given Xin-Yue Shi
  • 19.12. Canceled
  • 09.01. Radiative transfer — lecture + hands-on
  • 16.01. HPC computing — lecture + hands-on
  • 23.01. Advanced — Smooth particle hydrodynamics method — lecture
  • 30.01. Advanced — Hybrid, Gyrokinetics — lecture
  • 06.02. Advanced — Vlasov and Linear Vlasov dispersion solvers — lecture

2 High performance computing (HPC) introduction

2.1 Introduction

Definition of High performance computing (HPC)

Usage of supercomputers and computer clusters to solve advanced computation problems.

Simple performance estimation

  • Let us assume, we have numerically implemented one of the space plasma simulation algorithms.
  • What is the simulation performance at a typical computer?
  • 4 cores at 5 GHz, calculating 4 floating point operations (Flops) at CPU tick \(\Rightarrow\) 80 GFlops.
  • What can we achieve?
  • Pushing one particle per one time step can take 100 Flops \(\Rightarrow\) pushing 800 M particles per second (in ideal case).

2.2 Your turn

  • Do we need higher performance? Why?
  • How can we speed up?
  • What can go wrong in my calculation?
  • Have you used any computer cluster?
  • If yes, which one and to calculate what?
  • Did you hear about supercomputer “fights” between countries?
  • Do you plan to follow the academic path after finishing your degree?
  • Do you consider to start working in industry after finishing your degree?

3 Examples of supercomputers

3.1 Top500

  • List of the most powerfull supercomputers in the world: Top 500
  • Released every half-year, now November 2024 release
  • Not all computers are listed

3.2 Examples

3.3 Green500

  • List of the most energy efficient supercomputers in the world:
  • Green500

3.4 Quantum computing

IQM quantum computers

3.5 Distributed vs cloud computing

Figure

3.6 Access to a (super)computer

Node List of Uni Potsdam cluster

3.7 Application procedure to cluster

  • Profile you simulation that it really well scales with number of CPUs or GPUs
  • Find an interesting large-scale simulation research
  • Write an application and apply at the center
  • Do (not) obtain computing time
  • Access the cluster, compile your simulation, upload your data
  • Submit jobs to queue
  • Wait! for running
  • Process, analyze, and download your data

4 Inside the supercomputer

4.1 Supercomputer scheme

Scheme of a MeluXina supercomputer (Luxembourg)

4.2 Front-nodes / Access nodes

  • Access to the cluster
  • Typically connected through SSH (Secure SHell)
  • Not for computing porposes
  • Usefull for data upload/download, easy analysis, visualization,…
  • Code compilaton, Running Jupyter Lab, Paraview, Job submission

4.3 Compute nodes

  • Unit determined for the calculations
  • Typically not accessible for a user directly, automatic processing using job queue
  • Large computing power
  • Various types of nodes can be present at the supercomputer
    • Compute nodes (typically 2 CPUs with 96 cores and 128 GB RAM)
    • Fat nodes (typically 8 CPUs with 256 cores and 1 TB RAM)
    • Testing nodes
    • GPU/Accelerated/CPU nodes (typically 2 CPUs and 128 GB RAM, 6 GPUs and 384 HBM memory )

4.4 Storage

  • Large scale storage for simulation outputs
  • Always good to know the properties
  • Local storage at nodes / Commong storage for whole cluster
  • To upload/download large data (>TBs), specific transport protocols available, like GLOBUS)
  • Usual topology:
    • Scratch - direct output from simulations, fast
    • Work - data processing, fast
    • Archive - long-term data backup, tapes, slow
  • Each typically tens of PBs of data

4.5 Infiniband

  • To get more nodes cooperating at calculation of the same simulation
  • It provides connection and communication between nodes and between nodes and storages
  • Extremely fast (~400 Gbps between each two nodes) + huge aggregated bandwidth
  • Extremely low latency (~1\(\mu\)s)
  • For comparison, fast ethernet can reach ~100 Gbps and 1 ms latency
  • Simulations must be prepared to used this communication - Message Passing Interface (MPI)

4.6 Software

  • Operating system
  • Modules
  • Licenses
  • Containers
  • Compilation
  • Analysis

4.7 SuperMUC-NG Queues

4.8 Example of SuperMUC-NG - My Hands-on session

5 Simulation on a PC/node

5.1 How our computers compute our simulations?

  • In the following slides, various ways how the computer are optimized to run as fast as possible
  • This understanding can help to speed-up our codes
  • Several hardware + software levels
  • Hardware for calculations: CPUs + GPUs + Other accelerators
  • Software speed-up techniques: OpenMP, MPI, CUDA, HIP, …

5.2 CPUs

5.3 CPU - Central Processing Unit

  • Basic and general-purpose processing unit of a computer
  • Reads the instructions of binary programs and executes them
  • Advantage: can run various programs and types of loads, prepared be the control unit of a computer
  • Disadvantage: Not many calculations can be done in parallel
  • Execution speed approx. the clock speed (typically a few GHz)
  • Two main techniques how to speed-up:
    1. Instruction level parallelism
    2. Task level parallelism

Scheme of a CPU Wiki.

5.4 Pipelines

  • Every instruction of a program has several phases
  • Pipeline: Load/Fetch instruction -> Decode -> (Load data) -> Execute -> Memory Access -> Write Back -> …
  • This loop can be parallelized
  • Different unit for each step
  • E.g., when one instruction get decoded, next one can be already loaded.

Pipeline

5.5 Superscalarity

  • Each step in the pipeline can be also done in parallel
  • Specific ones for floating point, integer, load, store operations
  • Some pipeline items present more times
  • Different execution times for integer: add (1 cycle); double: add (4 cycles), multiply (7), divide (23), sqrt(>23).
  • Not every instruction takes the same place (a few to thousands of clock ticks)

Superscalar pipeline

5.6 Vector Instructions

  • To process even more data a time, each instruction can take more scalar values
  • E.g., AVX512 instructions can take 8 Double scalars. The mathematical operation is simultaneous at all 8 numbers
  • Advantage: increased performance up to 8 times
  • Disadvantages:
    • Large memory transports - large bandwidth required
    • Not all processors support all types of vector operations
    • The code (e.g., a loop) must be able to process data independently

Vector instructions

5.7 Threads — Hyperthreading

  • To efficiently utilize all execution units, each core can run two or more independent threads
  • The threads are indipendent programs, but they utilize the same execution units
  • If the execution units not utilized, they would only wait and consume energy

Hyperthreading

5.8 Cores

  • Implementing more independent cores mean more programs can run in parallel
  • However, if more programs solve the same issue, they have to communicate
  • Communication between cores is relatively slow (~100 clock ticks)
  • Also, the programs must be prepared (by programmer) to run more time and communicate to each other
  • If not prepared, the parallel run at more cores is basically not possible

Node Topology

5.9 GPUs

5.10 GPU - Graphical Processing Unit

  • Originally developed for computer games
  • Advantage: Relatively narrow-purpose unit to run as many simultaneous calculations as possible.
  • Disadvantage: The programs must be prepared to take into account many execution units; if not, the program is running very inefficient way (and lots of power and money burned)
  • Typical tick speed of 2 GHz
  • Each multiprocessors many parallel execution units called stream processors
  • Typical power: 16000 stream processors x 2 GHz = 32 TFlops (4 core CPU has 200-1000 GFlops)

Scheme of a GPU

5.11 GPU acceleration — offloading

  • Start our program on CPU
  • Send the calculation to GPU
  • Obtain the results from GPU
  • Store/finish the simulation on CPU

GPU offloading

5.12 Memory management

5.13 Memory management

  • Differences between CPU and GPU
  • Scheme how data flow from RAM to L3, L2, L1 to L0 and execution unit
  • Various times to transport the data \(\Rightarrow\) the best approach is to keep the programm to close to the core as possible
  • Do not make dependent loops, if the program has to wait to something previous, it cannot proceed.
  • Memory vs. Computing bound tasks

5.14 AMD Ryzen Die Scheme

Die scheme

5.15 Test the performance

  • We want to utilize the CPU as much as possible during our calculations
  • Compare the following cases in Jupyter notebook
N = 1_000_000
a = np.random.random(N)
b = np.random.random(N)
c = np.zeros_like(a)
%%timeit
for i in range(N):
    c[i] = a[i] + b[i]
231 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
c = a + b
1.1 ms ± 83.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

6 Parallelization techniques

6.1 OpenMP — Open Multi-Processing

  • Easiest way to run your program parallel

  • Work only in some compiled languages — C/C++ and Fortran

  • Maximal parallelization limited to one CPU

  • Example:

    int a[100000];
    
    for (int i = 0; i < 100000; i++) {
        a[i] = 2 * i;
    }
  • Is transformed to

    int a[100000];
    
    #pragma omp parallel for
    for (int i = 0; i < 100000; i++) {
        a[i] = 2 * i;
    }

6.2 MPI — Message Passing Interface

  • Communication protocol between processes
  • A standard for supercomputer processing
  • Utilization of all nodes at the cluster
  • The communication must be implemented into the current simulation code
  • Example

Initialization and finishing

    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Print off a hello world message
    printf("Hello world from processor rank %d out of %d processors\n",
           world_rank, world_size);

    // Finalize the MPI environment.
    MPI_Finalize();

Communication

int number;
if (world_rank == 0) {
    number = -1;
    MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (world_rank == 1) {
    MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
             MPI_STATUS_IGNORE);
    printf("Process 1 received number %d from process 0\n",
           number);
}

Compilation and run

mpicc -o hello_world hello_world.c
mpirun -n 4 hellow_world

6.3 CUDA/HIP

  • Used for offloading to the GPUs
  • More ways how to do it
  • OpenACC or OpenMP offloading
    • similar to OpenMP but more directives
    • Not so good performance
    • Not broadly supported
  • CUDA/HIP
    • Your code must be rewritten to a different programming language
    • Excellent performance can be reached
    • Running on NVIDIA or AMD GPUs
    • Using so called kernels - peaces of code that can run in parallel

6.4 CUDA Example

C

void c_hello(){
    printf("Hello World!\n");
}

int main() {
    c_hello();
    return 0;
}

CUDA

__global__ void cuda_hello(){
    printf("Hello World from GPU!\n");
}

int main() {
    cuda_hello<<<1,1>>>(); 
    return 0;
}

6.5 Others

  • Other accelerator types exist
  • The mostly used are FPGAs — Field-programmable gate array
  • A microprocessor/array that can be setup at the beginning and than run only that calculation
  • Extremely fast calculations
  • Difficult to program