Threading and Tasks on the Cluster

To give you some background “hyperthreading”, that is the ability for a single physical core to run more threads than 1, is turned off on the cluster by default because it does not work well with most HPC codes. 

So the “maximum” threads a node in the standard partition can have is 36 since there are 36 cores in each standard node, 1 thread per core. 
Maximum is a bit of a misnomer because you can tell OpenMP to use as many threads as you want to tell it to use.   
OpenMP will spin up that many “threads” and otherwise think it’s using that many but in reality, there is a single processor doing a bunch of context-switching to accomplish the work. 

You can test this by writing a small C program like this:

//main.c
#include <iostream>
#include <omp.h>
#include <unistd.h>

int main()
{
  #pragma omp parallel
  {
    std::cout << "Number of available threads: " << omp_get_num_threads() << std::endl;
    std::cout << "Current thread number: " << omp_get_thread_num() << std::endl;
    std::cout << "Hello World!" << std::endl;
  }
  return 0;
}

And requesting an interactive job with one task and one cpu assigned to that task:
salloc --account=<your_account_here> --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --mem-per-cpu=1g --time=8:00:00

Let’s compile the code using: 
g++ main.c -fopenmp -o main
and set this environmental variable:
export OMP_NUM_THREADS=20
then finally run the code:
./main

You’ll see 20 “threads” being used, but also notice we didn’t set any limits in the code for race conditions, but each thread replies individually because we only have one core doing all the work, and it does so sequentially.
If you exit and requeue the interactive job, this time with 20 CPUS-per-task:
salloc --account=<your_account_here> --nodes=1 --ntasks-per-node=1 --cpus-per-task=20 --mem-per-cpu=1g --time=8:00:00
and setting:
export OMP_NUM_THREADS=20
you’ll see the race conditions appear as each thread prints out at the same time. 

So how do we get multiple threads working on a task?
You set 
#SBATCH –nodes=1
#SBATCH –ntasks-per-node=1
#SBATCH –cpus-per-task=2
export OMP_NUM_THREADS=2

With this setup you’ll get one MPI task, and that task will get two threads since we set OMP_NUM_THREADS=2, with a single thread on each of CPUs we assigned to the task.

The OMP_NUM_THREADS is set at the task level, so if you had MPI running multiple tasks you could set multiple CPUs running in each task and tell OMP to use that many threads within each task.