Doing More with Slurm
Advanced Capabilities
Nick Ihli, Director - Cloud and Sales Engineering - SchedMD
nick@schedmd.com
Most people know Slurm…
● Policy-driven, open source, fault-tolerant, and
highly scalable workload management and job
scheduling system
● Three Key Functions
● Allocates exclusive and/or non-exclusive access to
resources to users for some duration of time for a workload
● Provides a framework for starting, executing, and
monitoring work on the set of allocated nodes
● Arbitrates contention for resources by managing a queue of
pending work
Rank System Cores Rpeak (TFlop/s)
1 Supercomputer Fugaku 7,630,848 537,212.0
Slurm on Top500
2 Summit - IBM 2,414,592 200,794.9 5 of top 10
DOE/SC/Oak Ridge National Laboratory - United States
3 Sierra - IBM / NVIDIA / Mellanox 1,572,480 125,712.0
More than 50% of Top100
DOE/NNSA/LLNL - United States
4 Sunway TaihuLight - NRCPC 10,649,600 125,435.9
National Supercomputing Center in Wuxi - China
5 Perlmutter - HPE 761,856 93,750.0
DOE/SC/LBNL/NERSC - United States
6 Selene - Nvidia 555,520 79,215.0
NVIDIA Corporation - United States
7 Tianhe-2A - NUDT 4,981,760 100,678.7
National Super Computer Center in Guangzhou - China
8 JUWELS Booster Module - Atos 449,280 70,980.0
Forschungszentrum Juelich (FZJ) - Germany
9 HPC5 - Dell EMC 669,760 51,720.8
Eni S.p.A. - Italy
10 Voyager-EUS2 - Microsoft Azure 253,440 39,531.2
But what is SchedMD?
● Maintainers and Supporters of Slurm
● Only organization providing level-3 support
● Training
● Consultation
● Custom Development
Industry Trends
Manufacturing & EDA
● GPUs - AI Workloads
Healthcare & Lifesciences
● Hybrid Cloud
Financial Services & Insurance
● AI Tooling Integration
Energy
Government
Academic
GPU Scheduling for
AI Workloads
Fine-Grained GPU Control
Same options apply to salloc, sbatch and srun commands
● --cpus-per-gpu= CPUs required per allocated GPU
● -G/--gpus= GPU count across entire job allocation
● --gpu-bind= Task/GPU binding option
● --gpu-freq= Specify GPU and memory frequency
● --gpus-per-node= Works like “--gres=gpu:#” option today
● --gpus-per-socket= GPUs per allocated socket
● --gpus-per-task= GPUs per spawned task
● --mem-per-gpu= Memory per allocated GPU
Examples of Use
$ sbatch --ntasks=16 --gpus-per-task=2 my.bash
$ sbatch --ntasks=8 --ntasks-per-socket=2 --gpus-per-socket=k80:1 my.bash
$ sbatch --gpus=16 --gpu-bind=closest --nodes=2 my.bash
$ sbatch --gpus=k80:8,a100:2 --nodes=1 my.bash
Configuring GPUs
● GPUs fall under the Generic Resource (GRES) plugin
○ Node-specific resources
● Requires definition in slurm.conf and gres.conf on node
● GRES can be associated with specific device files (e.g. specific GPUs)
● GPUs can be autodetected with NVML or RSMI libraries
● Sets CUDA_VISIBLE_DEVICES environment variable for the job
Restricting Devices with Cgroups
● Uses the devices subsystem
○ devices.allow and devices.deny control access to devices
○ All devices in gres.conf that the job does not request are added to
devices.deny so the job can’t use them
● Must be a Unix device file. Cgroups restrict devices based on major/minor
number, not file path
● GPUs are the most common use case, but any Unix device file can be
restricted with cgroups
NVIDIA MIG Support
● Configured like regular GPUs in Slurm
● Fully supported by task/cgroup and --gpu-bind
● AutoDetect support
● Make it work with CUDA_VISIBLE_DEVICES
● MIGs must be manually partitioned outside of Slurm beforehand via
nvidia-smi
Hybrid Cloud Autoscaling
Hybrid Cloud
Cloud Enablement
● Power Saving module
○ Requires 3 parameters to enable
■ ResumeProgram
■ SuspendProgram
■ SuspendTime (Either global or
Partition)
○ Other important parameters
■ ResumeTimeout
■ SuspendTimeout
Power State Transition - Resume
Job State Configuring Running Completing
IDLE ALLOCATED / MIXED
Node State
ALLOCATED / MIXED
POWERED_DOWN ~ POWERING_UP #
Power State Transition - Suspend
IDLE IDLE
Node State IDLE
POWERING_DOWN % POWERED_DOWN ~
SuspendTime SuspendTimeOut
What about the Data?
● Most common question - How do we get my data from onprem to cloud?
● Previous best option - mini-workflow w/ job dependency
Stage-in job > Application job > Stage-out job
● Benefit: easy to increase the number of nodes involved in moving the data
New Option: Lua Burst Buffer plugin
● Originally developed for Cray Datawarp
○ Intermediate storage - in between slow long-term storage and the fast memory
on compute nodes
● Asynchronously calls an external script to not interfere with the scheduler
● Generalized this function so you don’t need Cray Datawarp or actual
hardware “burst buffers” or Cray’s API
● Good for Data movement or provisioning cloud nodes
○ Anything you think you want to do while the job is pending (or at other
job states)
Asynchronous “stages”
● Stage in - called before the job is scheduled, job state == pending
○ Best time for Cloud data staging
● Pre run - called after the job is scheduled, job state == running + configuring
○ Job not actually running yet
● Stage out - called after the job completes, job state == stage out
○ Job cannot be purged until this is done
● Teardown - called after stage out, job state == complete
AI Tooling Integration:
Enter the REST API
New Integration Requirements
What is Slurm REST API
GET
POST
Client JSON/YAML HTTP Server
PUT
Client sends a request. DELETE
Server sends a response.
(NOT srun,sbatch,salloc)
HTTP Method
slurmrestd
A tool that runs inside of the Slurm perimeter that will translate JSON/YAML
requests into Slurm RPC requests
SLURM
slurmctld RPC REST API
slurmrestd clients
slurmdbd
Slurm REST API Architecture (rest_auth/jwt)
AuthAltTypes Perimeter - JWT authentication
client
Munge Perimeter client
client
slurmrestd client
slurmctld client
client
client
slurmdbd
cluster network
slurmd
slurmd
slurmd
slurmd
slurmd
Slurm REST API Architecture (rest_auth/jwt + Proxy)
AuthAltTypes Perimeter Site Authentication Perimeter
Munge Perimeter
slurmrestd
slurmctld
Authenticated client
slurmdbd
Site
slurmd Authenticating Authentication Server
slurmd
slurmd HTTP proxy TLS wrapped
slurmd
slurmd
JSON/YAML output
● Slurmrestd uses content (a.k.a. openapi) plugins. These plugins have been made
global to allow other parts of Slurm to be able to dump JSON/YAML output.
● New output formatting (limited to these binaries only):
○ sacct --json or sacct --yaml
○ sinfo --json or squeue --yaml
○ squeue --json or squeue --yaml
● Output is always same format of latest version of slurmrestd output.
○ Formatting arguments are ignored for JSON or YAML output as it is expected
that clients can easily pick and choose what they want.
$ sinfo --json … …
{ "gres": "", "operating_system": "Linux
"meta": { "gres_drained": "N\/A", 5.4.0-100-generic #113-Ubuntu SMP Thu Feb 3
"plugin": { "gres_used": "scratch:0", 18:43:29 UTC 2022",
"type": "openapi\/v0.0.37", "mcs_label": "", "owner": null,
"name": "Slurm OpenAPI v0.0.37" "name": "node00", "partitions": [
}, "next_state_after_reboot": "invalid", "debug"
"Slurm": { "address": "node00", ],
"version": { "hostname": "node00", "port": 6818,
"major": 22, "state": "idle", "real_memory": 31856,
"micro": 0, "state_flags": [ "reason": "",
"minor": 5 ], "reason_changed_at": 0,
}, "next_state_after_reboot_flags": [ "reason_set_by_user": null,
"release": "21.08.6" ], "slurmd_start_time": 1646430151,
} "operating_system": "Linux "sockets": 1,
}, 5.4.0-100-generic #113-Ubuntu SMP Thu Feb 3 "threads": 2,
"errors": [ 18:43:29 UTC 2022", "temporary_disk": 0,
], "owner": null, "weight": 1,
"nodes": [ "partitions": [ "tres":
{ "debug" "cpu=12,mem=31856M,billing=12",
"architecture": "x86_64", ], "slurmd_version": "22.05.0-0pre1",
"burstbuffer_network_address": "", "port": 6818, "alloc_memory": 0,
"boards": 1, "real_memory": 31856, "alloc_cpus": 0,
"boot_time": 1646380817, "reason": "", "idle_cpus": 12,
"comment": "", "reason_changed_at": 0, "tres_used": null,
"cores": 6, "reason_set_by_user": null, "tres_weighted": 0.0
"cpu_binding": 0, "slurmd_start_time": 1646430151, }
"cpu_load": 64, "sockets": 1, ]
"extra": "", "threads": 2, }
"free_memory": 3208, "temporary_disk": 0,
"cpus": 12, "weight": 1,
"last_busy": 1646430364, "tres":
"features": "", "cpu=12,mem=31856M,billing=12",
"active_features": "", …
…
A Migration Journey
Large Energy Company
● Using their scheduler for many years
○ Can’t just flip a switch and go to production
● Massive scale - multiple international sites, nodes and
workloads
● Many integrations required
3-4 Months to Production
Three Migration Steps
● Admin/User education
○ Training - Help admins identify the commonalities and learn the Slurm way
○ Wrappers - a bridge to migration not a crutch
■ LSF, Grid Engine - command and submission
■ PBS - command, submission, environment variables, #PBS scripts
● Policy replication
○ Reevaluate policies
■ Are we continuing to produce technical debt due to “doing things how we’ve always
done them?”
○ Optimizing for scale and throughput - 1 million jobs/day
■ Some Financial sites doing up to 15 million/day
● Tooling integration
○ Most time consuming of the journey
Questions?
Thank You
schedmd.com slurm.schedmd.com nick@schedmd.com