
Giang Nguyen
12 Jan 2026
In Part 1, we learned the fundamentals by building a single-node Slurm cluster. Now it's time to scale up to a production-ready, multi-node cluster with automated deployment, monitoring, and alerting.
In this post, we'll use Ansible to automate the entire deployment process, making it reproducible and maintainable.
Moving from a single-node setup to a multi-node production cluster involves:
Doing this manually is error-prone and time-consuming. Infrastructure automation tools like Ansible, Puppet, or Terraform solve this problem. We chose Ansible because:
For quick review what is the Ansible, follow this video https://www.youtube.com/watch?v=xRMPKQweySE
Figure 1: The standard architecture of a multi-node Slurm cluster
The above Figure 1 shows how the multiple nodes can be setup related to the Slurm cluster architecture only. However, on the real production, there are some additional steps that are required for the bioinformatics services, include: Shared Storage and Monitoring. It will be explained more details as below
Figure 2: The standard sharing storage system accross multiple node
Why we need the sharing storage system, the simple idea is that when we ran a job on a worker, it need to access the data. Later, when we stop to work or request a higher resource system, we still be able to access the data. In advanced task, we need to analyze on multiple compute nodes, we need the raw data, intemediate files can be shared and storaged to be analyzed accross multiple machine.
Normally, the standardized setup can be showed as Figure 2, with:
HOME directory, then your user home folder will be available on the target nodes.
Figure 3: The monitoring system designed for multiple nodes to collect the system and Slurm metrics
As the administrator of the HPC, you should not login into the HPC, then check the status of the cluster. According to Figure 3, Normally, for the HPC, the administrator will:
executor=slurm
it will help to automatically submit and monitor tasks while submitting a single job with high computing resources while running the pipeline with executor=local will waste of resource when a few steps do not require too many resouces while it takes time to run.How the metrics can be collected:
promethus-node-exporter at each node where it collect the system metrics and exposed as web api service at https://<node name>:3000. Then, the head node can send the
request to collect these metrics.promethus-slurm-exporter at the node in the Slurm cluster. It is usually the controller node.prometheus will collect these metrics and can be the source for the Grafana with configured dasboard for interactive analysis
The HPC does not always work properly. If a worker not does not work, the administrator should get the notification as soon as possible instead of waiting for the users to report the incidents. Therefore, prometheus can be
intergrated with alertmanager. Simply, we can configure the rule: If the system metric can not be collected from a node, it does not response, then it should mark as the down node and absolutely send the alert message to
the chat application. It can be Slack or Zalo (Vietnam region) that can be quickly fix the issue.
It can be done step by step, via:
Promethus when it see the healthy condition is failed, will call the API to alert the administrators and usersinventories which store the information of how your machine can connect and set up on the nodes, decide which node should be configured as the controller, login, worker to install with relative services.inventories file and install/configure the softwares on the remote machine automatically instead of
manually login to each nodes and install softwaresgit clone https://github.com/gianglabs/omicslab-hpc -b 1.0.0
cd omicslab-hpc
# pixi is similar to conda
# install pixi
make ${HOME}/.pixi/bin/pixi
# activate environment
pixi shell
# install dependencies, start up example instances with cluster os, support multiple version
# bash scripts/setup.sh 22.04
# bash scripts/setup.sh 20.04
bash scripts/setup.sh 24.04
# or
make vm-start
This script installs:
alertmanager Slack is used by default. Contact us at contact@omicslab.io that we provide the customized solution if you want to use another solution (Zalo, Discord, etc)Create a Slack App

Enable Incoming Webhooks
#cluster-alerts)


Manually Test The Webhook
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"Hello from Slurm cluster!"}' \
https://hooks.slack.com/services/YOUR/WEBHOOK/URL
You should see the message appear in your Slack channel!
Create inventories/hosts (or copy from inventories/hosts.example):
[slurm_master]
controller-01 ansible_host=192.168.58.10
[slurm_worker]
worker-01 ansible_host=192.168.58.11
worker-02 ansible_host=192.168.58.12
[slurm:children]
slurm_master
slurm_worker
\[all:vars] # remove the backflash that html turbopack failed to render as code block
ansible_user=your_username
slurm_password=secure_munge_password
slurm_account_db_pass=secure_db_password
slack_api_url=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
slack_channel=#cluster-alerts
admin_user=admin
admin_password=secure_grafana_password
Use Ansible Vault to encrypt sensitive variables like passwords and API keys.
# create encrypted vault, add your vault password, content, then exit with vim key :wq
ansible-vault create inventories/hosts.prod
# edit exisiting files, add your vault password to work directly inside
ansible-vault edit inventories/hosts.prod
default_password=temporary_user_password # Forces change on first login
users=alice,bob,charlie # Comma-separated list
Now for the magic moment - deploy your entire cluster with one command!
# If you have passwordless sudo configured
ansible-playbook -i inventories/hosts cluster_slurm.yml
# If you need to enter sudo password
ansible-playbook -i inventories/hosts cluster_slurm.yml --ask-become-pass --ask-vault-pass
What this playbook does:
ansible-playbook -i inventories/hosts cluster_account.yml
This creates Linux users on all nodes with:
For production, consider integrating with LDAP or Active Directory. However, NIS and LDAP setup can be complex on Ubuntu. Our Ansible approach provides a simpler alternative that works well for small to medium clusters.
SSH into the controller node and run:
# Check cluster status
sinfo
# Expected output:
# PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
# compute* up infinite 2 idle worker-01,worker-02
# View job queue
squeue
# Submit a test job
srun --nodes=1 --ntasks=1 hostname
# Check accounting
sacct
# View cluster configuration
scontrol show config | head -20
Grafana runs on the controller node at port 3000. To access it securely from your local machine:
# Create SSH tunnel
ssh -N -L 3001:localhost:3000 your_user@controller_ip
# Now open in browser: http://localhost:3001
# Login: admin / your_grafana_password
You'll see pre-configured dashboards showing:


Alertmanager is configured to send Slack notifications for:
Example alert in Slack when a node goes down:

For detailed information, check the Grafana dashboard:

Let's run some tests to ensure everything works:
srun hostname
srun --nodes=2 --ntasks=2 hostname
srun --nodes=1 --cpus-per-task=2 --mem=2G --pty bash
# Inside the session
hostname
nproc
free -h
exit
Create test_job.sh:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=test_%j.out
#SBATCH --error=test_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=1G
#SBATCH --time=00:05:00
echo "Job started at $(date)"
echo "Running on node: $(hostname)"
echo "CPUs allocated: $SLURM_CPUS_PER_TASK"
echo "Memory allocated: $SLURM_MEM_PER_NODE MB"
# Do some work
sleep 60
echo "Job finished at $(date)"
Submit it:
sbatch test_job.sh
# Check status
squeue
# When done, view output
cat test_*.out
# Submit job requesting more resources than available
srun --mem=999999 --pty bash
# Should fail with:
# srun: error: Unable to allocate resources: Requested node configuration is not available
# View your jobs
sacct
# Detailed accounting info
sacct --format=JobID,JobName,User,State,Start,End,Elapsed,CPUTime,MaxRSS
# Cluster usage summary
sreport cluster utilization
In this post, we've covered:
In Part 3, we'll cover daily administration tasks, troubleshooting, security best practices, and advanced resource management.
1.Slurm Overview — Official documentation for Slurm workload manager
2.NVIDIA/deepops — Open-source cluster deployment toolkit (BSD-3-Clause License)
3.elasticluster — Elastic cluster provisioning tool (GPL-3.0 License)
4.Ansible- Ansible for IT Automation DevOps
5.GitHub Repository-Omicslab HPC- Ansible scripts for setting up the Slurm HPC
This is Part 2 of the RiverXData series on building Slurm HPC clusters. Continue to Part 3 for administration and best practices.