The significant focus on artificial intelligence over last years has led to growing interest in High-Performance Computing (HPC), particularly in the areas of parallel computing, resource allocation and job scheduling. Among the many systems available, SLURM has established itself as a leading player in the areas of job allocation and shared resource usage within clusters. Among the various features that have enabled it to establish itself in its field are: extremely high scalability; open source; large community support; extremely flexible policies; and excellent integration with the most modern computing systems. Despite the utmost importance of these systems in the modern technological landscape, the optimisation of various parameters is mainly carried out through empirical methods and arbitrary choices, dictated by common sense, made by system administrators. This is precisely the context in which this work fits in, aiming mainly to study the simulation possibilities of a SLURM system through machine learning models. The research focused on developing a predictive model capable of estimating work queue waiting times. In particular, by exploiting Transformer-based architectures, it was demonstrated that it is possible to model dependencies in job submission and scheduling behaviour. Using Slurm logs, the model processes a representation of the internal state of the cluster in order to predict queue waiting times. This system opens the door to possible models for speeding up simulations and, with further research, adaptive implementations for optimising cluster workloads.

Queue-State Learning: Job Wait Time Prediction in HPC Job Queues

AGOSTINI, DAVIDE
2024/2025

Abstract

The significant focus on artificial intelligence over last years has led to growing interest in High-Performance Computing (HPC), particularly in the areas of parallel computing, resource allocation and job scheduling. Among the many systems available, SLURM has established itself as a leading player in the areas of job allocation and shared resource usage within clusters. Among the various features that have enabled it to establish itself in its field are: extremely high scalability; open source; large community support; extremely flexible policies; and excellent integration with the most modern computing systems. Despite the utmost importance of these systems in the modern technological landscape, the optimisation of various parameters is mainly carried out through empirical methods and arbitrary choices, dictated by common sense, made by system administrators. This is precisely the context in which this work fits in, aiming mainly to study the simulation possibilities of a SLURM system through machine learning models. The research focused on developing a predictive model capable of estimating work queue waiting times. In particular, by exploiting Transformer-based architectures, it was demonstrated that it is possible to model dependencies in job submission and scheduling behaviour. Using Slurm logs, the model processes a representation of the internal state of the cluster in order to predict queue waiting times. This system opens the door to possible models for speeding up simulations and, with further research, adaptive implementations for optimising cluster workloads.
2024
Transformer
Scheduling
Regression
Slurm
Cluster optimization
File in questo prodotto:
File Dimensione Formato  
master-thesis-dagostini.pdf

accesso aperto

Dimensione 2.47 MB
Formato Adobe PDF
2.47 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/5710