Queue-State Learning: Job Wait Time Prediction in HPC Job Queues

The significant focus on artificial intelligence over last years has led to growing interest in High-Performance Computing (HPC), particularly in the areas of parallel computing, resource allocation and job scheduling. Among the many systems available, SLURM has established itself as a leading player in the areas of job allocation and shared resource usage within clusters. Among the various features that have enabled it to establish itself in its field are: extremely high scalability; open source; large community support; extremely flexible policies; and excellent integration with the most modern computing systems. Despite the utmost importance of these systems in the modern technological landscape, the optimisation of various parameters is mainly carried out through empirical methods and arbitrary choices, dictated by common sense, made by system administrators. This is precisely the context in which this work fits in, aiming mainly to study the simulation possibilities of a SLURM system through machine learning models. The research focused on developing a predictive model capable of estimating work queue waiting times. In particular, by exploiting Transformer-based architectures, it was demonstrated that it is possible to model dependencies in job submission and scheduling behaviour. Using Slurm logs, the model processes a representation of the internal state of the cluster in order to predict queue waiting times. This system opens the door to possible models for speeding up simulations and, with further research, adaptive implementations for optimising cluster workloads.

Queue-State Learning: Job Wait Time Prediction in HPC Job Queues

AGOSTINI, DAVIDE

2024/2025

Abstract

The significant focus on artificial intelligence over last years has led to growing interest in High-Performance Computing (HPC), particularly in the areas of parallel computing, resource allocation and job scheduling. Among the many systems available, SLURM has established itself as a leading player in the areas of job allocation and shared resource usage within clusters. Among the various features that have enabled it to establish itself in its field are: extremely high scalability; open source; large community support; extremely flexible policies; and excellent integration with the most modern computing systems. Despite the utmost importance of these systems in the modern technological landscape, the optimisation of various parameters is mainly carried out through empirical methods and arbitrary choices, dictated by common sense, made by system administrators. This is precisely the context in which this work fits in, aiming mainly to study the simulation possibilities of a SLURM system through machine learning models. The research focused on developing a predictive model capable of estimating work queue waiting times. In particular, by exploiting Transformer-based architectures, it was demonstrated that it is possible to model dependencies in job submission and scheduling behaviour. Using Slurm logs, the model processes a representation of the internal state of the cluster in order to predict queue waiting times. This system opens the door to possible models for speeding up simulations and, with further research, adaptive implementations for optimising cluster workloads.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria "Enzo Ferrari"
			
	Corso di studio
	
				Artificial intelligence engineering
			
	Anno Accademico
	
				2024
			
	Parola chiave
	
				Transformer
Scheduling
Regression
Slurm
Cluster optimization
			
	Relatore
	
				BARALDI, LORENZO
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
master-thesis-dagostini.pdf accesso aperto Dimensione 2.47 MB Formato Adobe PDF Visualizza/Apri	2.47 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/5710