Document Type
Conference Paper
Publication Date
2025
DOI
10.1051/epjconf/202533701120
Publication Title
EPJ Web of Conferences
Volume
337
Pages
01120
Conference Name
27th International Conference on Computing in High Energy and Nuclear Physics, October 19-25, 2024, Krakow, Poland
Abstract
In this study, we address the mounting challenge of monitoring high throughput computing clusters running computationally intensive jobs, which increasingly strains system administrators. We develop autoencoders that analyze traces of Linux kernel CPU metrics to capture salient system features by producing robust compressed embeddings for various downstream tasks. In addition, we employ graph neural networks to incorporate contextual information from surrounding CPUs and assess their performance. We also demonstrate the enhanced job differentiation achieved by increasing the sampling rate of these traces. Our models are evaluated based on their ability to generate meaningful latent representations, detect anomalies, and distinguish between different job types, marking a preliminary step towards self-supervised, large-scale foundation models for computing centers.
Rights
© 2025 The Authors.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Original Publication Citation
Mohammed, A. H., Jones, M., McSpadden, D., Schram, M., Hess, B., & Rajput, K. (2025). Decode the workload: Training deep learning models for efficient compute cluster representation. EPJ Web of Conferences. 337, 01120. https://doi.org/10.1051/epjconf/202533701120
Repository Citation
Mohammed, A. H., Jones, M., McSpadden, D., Schram, M., Hess, B., & Rajput, K. (2025). Decode the workload: Training deep learning models for efficient compute cluster representation. EPJ Web of Conferences. 337, 01120. https://doi.org/10.1051/epjconf/202533701120
Included in
Artificial Intelligence and Robotics Commons, Data Science Commons, Engineering Physics Commons