Document Type

Conference Paper

Publication Date

2025

DOI

10.1051/epjconf/202533701120

Publication Title

EPJ Web of Conferences

Volume

337

Pages

01120

Conference Name

27th International Conference on Computing in High Energy and Nuclear Physics, October 19-25, 2024, Krakow, Poland

Abstract

In this study, we address the mounting challenge of monitoring high throughput computing clusters running computationally intensive jobs, which increasingly strains system administrators. We develop autoencoders that analyze traces of Linux kernel CPU metrics to capture salient system features by producing robust compressed embeddings for various downstream tasks. In addition, we employ graph neural networks to incorporate contextual information from surrounding CPUs and assess their performance. We also demonstrate the enhanced job differentiation achieved by increasing the sampling rate of these traces. Our models are evaluated based on their ability to generate meaningful latent representations, detect anomalies, and distinguish between different job types, marking a preliminary step towards self-supervised, large-scale foundation models for computing centers.

Rights

© 2025 The Authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Original Publication Citation

Mohammed, A. H., Jones, M., McSpadden, D., Schram, M., Hess, B., & Rajput, K. (2025). Decode the workload: Training deep learning models for efficient compute cluster representation. EPJ Web of Conferences. 337, 01120. https://doi.org/10.1051/epjconf/202533701120

Share

COinS