Document Type

Article

Publication Date

2025

DOI

10.1093/bioinformatics/btaf228

Publication Title

Bioinformatics

Volume

41

Issue

Supplement_1

Pages

i362-i372

Conference Name

Joint 33rd Annual Conference on Intelligent Systems for Molecular Biology and 24th Annual European Conference on Computational Biology (ISMB/ECCB 2025), 20-24 July 2025, Liverpool, U.K.

Abstract

Motivation: Protein-protein interactions (PPIs) are fundamental aspects in understanding biological processes. Accurately predicting the effects of mutations on PPIs remains a critical requirement for drug design and disease mechanistic studies. Recently, deep learning models using protein 3D structures have become predominant for predicting mutation effects. However, significant challenges remain in practical applications, in part due to the considerable disparity in generalization capabilities between easy and hard mutations. Specifically, a hard mutation is defined as one with its maximum TM-score < 0.6 when compared to the training set. Additionally, compared to physics-based approaches, deep learning models may overestimate performance due to potential data leakage.

Results: We propose new training/test splits that mitigate data leakage according to the CATH homologous superfamily. Under the constraints of physical energy, protein 3D structures, and CATH domain objectives, we employ a hybrid noise strategy as data augmentation and present a geometric encoder scenario, named CATH-ddG, to represent the mutational microenvironment differences between wild-type and mutated protein complexes. Additionally, we fine-tune ESM2 representations by incorporating a lightweight nonlinear module to achieve the transferability of sequence co-evolutionary information. Finally, our study demonstrates that CATH-ddG framework provides enhanced generalization by outperforming other baselines on non-superfamily leakage splits, which plays a crucial role in exploring robust mutation effect regression prediction. Independent case studies demonstrate successful enhancement of binding affinity on 419 antibody variants to human epidermal growth factor receptor 2 (HER2) and 285 variants in the receptor-binding domain (RBD) of SARS-CoV-2 to angiotensin-converting enzyme 2 (ACE2) receptor.

Availability and implementation: CATH-ddG is available at https://github.com/ak422/CATH-ddG.

Rights

© The Authors 2025.

This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) License, which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Data Availability

Article states: "All datasets utilized in this study are publicly available, and the data and source code are available in Github, at https://github.com/ak422/CATH-ddG."

Original Publication Citation

Yu, G., Bi, X., Ma, T., Li, Y., & Wang, J. (2025). CATH-ddG: Towards robust mutation effect prediction on protein-protein interactions out of CATH homologous superfamily. Bioinformatics, 41(Supplement_1), i362-i372. https://doi.org/10.1093/bioinformatics/btaf228

Share

COinS