Document Type
Article
Publication Date
2025
DOI
10.1093/bioinformatics/btaf228
Publication Title
Bioinformatics
Volume
41
Issue
Supplement_1
Pages
i362-i372
Conference Name
Joint 33rd Annual Conference on Intelligent Systems for Molecular Biology and 24th Annual European Conference on Computational Biology (ISMB/ECCB 2025), 20-24 July 2025, Liverpool, U.K.
Abstract
Motivation: Protein-protein interactions (PPIs) are fundamental aspects in understanding biological processes. Accurately predicting the effects of mutations on PPIs remains a critical requirement for drug design and disease mechanistic studies. Recently, deep learning models using protein 3D structures have become predominant for predicting mutation effects. However, significant challenges remain in practical applications, in part due to the considerable disparity in generalization capabilities between easy and hard mutations. Specifically, a hard mutation is defined as one with its maximum TM-score < 0.6 when compared to the training set. Additionally, compared to physics-based approaches, deep learning models may overestimate performance due to potential data leakage.
Results: We propose new training/test splits that mitigate data leakage according to the CATH homologous superfamily. Under the constraints of physical energy, protein 3D structures, and CATH domain objectives, we employ a hybrid noise strategy as data augmentation and present a geometric encoder scenario, named CATH-ddG, to represent the mutational microenvironment differences between wild-type and mutated protein complexes. Additionally, we fine-tune ESM2 representations by incorporating a lightweight nonlinear module to achieve the transferability of sequence co-evolutionary information. Finally, our study demonstrates that CATH-ddG framework provides enhanced generalization by outperforming other baselines on non-superfamily leakage splits, which plays a crucial role in exploring robust mutation effect regression prediction. Independent case studies demonstrate successful enhancement of binding affinity on 419 antibody variants to human epidermal growth factor receptor 2 (HER2) and 285 variants in the receptor-binding domain (RBD) of SARS-CoV-2 to angiotensin-converting enzyme 2 (ACE2) receptor.
Availability and implementation: CATH-ddG is available at https://github.com/ak422/CATH-ddG.
Rights
© The Authors 2025.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) License, which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Data Availability
Article states: "All datasets utilized in this study are publicly available, and the data and source code are available in Github, at https://github.com/ak422/CATH-ddG."
Original Publication Citation
Yu, G., Bi, X., Ma, T., Li, Y., & Wang, J. (2025). CATH-ddG: Towards robust mutation effect prediction on protein-protein interactions out of CATH homologous superfamily. Bioinformatics, 41(Supplement_1), i362-i372. https://doi.org/10.1093/bioinformatics/btaf228
Repository Citation
Yu, G., Bi, X., Ma, T., Li, Y., & Wang, J. (2025). CATH-ddG: Towards robust mutation effect prediction on protein-protein interactions out of CATH homologous superfamily. Bioinformatics, 41(Supplement_1), i362-i372. https://doi.org/10.1093/bioinformatics/btaf228
Included in
Amino Acids, Peptides, and Proteins Commons, Computational Biology Commons, Genetics Commons, Influenza Humans Commons, Virology Commons