Document Type
Conference Paper
Publication Date
2025
DOI
10.1145/3765612.3767255
Publication Title
BCB '25: Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
Pages
65
Conference Name
16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, October 12-15, 2025, Philadelphia, PA
Abstract
Accurate quantifying dietary contents, such as calories, proteins, carbohydrates, and fats, from an image of a meal plate is vital for managing diabetes. Recently, Large Multimodal Models (LMMs) have excelled in complex vision-language tasks due to their use of very large, highly diverse data. This study benchmarked the use of seven LMMs that include full and lightweight models of GPT, Gemini, and Llama for nutrition estimation based on Google's Nutrition5k dataset and our own phone-collected DonateAndLearn dataset. We analyzed the performance of LMMs and the RGB-D fusion model, in which the RGB-D model was specifically trained using Nutrition5k data. On our DonateAndLearn dataset, the full-weight versions of LMMs significantly outperformed the RGB-D fusion model, suggesting superior generalization capacity of the LMMs.
We propose a method to integrate Nutrition5k images with phone-collected meal images that often lack the physical sizes of objects in the images. We applied scaled phone images to the RGB-D fusion model to predict the total weight of food in each phone-collected image. Using the predicted weight, the mean absolute percentage error (MAPE) of carbohydrate prediction using the Gemini 2.5 Flash model decreased from 56.6% to 39.5%, based on 78 test cases in the DonateAndLearn dataset. Furthermore, when the ground-truth food weight was provided to Gemini2.5 Flash and GPT-4.1, the MAPE further improved dramatically to 20.2% and 26.8%, respectively, which underscores the critical value of integrating physical information into dietary assessment tools.
Rights
© 2025 Copyright is held by the owner/authors.
This work is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.
Original Publication Citation
Mu, Y., Sun, J., & He, J. (2025). Benchmarking and improving foundation model dietary estimates from meal images. In X. M. Shi, X. Qian, M. Chen, J. M. Luber, G. L. Rosen, & Y. Luo (Eds.), BCB '25: Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (Article 65). Association for Computing Machinery. https://doi.org/10.1145/3765612.3767255
Repository Citation
Mu, Y., Sun, J., & He, J. (2025). Benchmarking and improving foundation model dietary estimates from meal images. In X. M. Shi, X. Qian, M. Chen, J. M. Luber, G. L. Rosen, & Y. Luo (Eds.), BCB '25: Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (Article 65). Association for Computing Machinery. https://doi.org/10.1145/3765612.3767255
ORCID
0009-0005-7058-2979 (Mu), 0009-0000-8905-7553 (Sun)
Included in
Artificial Intelligence and Robotics Commons, Endocrinology, Diabetes, and Metabolism Commons, Medical Nutrition Commons, Nutrition Commons