Introduction
Foundation models have changed computer vision. Due to their broad pre-training, they are ideal starting points for further finetuning for a range of downstream tasks, leading to new possibilities even with limited amounts of training data. For RGB image tasks, foundation models have become a default choice. However, for infrared images this is not so straightforward. Recently, a foundation model for infrared images was proposed (Liu et al, ECCV 2024), InfMAE, using an information-aware masking strategy, because infrared images exhibit less diversity and less information content (e.g. less texture).
Surprisingly, finetuning this infrared model for object detection and semantic segmentation, did not result in significantly better results, compared to finetuning a large RGB model. Both for the finetuned RGB models as well as the infrared model, a good performance could be achieved on the infrared tasks. Sometimes a RGB model performed even better than the infrared model. That begs the question: how to get most out of RGB foundation models (e.g. DINO v2) for infrared tasks?
Liu, F., Gao, C., Zhang, Y., Guo, J., Wang, J., & Meng, D. (2024, September). InfMAE: A foundation model in the infrared modality. In European Conference on Computer Vision (pp. 420-437). Cham: Springer Nature Switzerland.
Hu, Z., Yang, B., & Ye, M. (2025). Empowering Visible-Infrared Person Re-Identification with Large Foundation Models. Advances in Neural Information Processing Systems, 37, 117363-117387.
Fan, R., Zhao, W., Lin, M., Wang, Q., Liu, Y. J., & Wang, W. (2024, May). Generalizable Thermal-based Depth Estimation via Pre-trained Visual Foundation Model. In 2024 IEEE International Conference on Robotics and Automation (ICRA) (pp. 14614-14621). IEEE.
Yuan, M., Cui, B., Zhao, T., Wang, J., Fu, S., & Wei, X. (2024). UniRGB-IR: A Unified Framework for RGB-Infrared Semantic Tasks via Adapter Tuning. arXiv preprint arXiv:2404.17360.
What will you be doing?
Explore the opportunities of foundation models for infrared images. Research how to transfer the knowledge in RGB to infrared and how to finetune it adequately for infrared. Investigate the added value for object detection by integrating the recipes into a modern model such as Grounding DINO.
You will perform this assignment within TNO’s Intelligent Imaging department. The Intelligent Imaging department is a passionate, creative, and dedicated team of professionals (60 people) specializing in developing groundbreaking applications in the field of computer vision. Our team members have diverse backgrounds, ranging from the medical field to artificial intelligence. Intelligent Imaging is a young and growing department that has built up a lot of expertise over the past years in AI and deep learning.