Chunxu Liu*,
Chi Xie*,
Xiaxu Chen,
Feng Zhu,
Rui Zhao,
Limin Wang,
Nanjing University, SenseTime Research
TL; DR. We introduce Small Object Retrieval in Complex Environments (SORCE) task, which is a new subfield of T2IR, focusing on retrieving small objects in complex images.
We introduce a new dataset, SORCE-1K, comprising 1,023 image-text pairs in which each caption describes only a localized object region. This design explicitly avoids providing contextual clues from the broader scene, thereby preventing models from exploiting shortcut cues.
Additionally, we demonstrate that with the use of simple yet effective Regional Prompts (ReP), multimodal large language models (MLLMs) can accurately attend to and embed the corresponding image regions. Our fine-tuned models are available for evaluation here.
Please download SORCE-1K dataset from Hugging Face and place it in the datasets folder.
mkdir datasets
huggingface-cli download --repo-type dataset --resume-download lcxrocks/sorce-1k --local-dir ./datasets/sorce-1k
Please make sure the transformers version is compatible.
conda create -n sorce python=3.11
pip install -r requirements.txt
To evaluate the model, please run the following command, which will download the 🤗hugginface pretrained model.
bash dist_eval.sh
If you think this project is helpful in your research or for application, please feel free to leave a star⭐️ and cite our paper:
@misc{liu2025sorcesmallobjectretrieval,
title={SORCE: Small Object Retrieval in Complex Environments},
author={Chunxu Liu and Chi Xie and Xiaxu Chen and Wei Li and Feng Zhu and Rui Zhao and Limin Wang},
year={2025},
eprint={2505.24441},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.24441},
}
This project is released under the Apache 2.0 license. The codes are based on E5-V. Please also follow their licenses. Thanks for their awesome work!
