Skip to content

InstructionRetrieval

  • Number of tasks: 8

IFIRAila

Benchmark aila subset in aila within instruction following abilities. The instructions simulate lawyers' or legal assistants' nuanced queries to retrieve relevant legal documents.

Dataset: if-ir/aila • License: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to text (t2t) ndcg_at_20 eng Legal, Written human-annotated found
Citation
@inproceedings{song2025ifir,
  author = {Song, Tingyu and Gan, Guo and Shang, Mingsheng and Zhao, Yilun},
  booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
  pages = {10186--10204},
  title = {IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval},
  year = {2025},
}

IFIRCds

Benchmark IFIR cds subset within instruction following abilities. The instructions simulate a doctor's nuanced queries to retrieve suitable clinical trails, treatment and diagnosis information.

Dataset: if-ir/cds • License: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to text (t2t) ndcg_at_20 eng Medical, Written human-annotated found
Citation
@inproceedings{song2025ifir,
  author = {Song, Tingyu and Gan, Guo and Shang, Mingsheng and Zhao, Yilun},
  booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
  pages = {10186--10204},
  title = {IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval},
  year = {2025},
}

IFIRFiQA

Benchmark IFIR fiqa subset within instruction following abilities. The instructions simulate people's daily life queries to retrieve suitable financial suggestions.

Dataset: if-ir/fiqa • License: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to text (t2t) ndcg_at_20 eng Financial, Written human-annotated created
Citation
@inproceedings{song2025ifir,
  author = {Song, Tingyu and Gan, Guo and Shang, Mingsheng and Zhao, Yilun},
  booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
  pages = {10186--10204},
  title = {IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval},
  year = {2025},
}

IFIRFire

Benchmark IFIR fire subset within instruction following abilities. The instructions simulate lawyers' or legal assistants' nuanced queries to retrieve relevant legal documents.

Dataset: if-ir/fire • License: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to text (t2t) ndcg_at_20 eng Legal, Written human-annotated found
Citation
@inproceedings{song2025ifir,
  author = {Song, Tingyu and Gan, Guo and Shang, Mingsheng and Zhao, Yilun},
  booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
  pages = {10186--10204},
  title = {IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval},
  year = {2025},
}

IFIRNFCorpus

Benchmark IFIR nfcorpus subset within instruction following abilities. The instructions in this dataset simulate nuanced queries from students or researchers to retrieve relevant science literature in the medical and biological domains.

Dataset: if-ir/nfcorpus • License: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to text (t2t) ndcg_at_20 eng Academic, Medical, Written human-annotated found
Citation
@inproceedings{song2025ifir,
  author = {Song, Tingyu and Gan, Guo and Shang, Mingsheng and Zhao, Yilun},
  booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
  pages = {10186--10204},
  title = {IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval},
  year = {2025},
}

IFIRPm

Benchmark IFIR pm subset within instruction following abilities. The instructions simulate a doctor's nuanced queries to retrieve suitable clinical trails, treatment and diagnosis information.

Dataset: if-ir/pm • License: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to text (t2t) ndcg_at_20 eng Medical, Written human-annotated found
Citation
@inproceedings{song2025ifir,
  author = {Song, Tingyu and Gan, Guo and Shang, Mingsheng and Zhao, Yilun},
  booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
  pages = {10186--10204},
  title = {IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval},
  year = {2025},
}

IFIRScifact

Benchmark IFIR scifact_open subset within instruction following abilities. The instructions in this dataset simulate nuanced queries from students or researchers to retrieve relevant science literature.

Dataset: if-ir/scifact_open • License: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to text (t2t) ndcg_at_20 eng Academic, Written human-annotated found
Citation
@inproceedings{song2025ifir,
  author = {Song, Tingyu and Gan, Guo and Shang, Mingsheng and Zhao, Yilun},
  booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
  pages = {10186--10204},
  title = {IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval},
  year = {2025},
}

InstructIR

A benchmark specifically designed to evaluate the instruction following ability in information retrieval models. Our approach focuses on user-aligned instructions tailored to each query instance, reflecting the diverse characteristics inherent in real-world search scenarios. NOTE: scores on this may differ unless you include instruction first, then "[SEP]" and then the query via redefining combine_query_and_instruction in your model.

Dataset: mteb/InstructIR-mteb • License: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to text (t2t) robustness_at_10 eng Web human-annotated created
Citation
@article{oh2024instructir,
  archiveprefix = {{arXiv}},
  author = {{Hanseok Oh and Hyunji Lee and Seonghyeon Ye and Haebin Shin and Hansol Jang and Changwook Jun and Minjoon Seo}},
  eprint = {{2402.14334}},
  primaryclass = {{cs.CL}},
  title = {{INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models}},
  year = {{2024}},
}