BitextMining¶

Number of tasks: 31

BUCC¶

BUCC bitext mining dataset train split.

Dataset: mteb/BUCC • License: not specified • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	cmn, deu, eng, fra, rus	Written	human-annotated	human-translated

Citation

@inproceedings{zweigenbaum-etal-2017-overview,
  address = {Vancouver, Canada},
  author = {Zweigenbaum, Pierre  and
Sharoff, Serge  and
Rapp, Reinhard},
  booktitle = {Proceedings of the 10th Workshop on Building and Using Comparable Corpora},
  doi = {10.18653/v1/W17-2512},
  editor = {Sharoff, Serge  and
Zweigenbaum, Pierre  and
Rapp, Reinhard},
  month = aug,
  pages = {60--67},
  publisher = {Association for Computational Linguistics},
  title = {Overview of the Second {BUCC} Shared Task: Spotting Parallel Sentences in Comparable Corpora},
  url = {https://aclanthology.org/W17-2512},
  year = {2017},
}

BUCC.v2¶

BUCC bitext mining dataset train split, gold set only.

Dataset: mteb/bucc-bitext-mining • License: not specified • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	cmn, deu, eng, fra, rus	Written	human-annotated	human-translated

Citation

@inproceedings{zweigenbaum-etal-2017-overview,
  address = {Vancouver, Canada},
  author = {Zweigenbaum, Pierre  and
Sharoff, Serge  and
Rapp, Reinhard},
  booktitle = {Proceedings of the 10th Workshop on Building and Using Comparable Corpora},
  doi = {10.18653/v1/W17-2512},
  editor = {Sharoff, Serge  and
Zweigenbaum, Pierre  and
Rapp, Reinhard},
  month = aug,
  pages = {60--67},
  publisher = {Association for Computational Linguistics},
  title = {Overview of the Second {BUCC} Shared Task: Spotting Parallel Sentences in Comparable Corpora},
  url = {https://aclanthology.org/W17-2512},
  year = {2017},
}

BibleNLPBitextMining¶

Partial Bible translations in 829 languages, aligned by verse.

Dataset: davidstap/biblenlp-corpus-mmteb • License: cc-by-sa-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	aai, aak, aau, aaz, abt, ... (829)	Religious, Written	expert-annotated	created

Citation

@article{akerman2023ebible,
  author = {Akerman, Vesa and Baines, David and Daspit, Damien and Hermjakob, Ulf and Jang, Taeho and Leong, Colin and Martin, Michael and Mathew, Joel and Robie, Jonathan and Schwarting, Marcus},
  journal = {arXiv preprint arXiv:2304.09919},
  title = {The eBible Corpus: Data and Model Benchmarks for Bible Translation for Low-Resource Languages},
  year = {2023},
}

BornholmBitextMining¶

Danish Bornholmsk Parallel Corpus. Bornholmsk is a Danish dialect spoken on the island of Bornholm, Denmark. Historically it is a part of east Danish which was also spoken in Scania and Halland, Sweden.

Dataset: mteb/BornholmBitextMining • License: cc-by-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	dan	Fiction, Social, Web, Written	expert-annotated	created

Citation

@inproceedings{derczynskiBornholmskNaturalLanguage2019,
  author = {Derczynski, Leon and Kjeldsen, Alex Speed},
  booktitle = {Proceedings of the Nordic Conference of Computational Linguistics (2019)},
  date = {2019},
  file = {Available Version (via Google Scholar):/Users/au554730/Zotero/storage/FBQ73ZYN/Derczynski and Kjeldsen - 2019 - Bornholmsk natural language processing Resources .pdf:application/pdf},
  pages = {338--344},
  publisher = {Linköping University Electronic Press},
  shorttitle = {Bornholmsk natural language processing},
  title = {Bornholmsk natural language processing: Resources and tools},
  url = {https://pure.itu.dk/ws/files/84551091/W19_6138.pdf},
  urldate = {2024-04-24},
}

DanishMedicinesAgencyBitextMining¶

A Bilingual English-Danish parallel corpus from The Danish Medicines Agency.

Dataset: mteb/english-danish-parallel-corpus • License: https://opendefinition.org/od/2.1/en/ • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	dan, eng	Medical, Written	human-annotated	found

Citation

@misc{elrc_danish_medicines_agency_2018,
  author = {Rozis, Roberts},
  institution = {European Union},
  license = {Open Under-PSI},
  note = {Dataset created within the European Language Resource Coordination (ELRC) project under the Connecting Europe Facility - Automated Translation (CEF.AT) actions SMART 2014/1074 and SMART 2015/1091.},
  title = {Bilingual English-Danish Parallel Corpus from the Danish Medicines Agency},
  url = {https://sprogteknologi.dk/dataset/bilingual-english-danish-parallel-corpus-from-the-danish-medicines-agency},
  year = {2019},
}

DiaBlaBitextMining¶

English-French Parallel Corpus. DiaBLa is an English-French dataset for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue.

Dataset: mteb/DiaBlaBitextMining • License: cc-by-nc-sa-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	eng, fra	Social, Written	human-annotated	created

Citation

@inproceedings{gonzalez2019diabla,
  author = {González, Matilde and García, Clara and Sánchez, Lucía},
  booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference},
  pages = {4192--4198},
  title = {DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation},
  year = {2019},
}

FloresBitextMining¶

FLORES is a benchmark dataset for machine translation between English and low-resource languages.

Dataset: mteb/FloresBitextMining • License: cc-by-sa-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	ace, acm, acq, aeb, afr, ... (196)	Encyclopaedic, Non-fiction, Written	human-annotated	created

Citation

@inproceedings{goyal2022flores,
  author = {Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc'Aurelio and Guzm{\'a}n, Francisco},
  booktitle = {Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
  pages = {19--35},
  title = {The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation},
  year = {2022},
}

IN22ConvBitextMining¶

IN22-Conv is a n-way parallel conversation domain benchmark dataset for machine translation spanning English and 22 Indic languages.

Dataset: mteb/IN22ConvBitextMining • License: cc-by-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	asm, ben, brx, doi, eng, ... (23)	Fiction, Social, Spoken, Spoken	expert-annotated	created

Citation

@article{gala2023indictrans,
  author = {Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
  issn = {2835-8856},
  journal = {Transactions on Machine Learning Research},
  note = {},
  title = {IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
  url = {https://openreview.net/forum?id=vfT4YuzAYA},
  year = {2023},
}

IN22GenBitextMining¶

IN22-Gen is a n-way parallel general-purpose multi-domain benchmark dataset for machine translation spanning English and 22 Indic languages.

Dataset: mteb/IN22GenBitextMining • License: cc-by-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	asm, ben, brx, doi, eng, ... (23)	Government, Legal, News, Non-fiction, Religious, ... (7)	expert-annotated	created

Citation

@article{gala2023indictrans,
  author = {Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
  issn = {2835-8856},
  journal = {Transactions on Machine Learning Research},
  note = {},
  title = {IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
  url = {https://openreview.net/forum?id=vfT4YuzAYA},
  year = {2023},
}

IWSLT2017BitextMining¶

The IWSLT 2017 Multilingual Task addresses text translation, including zero-shot translation, with a single MT system across all directions including English, German, Dutch, Italian and Romanian.

Dataset: mteb/IWSLT2017BitextMining • License: cc-by-nc-nd-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	ara, cmn, deu, eng, fra, ... (10)	Fiction, Non-fiction, Written	expert-annotated	found

Citation

@inproceedings{cettolo-etal-2017-overview,
  address = {Tokyo, Japan},
  author = {Cettolo, Mauro  and
Federico, Marcello  and
Bentivogli, Luisa  and
Niehues, Jan  and
St{\"u}ker, Sebastian  and
Sudoh, Katsuhito  and
Yoshino, Koichiro  and
Federmann, Christian},
  booktitle = {Proceedings of the 14th International Conference on Spoken Language Translation},
  editor = {Sakti, Sakriani  and
Utiyama, Masao},
  month = dec # { 14-15},
  pages = {2--14},
  publisher = {International Workshop on Spoken Language Translation},
  title = {Overview of the {IWSLT} 2017 Evaluation Campaign},
  url = {https://aclanthology.org/2017.iwslt-1.1},
  year = {2017},
}

IndicGenBenchFloresBitextMining¶

Flores-IN dataset is an extension of Flores dataset released as a part of the IndicGenBench by Google

Dataset: mteb/IndicGenBenchFloresBitextMining • License: cc-by-sa-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	asm, awa, ben, bgc, bho, ... (30)	News, Web, Written	expert-annotated	human-translated and localized

Citation

@misc{singh2024indicgenbench,
  archiveprefix = {arXiv},
  author = {Harman Singh and Nitish Gupta and Shikhar Bharadwaj and Dinesh Tewari and Partha Talukdar},
  eprint = {2404.16816},
  primaryclass = {cs.CL},
  title = {IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages},
  year = {2024},
}

LinceMTBitextMining¶

LinceMT is a parallel corpus for machine translation pairing code-mixed Hinglish (a fusion of Hindi and English commonly used in modern India) with human-generated English translations.

Dataset: gentaiscool/bitext_lincemt_miners • License: not specified • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	eng, hin	Social, Written	human-annotated	found

Citation

@inproceedings{aguilar2020lince,
  author = {Aguilar, Gustavo and Kar, Sudipta and Solorio, Thamar},
  booktitle = {Proceedings of the Twelfth Language Resources and Evaluation Conference},
  pages = {1803--1813},
  title = {LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation},
  year = {2020},
}

NTREXBitextMining¶

NTREX is a News Test References dataset for Machine Translation Evaluation, covering translation from English into 128 languages. We select language pairs according to the M2M-100 language grouping strategy, resulting in 1916 directions.

Dataset: mteb/NTREXBitextMining • License: cc-by-sa-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	afr, amh, arb, aze, bak, ... (119)	News, Written	expert-annotated	human-translated and localized

Citation

@inproceedings{federmann-etal-2022-ntrex,
  address = {Online},
  author = {Federmann, Christian and Kocmi, Tom and Xin, Ying},
  booktitle = {Proceedings of the First Workshop on Scaling Up Multilingual Evaluation},
  month = {nov},
  pages = {21--24},
  publisher = {Association for Computational Linguistics},
  title = {{NTREX}-128 {--} News Test References for {MT} Evaluation of 128 Languages},
  url = {https://aclanthology.org/2022.sumeval-1.4},
  year = {2022},
}

NollySentiBitextMining¶

NollySenti is Nollywood movie reviews for five languages widely spoken in Nigeria (English, Hausa, Igbo, Nigerian-Pidgin, and Yoruba.

Dataset: gentaiscool/bitext_nollysenti_miners • License: cc-by-sa-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	eng, hau, ibo, pcm, yor	Reviews, Social, Written	human-annotated	found

Citation

@inproceedings{shode2023nollysenti,
  author = {Shode, Iyanuoluwa and Adelani, David Ifeoluwa and Peng, Jing and Feldman, Anna},
  booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
  pages = {986--998},
  title = {NollySenti: Leveraging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification},
  year = {2023},
}

NorwegianCourtsBitextMining¶

Nynorsk and Bokmål parallel corpus from Norwegian courts. Norwegian courts have two standardised written languages. Bokmål is a variant closer to Danish, while Nynorsk was created to resemble regional dialects of Norwegian.

Dataset: kardosdrur/norwegian-courts • License: cc-by-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	nno, nob	Legal, Written	human-annotated	found

Citation

@inproceedings{opus4,
  author = {Tiedemann, J{\"o}rg and Thottingal, Santhosh},
  booktitle = {Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (EAMT)},
  title = {OPUS-MT — Building open translation services for the World},
  year = {2020},
}

NusaTranslationBitextMining¶

NusaTranslation is a parallel dataset for machine translation on 11 Indonesia languages and English.

Dataset: gentaiscool/bitext_nusatranslation_miners • License: cc-by-sa-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	abs, bbc, bew, bhp, ind, ... (12)	Social, Written	human-annotated	created

Citation

@inproceedings{cahyawijaya-etal-2023-nusawrites,
  address = {Nusa Dua, Bali},
  author = {Cahyawijaya, Samuel  and  Lovenia, Holy  and Koto, Fajri  and  Adhista, Dea  and  Dave, Emmanuel  and  Oktavianti, Sarah  and  Akbar, Salsabil  and  Lee, Jhonson  and  Shadieq, Nuur  and  Cenggoro, Tjeng Wawan  and  Linuwih, Hanung  and  Wilie, Bryan  and  Muridan, Galih  and  Winata, Genta  and  Moeljadi, David  and  Aji, Alham Fikri  and  Purwarianti, Ayu  and  Fung, Pascale},
  booktitle = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
  editor = {Park, Jong C.  and  Arase, Yuki  and  Hu, Baotian  and  Lu, Wei  and  Wijaya, Derry  and  Purwarianti, Ayu  and  Krisnadhi, Adila Alfa},
  month = nov,
  pages = {921--945},
  publisher = {Association for Computational Linguistics},
  title = {NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages},
  url = {https://aclanthology.org/2023.ijcnlp-main.60},
  year = {2023},
}

NusaXBitextMining¶

NusaX is a parallel dataset for machine translation and sentiment analysis on 11 Indonesia languages and English.

Dataset: gentaiscool/bitext_nusax_miners • License: cc-by-sa-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	ace, ban, bbc, bjn, bug, ... (12)	Reviews, Written	human-annotated	created

Citation

@inproceedings{winata2023nusax,
  author = {Winata, Genta Indra and Aji, Alham Fikri and Cahyawijaya, Samuel and Mahendra, Rahmad and Koto, Fajri and Romadhony, Ade and Kurniawan, Kemal and Moeljadi, David and Prasojo, Radityo Eko and Fung, Pascale and others},
  booktitle = {Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics},
  pages = {815--834},
  title = {NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages},
  year = {2023},
}

@misc{winata2024miners,
  archiveprefix = {arXiv},
  author = {Genta Indra Winata and Ruochen Zhang and David Ifeoluwa Adelani},
  eprint = {2406.07424},
  primaryclass = {cs.CL},
  title = {MINERS: Multilingual Language Models as Semantic Retrievers},
  year = {2024},
}

PhincBitextMining¶

Phinc is a parallel corpus for machine translation pairing code-mixed Hinglish (a fusion of Hindi and English commonly used in modern India) with human-generated English translations.

Dataset: gentaiscool/bitext_phinc_miners • License: cc-by-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	eng, hin	Social, Written	human-annotated	found

Citation

@inproceedings{srivastava2020phinc,
  author = {Srivastava, Vivek and Singh, Mayank},
  booktitle = {Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)},
  pages = {41--49},
  title = {PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation},
  year = {2020},
}

PubChemSMILESBitextMining¶

ChemTEB evaluates the performance of text embedding models on chemical domain data.

Dataset: BASF-AI/PubChemSMILESBitextMining • License: cc-by-nc-sa-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	eng	Chemistry	derived	created

Citation

@article{kasmaee2024chemteb,
  author = {Kasmaee, Ali Shiraee and Khodadad, Mohammad and Saloot, Mohammad Arshi and Sherck, Nick and Dokas, Stephen and Mahyar, Hamidreza and Samiee, Soheila},
  journal = {arXiv preprint arXiv:2412.00532},
  title = {ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance \& Efficiency on a Specific Domain},
  year = {2024},
}

@article{kim2023pubchem,
  author = {Kim, Sunghwan and Chen, Jie and Cheng, Tiejun and Gindulyte, Asta and He, Jia and He, Siqian and Li, Qingliang and Shoemaker, Benjamin A and Thiessen, Paul A and Yu, Bo and others},
  journal = {Nucleic acids research},
  number = {D1},
  pages = {D1373--D1380},
  publisher = {Oxford University Press},
  title = {PubChem 2023 update},
  volume = {51},
  year = {2023},
}

RomaTalesBitextMining¶

Parallel corpus of Roma Tales in Lovari with Hungarian translations.

Dataset: kardosdrur/roma-tales • License: not specified • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	hun, rom	Fiction, Written	expert-annotated	created

RuSciBenchBitextMining¶

This task focuses on finding translations of scientific articles. The dataset is sourced from eLibrary, Russia's largest electronic library of scientific publications. Russian authors often provide English translations for their abstracts and titles, and the data consists of these paired titles and abstracts. The task evaluates a model's ability to match an article's Russian title and abstract to its English counterpart, or vice versa.

Dataset: mlsa-iai-msu-lab/ru_sci_bench_bitext_mining • License: not specified • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to category (t2c)	f1	eng, rus	Academic, Non-fiction, Written	derived	found

Citation

@article{vatolin2024ruscibench,
  author = {Vatolin, A. and Gerasimenko, N. and Ianina, A. and Vorontsov, K.},
  doi = {10.1134/S1064562424602191},
  issn = {1531-8362},
  journal = {Doklady Mathematics},
  month = {12},
  number = {1},
  pages = {S251--S260},
  title = {RuSciBench: Open Benchmark for Russian and English Scientific Document Representations},
  url = {https://doi.org/10.1134/S1064562424602191},
  volume = {110},
  year = {2024},
}

RuSciBenchBitextMining.v2¶

This task focuses on finding translations of scientific articles. The dataset is sourced from eLibrary, Russia's largest electronic library of scientific publications. Russian authors often provide English translations for their abstracts and titles, and the data consists of these paired titles and abstracts. The task evaluates a model's ability to match an article's Russian title and abstract to its English counterpart, or vice versa. Compared to the previous version, 6 erroneous examples have been removed.

Dataset: mlsa-iai-msu-lab/ru_sci_bench_bitext_mining • License: not specified • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to category (t2c)	f1	eng, rus	Academic, Non-fiction, Written	derived	found

Citation

@article{vatolin2024ruscibench,
  author = {Vatolin, A. and Gerasimenko, N. and Ianina, A. and Vorontsov, K.},
  doi = {10.1134/S1064562424602191},
  issn = {1531-8362},
  journal = {Doklady Mathematics},
  month = {12},
  number = {1},
  pages = {S251--S260},
  title = {RuSciBench: Open Benchmark for Russian and English Scientific Document Representations},
  url = {https://doi.org/10.1134/S1064562424602191},
  volume = {110},
  year = {2024},
}

SAMSumFa¶

Translated Version of SAMSum Dataset for summary retrieval.

Dataset: MCINext/samsum-fa • License: not specified • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	fas	Spoken	LM-generated	machine-translated

SRNCorpusBitextMining¶

SRNCorpus is a machine translation corpus for creole language Sranantongo and Dutch.

Dataset: mteb/SRNCorpusBitextMining • License: cc-by-sa-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	nld, srn	Social, Web, Written	human-annotated	found

Citation

@article{zwennicker2022towards,
  author = {Zwennicker, Just and Stap, David},
  journal = {arXiv preprint arXiv:2212.06383},
  title = {Towards a general purpose machine translation system for Sranantongo},
  year = {2022},
}

SynPerChatbotRAGSumSRetrieval¶

Synthetic Persian Chatbot RAG Summary Dataset for summary retrieval.

Dataset: MCINext/synthetic-persian-chatbot-rag-summary-retrieval • License: not specified • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	fas	Spoken	LM-generated	LM-generated and verified

Citation

SynPerChatbotSumSRetrieval¶

Synthetic Persian Chatbot Summary Dataset for summary retrieval.

Dataset: MCINext/synthetic-persian-chatbot-summary-retrieval • License: not specified • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	fas	Spoken	LM-generated	LM-generated and verified

Citation

Tatoeba¶

1,000 English-aligned sentence pairs for each language based on the Tatoeba corpus

Dataset: mteb/tatoeba-bitext-mining • License: cc-by-2.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	afr, amh, ang, ara, arq, ... (113)	Written	human-annotated	found

Citation

@misc{tatoeba,
  author = {Tatoeba community},
  title = {Tatoeba: Collection of sentences and translations},
  year = {2021},
}

TbilisiCityHallBitextMining¶

Parallel news titles from the Tbilisi City Hall website (https://tbilisi.gov.ge/).

Dataset: jupyterjazz/tbilisi-city-hall-titles • License: not specified • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	eng, kat	News, Written	derived	created

VieMedEVBitextMining¶

A high-quality Vietnamese-English parallel data from the medical domain for machine translation

Dataset: mteb/VieMedEVBitextMining • License: cc-by-nc-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	eng, vie	Medical, Written	expert-annotated	human-translated and localized

Citation

@inproceedings{medev,
  author = {Nhu Vo and Dat Quoc Nguyen and Dung D. Le and Massimo Piccardi and Wray Buntine},
  booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)},
  title = {{Improving Vietnamese-English Medical Machine Translation}},
  year = {2024},
}

WebFAQBitextMiningQAs¶

The WebFAQ Bitext Dataset consists of natural FAQ-style Question-Answer pairs that align across languages. A sentence in the "WebFAQBitextMiningQAs" task is a concatenation of a question and its corresponding answer. The dataset is sourced from FAQ pages on the web.

Dataset: PaDaS-Lab/webfaq-bitexts • License: cc-by-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	ara, aze, ben, bul, cat, ... (49)	Web, Written	human-annotated	human-translated

Citation

@misc{dinzinger2025webfaq,
  archiveprefix = {arXiv},
  author = {Michael Dinzinger and Laura Caspari and Kanishka Ghosh Dastidar and Jelena Mitrović and Michael Granitzer},
  eprint = {2502.20936},
  primaryclass = {cs.CL},
  title = {WebFAQ: A Multilingual Collection of Natural Q&amp;A Datasets for Dense Retrieval},
  url = {https://arxiv.org/abs/2502.20936},
  year = {2025},
}

WebFAQBitextMiningQuestions¶

The WebFAQ Bitext Dataset consists of natural FAQ-style Question-Answer pairs that align across languages. A sentence in the "WebFAQBitextMiningQuestions" task is the question originating from an aligned QA. The dataset is sourced from FAQ pages on the web.

Dataset: PaDaS-Lab/webfaq-bitexts • License: cc-by-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
text to text (t2t)	f1	ara, aze, ben, bul, cat, ... (49)	Web, Written	human-annotated	human-translated

Citation

@misc{dinzinger2025webfaq,
  archiveprefix = {arXiv},
  author = {Michael Dinzinger and Laura Caspari and Kanishka Ghosh Dastidar and Jelena Mitrović and Michael Granitzer},
  eprint = {2502.20936},
  primaryclass = {cs.CL},
  title = {WebFAQ: A Multilingual Collection of Natural Q&amp;A Datasets for Dense Retrieval},
  url = {https://arxiv.org/abs/2502.20936},
  year = {2025},
}