Clustering¶
- Number of tasks: 98
AlloProfClusteringP2P¶
Clustering of document titles and descriptions from Allo Prof dataset. Clustering of 10 sets on the document topic.
Dataset: lyon-nlp/alloprof
• License: mit • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | fra | Encyclopaedic, Written | human-annotated | found |
Citation
@misc{lef23,
author = {Lefebvre-Brossard, Antoine and Gazaille, Stephane and Desmarais, Michel C.},
copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International},
doi = {10.48550/ARXIV.2302.07738},
keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
publisher = {arXiv},
title = {Alloprof: a new French question-answer education dataset and its use in an information retrieval case study},
url = {https://arxiv.org/abs/2302.07738},
year = {2023},
}
AlloProfClusteringP2P.v2¶
Clustering of document titles and descriptions from Allo Prof dataset. Clustering of 10 sets on the document topic.
Dataset: lyon-nlp/alloprof
• License: mit • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | fra | Encyclopaedic, Written | human-annotated | found |
Citation
@misc{lef23,
author = {Lefebvre-Brossard, Antoine and Gazaille, Stephane and Desmarais, Michel C.},
copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International},
doi = {10.48550/ARXIV.2302.07738},
keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
publisher = {arXiv},
title = {Alloprof: a new French question-answer education dataset and its use in an information retrieval case study},
url = {https://arxiv.org/abs/2302.07738},
year = {2023},
}
AlloProfClusteringS2S¶
Clustering of document titles from Allo Prof dataset. Clustering of 10 sets on the document topic.
Dataset: lyon-nlp/alloprof
• License: mit • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | fra | Encyclopaedic, Written | human-annotated | found |
Citation
@misc{lef23,
author = {Lefebvre-Brossard, Antoine and Gazaille, Stephane and Desmarais, Michel C.},
copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International},
doi = {10.48550/ARXIV.2302.07738},
keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
publisher = {arXiv},
title = {Alloprof: a new French question-answer education dataset and its use in an information retrieval case study},
url = {https://arxiv.org/abs/2302.07738},
year = {2023},
}
AlloProfClusteringS2S.v2¶
Clustering of document titles from Allo Prof dataset. Clustering of 10 sets on the document topic.
Dataset: lyon-nlp/alloprof
• License: mit • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | fra | Encyclopaedic, Written | human-annotated | found |
Citation
@misc{lef23,
author = {Lefebvre-Brossard, Antoine and Gazaille, Stephane and Desmarais, Michel C.},
copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International},
doi = {10.48550/ARXIV.2302.07738},
keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
publisher = {arXiv},
title = {Alloprof: a new French question-answer education dataset and its use in an information retrieval case study},
url = {https://arxiv.org/abs/2302.07738},
year = {2023},
}
ArXivHierarchicalClusteringP2P¶
Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category
Dataset: mteb/arxiv-clustering-p2p
• License: cc0-1.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Written | derived | found |
ArXivHierarchicalClusteringS2S¶
Clustering of titles from arxiv. Clustering of 30 sets, either on the main or secondary category
Dataset: mteb/arxiv-clustering-s2s
• License: cc0-1.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Written | derived | found |
ArxivClusteringP2P¶
Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category
Dataset: mteb/arxiv-clustering-p2p
• License: cc0-1.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Written | derived | found |
Citation
@misc{arxiv_org_submitters_2024,
author = {arXiv.org submitters},
doi = {10.34740/KAGGLE/DSV/7548853},
publisher = {Kaggle},
title = {arXiv Dataset},
url = {https://www.kaggle.com/dsv/7548853},
year = {2024},
}
ArxivClusteringP2P.v2¶
Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category
Dataset: mteb/arxiv-clustering-p2p
• License: cc0-1.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Written | derived | found |
Citation
@misc{arxiv_org_submitters_2024,
author = {arXiv.org submitters},
doi = {10.34740/KAGGLE/DSV/7548853},
publisher = {Kaggle},
title = {arXiv Dataset},
url = {https://www.kaggle.com/dsv/7548853},
year = {2024},
}
ArxivClusteringS2S¶
Clustering of titles from arxiv. Clustering of 30 sets, either on the main or secondary category
Dataset: mteb/arxiv-clustering-s2s
• License: cc0-1.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Written | derived | found |
Citation
@misc{arxiv_org_submitters_2024,
author = {arXiv.org submitters},
doi = {10.34740/KAGGLE/DSV/7548853},
publisher = {Kaggle},
title = {arXiv Dataset},
url = {https://www.kaggle.com/dsv/7548853},
year = {2024},
}
BeytooteClustering¶
Beytoote Website Articles Clustering
Dataset: MCINext/beytoote-clustering
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | fas | News | derived | found |
Citation
BigPatentClustering¶
Clustering of documents from the Big Patent dataset. Test set only includes documents belonging to a single category, with a total of 9 categories.
Dataset: jinaai/big-patent-clustering
• License: cc-by-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Legal, Written | derived | found |
Citation
@article{DBLP:journals/corr/abs-1906-03741,
author = {Eva Sharma and
Chen Li and
Lu Wang},
bibsource = {dblp computer science bibliography, https://dblp.org},
biburl = {https://dblp.org/rec/journals/corr/abs-1906-03741.bib},
eprint = {1906.03741},
eprinttype = {arXiv},
journal = {CoRR},
timestamp = {Wed, 26 Jun 2019 07:14:58 +0200},
title = {{BIGPATENT:} {A} Large-Scale Dataset for Abstractive and Coherent
Summarization},
url = {http://arxiv.org/abs/1906.03741},
volume = {abs/1906.03741},
year = {2019},
}
BigPatentClustering.v2¶
Clustering of documents from the Big Patent dataset. Test set only includes documents belonging to a single category, with a total of 9 categories.
Dataset: mteb/big-patent
• License: cc-by-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Legal, Written | derived | found |
Citation
@article{DBLP:journals/corr/abs-1906-03741,
author = {Eva Sharma and
Chen Li and
Lu Wang},
bibsource = {dblp computer science bibliography, https://dblp.org},
biburl = {https://dblp.org/rec/journals/corr/abs-1906-03741.bib},
eprint = {1906.03741},
eprinttype = {arXiv},
journal = {CoRR},
timestamp = {Wed, 26 Jun 2019 07:14:58 +0200},
title = {{BIGPATENT:} {A} Large-Scale Dataset for Abstractive and Coherent
Summarization},
url = {http://arxiv.org/abs/1906.03741},
volume = {abs/1906.03741},
year = {2019},
}
BiorxivClusteringP2P¶
Clustering of titles+abstract from biorxiv. Clustering of 10 sets, based on the main category.
Dataset: mteb/biorxiv-clustering-p2p
• License: https://www.biorxiv.org/content/about-biorxiv • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Written | derived | created |
BiorxivClusteringP2P.v2¶
Clustering of titles+abstract from biorxiv across 26 categories.
Dataset: mteb/biorxiv-clustering-p2p
• License: https://www.biorxiv.org/content/about-biorxiv • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Written | derived | created |
BiorxivClusteringS2S¶
Clustering of titles from biorxiv. Clustering of 10 sets, based on the main category.
Dataset: mteb/biorxiv-clustering-s2s
• License: https://www.biorxiv.org/content/about-biorxiv • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Written | derived | created |
BiorxivClusteringS2S.v2¶
Clustering of titles from biorxiv across 26 categories.
Dataset: mteb/biorxiv-clustering-s2s
• License: https://www.biorxiv.org/content/about-biorxiv • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Written | derived | created |
BlurbsClusteringP2P¶
Clustering of book titles+blurbs. Clustering of 28 sets, either on the main or secondary genre.
Dataset: slvnwhrl/blurbs-clustering-p2p
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | deu | Written | not specified | not specified |
Citation
@inproceedings{Remus2019GermEval2T,
author = {Steffen Remus and Rami Aly and Chris Biemann},
booktitle = {Conference on Natural Language Processing},
title = {GermEval 2019 Task 1: Hierarchical Classification of Blurbs},
url = {https://api.semanticscholar.org/CorpusID:208334484},
year = {2019},
}
BlurbsClusteringP2P.v2¶
Clustering of book titles+blurbs. Clustering of 28 sets, either on the main or secondary genre.
Dataset: slvnwhrl/blurbs-clustering-p2p
• License: cc-by-nc-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | deu | Fiction, Written | derived | found |
Citation
@inproceedings{Remus2019GermEval2T,
author = {Steffen Remus and Rami Aly and Chris Biemann},
booktitle = {Conference on Natural Language Processing},
title = {GermEval 2019 Task 1: Hierarchical Classification of Blurbs},
url = {https://api.semanticscholar.org/CorpusID:208334484},
year = {2019},
}
BlurbsClusteringS2S¶
Clustering of book titles. Clustering of 28 sets, either on the main or secondary genre.
Dataset: slvnwhrl/blurbs-clustering-s2s
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | deu | Written | not specified | not specified |
Citation
@inproceedings{Remus2019GermEval2T,
author = {Steffen Remus and Rami Aly and Chris Biemann},
booktitle = {Conference on Natural Language Processing},
title = {GermEval 2019 Task 1: Hierarchical Classification of Blurbs},
url = {https://api.semanticscholar.org/CorpusID:208334484},
year = {2019},
}
BlurbsClusteringS2S.v2¶
Clustering of book titles. Clustering of 28 sets, either on the main or secondary genre.
Dataset: slvnwhrl/blurbs-clustering-s2s
• License: cc-by-nc-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | deu | Fiction, Written | derived | found |
Citation
@inproceedings{Remus2019GermEval2T,
author = {Steffen Remus and Rami Aly and Chris Biemann},
booktitle = {Conference on Natural Language Processing},
title = {GermEval 2019 Task 1: Hierarchical Classification of Blurbs},
url = {https://api.semanticscholar.org/CorpusID:208334484},
year = {2019},
}
BuiltBenchClusteringP2P¶
Clustering of built asset item descriptions based on categories identified within industry classification systems such as IFC, Uniclass, etc.
Dataset: mehrzad-shahin/BuiltBench-clustering-p2p
• License: cc-by-nd-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Engineering, Written | derived | created |
Citation
@article{shahinmoghadam2024benchmarking,
author = {Shahinmoghadam, Mehrzad and Motamedi, Ali},
journal = {arXiv preprint arXiv:2411.12056},
title = {Benchmarking pre-trained text embedding models in aligning built asset information},
year = {2024},
}
BuiltBenchClusteringS2S¶
Clustering of built asset names/titles based on categories identified within industry classification systems such as IFC, Uniclass, etc.
Dataset: mehrzad-shahin/BuiltBench-clustering-s2s
• License: cc-by-nd-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Engineering, Written | derived | created |
Citation
@article{shahinmoghadam2024benchmarking,
author = {Shahinmoghadam, Mehrzad and Motamedi, Ali},
journal = {arXiv preprint arXiv:2411.12056},
title = {Benchmarking pre-trained text embedding models in aligning built asset information},
year = {2024},
}
CLSClusteringP2P¶
Clustering of titles + abstract from CLS dataset. Clustering of 13 sets on the main category.
Dataset: C-MTEB/CLSClusteringP2P
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | cmn | not specified | not specified | not specified |
Citation
@article{li2022csl,
author = {Li, Yudong and Zhang, Yuqing and Zhao, Zhe and Shen, Linlin and Liu, Weijie and Mao, Weiquan and Zhang, Hui},
journal = {arXiv preprint arXiv:2209.05034},
title = {CSL: A large-scale Chinese scientific literature dataset},
year = {2022},
}
CLSClusteringP2P.v2¶
Clustering of titles + abstract from CLS dataset. Clustering of 13 sets on the main category.
Dataset: C-MTEB/CLSClusteringP2P
• License: apache-2.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | cmn | Academic, Written | derived | found |
Citation
@misc{li2022csl,
archiveprefix = {arXiv},
author = {Yudong Li and Yuqing Zhang and Zhe Zhao and Linlin Shen and Weijie Liu and Weiquan Mao and Hui Zhang},
eprint = {2209.05034},
primaryclass = {cs.CL},
title = {CSL: A Large-scale Chinese Scientific Literature Dataset},
year = {2022},
}
CLSClusteringS2S¶
Clustering of titles from CLS dataset. Clustering of 13 sets on the main category.
Dataset: C-MTEB/CLSClusteringS2S
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | cmn | not specified | not specified | not specified |
Citation
@article{li2022csl,
author = {Li, Yudong and Zhang, Yuqing and Zhao, Zhe and Shen, Linlin and Liu, Weijie and Mao, Weiquan and Zhang, Hui},
journal = {arXiv preprint arXiv:2209.05034},
title = {CSL: A large-scale Chinese scientific literature dataset},
year = {2022},
}
CLSClusteringS2S.v2¶
Clustering of titles from CLS dataset. Clustering of 13 sets on the main category.
Dataset: C-MTEB/CLSClusteringS2S
• License: apache-2.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | cmn | Academic, Written | derived | found |
Citation
@misc{li2022csl,
archiveprefix = {arXiv},
author = {Yudong Li and Yuqing Zhang and Zhe Zhao and Linlin Shen and Weijie Liu and Weiquan Mao and Hui Zhang},
eprint = {2209.05034},
primaryclass = {cs.CL},
title = {CSL: A Large-scale Chinese Scientific Literature Dataset},
year = {2022},
}
ClusTREC-Covid¶
A Topical Clustering Benchmark for COVID-19 Scientific Research across 50 covid-19 related topics.
Dataset: Uri-ka/ClusTREC-Covid
• License: cc-by-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Medical, Written | expert-annotated | created |
Citation
@inproceedings{katz-etal-2024-knowledge,
address = {Miami, Florida, USA},
author = {Katz, Uri and
Levy, Mosh and
Goldberg, Yoav},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2024},
month = nov,
pages = {8838--8855},
publisher = {Association for Computational Linguistics},
title = {Knowledge Navigator: {LLM}-guided Browsing Framework for Exploratory Search in Scientific Literature},
url = {https://aclanthology.org/2024.findings-emnlp.516},
year = {2024},
}
DigikalamagClustering¶
A total of 8,515 articles scraped from Digikala Online Magazine. This dataset includes seven different classes.
Dataset: PNLPhub/DigiMag
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | fas | Web | derived | found |
Citation
EightTagsClustering¶
Clustering of headlines from social media posts in Polish belonging to 8 categories: film, history, food, medicine, motorization, work, sport and technology.
Dataset: PL-MTEB/8tags-clustering
• License: gpl-3.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | pol | Social, Written | derived | found |
Citation
@inproceedings{dadas-etal-2020-evaluation,
abstract = {Methods for learning sentence representations have been actively developed in recent years. However, the lack of pre-trained models and datasets annotated at the sentence level has been a problem for low-resource languages such as Polish which led to less interest in applying these methods to language-specific tasks. In this study, we introduce two new Polish datasets for evaluating sentence embeddings and provide a comprehensive evaluation of eight sentence representation methods including Polish and multilingual models. We consider classic word embedding models, recently developed contextual embeddings and multilingual sentence encoders, showing strengths and weaknesses of specific approaches. We also examine different methods of aggregating word vectors into a single sentence vector.},
address = {Marseille, France},
author = {Dadas, Slawomir and
Pere{\\l}kiewicz, Micha{\\l} and
Po{\\'s}wiata, Rafa{\\l}},
booktitle = {Proceedings of the Twelfth Language Resources and Evaluation Conference},
editor = {Calzolari, Nicoletta and
B{\'e}chet, Fr{\'e}d{\'e}ric and
Blache, Philippe and
Choukri, Khalid and
Cieri, Christopher and
Declerck, Thierry and
Goggi, Sara and
Isahara, Hitoshi and
Maegaard, Bente and
Mariani, Joseph and
Mazo, H{\\'e}l{\\`e}ne and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios},
isbn = {979-10-95546-34-4},
language = {English},
month = may,
pages = {1674--1680},
publisher = {European Language Resources Association},
title = {Evaluation of Sentence Representations in {P}olish},
url = {https://aclanthology.org/2020.lrec-1.207},
year = {2020},
}
EightTagsClustering.v2¶
Clustering of headlines from social media posts in Polish belonging to 8 categories: film, history, food, medicine, motorization, work, sport and technology.
Dataset: PL-MTEB/8tags-clustering
• License: gpl-3.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | pol | Social, Written | derived | found |
Citation
@inproceedings{dadas-etal-2020-evaluation,
abstract = {Methods for learning sentence representations have been actively developed in recent years. However, the lack of pre-trained models and datasets annotated at the sentence level has been a problem for low-resource languages such as Polish which led to less interest in applying these methods to language-specific tasks. In this study, we introduce two new Polish datasets for evaluating sentence embeddings and provide a comprehensive evaluation of eight sentence representation methods including Polish and multilingual models. We consider classic word embedding models, recently developed contextual embeddings and multilingual sentence encoders, showing strengths and weaknesses of specific approaches. We also examine different methods of aggregating word vectors into a single sentence vector.},
address = {Marseille, France},
author = {Dadas, Slawomir and
Pere{\\l}kiewicz, Micha{\\l} and
Po{\\'s}wiata, Rafa{\\l}},
booktitle = {Proceedings of the Twelfth Language Resources and Evaluation Conference},
editor = {Calzolari, Nicoletta and
B{\\'e}chet, Fr{\\'e}d{\\'e}ric and
Blache, Philippe and
Choukri, Khalid and
Cieri, Christopher and
Declerck, Thierry and
Goggi, Sara and
Isahara, Hitoshi and
Maegaard, Bente and
Mariani, Joseph and
Mazo, H{\\'e}l{\\`e}ne and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios},
isbn = {979-10-95546-34-4},
language = {English},
month = may,
pages = {1674--1680},
publisher = {European Language Resources Association},
title = {Evaluation of Sentence Representations in {P}olish},
url = {https://aclanthology.org/2020.lrec-1.207},
year = {2020},
}
GeoreviewClusteringP2P¶
Review clustering based on Yandex Georeview dataset
Dataset: ai-forever/georeview-clustering-p2p
• License: mit • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | rus | Reviews, Written | derived | found |
HALClusteringS2S¶
Clustering of titles from HAL (https://huggingface.co/datasets/lyon-nlp/clustering-hal-s2s)
Dataset: lyon-nlp/clustering-hal-s2s
• License: apache-2.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | fra | Academic, Written | human-annotated | found |
Citation
@misc{ciancone2024extending,
archiveprefix = {arXiv},
author = {Mathieu Ciancone and Imene Kerboua and Marion Schaeffer and Wissam Siblini},
eprint = {2405.20468},
primaryclass = {cs.CL},
title = {Extending the Massive Text Embedding Benchmark to French},
year = {2024},
}
HALClusteringS2S.v2¶
Clustering of titles from HAL (https://huggingface.co/datasets/lyon-nlp/clustering-hal-s2s)
Dataset: lyon-nlp/clustering-hal-s2s
• License: apache-2.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | fra | Academic, Written | human-annotated | found |
Citation
@misc{ciancone2024extending,
archiveprefix = {arXiv},
author = {Mathieu Ciancone and Imene Kerboua and Marion Schaeffer and Wissam Siblini},
eprint = {2405.20468},
primaryclass = {cs.CL},
title = {Extending the Massive Text Embedding Benchmark to French},
year = {2024},
}
HamshahriClustring¶
These datasets have been extracted from the RSS feed of two Farsi news agency websites.
Dataset: community-datasets/farsi_news
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | fas | News | derived | found |
Citation
IndicReviewsClusteringP2P¶
Clustering of reviews from IndicSentiment dataset. Clustering of 14 sets on the generic categories label.
Dataset: mteb/IndicReviewsClusteringP2P
• License: cc0-1.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | asm, ben, brx, guj, hin, ... (13) | Reviews, Written | human-annotated | machine-translated and verified |
Citation
@article{doddapaneni2022towards,
author = {Sumanth Doddapaneni and Rahul Aralikatte and Gowtham Ramesh and Shreyansh Goyal and Mitesh M. Khapra and Anoop Kunchukuttan and Pratyush Kumar},
doi = {10.18653/v1/2023.acl-long.693},
journal = {Annual Meeting of the Association for Computational Linguistics},
title = {Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages},
year = {2022},
}
KlueMrcDomainClustering¶
this dataset is a processed and redistributed version of the KLUE-MRC dataset. Domain: Game / Media / Automotive / Finance / Real Estate / Education
Dataset: on-and-on/clustering_klue_mrc_context_domain
• License: cc-by-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | kor | News, Written | human-annotated | found |
Citation
@misc{park2021klue,
archiveprefix = {arXiv},
author = {Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jungwoo Ha and Kyunghyun Cho},
eprint = {2105.09680},
primaryclass = {cs.CL},
title = {KLUE: Korean Language Understanding Evaluation},
year = {2021},
}
KlueYnatMrcCategoryClustering¶
this dataset is a processed and redistributed version of the KLUE-Ynat & KLUE-MRC dataset. News_category: IT/Science, Sports, Media/Culture, Ecomomy/Finance, Real Estate
Dataset: on-and-on/clustering_klue_mrc_ynat_title
• License: cc-by-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to text (t2t) | v_measure | kor | News, Written | human-annotated | found |
Citation
@misc{park2021klue,
archiveprefix = {arXiv},
author = {Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jungwoo Ha and Kyunghyun Cho},
eprint = {2105.09680},
primaryclass = {cs.CL},
title = {KLUE: Korean Language Understanding Evaluation},
year = {2021},
}
LivedoorNewsClustering¶
Clustering of the news reports of a Japanese news site, Livedoor News by RONDHUIT Co, Ltd. in 2012. It contains over 7,000 news report texts across 9 categories (topics).
Dataset: sbintuitions/JMTEB
• License: cc-by-nd-2.1-jp • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | jpn | News, Written | derived | found |
LivedoorNewsClustering.v2¶
Clustering of the news reports of a Japanese news site, Livedoor News by RONDHUIT Co, Ltd. in 2012. It contains over 7,000 news report texts across 9 categories (topics). Version 2 updated on LivedoorNewsClustering by removing pairs where one of entries contain an empty sentences.
Dataset: sbintuitions/JMTEB
• License: cc-by-nd-2.1-jp • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | jpn | News, Written | derived | found |
MLSUMClusteringP2P¶
Clustering of newspaper article contents and titles from MLSUM dataset. Clustering of 10 sets on the newpaper article topics.
Dataset: mteb/mlsum
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | deu, fra, rus, spa | News, Written | derived | found |
Citation
@article{scialom2020mlsum,
author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
journal = {arXiv preprint arXiv:2004.14900},
title = {MLSUM: The Multilingual Summarization Corpus},
year = {2020},
}
MLSUMClusteringP2P.v2¶
Clustering of newspaper article contents and titles from MLSUM dataset. Clustering of 10 sets on the newpaper article topics.
Dataset: mteb/mlsum
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | deu, fra, rus, spa | News, Written | derived | found |
Citation
@article{scialom2020mlsum,
author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
journal = {arXiv preprint arXiv:2004.14900},
title = {MLSUM: The Multilingual Summarization Corpus},
year = {2020},
}
MLSUMClusteringS2S¶
Clustering of newspaper article contents and titles from MLSUM dataset. Clustering of 10 sets on the newpaper article topics.
Dataset: mteb/mlsum
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | deu, fra, rus, spa | News, Written | derived | found |
Citation
@article{scialom2020mlsum,
author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
journal = {arXiv preprint arXiv:2004.14900},
title = {MLSUM: The Multilingual Summarization Corpus},
year = {2020},
}
MLSUMClusteringS2S.v2¶
Clustering of newspaper article contents and titles from MLSUM dataset. Clustering of 10 sets on the newpaper article topics.
Dataset: mteb/mlsum
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | deu, fra, rus, spa | News, Written | derived | found |
Citation
@article{scialom2020mlsum,
author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
journal = {arXiv preprint arXiv:2004.14900},
title = {MLSUM: The Multilingual Summarization Corpus},
year = {2020},
}
MasakhaNEWSClusteringP2P¶
Clustering of news article headlines and texts from MasakhaNEWS dataset. Clustering of 10 sets on the news article label.
Dataset: masakhane/masakhanews
• License: afl-3.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | amh, eng, fra, hau, ibo, ... (16) | News, Non-fiction, Written | derived | found |
Citation
@article{adelani2023masakhanews,
author = {David Ifeoluwa Adelani and Marek Masiak and Israel Abebe Azime and Jesujoba Oluwadara Alabi and Atnafu Lambebo Tonja and Christine Mwase and Odunayo Ogundepo and Bonaventure F. P. Dossou and Akintunde Oladipo and Doreen Nixdorf and Chris Chinenye Emezue and Sana Sabah al-azzawi and Blessing K. Sibanda and Davis David and Lolwethu Ndolela and Jonathan Mukiibi and Tunde Oluwaseyi Ajayi and Tatiana Moteu Ngoli and Brian Odhiambo and Abraham Toluwase Owodunni and Nnaemeka C. Obiefuna and Shamsuddeen Hassan Muhammad and Saheed Salahudeen Abdullahi and Mesay Gemeda Yigezu and Tajuddeen Gwadabe and Idris Abdulmumin and Mahlet Taye Bame and Oluwabusayo Olufunke Awoyomi and Iyanuoluwa Shode and Tolulope Anu Adelani and Habiba Abdulganiy Kailani and Abdul-Hakeem Omotayo and Adetola Adeeko and Afolabi Abeeb and Anuoluwapo Aremu and Olanrewaju Samuel and Clemencia Siro and Wangari Kimotho and Onyekachi Raphael Ogbu and Chinedu E. Mbonu and Chiamaka I. Chukwuneke and Samuel Fanijo and Jessica Ojo and Oyinkansola F. Awosan and Tadesse Kebede Guge and Sakayo Toadoum Sari and Pamela Nyatsine and Freedmore Sidume and Oreen Yousuf and Mardiyyah Oduwole and Ussen Kimanuka and Kanda Patrick Tshinu and Thina Diko and Siyanda Nxakama and Abdulmejid Tuni Johar and Sinodos Gebre and Muhidin Mohamed and Shafie Abdi Mohamed and Fuad Mire Hassan and Moges Ahmed Mehamed and Evrard Ngabire and and Pontus Stenetorp},
journal = {ArXiv},
title = {MasakhaNEWS: News Topic Classification for African languages},
volume = {},
year = {2023},
}
MasakhaNEWSClusteringS2S¶
Clustering of news article headlines from MasakhaNEWS dataset. Clustering of 10 sets on the news article label.
Dataset: masakhane/masakhanews
• License: afl-3.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | amh, eng, fra, hau, ibo, ... (16) | News, Written | human-annotated | not specified |
Citation
@article{adelani2023masakhanews,
author = {David Ifeoluwa Adelani and Marek Masiak and Israel Abebe Azime and Jesujoba Oluwadara Alabi and Atnafu Lambebo Tonja and Christine Mwase and Odunayo Ogundepo and Bonaventure F. P. Dossou and Akintunde Oladipo and Doreen Nixdorf and Chris Chinenye Emezue and Sana Sabah al-azzawi and Blessing K. Sibanda and Davis David and Lolwethu Ndolela and Jonathan Mukiibi and Tunde Oluwaseyi Ajayi and Tatiana Moteu Ngoli and Brian Odhiambo and Abraham Toluwase Owodunni and Nnaemeka C. Obiefuna and Shamsuddeen Hassan Muhammad and Saheed Salahudeen Abdullahi and Mesay Gemeda Yigezu and Tajuddeen Gwadabe and Idris Abdulmumin and Mahlet Taye Bame and Oluwabusayo Olufunke Awoyomi and Iyanuoluwa Shode and Tolulope Anu Adelani and Habiba Abdulganiy Kailani and Abdul-Hakeem Omotayo and Adetola Adeeko and Afolabi Abeeb and Anuoluwapo Aremu and Olanrewaju Samuel and Clemencia Siro and Wangari Kimotho and Onyekachi Raphael Ogbu and Chinedu E. Mbonu and Chiamaka I. Chukwuneke and Samuel Fanijo and Jessica Ojo and Oyinkansola F. Awosan and Tadesse Kebede Guge and Sakayo Toadoum Sari and Pamela Nyatsine and Freedmore Sidume and Oreen Yousuf and Mardiyyah Oduwole and Ussen Kimanuka and Kanda Patrick Tshinu and Thina Diko and Siyanda Nxakama and Abdulmejid Tuni Johar and Sinodos Gebre and Muhidin Mohamed and Shafie Abdi Mohamed and Fuad Mire Hassan and Moges Ahmed Mehamed and Evrard Ngabire and and Pontus Stenetorp},
journal = {ArXiv},
title = {MasakhaNEWS: News Topic Classification for African languages},
volume = {},
year = {2023},
}
MedrxivClusteringP2P¶
Clustering of titles+abstract from medrxiv. Clustering of 10 sets, based on the main category.
Dataset: mteb/medrxiv-clustering-p2p
• License: https://www.medrxiv.org/content/about-medrxiv • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Written | derived | created |
MedrxivClusteringP2P.v2¶
Clustering of titles+abstract from medrxiv across 51 categories.
Dataset: mteb/medrxiv-clustering-p2p
• License: https://www.medrxiv.org/content/about-medrxiv • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Medical, Written | derived | created |
MedrxivClusteringS2S¶
Clustering of titles from medrxiv. Clustering of 10 sets, based on the main category.
Dataset: mteb/medrxiv-clustering-s2s
• License: https://www.medrxiv.org/content/about-medrxiv • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Medical, Written | derived | created |
MedrxivClusteringS2S.v2¶
Clustering of titles from medrxiv across 51 categories.
Dataset: mteb/medrxiv-clustering-s2s
• License: https://www.medrxiv.org/content/about-medrxiv • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Academic, Medical, Written | derived | created |
MewsC16JaClustering¶
MewsC-16 (Multilingual Short Text Clustering Dataset for News in 16 languages) is constructed from Wikinews. This dataset is the Japanese split of MewsC-16, containing topic sentences from Wikinews articles in 12 categories. More detailed information is available in the Appendix E of the citation.
Dataset: sbintuitions/JMTEB
• License: cc-by-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | jpn | News, Written | derived | found |
Citation
@inproceedings{nishikawa-etal-2022-ease,
abstract = {We present EASE, a novel method for learning sentence embeddings via contrastive learning between sentences and their related entities.The advantage of using entity supervision is twofold: (1) entities have been shown to be a strong indicator of text semantics and thus should provide rich training signals for sentence embeddings; (2) entities are defined independently of languages and thus offer useful cross-lingual alignment supervision.We evaluate EASE against other unsupervised models both in monolingual and multilingual settings.We show that EASE exhibits competitive or better performance in English semantic textual similarity (STS) and short text clustering (STC) tasks and it significantly outperforms baseline methods in multilingual settings on a variety of tasks.Our source code, pre-trained models, and newly constructed multi-lingual STC dataset are available at https://github.com/studio-ousia/ease.},
address = {Seattle, United States},
author = {Nishikawa, Sosuke and
Ri, Ryokan and
Yamada, Ikuya and
Tsuruoka, Yoshimasa and
Echizen, Isao},
booktitle = {Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
month = jul,
pages = {3870--3885},
publisher = {Association for Computational Linguistics},
title = {{EASE}: Entity-Aware Contrastive Learning of Sentence Embedding},
url = {https://aclanthology.org/2022.naacl-main.284},
year = {2022},
}
NLPTwitterAnalysisClustering¶
Clustering of tweets from twitter across 26 categories.
Dataset: hamedhf/nlp_twitter_analysis
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | fas | Social | derived | found |
Citation
PlscClusteringP2P¶
Clustering of Polish article titles+abstracts from Library of Science (https://bibliotekanauki.pl/), either on the scientific field or discipline.
Dataset: PL-MTEB/plsc-clustering-p2p
• License: cc0-1.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | pol | Academic, Written | derived | found |
PlscClusteringP2P.v2¶
Clustering of Polish article titles+abstracts from Library of Science (https://bibliotekanauki.pl/), either on the scientific field or discipline.
Dataset: PL-MTEB/plsc-clustering-p2p
• License: cc0-1.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | pol | Academic, Written | derived | found |
PlscClusteringS2S¶
Clustering of Polish article titles from Library of Science (https://bibliotekanauki.pl/), either on the scientific field or discipline.
Dataset: PL-MTEB/plsc-clustering-s2s
• License: cc0-1.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | pol | Academic, Written | derived | found |
PlscClusteringS2S.v2¶
Clustering of Polish article titles from Library of Science (https://bibliotekanauki.pl/), either on the scientific field or discipline.
Dataset: PL-MTEB/plsc-clustering-s2s
• License: cc0-1.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | pol | Academic, Written | derived | found |
RedditClustering¶
Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.
Dataset: mteb/reddit-clustering
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Social, Web, Written | derived | found |
Citation
@article{geigle:2021:arxiv,
archiveprefix = {arXiv},
author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
eprint = {2104.07081},
journal = {arXiv preprint},
title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
url = {http://arxiv.org/abs/2104.07081},
volume = {abs/2104.07081},
year = {2021},
}
RedditClustering-VN¶
A translated dataset from Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherence's Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria.
Dataset: GreenNode/reddit-clustering-vn
• License: cc-by-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | vie | Social, Web, Written | derived | machine-translated and LM verified |
Citation
@misc{pham2025vnmtebvietnamesemassivetext,
archiveprefix = {arXiv},
author = {Loc Pham and Tung Luu and Thu Vo and Minh Nguyen and Viet Hoang},
eprint = {2507.21500},
primaryclass = {cs.CL},
title = {VN-MTEB: Vietnamese Massive Text Embedding Benchmark},
url = {https://arxiv.org/abs/2507.21500},
year = {2025},
}
RedditClustering.v2¶
Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.
Dataset: mteb/reddit-clustering
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Social, Web, Written | derived | found |
Citation
@article{geigle:2021:arxiv,
archiveprefix = {arXiv},
author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
eprint = {2104.07081},
journal = {arXiv preprint},
title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
url = {http://arxiv.org/abs/2104.07081},
volume = {abs/2104.07081},
year = {2021},
}
RedditClusteringP2P¶
Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs.
Dataset: mteb/reddit-clustering-p2p
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Social, Web, Written | derived | found |
Citation
@article{geigle:2021:arxiv,
archiveprefix = {arXiv},
author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
eprint = {2104.07081},
journal = {arXiv preprint},
title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
url = {http://arxiv.org/abs/2104.07081},
volume = {abs/2104.07081},
year = {2021},
}
RedditClusteringP2P-VN¶
A translated dataset from Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs. The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherence's Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria.
Dataset: GreenNode/reddit-clustering-p2p-vn
• License: cc-by-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | vie | Social, Web, Written | derived | machine-translated and LM verified |
Citation
@misc{pham2025vnmtebvietnamesemassivetext,
archiveprefix = {arXiv},
author = {Loc Pham and Tung Luu and Thu Vo and Minh Nguyen and Viet Hoang},
eprint = {2507.21500},
primaryclass = {cs.CL},
title = {VN-MTEB: Vietnamese Massive Text Embedding Benchmark},
url = {https://arxiv.org/abs/2507.21500},
year = {2025},
}
RedditClusteringP2P.v2¶
Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs.
Dataset: mteb/reddit-clustering-p2p
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Social, Web, Written | derived | found |
Citation
@article{geigle:2021:arxiv,
archiveprefix = {arXiv},
author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
eprint = {2104.07081},
journal = {arXiv preprint},
title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
url = {http://arxiv.org/abs/2104.07081},
volume = {abs/2104.07081},
year = {2021},
}
RomaniBibleClustering¶
Clustering verses from the Bible in Kalderash Romani by book.
Dataset: kardosdrur/romani-bible
• License: mit • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | rom | Religious, Written | derived | human-translated and localized |
RuSciBenchGRNTIClusteringP2P¶
Clustering of scientific papers (title+abstract) by rubric
Dataset: ai-forever/ru-scibench-grnti-classification
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | rus | Academic, Written | derived | found |
RuSciBenchOECDClusteringP2P¶
Clustering of scientific papers (title+abstract) by rubric
Dataset: ai-forever/ru-scibench-oecd-classification
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | rus | Academic, Written | derived | found |
SIB200ClusteringS2S¶
SIB-200 is the largest publicly available topic classification dataset based on Flores-200 covering 205 languages and dialects annotated. The dataset is annotated in English for the topics, science/technology, travel, politics, sports, health, entertainment, and geography. The labels are then transferred to the other languages in Flores-200 which are human-translated.
Dataset: mteb/sib200
• License: cc-by-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | ace, acm, acq, aeb, afr, ... (197) | News, Written | expert-annotated | human-translated and localized |
Citation
@article{adelani2023sib,
author = {Adelani, David Ifeoluwa and Liu, Hannah and Shen, Xiaoyu and Vassilyev, Nikita and Alabi, Jesujoba O and Mao, Yanke and Gao, Haonan and Lee, Annie En-Shiun},
journal = {arXiv preprint arXiv:2309.07445},
title = {SIB-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects},
year = {2023},
}
SIDClustring¶
Clustering of summariesfrom SIDClustring across categories.
Dataset: MCINext/sid-clustering
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | fas | Academic | derived | found |
Citation
SNLClustering¶
Webscrabed articles from the Norwegian lexicon 'Det Store Norske Leksikon'. Uses articles categories as clusters.
Dataset: navjordj/SNL_summarization
• License: cc-by-nc-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | nob | Encyclopaedic, Non-fiction, Written | derived | found |
Citation
@mastersthesis{navjord2023beyond,
author = {Navjord, J{\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen},
school = {Norwegian University of Life Sciences, {\AA}s},
title = {Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers},
year = {2023},
}
SNLHierarchicalClusteringP2P¶
Webscrabed articles from the Norwegian lexicon 'Det Store Norske Leksikon'. Uses articles categories as clusters.
Dataset: mteb/SNLHierarchicalClusteringP2P
• License: cc-by-nc-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | nob | Encyclopaedic, Non-fiction, Written | derived | found |
Citation
@mastersthesis{navjord2023beyond,
author = {Navjord, J{\\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen},
school = {Norwegian University of Life Sciences, {\\AA}s},
title = {Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers},
year = {2023},
}
SNLHierarchicalClusteringS2S¶
Webscrabed articles from the Norwegian lexicon 'Det Store Norske Leksikon'. Uses articles categories as clusters.
Dataset: mteb/SNLHierarchicalClusteringS2S
• License: cc-by-nc-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | nob | Encyclopaedic, Non-fiction, Written | derived | found |
Citation
@mastersthesis{navjord2023beyond,
author = {Navjord, J{\\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen},
school = {Norwegian University of Life Sciences, {\\AA}s},
title = {Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers},
year = {2023},
}
SpanishNewsClusteringP2P¶
Clustering of news articles, 7 topics in total.
Dataset: jinaai/spanish_news_clustering
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | spa | not specified | not specified | not specified |
StackExchangeClustering¶
Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.
Dataset: mteb/stackexchange-clustering
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Web, Written | derived | found |
Citation
@article{geigle:2021:arxiv,
archiveprefix = {arXiv},
author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
eprint = {2104.07081},
journal = {arXiv preprint},
title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
url = {http://arxiv.org/abs/2104.07081},
volume = {abs/2104.07081},
year = {2021},
}
StackExchangeClustering-VN¶
A translated dataset from Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherence's Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria.
Dataset: GreenNode/stackexchange-clustering-vn
• License: cc-by-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | vie | Web, Written | derived | machine-translated and LM verified |
Citation
@misc{pham2025vnmtebvietnamesemassivetext,
archiveprefix = {arXiv},
author = {Loc Pham and Tung Luu and Thu Vo and Minh Nguyen and Viet Hoang},
eprint = {2507.21500},
primaryclass = {cs.CL},
title = {VN-MTEB: Vietnamese Massive Text Embedding Benchmark},
url = {https://arxiv.org/abs/2507.21500},
year = {2025},
}
StackExchangeClustering.v2¶
Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.
Dataset: mteb/stackexchange-clustering
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Web, Written | derived | found |
Citation
@article{geigle:2021:arxiv,
archiveprefix = {arXiv},
author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
eprint = {2104.07081},
journal = {arXiv preprint},
title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
url = {http://arxiv.org/abs/2104.07081},
volume = {abs/2104.07081},
year = {2021},
}
StackExchangeClusteringP2P¶
Clustering of title+body from stackexchange. Clustering of 5 sets of 10k paragraphs and 5 sets of 5k paragraphs.
Dataset: mteb/stackexchange-clustering-p2p
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Web, Written | derived | found |
Citation
@article{geigle:2021:arxiv,
archiveprefix = {arXiv},
author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
eprint = {2104.07081},
journal = {arXiv preprint},
title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
url = {http://arxiv.org/abs/2104.07081},
volume = {abs/2104.07081},
year = {2021},
}
StackExchangeClusteringP2P-VN¶
A translated Clustering of title+body from stackexchange. Clustering of 5 sets of 10k paragraphs and 5 sets of 5k paragraphs. The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherence's Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria.
Dataset: GreenNode/stackexchange-clustering-p2p-vn
• License: cc-by-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | vie | Web, Written | derived | machine-translated and LM verified |
Citation
@misc{pham2025vnmtebvietnamesemassivetext,
archiveprefix = {arXiv},
author = {Loc Pham and Tung Luu and Thu Vo and Minh Nguyen and Viet Hoang},
eprint = {2507.21500},
primaryclass = {cs.CL},
title = {VN-MTEB: Vietnamese Massive Text Embedding Benchmark},
url = {https://arxiv.org/abs/2507.21500},
year = {2025},
}
StackExchangeClusteringP2P.v2¶
Clustering of title+body from stackexchange. Clustering of 5 sets of 10k paragraphs and 5 sets of 5k paragraphs.
Dataset: mteb/stackexchange-clustering-p2p
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Web, Written | derived | found |
Citation
@article{geigle:2021:arxiv,
archiveprefix = {arXiv},
author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
eprint = {2104.07081},
journal = {arXiv preprint},
title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
url = {http://arxiv.org/abs/2104.07081},
volume = {abs/2104.07081},
year = {2021},
}
SwednClustering¶
The SWE-DN corpus is based on 1,963,576 news articles from the Swedish newspaper Dagens Nyheter (DN) during the years 2000--2020. The articles are filtered to resemble the CNN/DailyMail dataset both regarding textual structure. This dataset uses the category labels as clusters.
Dataset: sbx/superlim-2
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | swe | News, Non-fiction, Written | derived | found |
Citation
@inproceedings{monsen2021method,
author = {Monsen, Julius and J{\"o}nsson, Arne},
booktitle = {Proceedings of CLARIN Annual Conference},
title = {A method for building non-english corpora for abstractive text summarization},
year = {2021},
}
SwednClusteringP2P¶
The SWE-DN corpus is based on 1,963,576 news articles from the Swedish newspaper Dagens Nyheter (DN) during the years 2000--2020. The articles are filtered to resemble the CNN/DailyMail dataset both regarding textual structure. This dataset uses the category labels as clusters.
Dataset: sbx/superlim-2
• License: cc-by-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | swe | News, Non-fiction, Written | derived | found |
Citation
@inproceedings{monsen2021method,
author = {Monsen, Julius and J{\"o}nsson, Arne},
booktitle = {Proceedings of CLARIN Annual Conference},
title = {A method for building non-english corpora for abstractive text summarization},
year = {2021},
}
SwednClusteringS2S¶
The SWE-DN corpus is based on 1,963,576 news articles from the Swedish newspaper Dagens Nyheter (DN) during the years 2000--2020. The articles are filtered to resemble the CNN/DailyMail dataset both regarding textual structure. This dataset uses the category labels as clusters.
Dataset: sbx/superlim-2
• License: cc-by-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | swe | News, Non-fiction, Written | derived | found |
Citation
@inproceedings{monsen2021method,
author = {Monsen, Julius and J{\"o}nsson, Arne},
booktitle = {Proceedings of CLARIN Annual Conference},
title = {A method for building non-english corpora for abstractive text summarization},
year = {2021},
}
TenKGnadClusteringP2P¶
Clustering of news article titles+subheadings+texts. Clustering of 10 splits on the news article category.
Dataset: slvnwhrl/tenkgnad-clustering-p2p
• License: cc-by-nc-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | deu | Web, Written | not specified | found |
TenKGnadClusteringP2P.v2¶
Clustering of news article titles+subheadings+texts. Clustering of 10 splits on the news article category.
Dataset: slvnwhrl/tenkgnad-clustering-p2p
• License: cc-by-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | deu | News, Non-fiction, Written | derived | found |
TenKGnadClusteringS2S¶
Clustering of news article titles. Clustering of 10 splits on the news article category.
Dataset: slvnwhrl/tenkgnad-clustering-s2s
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | deu | News, Non-fiction, Written | not specified | not specified |
TenKGnadClusteringS2S.v2¶
Clustering of news article titles. Clustering of 10 splits on the news article category.
Dataset: slvnwhrl/tenkgnad-clustering-s2s
• License: cc-by-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | deu | News, Non-fiction, Written | derived | found |
ThuNewsClusteringP2P¶
Clustering of titles + abstracts from the THUCNews dataset
Dataset: C-MTEB/ThuNewsClusteringP2P
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | cmn | not specified | not specified | not specified |
Citation
@inproceedings{eisner2007proceedings,
author = {Eisner, Jason},
booktitle = {Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)},
title = {Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)},
year = {2007},
}
@inproceedings{li2006comparison,
author = {Li, Jingyang and Sun, Maosong and Zhang, Xian},
booktitle = {proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics},
pages = {545--552},
title = {A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization},
year = {2006},
}
ThuNewsClusteringP2P.v2¶
Clustering of titles + abstracts from the THUCNews dataset
Dataset: C-MTEB/ThuNewsClusteringP2P
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | cmn | News, Written | derived | found |
Citation
@software{THUCTC,
author = {Sun, M. and Li, J. and Guo, Z. and Yu, Z. and Zheng, Y. and Si, X. and Liu, Z.},
note = {THU Chinese Text Classification Toolkit},
publisher = {THU Natural Language Processing Lab},
title = {THUCTC: An Efficient Chinese Text Classifier},
url = {https://github.com/thunlp/THUCTC},
year = {2016},
}
ThuNewsClusteringS2S¶
Clustering of titles from the THUCNews dataset
Dataset: C-MTEB/ThuNewsClusteringS2S
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | cmn | not specified | not specified | not specified |
Citation
@inproceedings{eisner2007proceedings,
author = {Eisner, Jason},
booktitle = {Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)},
title = {Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)},
year = {2007},
}
@inproceedings{li2006comparison,
author = {Li, Jingyang and Sun, Maosong and Zhang, Xian},
booktitle = {proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics},
pages = {545--552},
title = {A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization},
year = {2006},
}
ThuNewsClusteringS2S.v2¶
Clustering of titles from the THUCNews dataset
Dataset: C-MTEB/ThuNewsClusteringS2S
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | cmn | News, Written | derived | found |
Citation
@software{THUCTC,
author = {Sun, M. and Li, J. and Guo, Z. and Yu, Z. and Zheng, Y. and Si, X. and Liu, Z.},
note = {THU Chinese Text Classification Toolkit},
publisher = {THU Natural Language Processing Lab},
title = {THUCTC: An Efficient Chinese Text Classifier},
url = {https://github.com/thunlp/THUCTC},
year = {2016},
}
TwentyNewsgroupsClustering¶
Clustering of the 20 Newsgroups dataset (subject only).
Dataset: mteb/twentynewsgroups-clustering
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | News, Written | derived | found |
Citation
@incollection{LANG1995331,
address = {San Francisco (CA)},
author = {Ken Lang},
booktitle = {Machine Learning Proceedings 1995},
doi = {https://doi.org/10.1016/B978-1-55860-377-6.50048-7},
editor = {Armand Prieditis and Stuart Russell},
isbn = {978-1-55860-377-6},
pages = {331-339},
publisher = {Morgan Kaufmann},
title = {NewsWeeder: Learning to Filter Netnews},
url = {https://www.sciencedirect.com/science/article/pii/B9781558603776500487},
year = {1995},
}
TwentyNewsgroupsClustering-VN¶
A translated dataset from Clustering of the 20 Newsgroups dataset (subject only). The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherence's Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria.
Dataset: GreenNode/twentynewsgroups-clustering-vn
• License: cc-by-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | vie | News, Written | derived | machine-translated and LM verified |
Citation
@misc{pham2025vnmtebvietnamesemassivetext,
archiveprefix = {arXiv},
author = {Loc Pham and Tung Luu and Thu Vo and Minh Nguyen and Viet Hoang},
eprint = {2507.21500},
primaryclass = {cs.CL},
title = {VN-MTEB: Vietnamese Massive Text Embedding Benchmark},
url = {https://arxiv.org/abs/2507.21500},
year = {2025},
}
TwentyNewsgroupsClustering.v2¶
Clustering of the 20 Newsgroups dataset (subject only).
Dataset: mteb/twentynewsgroups-clustering
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | News, Written | derived | found |
Citation
@incollection{LANG1995331,
address = {San Francisco (CA)},
author = {Ken Lang},
booktitle = {Machine Learning Proceedings 1995},
doi = {https://doi.org/10.1016/B978-1-55860-377-6.50048-7},
editor = {Armand Prieditis and Stuart Russell},
isbn = {978-1-55860-377-6},
pages = {331-339},
publisher = {Morgan Kaufmann},
title = {NewsWeeder: Learning to Filter Netnews},
url = {https://www.sciencedirect.com/science/article/pii/B9781558603776500487},
year = {1995},
}
VGClustering¶
Articles and their classes (e.g. sports) from VG news articles extracted from Norsk Aviskorpus.
Dataset: navjordj/VG_summarization
• License: not specified • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | nob | News, Non-fiction, Written | derived | found |
Citation
@mastersthesis{navjord2023beyond,
author = {Navjord, J{\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen},
school = {Norwegian University of Life Sciences, {\AA}s},
title = {Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers},
year = {2023},
}
VGHierarchicalClusteringP2P¶
Articles and their classes (e.g. sports) from VG news articles extracted from Norsk Aviskorpus.
Dataset: navjordj/VG_summarization
• License: cc-by-nc-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | nob | News, Non-fiction, Written | derived | found |
Citation
@mastersthesis{navjord2023beyond,
author = {Navjord, J{\\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen},
school = {Norwegian University of Life Sciences, {\\AA}s},
title = {Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers},
year = {2023},
}
VGHierarchicalClusteringS2S¶
Articles and their classes (e.g. sports) from VG news articles extracted from Norsk Aviskorpus.
Dataset: navjordj/VG_summarization
• License: cc-by-nc-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | nob | News, Non-fiction, Written | derived | found |
Citation
@mastersthesis{navjord2023beyond,
author = {Navjord, J{\\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen},
school = {Norwegian University of Life Sciences, {\\AA}s},
title = {Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers},
year = {2023},
}
WikiCitiesClustering¶
Clustering of Wikipedia articles of cities by country from https://huggingface.co/datasets/wikipedia. Test set includes 126 countries, and a total of 3531 cities.
Dataset: jinaai/cities_wiki_clustering
• License: cc-by-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Encyclopaedic, Written | derived | found |
Citation
@online{wikidump,
author = {Wikimedia Foundation},
title = {Wikimedia Downloads},
url = {https://dumps.wikimedia.org},
}
WikiClusteringP2P¶
Clustering of wikipedia articles inspired by BlubrbsClusteringP2P. Labels are taken from top-level categories of the respective languages (e.g., https://lv.wikipedia.org/wiki/Kategorija:Pamatkategorijas).
Dataset: ryzzlestrizzle/multi-wiki-clustering-p2p
• License: cc-by-sa-3.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | bos, cat, ces, dan, eus, ... (14) | Encyclopaedic, Written | derived | created |
WikiClusteringP2P.v2¶
Clustering of wikipedia articles inspired by BlubrbsClusteringP2P. Labels are taken from top-level categories of the respective languages (e.g., https://lv.wikipedia.org/wiki/Kategorija:Pamatkategorijas).
Dataset: ryzzlestrizzle/multi-wiki-clustering-p2p
• License: cc-by-sa-3.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | bos, cat, ces, dan, eus, ... (14) | Encyclopaedic, Written | derived | created |
WikipediaChemistryTopicsClustering¶
ChemTEB evaluates the performance of text embedding models on chemical domain data.
Dataset: BASF-AI/WikipediaEasy10Clustering
• License: cc-by-nc-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Chemistry | derived | created |
Citation
@article{kasmaee2024chemteb,
author = {Kasmaee, Ali Shiraee and Khodadad, Mohammad and Saloot, Mohammad Arshi and Sherck, Nick and Dokas, Stephen and Mahyar, Hamidreza and Samiee, Soheila},
journal = {arXiv preprint arXiv:2412.00532},
title = {ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance \& Efficiency on a Specific Domain},
year = {2024},
}
WikipediaSpecialtiesInChemistryClustering¶
ChemTEB evaluates the performance of text embedding models on chemical domain data.
Dataset: BASF-AI/WikipediaMedium5Clustering
• License: cc-by-nc-sa-4.0 • Learn more →
Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
---|---|---|---|---|---|
text to category (t2c) | v_measure | eng | Chemistry | derived | created |
Citation
@article{kasmaee2024chemteb,
author = {Kasmaee, Ali Shiraee and Khodadad, Mohammad and Saloot, Mohammad Arshi and Sherck, Nick and Dokas, Stephen and Mahyar, Hamidreza and Samiee, Soheila},
journal = {arXiv preprint arXiv:2412.00532},
title = {ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance \& Efficiency on a Specific Domain},
year = {2024},
}