MultilabelClassification¶
- Number of tasks: 11
BrazilianToxicTweetsClassification¶
ToLD-Br is the biggest dataset for toxic tweets in Brazilian Portuguese, crowdsourced by 42 annotators selected from a pool of 129 volunteers. Annotators were selected aiming to create a plural group in terms of demographics (ethnicity, sexual orientation, age, gender). Each tweet was labeled by three annotators in 6 possible categories: LGBTQ+phobia, Xenophobia, Obscene, Insult, Misogyny and Racism.
Dataset: mteb/BrazilianToxicTweetsClassification • License: cc-by-sa-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| text to category (t2c) | accuracy | por | Constructed, Written | expert-annotated | found |
Citation
@article{DBLP:journals/corr/abs-2010-04543,
author = {Joao Augusto Leite and
Diego F. Silva and
Kalina Bontcheva and
Carolina Scarton},
eprint = {2010.04543},
eprinttype = {arXiv},
journal = {CoRR},
timestamp = {Tue, 15 Dec 2020 16:10:16 +0100},
title = {Toxic Language Detection in Social Media for Brazilian Portuguese:
New Dataset and Multilingual Analysis},
url = {https://arxiv.org/abs/2010.04543},
volume = {abs/2010.04543},
year = {2020},
}
CEDRClassification¶
Classification of sentences by emotions, labeled into 5 categories (joy, sadness, surprise, fear, and anger).
Dataset: ai-forever/cedr-classification • License: apache-2.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| text to category (t2c) | accuracy | rus | Blog, Social, Web, Written | human-annotated | found |
Citation
@article{sboev2021data,
author = {Sboev, Alexander and Naumov, Aleksandr and Rybka, Roman},
journal = {Procedia Computer Science},
pages = {637--642},
publisher = {Elsevier},
title = {Data-Driven Model for Emotion Detection in Russian Texts},
volume = {190},
year = {2021},
}
CovidDisinformationNLMultiLabelClassification¶
The dataset is curated to address questions of interest to journalists, fact-checkers, social media platforms, policymakers, and the general public.
Dataset: clips/mteb-nl-COVID-19-disinformation • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| text to category (t2c) | f1 | nld | Social, Web, Written | human-annotated | found |
Citation
@inproceedings{alam-etal-2021-fighting-covid,
address = {Punta Cana, Dominican Republic},
author = {Alam, Firoj and
Shaar, Shaden and
Dalvi, Fahim and
Sajjad, Hassan and
Nikolov, Alex and
Mubarak, Hamdy and
Da San Martino, Giovanni and
Abdelali, Ahmed and
Durrani, Nadir and
Darwish, Kareem and
Al-Homaid, Abdulaziz and
Zaghouani, Wajdi and
Caselli, Tommaso and
Danoe, Gijs and
Stolk, Friso and
Bruntink, Britt and
Nakov, Preslav},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2021},
doi = {10.18653/v1/2021.findings-emnlp.56},
editor = {Moens, Marie-Francine and
Huang, Xuanjing and
Specia, Lucia and
Yih, Scott Wen-tau},
month = nov,
pages = {611--649},
publisher = {Association for Computational Linguistics},
title = {Fighting the {COVID}-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society},
url = {https://aclanthology.org/2021.findings-emnlp.56/},
year = {2021},
}
EmitClassification¶
The EMit dataset is a comprehensive resource for the detection of emotions in Italian social media texts. The EMit dataset consists of social media messages about TV shows, TV series, music videos, and advertisements. Each message is annotated with one or more of the 8 primary emotions defined by Plutchik (anger, anticipation, disgust, fear, joy, sadness, surprise, trust), as well as an additional label “love.”
Dataset: MattiaSangermano/emit • License: cc-by-sa-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| text to category (t2c) | accuracy | ita | Social, Written | expert-annotated | found |
Citation
@inproceedings{araque2023emit,
author = {Araque, O and Frenda, S and Sprugnoli, R and Nozza, D and Patti, V and others},
booktitle = {CEUR WORKSHOP PROCEEDINGS},
organization = {CEUR-WS},
pages = {1--8},
title = {EMit at EVALITA 2023: Overview of the Categorical Emotion Detection in Italian Social Media Task},
volume = {3473},
year = {2023},
}
KorHateSpeechMLClassification¶
The Korean Multi-label Hate Speech Dataset, K-MHaS, consists of 109,692 utterances from Korean online news comments, labelled with 8 fine-grained hate speech classes (labels: Politics, Origin, Physical, Age, Gender, Religion, Race, Profanity) or Not Hate Speech class. Each utterance provides from a single to four labels that can handles Korean language patterns effectively. For more details, please refer to the paper about K-MHaS, published at COLING 2022. This dataset is based on the Korean online news comments available on Kaggle and Github. The unlabeled raw data was collected between January 2018 and June 2020. The language producers are users who left the comments on the Korean online news platform between 2018 and 2020.
Dataset: mteb/KorHateSpeechMLClassification • License: cc-by-sa-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| text to category (t2c) | accuracy | kor | Social, Written | expert-annotated | found |
Citation
@inproceedings{lee-etal-2022-k,
address = {Gyeongju, Republic of Korea},
author = {Lee, Jean and
Lim, Taejun and
Lee, Heejun and
Jo, Bogeun and
Kim, Yangsok and
Yoon, Heegeun and
Han, Soyeon Caren},
booktitle = {Proceedings of the 29th International Conference on Computational Linguistics},
month = oct,
pages = {3530--3538},
publisher = {International Committee on Computational Linguistics},
title = {K-{MH}a{S}: A Multi-label Hate Speech Detection Dataset in {K}orean Online News Comment},
url = {https://aclanthology.org/2022.coling-1.311},
year = {2022},
}
MalteseNewsClassification¶
A multi-label topic classification dataset for Maltese News Articles. The data was collected from the press_mt subset from Korpus Malti v4.0. Article contents were cleaned to filter out JavaScript, CSS, & repeated non-Maltese sub-headings. The labels are based on the category field from this corpus.
Dataset: MLRS/maltese_news_categories • License: cc-by-nc-sa-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| text to category (t2c) | accuracy | mlt | Constructed, Written | expert-annotated | found |
Citation
@inproceedings{maltese-news-datasets,
author = {Chaudhary, Amit Kumar and
Micallef, Kurt and
Borg, Claudia},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation},
month = may,
publisher = {Association for Computational Linguistics},
title = {Topic Classification and Headline Generation for {M}altese using a Public News Corpus},
year = {2024},
}
MultiEURLEXMultilabelClassification¶
EU laws in 23 EU languages containing annotated labels for 21 EUROVOC concepts.
Dataset: mteb/eurlex-multilingual • License: cc-by-sa-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| text to category (t2c) | accuracy | bul, ces, dan, deu, ell, ... (23) | Government, Legal, Written | expert-annotated | found |
Citation
@inproceedings{chalkidis-etal-2021-multieurlex,
author = {Chalkidis, Ilias
and Fergadiotis, Manos
and Androutsopoulos, Ion},
booktitle = {Proceedings of the 2021 Conference on Empirical Methods
in Natural Language Processing},
location = {Punta Cana, Dominican Republic},
publisher = {Association for Computational Linguistics},
title = {MultiEURLEX -- A multi-lingual and multi-label legal document
classification dataset for zero-shot cross-lingual transfer},
url = {https://arxiv.org/abs/2109.00904},
year = {2021},
}
SensitiveTopicsClassification¶
Multilabel classification of sentences across 18 sensitive topics.
Dataset: ai-forever/sensitive-topics-classification • License: cc-by-nc-sa-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| text to category (t2c) | accuracy | rus | Social, Web, Written | human-annotated | found |
Citation
@inproceedings{babakov-etal-2021-detecting,
address = {Kiyv, Ukraine},
author = {Babakov, Nikolay and
Logacheva, Varvara and
Kozlova, Olga and
Semenov, Nikita and
Panchenko, Alexander},
booktitle = {Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing},
editor = {Babych, Bogdan and
Kanishcheva, Olga and
Nakov, Preslav and
Piskorski, Jakub and
Pivovarova, Lidia and
Starko, Vasyl and
Steinberger, Josef and
Yangarber, Roman and
Marci{\'n}czuk, Micha{\l} and
Pollak, Senja and
P{\v{r}}ib{\'a}{\v{n}}, Pavel and
Robnik-{\v{S}}ikonja, Marko},
month = apr,
pages = {26--36},
publisher = {Association for Computational Linguistics},
title = {Detecting Inappropriate Messages on Sensitive Topics that Could Harm a Company{'}s Reputation},
url = {https://aclanthology.org/2021.bsnlp-1.4},
year = {2021},
}
SwedishPatentCPCGroupClassification¶
This dataset contains historical Swedish patent documents (1885-1972) classified according to the Cooperative Patent Classification (CPC) system at the group level. Each document can have multiple labels, making this a challenging multi-label classification task with significant class imbalance and data sparsity characteristics. The dataset includes patent claims text extracted from digitally recreated versions of historical Swedish patents, generated using Optical Character Recognition (OCR) from original paper documents. The text quality varies due to OCR limitations, but all CPC labels were manually assigned by patent engineers at PRV (Swedish Patent and Registration Office), ensuring high reliability for machine learning applications.
Dataset: atheer2104/swedish-patent-cpc-group-new • License: mit • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| text to text (t2t) | accuracy | swe | Government, Legal | expert-annotated | found |
Citation
@mastersthesis{Salim1987995,
author = {Salim, Atheer},
institution = {KTH, School of Electrical Engineering and Computer Science (EECS)},
keywords = {Multi-label Text Classification, Machine Learning, Patent Classification, Deep Learning, Natural Language Processing, Textklassificering med flera Klasser, Maskininlärning, Patentklassificering, Djupinlärning, Språkteknologi},
number = {2025:571},
pages = {70},
school = {KTH, School of Electrical Engineering and Computer Science (EECS)},
series = {TRITA-EECS-EX},
title = {Machine Learning for Classifying Historical Swedish Patents : A Comparison of Textual and Combined Data Approaches},
url = {https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-368254},
year = {2025},
}
SwedishPatentCPCSubclassClassification¶
This dataset contains historical Swedish patent documents (1885-1972) classified according to the Cooperative Patent Classification (CPC) system. Each document can have multiple labels, making this a multi-label classification task with significant implications for patent retrieval and prior art search. The dataset includes patent claims text extracted from digitally recreated versions of historical Swedish patents, generated using Optical Character Recognition (OCR) from original paper documents. The text quality varies due to OCR limitations, but all CPC labels were manually assigned by patent engineers at PRV (Swedish Patent and Registration Office), ensuring high reliability for machine learning applications.
Dataset: atheer2104/swedish-patent-cpc-subclass-new • License: mit • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| text to text (t2t) | accuracy | swe | Government, Legal | expert-annotated | found |
Citation
@mastersthesis{Salim1987995,
author = {Salim, Atheer},
institution = {KTH, School of Electrical Engineering and Computer Science (EECS)},
keywords = {Multi-label Text Classification, Machine Learning, Patent Classification, Deep Learning, Natural Language Processing, Textklassificering med flera Klasser, Maskininlärning, Patentklassificering, Djupinlärning, Språkteknologi},
number = {2025:571},
pages = {70},
school = {KTH, School of Electrical Engineering and Computer Science (EECS)},
series = {TRITA-EECS-EX},
title = {Machine Learning for Classifying Historical Swedish Patents : A Comparison of Textual and Combined Data Approaches},
url = {https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-368254},
year = {2025},
}
VABBMultiLabelClassification¶
This dataset contains the fourteenth edition of the Flemish Academic Bibliography for the Social Sciences and Humanities (VABB-SHW), a database of academic publications from the social sciences and humanities authored by researchers affiliated to Flemish universities (more information). Publications in the database are used as one of the parameters of the Flemish performance-based research funding system
Dataset: clips/mteb-nl-vabb-mlcls-pr • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| text to category (t2c) | f1 | nld | Academic, Written | human-annotated | found |
Citation
@dataset{aspeslagh2024vabb,
author = {Aspeslagh, Pieter and Guns, Raf and Engels, Tim C. E.},
doi = {10.5281/zenodo.14214806},
publisher = {Zenodo},
title = {VABB-SHW: Dataset of Flemish Academic Bibliography for the Social Sciences and Humanities (edition 14)},
url = {https://doi.org/10.5281/zenodo.14214806},
year = {2024},
}