Adding a Dataset¶
To add a new dataset to MTEB, you need to do three things:
1) Implement a task with the desired dataset, by subclassing an abstract task
2) Add metadata to the task
3) Calculate statistics of the task (run task.calculate_descriptive_statistics()
)
4) Submit the edits to the MTEB repository
If you have any questions regarding this process feel free to open a discussion thread.
Note
When we mention adding a dataset we refer to a subclass of one of the abstasks.
Creating a new subclass¶
A Simple Example¶
To add a new task, you need to implement a new class that inherits from the AbsTask
associated with the task type (e.g. AbsTaskRetrieval
for retrieval tasks). You can find the supported task types in here.
SciDocs Reranking Task
from mteb.abstasks.retrieval import AbsTaskRetrieval
from mteb.abstasks.task_metadata import TaskMetadata
class SciDocsReranking(AbsTaskRetrieval):
metadata = TaskMetadata(
name="SciDocsRR",
description="Ranking of related scientific papers based on their title.",
reference="https://allenai.org/data/scidocs",
type="Reranking",
category="t2t",
modalities=["text"],
eval_splits=["test"],
eval_langs=["eng-Latn"],
main_score="map",
dataset={
"path": "mteb/scidocs-reranking",
"revision": "d3c5e1fc0b855ab6097bf1cda04dd73947d7caab",
},
date=("2000-01-01", "2020-12-31"), # best guess
domains=["Academic", "Non-fiction", "Domains"],
task_subtypes=["Scientific Reranking"],
license="cc-by-4.0",
annotations_creators="derived",
dialect=[],
sample_creation="found",
bibtex_citation="""
@inproceedings{cohan-etal-2020-specter,
title = "{SPECTER}: Document-level Representation Learning using Citation-informed Transformers",
author = "Cohan, Arman and
Feldman, Sergey and
Beltagy, Iz and
Downey, Doug and
Weld, Daniel",
editor = "Jurafsky, Dan and
Chai, Joyce and
Schluter, Natalie and
Tetreault, Joel",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.acl-main.207",
doi = "10.18653/v1/2020.acl-main.207",
pages = "2270--2282",
}
""",
)
# testing the task with a model:
model = mteb.get_model("intfloat/multilingual-e5-small")
results = mteb.evaluate(model, tasks=[SciDocsReranking()])
Note
For multilingual/crosslingual tasks, make sure you've specified eval_langs
as a dictionary, as shown in this example.
A Detailed Example¶
Often the dataset from HuggingFace is not in the format expected by MTEB. To resolve this you can either change the format on Hugging Face or add a dataset_transform
method to your dataset to transform it into the right format on the fly. Here is an example along with some design considerations:
DBpediaClassificationV2 Task
from mteb.abstasks.task_metadata import TaskMetadata
from mteb.abstasks.classification import AbsTaskClassification
class DBpediaClassificationV2(AbsTaskClassification):
metadata = TaskMetadata(
... # fill in metadata as shown in the simple example above
)
def load_dataset(self):
self.dataset = load_dataset(
**self.metadata.dataset,
)
... # some processing
self.data_loaded = True
# dataset trasform will be called if `load_dataset` is not overridden
def dataset_transform(self):
self.dataset = self.stratified_subsampling(
self.dataset, seed=self.seed, splits=["train", "test"]
)
Creating the metadata object¶
Along with the task MTEB requires metadata regarding the task. If the metadata isn't available please provide your best guess or leave the field as None
.
To get an overview of the fields in the metadata object, you can look at the TaskMetadata class.
Note
That these fields can be left blank if the information is not available and can be extended if necessary. We do not include any machine-translated (without verification) datasets in the benchmark.
Submit a PR¶
Once you are finished create a PR to the MTEB repository. If you haven't created a PR before please refer to the GitHub documentation
The PR will be reviewed by one of the organizers or contributors who might ask you to change things. Once the PR is approved the dataset will be added into the main repository.
Before you commit, here is a checklist you should complete before submitting:
- [ ] I have outlined why this dataset is filling an existing gap in `mteb`
- [ ] I have tested that the dataset runs with the `mteb` package.
- [ ] I have run the following models on the task (adding the results to the pr). These can be run using the `mteb run -m {model_name} -t {task_name}` command.
- [ ] `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`
- [ ] `intfloat/multilingual-e5-small`
- [ ] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
- [ ] I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)
An easy way to test it is using:
import mteb
# sample model:
model = mteb.get_model("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
task = mteb.get_task("{name of your task}")
results = mteb.evaluate(model, task)
mteb run -m sentence-transformers/paraphrase-multilingual-MiniLM -t {name of your task}