Selecting Tasks or Benchmarks¶
This section describes how to select benchmarks and tasks to evaluate, including selecting specific subsets or splits to run.
Selecting a Benchmark¶
mteb
comes with a set of predefined benchmarks. These can be fetched using get_benchmark
or get_benchmarks
and run in a similar fashion to other sets of tasks.
For instance to select the English benchmark that forms the English leaderboard:
import mteb
benchmark = mteb.get_benchmark("MTEB(eng, v2)")
model = ...
results = mteb.evaluate(model, tasks=benchmark)
The benchmark specifies not only a list of tasks, but also what splits and language to run on.
Note
Generally we use the naming scheme for benchmarks MTEB(*)
, where the "*" denotes the target of the benchmark.
In the case of a language, we use the three-letter language code.
For large groups of languages, we use the group notation, e.g., MTEB(Scandinavian, v1)
for Scandinavian languages.
External benchmarks implemented in MTEB like CoIR
1 use their original name.
To get an overview of all available benchmarks, simply run:
import mteb
benchmarks = mteb.get_benchmarks()
When using a benchmark from MTEB please cite mteb
along with the citations of the benchmark which you can access using benchmark.citation
.
Selecting a Task¶
mteb
comes with the utility function get_task
and get_tasks
for fetching and analysing the tasks of interest.
This can be done in multiple ways, e.g.:
- by the task name
- by their type (e.g. "Clustering" or "Classification")
- by their languages (specified as a three letter code)
- by their domains
- by their modalities
- and many more
# by name
tasks = mteb.get_tasks(tasks=["Banking77Classification"])
# by type
tasks = mteb.get_tasks(task_types=["Clustering", "Retrieval"]) # Only select clustering and retrieval tasks
# by language
tasks = mteb.get_tasks(languages=["eng", "deu"]) # Only select datasets which contain "eng" or "deu" (iso 639-3 codes)
# by domain
tasks = get_tasks(domains=["Legal"])
# by modality
tasks = mteb.get_tasks(modalities=["text", "image"]) # Only select tasks with text or image modalities
# or using multiple
tasks = get_tasks(languages=["eng", "deu"], script=["Latn"], domains=["Legal"])
You can also specify which languages to load for multilingual/cross-lingual tasks like below:
import mteb
tasks = [
mteb.get_task("AmazonReviewsClassification", languages = ["eng", "fra"]),
mteb.get_task("BUCCBitextMining", languages = ["deu"]), # all subsets containing "deu"
]
get_tasks
and get_task
.
Selecting Evaluation Split or Subsets¶
A task in mteb
mirrors the structure of a dataset on Huggingface. It includes a splits (i.e. "test") and a subset.
# selecting an evaluation split
task = mteb.get_task("Banking77Classification", eval_splits=["test"])
# selecting a Huggingface subset
task = mteb.get_task("AmazonReviewsClassification", hf_subsets=["en", "fr"])
What is a subset?
A subset on a Huggingface dataset is what you specify after the dataset name, e.g. datasets.load_dataset("nyu-mll/glue", "cola")
.
Often the subset does not need to be defined and is left as "default". The subset is however useful, especially for multilingual datasets to specify the
desired language or language pair e.g. in mteb/bucc-bitext-mining
we might want to evaluate only on the French-English subset "fr-en"
.
Using a Custom Task¶
To evaluate on a custom task, you can run the following code on your custom task. See how to add a new task, for how to create a new task in MTEB.
import mteb
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
class MyCustomTask(AbsTaskReranking):
...
model = mteb.get_model(...)
results = mteb.evaluate(model, tasks=[MyCustomTask()])
-
Xiangyang Li, Kuicai Dong, Yi Quan Lee, Wei Xia, Yichun Yin, Hao Zhang, Yong Liu, Yasheng Wang, and Ruiming Tang. Coir: a comprehensive benchmark for code information retrieval models. 2024. URL: https://arxiv.org/abs/2407.02883, arXiv:2407.02883. ↩