Adding a Benchmark

Adding a benchmark¶

The MTEB has a growing list of benchmarks, and we are always looking to add more. MTEB include both benchmark that displayed on the leaderboard and benchmarks that are not displayed on the leaderboard but are still available for evaluation. These non-leaderboard benchmarks are available in the mteb.get_benchmark(s) function and are e.g. useful for evaluating models during development or benchmark that are too specific to be added to the leaderboard.

Implement a new benchmark¶

To implement a new benchmark Benchmark object, and select the MTEB tasks that will be in the benchmark. If some of the tasks do not exist in MTEB, follow the "add a dataset" instructions to add them.

Once you have selected the tasks, you can create a new benchmark as follows:

import mteb

custom_bench = Benchmark(
    name="MTEB(custom, v1)", # set the name 
    tasks=mteb.get_tasks( # (1)
        tasks=["AmazonCounterfactualClassification", "AmazonPolarityClassification"],
        languages=["eng"],
    ),
    # give a short description of the benchmark of what the benchmarks 
    # seeks to test for:
    description=(
        "My custom Amazon benchmark, "
        "which seeks to test for the ability of models "
        "to classify Amazon reviews based on their embeddings."
    ),
)

Select the tasks that will be in the benchmark. See selecting tasks for more details on how to select tasks.

Selecting high-quality tasks

When selecting tasks for a benchmark, it is important to select high-quality tasks that reflects what you seeks to measure. To facilitate this process each task in mteb comes with metadata (task.metadata) that includes a description of the task, the construction and annotation process, licensing and more. We additionally also include descriptive statistics (task.metadata.descriptive_stats) which includes information about the number of samples, minimum length and other statistics that can be useful to select the right tasks for your benchmark.

Generally we recommend selecting tasks that are well established in the community, are not machine translated, are not too small and that are not too similar to other tasks in the benchmark. However, the selection of tasks will depend on what you seek to measure with the benchmark, and thus we recommend carefully reading the metadata of the tasks and selecting the ones that best fit your needs.

Submitting a Benchmark¶

To submit a benchmark to MTEB, you need to add your benchmark to benchmarks.py and then open a pull request (PR).

Once submitted the PR will be reviewed by one of the organizers or contributors who might ask you to change things. The reviewer review both the implementation of the benchmark, but also quality and relevance of the tasks. Once the PR is approved the benchmark will be added into mteb and will be fetchable using mteb.get_benchmark(name). Note this does not automatically add the benchmark to the leaderboard, see next section for instructions on how to do that.

Submitting a Benchmark to the Leaderboard¶

To submit a benchmark to the leaderboard, you need to:

Have added the benchmark to MTEB as described in the previous section
Evalaute a set of models on the benchmark and submit a PR with the results to the results repository with the results of the models on the benchmark.
When your PR with benchmarks results is merged, you can add your benchmark to the most fitting section in benchmark_selector.py to be shown on the leaderboard. You can check that the leaderboard looks correctly by running the leaderboard locally.
When PRs are merged, your benchmark will be added to the leaderboard automatically after the next workflow trigger (every day at midnight Pacific Time (8 AM UTC)).

Not all benchmarks becomes leaderboards

A benchmark is a selection of tasks that intends to test for a specific purpose. Some benchmarks are very specific, are intended for development or are superseded by newer benchmark. We continually try to keep the benchmarks on the leaderboard relevant and thus we might remove benchmarks from the leaderboard if they are no longer relevant. However, these benchmarks will still be available in MTEB and can be used for evaluation, but they just won't be shown on the leaderboard.