Skip to content

What's New

This section is an overview of releases for more information check out the autogenerated changelog.

New in v2.8

Added Audio Support

Added audio support to MTEB 🎉. This includes support for loading and processing audio data in tasks. Overall this includes

import mteb

tasks= mteb.get_tasks()

audio_task = [task for task in tasks if "audio" in task.metadata.modalities]
len(audio_task) # 108 tasks

models = mteb.get_model_metas()
audio_models = mteb.get_tasks(modalities=["audio"])
len(audio_models) # 56 models

# and as easy as always to evaluate on these tasks:
audio_task = audio_task[0]
print(audio_task) # CREMAD(name='CREMA_D', languages=['eng'])

audio_model = audio_models[0]
print(audio_model.name) # google/vggish

mteb.evaluate(audio_model, audio_task)

To run audio tasks you will need to have the audio extension installed, you can do this using pip install mteb[audio]. For more information on installation check out the extended installation guide in the documentation here.

Added event logging support

Added event logging support. This change introduces a new event_logger module for tracking key user interactions within MTEB’s leaderboard UI and backend. Logged events include actions such as page loads, benchmark switches, and filter changes, along with associated metadata. This enables better insight into how users interact with the leaderboard and provides groundwork for analytics and future improvements.

New in v2.7

Added vLLM support

Added vLLM support. While it is currently not the reference implementation for any models it allows you to run comparisons on performance and throughput on a single model. This can inform whether it is worth switching your local setup over to vLLM. While you can read more about it here

New in v2.6

Added leaderboard CLI command

While the mteb leaderboard before could be run locally, we have now added an official CLI to run the leaderboard, which comes with additional arguments, e.g. for changing which results cache to use, so that e.g. companies can host it with their internal results.

# Launch leaderboard with custom results directory
mteb leaderboard --cache-path results

# Launch with specific host and port
mteb leaderboard --cache-path ./my_results --host 0.0.0.0 --port 8080

# Create public shareable link
mteb leaderboard --share

# View all options
mteb leaderboard --help

Improved typing throughout

mteb has now added a type checks, which both improved our typing going forward. This also come with a lot of additional typing information.

New in v2.5

Work with leaderboard tables locally

If you loaded results for a specific benchmark, you can get the aggregated benchmark scores for each model using the get_benchmark_result() method:

import mteb
from mteb.cache import ResultCache

# Load results for a specific benchmark
benchmark = mteb.get_benchmark("MTEB(eng, v2)")
cache = ResultCache()
cache.download_from_remote()  # download results from the remote repository
results = cache.load_results(
    models=["intfloat/e5-small", "intfloat/multilingual-e5-small"],
    tasks=benchmark,
)

benchmark_scores_df = results.get_benchmark_result()
print(benchmark_scores_df)
#    Rank (Borda)                                              Model  Zero-shot  Memory Usage (MB)  Number of Parameters (B)  Embedding Dimensions  Max Tokens  ...  Classification  Clustering  Pair Classification  Reranking  Retrieval       STS  Summarization
# 0             1  [e5-small](https://huggingface.co/intfloat/e5-...        100                127                     0.033                   384       512.0  ...        0.599545    0.422085             0.850895   0.444613   0.450684  0.790284       0.310609
# 1             2  [multilingual-e5-small](https://huggingface.co...         95                449                     0.118                   384       512.0  ...        0.673919    0.413591             0.840878   0.431942   0.464342  0.800185       0.292190

New in v2.4

Added utilities for autogenerating ModelMeta

To make it easier to generate high quality metadata from models we created .from_hf_hub, .from_sentence_transformer_model and .from_cross_encoder.

This does not fill out everything, but it fills out everything that can be automated.

from sentence_transformers import SentenceTransformer

from mteb.models import ModelMeta

model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", device="cpu")
meta = ModelMeta.from_sentence_transformer_model(model)
print(meta.to_dict())
# {'loader_kwargs': {}, 'name': 'Qwen/Qwen3-Embedding-0.6B', 'revision': 'c54f2e6e80b2d7b7de06f51cec4959f6b3e03418', 'release_date': None, 'languages': None, 'n_parameters': 595776512, 'memory_usage_mb': 1136, 'max_tokens': 32768, 'embed_dim': 1024, 'license': 'apache-2.0', 'open_weights': True, 'public_training_code': None, 'public_training_data': None, 'framework': ['Sentence Transformers'], 'reference': None, 'similarity_fn_name': <ScoringFunction.COSINE: 'cosine'>, 'use_instructions': None, 'training_datasets': None, 'adapted_from': None, 'superseded_by': None, 'modalities': ['text'], 'is_cross_encoder': None, 'citation': None, 'contacts': None, 'loader': 'sentence_transformers_loader'}

New in v2.3

Support for custom search backends

MTEB v2.3 adds support for custom search encoder IndexEncoderSearchProtocol and adds the FaissSearchIndex.

import mteb
from mteb.models import SearchEncoderWrapper
from mteb.models.search_encoder_index import FaissSearchIndex

model = mteb.get_model(...)
index_backend = FaissSearchIndex(model)
model = SearchEncoderWrapper(
    model,
    index_backend=index_backend
)
...

This leads to a slight increase in performance, for example running minishlab/potion-base-2M on SWEbenchVerifiedRR took 694 seconds instead of 769. It, however, does not change the default behaviour.

New in v2.2

Support for Asymmetric embeddings in STS and PairClassification

MTEB v2.2 adds support for prompt_type for STS and PairClassification thus allowing for asymmetric embeddings.

E.g. for TERRa, this allow us to add TERRa.v2,

class TERRaV2(AbsTaskPairClassification):
    input1_prompt_type = PromptType.document
    input2_prompt_type = PromptType.query

    metadata = TaskMetadata(
        name="TERRa.V2", ...
    )

This is not backward compatible in scores for models with query/document separation, which is why we introduce the v2, but it better reflect the actual performance of these models.

Example for intfloat/multilingual-e5-small:

Task main PR
TERRa.v2 0.575105 0.589083

New Benchmark Vidore v3

Added Vidore V3 to the leaderboard (#3542), thanks QuentinJGMace et al for working on this!

Added support for python 3.14

Support for python 3.14 was added in #3450.

New in v2.1

New benchmark for Dutch

MTEB v2.1 introduces a new benchmark for dutch MTEB(nld, v1) (#3464). Thanks to nikolay-banar for the PR.

New in v2.0

This section goes through new features added in v2. Below we give an overview of changes following by detailed examples.

What are the reasons for the changes? Generally the many inconsistencies in the library made it hard to maintain without introducing breaking changes and we do think that there are multiple important areas to expand in, e.g. [adding new benchmark for image embeddings]1, support new model types in general making the library more accessible. We have already been able to add many new feature in v2.0, but hope that this new version allow us to keep doing so without breaking backward compatibility. See upgrading from v1 for specific deprecations and how to fix them.

Easier evaluation

Evaluations are now a lot easier using mteb.evaluate,

results = mteb.evaluate(model, tasks)

Better local and online caching

The new mteb.ResultCache makes managing the cache notably easier:

import mteb

model = ...
tasks = ...

cache = mteb.ResultCache(cache_path="~/.cache/mteb")  # default

# simple evaluate with cache
results = mteb.evaluate(model, tasks, cache=cache)  # only runs if results not in cache

It allow you to access the online cache so you don't have to rerun existing models.

# no need to rerun already public results
cache.download_from_remote() # download the latest results from the remote repository
results = mteb.evaluate(model, tasks, cache=cache)

Multimodal Input format

Models in mteb who implements the Encoder protocol now supports multimodal input With the model protocol roughly looking like so:

class EncoderProtocol(Protocol):  # simplified
    """The interface for an encoder in MTEB."""

    def encode(self, inputs: DataLoader[BatchedInput], ...) -> Array: ...
Not only does this allow more efficient loading using the torch dataloader, but it also allows keys for multiple modalities:

batch_input: BatchedInput = {
    "text": list[str],
    "images": list[PIL.Image],
    "audio": list[list[audio]], # upcoming
    # + optional fields such as document title
}

Where text is a batch of texts and list[images] is a batch for that texts. This e.g. allows markdown documents with multiple figures like so:

> As you see in the following figure [figure 1](image_1) there is a correlation between A and B.

Note

More examples of new multimodal inputs you can find in BatchedInput documentation.

However, this also allows no text, multi-image inputs (e.g. for PDFs). Overall this greatly expands the possible tasks that can now be evaluated in MTEB. To see how to convert a legacy model see the converting model section.

Better support for CrossEncoders

Also, we've introduced a new CrossEncoderProtocol for cross-encoders and now all cross-encoders have better support for evaluation:

class CrossEncoderProtocol(Protocol):
    def predict(
        self,
        inputs1: DataLoader[BatchedInput],
        inputs2: DataLoader[BatchedInput],
        ...
    ) -> Array:

Unified Retrieval, Reranking and instruction variants

The retrieval tasks in MTEB now supports both retrieval and reranking using the same base task. The main difference now that Reranking tasks should have top_ranked subset to be evaluated on. New structure of retrieval tasks: dataset[subset][split] = RetrievalSplitData. On HF this dataset should these subsets:

  1. Corpus - the corpus to retrieve from. Monolingual name: corpus, multilingual name: {subset}-corpus. Can contain columns:
  2. id, text, title for text corpus
  3. id, image, (text optionally) for image or multimodal corpus
  4. Queries - the queries to retrieve with. Monolingual name: queries, multilingual name: {subset}-queries.
  5. id, text for text queries. Where text can be str for single query or list[str] or Conversation for multi-turn dialogs queries.
  6. id, text, instructions for instruction retrieval/reranking tasks
  7. id, image, (text optionally) for image or multimodal queries
  8. Qrels - the relevance judgements. Monolingual name: qrels, multilingual name: {subset}-qrels. query-id, corpus-id, score (int or float) for relevance judgements.
  9. Top Ranked - the top ranked documents to rerank. Only for reranking tasks. Monolingual name: top_ranked, multilingual name: {subset}-top_ranked. query-id, corpus-ids (list[str]) - the top ranked documents for each query.

Search Interface

To make it easier to use MTEB for search, we have added a simple search interface using the new SearchProtocol:

class SearchProtocol(Protocol):
    """Interface for searching models."""

    def index(
        self,
        corpus: CorpusDatasetType,
        *,
        task_metadata: TaskMetadata,
        hf_split: str,
        hf_subset: str,
        encode_kwargs: dict[str, Any],
    ) -> None:
        ...

    def search(
        self,
        queries: QueryDatasetType,
        *,
        task_metadata: TaskMetadata,
        hf_split: str,
        hf_subset: str,
        top_k: int,
        encode_kwargs: dict[str, Any],
        top_ranked: TopRankedDocumentsType | None = None,
    ) -> RetrievalOutputType:
        ...

We're automatically wrapping Encoder and CrossEncoder models support SearchProtocol. However, if your model needs a custom index you can implement this protocol directly, like was done for colbert-like models.

New Documentation

We've added a lot of new documentation to make it easier to get started with MTEB.

Better support for loading and comparing results

The new ResultCache also makes it easier to load, inspect and compare both local and online results:

import mteb

cache = mteb.ResultCache(cache_path="~/.cache/mteb") # default
cache.download_from_remote() # download the latest results from the remote repository

# load both local and online results
results = cache.load_results(models=["sentence-transformers/all-MiniLM-L6-v2", ...], tasks=["STS12"])
df = results.to_dataframe()

Descriptive Statistics

Descriptive statistics isn't a new thing in MTEB, however, now it is there for every task, to extract it simply run:

import mteb
task = mteb.get_task("MIRACLRetrievalHardNegatives")

task.metadata.descriptive_stats

And you will get a highly detailed set of descriptive statistics covering everything from number of samples query lengths, duplicates, etc. These not only make it easier for you to examine tasks, but it also makes it easier for us to make quality checks on future tasks.

Example for reranking task:

{
    "test": {
        "num_samples": 160,
        "number_of_characters": 310133,
        "documents_text_statistics": {
            "total_text_length": 307938,
            "min_text_length": 0,
            "average_text_length": 2199.557142857143,
            "max_text_length": 2710,
            "unique_texts": 140
        },
        "documents_image_statistics": null,
        "queries_text_statistics": {
            "total_text_length": 2195,
            "min_text_length": 55,
            "average_text_length": 109.75,
            "max_text_length": 278,
            "unique_texts": 20
        },
        "queries_image_statistics": null,
        "relevant_docs_statistics": {
            "num_relevant_docs": 60,
            "min_relevant_docs_per_query": 7,
            "average_relevant_docs_per_query": 3.0,
            "max_relevant_docs_per_query": 7,
            "unique_relevant_docs": 140
        },
        "top_ranked_statistics": {
            "num_top_ranked": 140,
            "min_top_ranked_per_query": 7,
            "average_top_ranked_per_query": 7.0,
            "max_top_ranked_per_query": 7
        }
    }
}

Documentation for the descriptive statistics types.

Saving Predictions

To support error analysis it is now possible to save the model prediction on a given task. You can do this simply as follows:

import mteb

# using a small model and small dataset
encoder = mteb.get_model("sentence-transformers/static-similarity-mrl-multilingual-v1")
task = mteb.get_task("NanoArguAnaRetrieval")

prediction_folder = "path/to/model_predictions"

res = mteb.evaluate(
    encoder,
    task,
    prediction_folder=prediction_folder,
)

Result of prediction will be saved in path/to/model_predictions/{task_name}_predictions.json and will look like so for retrieval tasks:

{
  "test": {
        "query1": {"document1": 0.77, "document2": 0.12, ...},
        "query2": {"document2": 0.87, "document1": 0.32, ...},
        ...
    }
}

Support datasets v4

With the new functionality for reuploading datasets to the standard datasets Parquet format, we’ve reuploaded all tasks with trust_remote_code, and MTEB now fully supports Datasets v4.

Upgrading from v1

This section gives an introduction of how to upgrade from v1 to v2.

Replacing mteb.MTEB

The previous approach to evaluate would require you to first create MTEB object and then call .run on that object. The MTEB object was initially a sort of catch all object intended for both filtering tasks, selecting tasks, evaluating and few other cases.

This overload of functionality made it hard to change. We have already for a while made it easier to filter and select tasks using get_tasks and mteb.evaluate now superseded MTEB as the method for evaluation.

# Approach before 2.0.0:
eval = mteb.MTEB(tasks=tasks) # now throw a deprecation warning
results = eval.run(
    model,
    overwrite=True,
    encode_kwargs={},
    ...
)

# Recommended:
mteb.evaluate(
    model,
    tasks,
    overwrite_strategy="only-missing", # only rerun missing splits
    encode_kwargs={},
    ...
)

Replacing mteb.load_results()

Given the new ResultCache makes dealing with a results from both local and online caches a lot easier, it can now replace mteb.load_results it

tasks = mteb.get_tasks(tasks=["STS12"])
model_names = ["intfloat/multilingual-e5-large"]

# Approach before 2.0.0:
results = mteb.load_results(models=model_names, tasks=tasks, download_latest=True)

# Recommended:
cache = ResultCache("~/.cache/mteb")  # default
cache.download_from_remote()  # downloads remote results

results = cache.load_results(models=model_names, tasks=tasks)

Converting model to new format

As mentioned in the above section MTEB v2, now supports multimodal input as the default. Luckily for you all models implemented in MTEB already supports this new format! However, if you have a local model that you would like to evaluate Here is a quick conversion guide. If you previous implementation looks like so:

# v1.X.X
class MyDummyEncoder:
    def __init__(self, **kwargs):
        self.model = ...

    def encode(self, sentences: list[str], **kwargs) -> Array:
        embeddings = self.model.encode(sentences)
        return embeddings

You can simply unpack it to its text input like so:

# v2.0.0
class MyDummyEncoder:
    def __init__(self, **kwargs):
        self.model = ...

    def encode(self, input: DataLoader[BatchedInput], **kwargs) -> Array:
        # unpack to v1 format:
        sentences = [text for batch in inputs for text in batch["text"]]
        # do as you did beforehand:
        embeddings = self.model.encode(sentences)
        return embeddings

Of course, it will be more efficient if you work directly with the dataloader.

Reuploading datasets

If your dataset is in old format, or you want to reupload it to the new Parquet format, you can do so using the new push_dataset_to_hub method:

import mteb

task = mteb.get_task("MyOldTask")
task.push_dataset_to_hub("my-username/my-new-task")

Converting Reranking datasets to new format

If you have a reranking dataset, you can convert it to the retrieval format. To do this you need to add your task name to the mteb.abstasks.text.reranking.OLD_FORMAT_RERANKING_TASKS and after this it would be converted to the new format automatically. To reupload them in new reranking format you refer to the reuploading datasets section.

import mteb
from mteb.abstasks.text.reranking import OLD_FORMAT_RERANKING_TASKS

OLD_FORMAT_RERANKING_TASKS.append("MyOldRerankingTask")

task = mteb.get_task("MyOldRerankingTask")
model = ...
mteb.evaluate(model, task)

  1. Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, and Niklas Muennighoff. Mieb: massive image embedding benchmark. arXiv preprint arXiv:2504.10471, 2025. URL: https://arxiv.org/abs/2504.10471, doi:10.48550/ARXIV.2504.10471