Loading and working with results¶
To make the results more easily accessible, we have designed functionality for retrieving results from both online and the local cache.
Generally you access this functionality using the ResultCache object.
For instance, if you are selecting the best model for semantic text similarity (STS) you could fetch the relevant tasks and create a dataframe of the results using the following code:
import mteb
from mteb.cache import ResultCache
tasks = mteb.get_tasks(tasks=["STS12"])
model_names = ["intfloat/multilingual-e5-large"]
cache = ResultCache("~/.cache/mteb")
results = cache.load_results(models=model_names, tasks=tasks)
From this you will get a BenchmarkResults object:
results
# BenchmarkResults(model_results=[...](#1))
type(results)
# mteb.load_results.benchmark_results.BenchmarkResults
df = results.to_dataframe()
Working with public results¶
All previously submitted results are available results repository.
You can download this using:
from mteb.cache import ResultCache
cache = ResultCache()
cache.download_from_remote() # download results from the remote repository
From here, you can work with the cache as usual. For instance, if you are selecting the best model for your French and English retrieval task on legal documents, you could fetch the relevant tasks and create a dataframe of the results using the following code:
from mteb.cache import ResultCache
# select your tasks
tasks = mteb.get_tasks(task_types=["Retrieval"], languages=["eng", "fra"], domains=["Legal"])
model_names = [
"GritLM/GritLM-7B",
"intfloat/multilingual-e5-large",
]
cache = ResultCache()
cache.download_from_remote() # download results from the remote repository. Might take a while the first time.
results = cache.load_results(
models=model_names,
tasks=tasks,
include_remote=True, # default
)
Working with BenchmarkResults¶
The result object is a convenient object in mteb for working with dataframes and allows you to quick examine your results.

The object contain a lot of convenience functions for inspecting and examining the results:
print(results.model_names)
# ['GritLM/GritLM-7B', 'intfloat/multilingual-e5-large']
task_names = results.task_names
print(task_names)
# ['SpartQA', 'PlscClusteringP2P.v2', 'StackOverflowQA', 'JSICK', ...
Getting Benchmark Results¶
If you loaded results for a specific benchmark, you can get the aggregated benchmark scores for each model using the get_benchmark_result() method:
import mteb
from mteb.cache import ResultCache
# Load results for a specific benchmark
benchmark = mteb.get_benchmark("MTEB(eng, v2)")
cache = ResultCache()
cache.download_from_remote() # download results from the remote repository
results = cache.load_results(
models=["intfloat/e5-small", "intfloat/multilingual-e5-small"],
tasks=benchmark,
)
benchmark_scores_df = results.get_benchmark_result()
print(benchmark_scores_df)
# Rank (Borda) Model Zero-shot Memory Usage (MB) Number of Parameters (B) Embedding Dimensions Max Tokens ... Classification Clustering Pair Classification Reranking Retrieval STS Summarization
# 0 1 [e5-small](https://huggingface.co/intfloat/e5-... 100 127 0.033 384 512.0 ... 0.599545 0.422085 0.850895 0.444613 0.450684 0.790284 0.310609
# 1 2 [multilingual-e5-small](https://huggingface.co... 95 449 0.118 384 512.0 ... 0.673919 0.413591 0.840878 0.431942 0.464342 0.800185 0.292190
Filtering Results¶
There is also utility function that allows you to select certain models or tasks:
# select only gritLM
results = results.select_models(["GritLM/GritLM-7B"])
# select only retrieval tasks
tasks = mteb.get_tasks(tasks=task_names)
retrieval_tasks = [task for task in tasks if task.metadata.type == "Retrieval"]
results = results.select_tasks(retrieval_tasks)
Creating a Dataframe¶
df = results.to_dataframe()
print(df)
# model_name task_name GritLM/GritLM-7B
# 0 AILAStatutes 0.418000
# 1 ArguAna 0.631710
# 2 BelebeleRetrieval 0.717035
# 3 CovidRetrieval 0.734010
# 4 HagridRetrieval 0.986730
# 5 LEMBPasskeyRetrieval 0.382500
# 6 LegalBenchCorporateLobbying 0.949990
# 7 MIRACLRetrievalHardNegatives 0.516793
# 8 MLQARetrieval 0.727420
# 9 SCIDOCS 0.244090
# 10 SpartQA 0.093550
# 11 StackOverflowQA 0.933670
# 12 StatcanDialogueDatasetRetrieval 0.457587
# 13 TRECCOVID 0.743130
# 14 TempReasonL1 0.071640
# 15 TwitterHjerneRetrieval 0.432660
# 16 WikipediaRetrievalMultilingual 0.917722
# 17 WinoGrande 0.536970
By default this will give you the results in a "wide" format. However, you can just as well get them in a long format:
long_format_df = results.to_dataframe(format="long")
print(long_format_df.head(5))
# model_name task_name score
# 0 GritLM/GritLM-7B AILAStatutes 0.418000
# 1 GritLM/GritLM-7B ArguAna 0.631710
# 2 GritLM/GritLM-7B BelebeleRetrieval 0.717035
# 3 GritLM/GritLM-7B CovidRetrieval 0.734010
# 4 GritLM/GritLM-7B HagridRetrieval 0.986730
Adding metadata to table¶
One might want to add some more metadata to the table. This is luckily quite easy using:
import pandas as pd
task_df = tasks.to_dataframe(properties=["name", "type", "domains"])
task_df = task_df.rename(columns={"name": "task_name"})
df_with_meta = pd.merge(task_df, df)
print(df_with_meta.head(5))
# task_name type domains GritLM/GritLM-7B
# 0 SpartQA Retrieval [Encyclopaedic, Written] 0.093550
# 1 StackOverflowQA Retrieval [Programming, Written] 0.933670
# 2 BelebeleRetrieval Retrieval [Web, News, Written] 0.717035
# 3 ArguAna Retrieval [Medical, Written] 0.631710
# 4 TempReasonL1 Retrieval [Encyclopaedic, Written] 0.071640