Evaluation¶

`mteb.evaluate` ¶

`OverwriteStrategy` ¶

Bases: HelpfulStrEnum

Enum for the overwrite strategy when running a task.

"always": Always run the task, overwriting the results
"never": Run the task only if the results are not found in the cache. If the results are found, it will not run the task.
"only-missing": Only rerun the missing splits of a task. It will not rerun the splits if the dataset revision or mteb version has changed.
"only-cache": Only load the results from the cache folder and do not run the task. Useful if you just want to load the results from the cache.

Source code in mteb/evaluate.py

class OverwriteStrategy(HelpfulStrEnum):
    """Enum for the overwrite strategy when running a task.

    - "always": Always run the task, overwriting the results
    - "never": Run the task only if the results are not found in the cache. If the results are found, it will not run the task.
    - "only-missing": Only rerun the missing splits of a task. It will not rerun the splits if the dataset revision or mteb version has
        changed.
    - "only-cache": Only load the results from the cache folder and do not run the task. Useful if you just want to load the results from the
        cache.
    """

    ALWAYS = "always"
    NEVER = "never"
    ONLY_MISSING = "only-missing"
    ONLY_CACHE = "only-cache"

`evaluate(model, tasks, *, co2_tracker=None, raise_error=True, encode_kwargs=None, cache=ResultCache(), overwrite_strategy='only-missing', prediction_folder=None, show_progress_bar=True)` ¶

This function runs a model on a given task and returns the results.

Parameters:

Name	Type	Description	Default
`model`	`ModelMeta \| MTEBModels \| SentenceTransformer \| CrossEncoder`	The model to use for encoding.	required
`tasks`	`AbsTask \| Iterable[AbsTask]`	A task to run.	required
`co2_tracker`	`bool \| None`	If True, track the CO₂ emissions of the evaluation, required codecarbon to be installed, which can be installed using `pip install mteb[codecarbon]`. If none is passed co2 tracking will only be run if codecarbon is installed.	`None`
`encode_kwargs`	`dict[str, Any] \| None`	Additional keyword arguments passed to the models `encode` method.	`None`
`raise_error`	`bool`	If True, raise an error if the task fails. If False, return an empty list.	`True`
`cache`	`ResultCache \| None`	The cache to use for loading the results. If None, then no cache will be used. The default cache saved the cache in the `~/.cache/mteb` directory. It can be overridden by setting the `MTEB_CACHE` environment variable to a different directory or by directly passing a `ResultCache` object.	`ResultCache()`
`overwrite_strategy`	`str \| OverwriteStrategy`	The strategy to use for run a task and overwrite the results. Can be: - "always": Always run the task, overwriting the results - "never": Run the task only if the results are not found in the cache. If the results are found, it will not run the task. - "only-missing": Only rerun the missing splits of a task. It will not rerun the splits if the dataset revision or mteb version has changed. - "only-cache": Only load the results from the cache folder and do not run the task. Useful if you just want to load the results from the cache.	`'only-missing'`
`prediction_folder`	`Path \| str \| None`	Optional folder in which to save model predictions for the task. Predictions of the tasks will be sabed in `prediction_folder/{task_name}_predictions.json`	`None`
`show_progress_bar`	`bool`	Whether to show a progress bar when running the evaluation. Default is True. Setting this to False will also set the `encode_kwargs['show_progress_bar']` to False if encode_kwargs is unspecified.	`True`

Returns:

Type	Description
`ModelResult`	The results of the evaluation.

Examples:

>>> import mteb
>>> model_meta = mteb.get_model_meta("sentence-transformers/all-MiniLM-L6-v2")
>>> task = mteb.get_task("STS12")
>>> result = mteb.evaluate(ModelMeta, task)
>>>
>>> # with CO2 tracking
>>> result = mteb.evaluate(model_meta, task, co2_tracker=True)
>>>
>>> # with encode kwargs
>>> result = mteb.evaluate(model_meta, task, encode_kwargs={"batch_size": 16})
>>>
>>> # with online cache
>>> cache = mteb.ResultCache(cache_path="~/.cache/mteb")
>>>
>>> cache.download_from_remote()
>>> result = mteb.evaluate(model_meta, task, cache=cache)

Source code in mteb/evaluate.py

def evaluate(
    model: ModelMeta | MTEBModels | SentenceTransformer | CrossEncoder,
    tasks: AbsTask | Iterable[AbsTask],
    *,
    co2_tracker: bool | None = None,
    raise_error: bool = True,
    encode_kwargs: dict[str, Any] | None = None,
    cache: ResultCache | None = ResultCache(),
    overwrite_strategy: str | OverwriteStrategy = "only-missing",
    prediction_folder: Path | str | None = None,
    show_progress_bar: bool = True,
) -> ModelResult:
    """This function runs a model on a given task and returns the results.

    Args:
        model: The model to use for encoding.
        tasks: A task to run.
        co2_tracker: If True, track the CO₂ emissions of the evaluation, required codecarbon to be installed, which can be installed using
            `pip install mteb[codecarbon]`. If none is passed co2 tracking will only be run if codecarbon is installed.
        encode_kwargs: Additional keyword arguments passed to the models `encode` method.
        raise_error: If True, raise an error if the task fails. If False, return an empty list.
        cache: The cache to use for loading the results. If None, then no cache will be used. The default cache saved the cache in the
            `~/.cache/mteb` directory. It can be overridden by setting the `MTEB_CACHE` environment variable to a different directory or by directly
            passing a `ResultCache` object.
        overwrite_strategy: The strategy to use for run a task and overwrite the results. Can be:
            - "always": Always run the task, overwriting the results
            - "never": Run the task only if the results are not found in the cache. If the results are found, it will not run the task.
            - "only-missing": Only rerun the missing splits of a task. It will not rerun the splits if the dataset revision or mteb version has
                changed.
            - "only-cache": Only load the results from the cache folder and do not run the task. Useful if you just want to load the results from the
                cache.
        prediction_folder: Optional folder in which to save model predictions for the task. Predictions of the tasks will be sabed in `prediction_folder/{task_name}_predictions.json`
        show_progress_bar: Whether to show a progress bar when running the evaluation. Default is True. Setting this to False will also set the
            `encode_kwargs['show_progress_bar']` to False if encode_kwargs is unspecified.

    Returns:
        The results of the evaluation.

    Examples:
        >>> import mteb
        >>> model_meta = mteb.get_model_meta("sentence-transformers/all-MiniLM-L6-v2")
        >>> task = mteb.get_task("STS12")
        >>> result = mteb.evaluate(ModelMeta, task)
        >>>
        >>> # with CO2 tracking
        >>> result = mteb.evaluate(model_meta, task, co2_tracker=True)
        >>>
        >>> # with encode kwargs
        >>> result = mteb.evaluate(model_meta, task, encode_kwargs={"batch_size": 16})
        >>>
        >>> # with online cache
        >>> cache = mteb.ResultCache(cache_path="~/.cache/mteb")
        >>>
        >>> cache.download_from_remote()
        >>> result = mteb.evaluate(model_meta, task, cache=cache)
    """
    if isinstance(prediction_folder, str):
        prediction_folder = Path(prediction_folder)

    if encode_kwargs is None:
        encode_kwargs = (
            {"show_progress_bar": False} if show_progress_bar is False else {}
        )
    if "batch_size" not in encode_kwargs:
        encode_kwargs["batch_size"] = 32
        logger.info(
            "No batch size defined in encode_kwargs. Setting `encode_kwargs['batch_size'] = 32`. Explicitly set the batch size to silence this message."
        )

    model, meta, model_name, model_revision = _sanitize_model(model)
    _check_model_modalities(meta, tasks)

    # AbsTaskAggregate is a special case where we have to run multiple tasks and combine the results
    if isinstance(tasks, AbsTaskAggregate):
        task = cast(AbsTaskAggregate, tasks)
        results = evaluate(
            model,
            task.metadata.task_list,
            co2_tracker=co2_tracker,
            raise_error=raise_error,
            encode_kwargs=encode_kwargs,
            cache=cache,
            overwrite_strategy=overwrite_strategy,
            prediction_folder=prediction_folder,
            show_progress_bar=show_progress_bar,
        )
        result = task.combine_task_results(results.task_results)
        return ModelResult(
            model_name=results.model_name,
            model_revision=results.model_revision,
            task_results=[result],
        )

    if isinstance(tasks, AbsTask):
        task = tasks
    else:
        results = []
        tasks_tqdm = tqdm(
            tasks,
            desc="Evaluating tasks",
            disable=not show_progress_bar,
        )
        for i, task in enumerate(tasks_tqdm):
            tasks_tqdm.set_description(f"Evaluating task {task.metadata.name}")
            _res = evaluate(
                model,
                task,
                co2_tracker=co2_tracker,
                raise_error=raise_error,
                encode_kwargs=encode_kwargs,
                cache=cache,
                overwrite_strategy=overwrite_strategy,
                prediction_folder=prediction_folder,
                show_progress_bar=False,
            )
            results.extend(_res.task_results)
        return ModelResult(
            model_name=_res.model_name,
            model_revision=_res.model_revision,
            task_results=results,
        )

    overwrite_strategy = OverwriteStrategy.from_str(overwrite_strategy)

    existing_results = None
    if cache and overwrite_strategy != OverwriteStrategy.ALWAYS:
        results = cache.load_task_result(task.metadata.name, meta)
        if results:
            existing_results = results

    if (
        existing_results
        and overwrite_strategy == "only-missing"
        and overwrite_strategy == OverwriteStrategy.ONLY_MISSING
        and existing_results.is_mergeable(task)
    ):
        missing_eval = existing_results.get_missing_evaluations(task)
    else:
        missing_eval = dict.fromkeys(task.eval_splits, task.hf_subsets)

    if (
        existing_results
        and not missing_eval
        and overwrite_strategy != OverwriteStrategy.ALWAYS
    ):
        # if there are no missing evals we can just return the results
        logger.info(
            f"Results for {task.metadata.name} already exist in cache. Skipping evaluation and loading results."
        )
        return ModelResult(
            model_name=model_name,
            model_revision=model_revision,
            task_results=[existing_results],
        )
    if missing_eval and overwrite_strategy in [
        OverwriteStrategy.NEVER,
        OverwriteStrategy.ONLY_CACHE,
    ]:
        raise ValueError(
            f"overwrite_strategy is set to '{overwrite_strategy.value}' and the results file exists. However there are the following missing splits (and subsets): {missing_eval}. To rerun these set overwrite_strategy to 'only-missing'."
        )

    if existing_results:
        logger.info(
            f"Found existing results for {task.metadata.name}, only running missing splits: {list(missing_eval.keys())}"
        )

    if isinstance(model, ModelMeta):
        logger.info(
            f"Loading model {model_name} with revision {model_revision} from ModelMeta."
        )
        model = model.load_model()
        logger.info("✓ Model loaded")

    if raise_error is False:
        try:
            result = _evaluate_task(
                model=model,
                splits=missing_eval,
                task=task,
                co2_tracker=co2_tracker,
                encode_kwargs=encode_kwargs,
                prediction_folder=prediction_folder,
            )
        except Exception as e:
            logger.error(
                f"Error while running task {task.metadata.name} on splits {list(missing_eval.keys())}: {e}"
            )
            return ModelResult(
                model_name=model_name,
                model_revision=model_revision,
                task_results=[],
            )
    else:
        result = _evaluate_task(
            model=model,
            splits=missing_eval,
            task=task,
            co2_tracker=False,
            encode_kwargs=encode_kwargs,
            prediction_folder=prediction_folder,
        )
    logger.info(f"✓ Finished evaluation for {task.metadata.name}")

    if existing_results:
        result = result.merge(existing_results)

    if cache:
        cache.save_to_cache(result, meta)

    return ModelResult(
        model_name=model_name,
        model_revision=model_revision,
        task_results=[result],
    )

Evaluation¶

mteb.evaluate ¶

OverwriteStrategy ¶

evaluate(model, tasks, *, co2_tracker=None, raise_error=True, encode_kwargs=None, cache=ResultCache(), overwrite_strategy='only-missing', prediction_folder=None, show_progress_bar=True) ¶

`mteb.evaluate` ¶

`OverwriteStrategy` ¶

`evaluate(model, tasks, *, co2_tracker=None, raise_error=True, encode_kwargs=None, cache=ResultCache(), overwrite_strategy='only-missing', prediction_folder=None, show_progress_bar=True)` ¶