Skip to content

Tasks

A task is an implementation of a dataset for evaluation. It could, for instance, be the MIRACL dataset consisting of queries, a corpus of documents ,and the correct documents to retrieve for a given query. In addition to the dataset, a task includes the specifications for how a model should be run on the dataset and how its output should be evaluated. Each task also comes with extensive metadata including the license, who annotated the data, etc.

An overview of the tasks within mteb

Utilities

mteb.get_tasks(tasks=None, *, languages=None, script=None, domains=None, task_types=None, categories=None, exclude_superseded=True, eval_splits=None, exclusive_language_filter=False, modalities=None, exclusive_modality_filter=False, exclude_aggregate=False, exclude_private=True)

Get a list of tasks based on the specified filters.

Parameters:

Name Type Description Default
tasks list[str] | None

A list of task names to include. If None, all tasks which pass the filters are included.

None
languages list[str] | None

A list of languages either specified as 3 letter languages codes (ISO 639-3, e.g. "eng") or as script languages codes e.g. "eng-Latn". For multilingual tasks this will also remove languages that are not in the specified list.

None
script list[str] | None

A list of script codes (ISO 15924 codes, e.g. "Latn"). If None, all scripts are included. For multilingual tasks this will also remove scripts that are not in the specified list.

None
domains list[TaskDomain] | None

A list of task domains, e.g. "Legal", "Medical", "Fiction".

None
task_types list[TaskType] | None

A string specifying the type of task e.g. "Classification" or "Retrieval". If None, all tasks are included.

None
categories list[TaskCategory] | None

A list of task categories these include "t2t" (text to text), "t2i" (text to image). See TaskMetadata for the full list.

None
exclude_superseded bool

A boolean flag to exclude datasets which are superseded by another.

True
eval_splits list[str] | None

A list of evaluation splits to include. If None, all splits are included.

None
exclusive_language_filter bool

Some datasets contains more than one language e.g. for STS22 the subset "de-en" contain eng and deu. If exclusive_language_filter is set to False both of these will be kept, but if set to True only those that contains all the languages specified will be kept.

False
modalities list[Modalities] | None

A list of modalities to include. If None, all modalities are included.

None
exclusive_modality_filter bool

If True, only keep tasks where all filter modalities are included in the task's modalities and ALL task modalities are in filter modalities (exact match). If False, keep tasks if any of the task's modalities match the filter modalities.

False
exclude_aggregate bool

If True, exclude aggregate tasks. If False, both aggregate and non-aggregate tasks are returned.

False
exclude_private bool

If True (default), exclude private/closed datasets (is_public=False). If False, include both public and private datasets.

True

Returns:

Type Description
MTEBTasks

A list of all initialized tasks objects which pass all of the filters (AND operation).

Examples:

>>> get_tasks(languages=["eng", "deu"], script=["Latn"], domains=["Legal"])
>>> get_tasks(languages=["eng"], script=["Latn"], task_types=["Classification"])
>>> get_tasks(languages=["eng"], script=["Latn"], task_types=["Clustering"], exclude_superseded=False)
>>> get_tasks(languages=["eng"], tasks=["WikipediaRetrievalMultilingual"], eval_splits=["test"])
>>> get_tasks(tasks=["STS22"], languages=["eng"], exclusive_language_filter=True) # don't include multilingual subsets containing English
Source code in mteb/overview.py
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
def get_tasks(
    tasks: list[str] | None = None,
    *,
    languages: list[str] | None = None,
    script: list[str] | None = None,
    domains: list[TaskDomain] | None = None,
    task_types: list[TaskType] | None = None,
    categories: list[TaskCategory] | None = None,
    exclude_superseded: bool = True,
    eval_splits: list[str] | None = None,
    exclusive_language_filter: bool = False,
    modalities: list[Modalities] | None = None,
    exclusive_modality_filter: bool = False,
    exclude_aggregate: bool = False,
    exclude_private: bool = True,
) -> MTEBTasks:
    """Get a list of tasks based on the specified filters.

    Args:
        tasks: A list of task names to include. If None, all tasks which pass the filters are included.
        languages: A list of languages either specified as 3 letter languages codes (ISO 639-3, e.g. "eng") or as script languages codes e.g.
            "eng-Latn". For multilingual tasks this will also remove languages that are not in the specified list.
        script: A list of script codes (ISO 15924 codes, e.g. "Latn"). If None, all scripts are included. For multilingual tasks this will also remove scripts
            that are not in the specified list.
        domains: A list of task domains, e.g. "Legal", "Medical", "Fiction".
        task_types: A string specifying the type of task e.g. "Classification" or "Retrieval". If None, all tasks are included.
        categories: A list of task categories these include "t2t" (text to text), "t2i" (text to image). See TaskMetadata for the full list.
        exclude_superseded: A boolean flag to exclude datasets which are superseded by another.
        eval_splits: A list of evaluation splits to include. If None, all splits are included.
        exclusive_language_filter: Some datasets contains more than one language e.g. for STS22 the subset "de-en" contain eng and deu. If
            exclusive_language_filter is set to False both of these will be kept, but if set to True only those that contains all the languages
            specified will be kept.
        modalities: A list of modalities to include. If None, all modalities are included.
        exclusive_modality_filter: If True, only keep tasks where _all_ filter modalities are included in the
            task's modalities and ALL task modalities are in filter modalities (exact match).
            If False, keep tasks if _any_ of the task's modalities match the filter modalities.
        exclude_aggregate: If True, exclude aggregate tasks. If False, both aggregate and non-aggregate tasks are returned.
        exclude_private: If True (default), exclude private/closed datasets (is_public=False). If False, include both public and private datasets.

    Returns:
        A list of all initialized tasks objects which pass all of the filters (AND operation).

    Examples:
        >>> get_tasks(languages=["eng", "deu"], script=["Latn"], domains=["Legal"])
        >>> get_tasks(languages=["eng"], script=["Latn"], task_types=["Classification"])
        >>> get_tasks(languages=["eng"], script=["Latn"], task_types=["Clustering"], exclude_superseded=False)
        >>> get_tasks(languages=["eng"], tasks=["WikipediaRetrievalMultilingual"], eval_splits=["test"])
        >>> get_tasks(tasks=["STS22"], languages=["eng"], exclusive_language_filter=True) # don't include multilingual subsets containing English
    """
    if tasks:
        _tasks = [
            get_task(
                task,
                languages,
                script,
                eval_splits=eval_splits,
                exclusive_language_filter=exclusive_language_filter,
                modalities=modalities,
                exclusive_modality_filter=exclusive_modality_filter,
            )
            for task in tasks
        ]
        return MTEBTasks(_tasks)

    _tasks = [
        cls().filter_languages(languages, script).filter_eval_splits(eval_splits)
        for cls in _create_task_list()
    ]

    if languages:
        _tasks = _filter_tasks_by_languages(_tasks, languages)
    if script:
        _tasks = _filter_tasks_by_script(_tasks, script)
    if domains:
        _tasks = _filter_tasks_by_domains(_tasks, domains)
    if task_types:
        _tasks = _filter_tasks_by_task_types(_tasks, task_types)
    if categories:
        logger.warning(
            "`s2p`, `p2p`, and `s2s` will be removed and replaced by `t2t` in v2.0.0."
        )
        _tasks = _filter_task_by_categories(_tasks, categories)
    if exclude_superseded:
        _tasks = _filter_superseded_datasets(_tasks)
    if modalities:
        _tasks = _filter_tasks_by_modalities(
            _tasks, modalities, exclusive_modality_filter
        )
    if exclude_aggregate:
        _tasks = _filter_aggregate_tasks(_tasks)

    # Apply privacy filtering
    if exclude_private:
        _tasks = [t for t in _tasks if t.metadata.is_public]

    return MTEBTasks(_tasks)

mteb.get_task(task_name, languages=None, script=None, eval_splits=None, hf_subsets=None, exclusive_language_filter=False, modalities=None, exclusive_modality_filter=False)

Get a task by name.

Parameters:

Name Type Description Default
task_name str

The name of the task to fetch.

required
languages list[str] | None

A list of languages either specified as 3 letter languages codes (ISO 639-3, e.g. "eng") or as script languages codes e.g. "eng-Latn". For multilingual tasks this will also remove languages that are not in the specified list.

None
script list[str] | None

A list of script codes (ISO 15924 codes). If None, all scripts are included. For multilingual tasks this will also remove scripts

None
eval_splits list[str] | None

A list of evaluation splits to include. If None, all splits are included.

None
hf_subsets list[str] | None

A list of Huggingface subsets to evaluate on.

None
exclusive_language_filter bool

Some datasets contains more than one language e.g. for STS22 the subset "de-en" contain eng and deu. If exclusive_language_filter is set to False both of these will be kept, but if set to True only those that contains all the languages specified will be kept.

False
modalities list[Modalities] | None

A list of modalities to include. If None, all modalities are included.

None
exclusive_modality_filter bool

If True, only keep tasks where all filter modalities are included in the task's modalities and ALL task modalities are in filter modalities (exact match). If False, keep tasks if any of the task's modalities match the filter modalities.

False

Returns:

Type Description
AbsTask

An initialized task object.

Examples:

>>> get_task("BornholmBitextMining")
Source code in mteb/overview.py
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
def get_task(
    task_name: str,
    languages: list[str] | None = None,
    script: list[str] | None = None,
    eval_splits: list[str] | None = None,
    hf_subsets: list[str] | None = None,
    exclusive_language_filter: bool = False,
    modalities: list[Modalities] | None = None,
    exclusive_modality_filter: bool = False,
) -> AbsTask:
    """Get a task by name.

    Args:
        task_name: The name of the task to fetch.
        languages: A list of languages either specified as 3 letter languages codes (ISO 639-3, e.g. "eng") or as script languages codes e.g.
            "eng-Latn". For multilingual tasks this will also remove languages that are not in the specified list.
        script: A list of script codes (ISO 15924 codes). If None, all scripts are included. For multilingual tasks this will also remove scripts
        eval_splits: A list of evaluation splits to include. If None, all splits are included.
        hf_subsets: A list of Huggingface subsets to evaluate on.
        exclusive_language_filter: Some datasets contains more than one language e.g. for STS22 the subset "de-en" contain eng and deu. If
            exclusive_language_filter is set to False both of these will be kept, but if set to True only those that contains all the languages
            specified will be kept.
        modalities: A list of modalities to include. If None, all modalities are included.
        exclusive_modality_filter: If True, only keep tasks where _all_ filter modalities are included in the
            task's modalities and ALL task modalities are in filter modalities (exact match).
            If False, keep tasks if _any_ of the task's modalities match the filter modalities.

    Returns:
        An initialized task object.

    Examples:
        >>> get_task("BornholmBitextMining")
    """
    if task_name in _TASK_RENAMES:
        _task_name = _TASK_RENAMES[task_name]
        logger.warning(
            f"The task with the given name '{task_name}' has been renamed to '{_task_name}'. To prevent this warning use the new name."
        )

    if task_name not in _TASKS_REGISTRY:
        close_matches = difflib.get_close_matches(task_name, _TASKS_REGISTRY.keys())
        if close_matches:
            suggestion = f"KeyError: '{task_name}' not found. Did you mean: '{close_matches[0]}'?"
        else:
            suggestion = (
                f"KeyError: '{task_name}' not found and no similar keys were found."
            )
        raise KeyError(suggestion)
    task = _TASKS_REGISTRY[task_name]()
    if eval_splits:
        task.filter_eval_splits(eval_splits=eval_splits)
    if modalities:
        task.filter_modalities(modalities, exclusive_modality_filter)
    return task.filter_languages(
        languages,
        script,
        hf_subsets=hf_subsets,
        exclusive_language_filter=exclusive_language_filter,
    )

Metadata

Each task also contains extensive metadata. We annotate this using the following object, which allows us to use pydantic to validate the metadata.

mteb.TaskMetadata

Bases: BaseModel

Metadata for a task.

Attributes:

Name Type Description
dataset MetadataDatasetDict

All arguments to pass to datasets.load_dataset to load the dataset for the task.

name str

The name of the task.

description str

A description of the task.

type TaskType

The type of the task. These includes "Classification", "Summarization", "STS", "Retrieval", "Reranking", "Clustering", "PairClassification", "BitextMining". The type should match the abstask type.

category TaskCategory | None

The category of the task. E.g. includes "t2t" (text to text), "t2i" (text to image).

reference StrURL | None

A URL to the documentation of the task. E.g. a published paper.

eval_splits list[str]

The splits of the dataset used for evaluation.

eval_langs Languages

The languages of the dataset used for evaluation. Languages follows a ETF BCP 47 standard consisting of "{language}-{script}" tag (e.g. "eng-Latn"). Where language is specified as a list of ISO 639-3 language codes (e.g. "eng") followed by ISO 15924 script codes (e.g. "Latn"). Can be either a list of languages or a dictionary mapping huggingface subsets to lists of languages (e.g. if a the huggingface dataset contain different languages).

main_score str

The main score used for evaluation.

date tuple[StrDate, StrDate] | None

The date when the data was collected. Specified as a tuple of two dates.

domains list[TaskDomain] | None

The domains of the data. These includes "Non-fiction", "Social", "Fiction", "News", "Academic", "Blog", "Encyclopaedic", "Government", "Legal", "Medical", "Poetry", "Religious", "Reviews", "Web", "Spoken", "Written". A dataset can belong to multiple domains.

task_subtypes list[TaskSubtype] | None

The subtypes of the task. E.g. includes "Sentiment/Hate speech", "Thematic Clustering". Feel free to update the list as needed.

license Licenses | StrURL | None

The license of the data specified as lowercase, e.g. "cc-by-nc-4.0". If the license is not specified, use "not specified". For custom licenses a URL is used.

annotations_creators AnnotatorType | None

The type of the annotators. Includes "expert-annotated" (annotated by experts), "human-annotated" (annotated e.g. by mturkers), "derived" (derived from structure in the data).

dialect list[str] | None

The dialect of the data, if applicable. Ideally specified as a BCP-47 language tag. Empty list if no dialects are present.

sample_creation SampleCreationMethod | None

The method of text creation. Includes "found", "created", "machine-translated", "machine-translated and verified", and "machine-translated and localized".

prompt str | PromptDict | None

The prompt used for the task. Can be a string or a dictionary containing the query and passage prompts.

bibtex_citation str | None

The BibTeX citation for the dataset. Should be an empty string if no citation is available.

adapted_from Sequence[str] | None

Datasets adapted (translated, sampled from, etc.) from other datasets.

is_public bool

Whether the dataset is publicly available. If False (closed/private), a HuggingFace token is required to run the datasets.

Source code in mteb/abstasks/task_metadata.py
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
class TaskMetadata(BaseModel):
    """Metadata for a task.

    Attributes:
        dataset: All arguments to pass to [datasets.load_dataset](https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/loading_methods#datasets.load_dataset) to load the dataset for the task.
        name: The name of the task.
        description: A description of the task.
        type: The type of the task. These includes "Classification", "Summarization", "STS", "Retrieval", "Reranking", "Clustering",
            "PairClassification", "BitextMining". The type should match the abstask type.
        category: The category of the task. E.g. includes "t2t" (text to text), "t2i" (text to image).
        reference: A URL to the documentation of the task. E.g. a published paper.
        eval_splits: The splits of the dataset used for evaluation.
        eval_langs: The languages of the dataset used for evaluation. Languages follows a ETF BCP 47 standard consisting of "{language}-{script}"
            tag (e.g. "eng-Latn"). Where language is specified as a list of ISO 639-3 language codes (e.g. "eng") followed by ISO 15924 script codes
            (e.g. "Latn"). Can be either a list of languages or a dictionary mapping huggingface subsets to lists of languages (e.g. if a the
            huggingface dataset contain different languages).
        main_score: The main score used for evaluation.
        date: The date when the data was collected. Specified as a tuple of two dates.
        domains: The domains of the data. These includes "Non-fiction", "Social", "Fiction", "News", "Academic", "Blog", "Encyclopaedic",
            "Government", "Legal", "Medical", "Poetry", "Religious", "Reviews", "Web", "Spoken", "Written". A dataset can belong to multiple domains.
        task_subtypes: The subtypes of the task. E.g. includes "Sentiment/Hate speech", "Thematic Clustering". Feel free to update the list as needed.
        license: The license of the data specified as lowercase, e.g. "cc-by-nc-4.0". If the license is not specified, use "not specified". For custom licenses a URL is used.
        annotations_creators: The type of the annotators. Includes "expert-annotated" (annotated by experts), "human-annotated" (annotated e.g. by
            mturkers), "derived" (derived from structure in the data).
        dialect: The dialect of the data, if applicable. Ideally specified as a BCP-47 language tag. Empty list if no dialects are present.
        sample_creation: The method of text creation. Includes "found", "created", "machine-translated", "machine-translated and verified", and
            "machine-translated and localized".
        prompt: The prompt used for the task. Can be a string or a dictionary containing the query and passage prompts.
        bibtex_citation: The BibTeX citation for the dataset. Should be an empty string if no citation is available.
        adapted_from: Datasets adapted (translated, sampled from, etc.) from other datasets.
        is_public: Whether the dataset is publicly available. If False (closed/private), a HuggingFace token is required to run the datasets.
    """

    model_config = ConfigDict(extra="forbid")

    dataset: MetadataDatasetDict

    name: str
    description: str
    prompt: str | PromptDict | None = None
    type: TaskType
    modalities: list[Modalities] = ["text"]
    category: TaskCategory | None = None
    reference: StrURL | None = None

    eval_splits: list[str] = ["test"]
    eval_langs: Languages
    main_score: str

    date: tuple[StrDate, StrDate] | None = None
    domains: list[TaskDomain] | None = None
    task_subtypes: list[TaskSubtype] | None = None
    license: Licenses | StrURL | None = None

    annotations_creators: AnnotatorType | None = None
    dialect: list[str] | None = None

    sample_creation: SampleCreationMethod | None = None
    bibtex_citation: str | None = None
    adapted_from: Sequence[str] | None = None
    is_public: bool = True

    def _validate_metadata(self) -> None:
        self._eval_langs_are_valid(self.eval_langs)

    @field_validator("prompt")
    @classmethod
    def _check_prompt_is_valid(
        cls, prompt: str | PromptDict | None
    ) -> str | PromptDict | None:
        if isinstance(prompt, dict):
            for key in prompt:
                if key not in [e.value for e in PromptType]:
                    raise ValueError(
                        "The prompt dictionary should only contain the keys 'query' and 'passage'."
                    )
        return prompt

    def _eval_langs_are_valid(self, eval_langs: Languages) -> None:
        """This method checks that the eval_langs are specified as a list of languages."""
        if isinstance(eval_langs, dict):
            for langs in eval_langs.values():
                for code in langs:
                    check_language_code(code)
        else:
            for code in eval_langs:
                check_language_code(code)

    @property
    def bcp47_codes(self) -> list[ISOLanguageScript]:
        """Return the languages and script codes of the dataset formatting in accordance with the BCP-47 standard."""
        if isinstance(self.eval_langs, dict):
            return sorted(
                {lang for langs in self.eval_langs.values() for lang in langs}
            )
        return sorted(set(self.eval_langs))

    @property
    def languages(self) -> list[str]:
        """Return the languages of the dataset as iso639-3 codes."""

        def get_lang(lang: str) -> str:
            return lang.split("-")[0]

        if isinstance(self.eval_langs, dict):
            return sorted(
                {get_lang(lang) for langs in self.eval_langs.values() for lang in langs}
            )
        return sorted({get_lang(lang) for lang in self.eval_langs})

    @property
    def scripts(self) -> set[str]:
        """Return the scripts of the dataset as iso15924 codes."""

        def get_script(lang: str) -> str:
            return lang.split("-")[1]

        if isinstance(self.eval_langs, dict):
            return {
                get_script(lang) for langs in self.eval_langs.values() for lang in langs
            }
        return {get_script(lang) for lang in self.eval_langs}

    def is_filled(self) -> bool:
        """Check if all the metadata fields are filled."""
        return all(
            getattr(self, field_name) is not None
            for field_name in self.model_fields
            if field_name not in ["prompt", "adapted_from"]
        )

    @property
    def hf_subsets_to_langscripts(self) -> dict[HFSubset, list[ISOLanguageScript]]:
        """Return a dictionary mapping huggingface subsets to languages."""
        if isinstance(self.eval_langs, dict):
            return self.eval_langs
        return {"default": self.eval_langs}  # type: ignore

    @property
    def intext_citation(self, include_cite: bool = True) -> str:
        """Create an in-text citation for the dataset."""
        cite = ""
        if self.bibtex_citation:
            cite = f"{self.bibtex_citation.split(',')[0].split('{')[1]}"
        if include_cite and cite:
            # check for whitespace in the citation
            if " " in cite:
                logger.warning(
                    "Citation contains whitespace. Please ensure that the citation is correctly formatted."
                )
            return f"\\cite{{{cite}}}"
        return cite

    @property
    def descriptive_stats(self) -> dict[str, DescriptiveStatistics] | None:
        """Return the descriptive statistics for the dataset."""
        if self.descriptive_stat_path.exists():
            with self.descriptive_stat_path.open("r") as f:
                return json.load(f)
        return None

    @property
    def descriptive_stat_path(self) -> Path:
        """Return the path to the descriptive statistics file."""
        descriptive_stat_base_dir = Path(__file__).parent.parent / "descriptive_stats"
        if self.type in MIEB_TASK_TYPE:
            descriptive_stat_base_dir = descriptive_stat_base_dir / "Image"
        task_type_dir = descriptive_stat_base_dir / self.type
        if not descriptive_stat_base_dir.exists():
            descriptive_stat_base_dir.mkdir()
        if not task_type_dir.exists():
            task_type_dir.mkdir()
        return task_type_dir / f"{self.name}.json"

    @property
    def n_samples(self) -> dict[str, int] | None:
        """Returns the number of samples in the dataset"""
        stats = self.descriptive_stats
        if not stats:
            return None

        n_samples = {}
        for subset, subset_value in stats.items():
            if subset == "hf_subset_descriptive_stats":
                continue
            n_samples[subset] = subset_value["num_samples"]  # type: ignore
        return n_samples

    @property
    def hf_subsets(self) -> list[str]:
        """Return the huggingface subsets."""
        return list(self.hf_subsets_to_langscripts.keys())

    @property
    def is_multilingual(self) -> bool:
        """Check if the task is multilingual."""
        return isinstance(self.eval_langs, dict)

    def __hash__(self) -> int:
        return hash(self.model_dump_json())

    @property
    def revision(self) -> str:
        """Return the dataset revision."""
        return self.dataset["revision"]

    def _create_dataset_card_data(
        self,
        existing_dataset_card_data: DatasetCardData | None = None,
    ) -> tuple[DatasetCardData, dict[str, Any]]:
        """Create a DatasetCardData object from the task metadata.

        Args:
            existing_dataset_card_data: The existing DatasetCardData object to update. If None, a new object will be created.

        Returns:
            A DatasetCardData object with the metadata for the task with kwargs to card
        """
        if existing_dataset_card_data is None:
            existing_dataset_card_data = DatasetCardData()

        dataset_type = [
            *self._hf_task_type(),
            *self._hf_task_category(),
            *self._hf_subtypes(),
        ]
        languages = self._hf_languages()

        multilinguality = "monolingual" if len(languages) == 1 else "multilingual"
        if self.sample_creation and "translated" in self.sample_creation:
            multilinguality = "translated"

        if self.adapted_from is not None:
            source_datasets = [
                task.metadata.dataset["path"]
                for task in mteb.get_tasks(self.adapted_from)
            ]
            source_datasets.append(self.dataset["path"])
        else:
            source_datasets = None if not self.dataset else [self.dataset["path"]]

        tags = ["mteb"] + self.modalities

        descriptive_stats = self.descriptive_stats
        if descriptive_stats is not None:
            for split, split_stat in descriptive_stats.items():
                if len(split_stat.get("hf_subset_descriptive_stats", {})) > 10:
                    split_stat.pop("hf_subset_descriptive_stats", {})
            descriptive_stats = json.dumps(descriptive_stats, indent=4)

        dataset_card_data_params = existing_dataset_card_data.to_dict()
        # override the existing values
        dataset_card_data_params.update(
            dict(
                language=languages,
                license=self._hf_license(),
                annotations_creators=[self.annotations_creators]
                if self.annotations_creators
                else None,
                multilinguality=multilinguality,
                source_datasets=source_datasets,
                task_categories=dataset_type,
                task_ids=self._hf_subtypes(),
                tags=tags,
            )
        )

        return (
            DatasetCardData(**dataset_card_data_params),
            # parameters for readme generation
            dict(
                citation=self.bibtex_citation,
                dataset_description=self.description,
                dataset_reference=self.reference,
                descritptive_stats=descriptive_stats,
                dataset_task_name=self.name,
                category=self.category,
                domains=", ".join(self.domains) if self.domains else None,
            ),
        )

    def generate_dataset_card(
        self,
        existing_dataset_card: DatasetCard | None = None,
    ) -> DatasetCard:
        """Generates a dataset card for the task.

        Args:
            existing_dataset_card: The existing dataset card to update. If None, a new dataset card will be created.

        Returns:
            DatasetCard: The dataset card for the task.
        """
        path = Path(__file__).parent / "dataset_card_template.md"
        existing_dataset_card_data = (
            existing_dataset_card.data if existing_dataset_card else None
        )
        dataset_card_data, template_kwargs = self._create_dataset_card_data(
            existing_dataset_card_data
        )
        dataset_card = DatasetCard.from_template(
            card_data=dataset_card_data,
            template_path=str(path),
            **template_kwargs,
        )
        return dataset_card

    def push_dataset_card_to_hub(self, repo_name: str) -> None:
        """Pushes the dataset card to the huggingface hub.

        Args:
            repo_name: The name of the repository to push the dataset card to.
        """
        dataset_card = None
        if repo_exists(
            repo_name, repo_type=constants.REPO_TYPE_DATASET
        ) and file_exists(
            repo_name, constants.REPOCARD_NAME, repo_type=constants.REPO_TYPE_DATASET
        ):
            dataset_card = DatasetCard.load(repo_name)
        dataset_card = self.generate_dataset_card(dataset_card)
        dataset_card.push_to_hub(repo_name, commit_message="Add dataset card")

    def _hf_subtypes(self) -> list[str]:
        # to get full list of available task_ids execute
        # requests.post("https://huggingface.co/api/validate-yaml", json={
        #   "content": "---\ntask_ids: 'test'\n---",
        #   "repoType": "dataset"
        # })
        mteb_to_hf_subtype = {
            "Article retrieval": ["document-retrieval"],
            "Conversational retrieval": ["conversational", "utterance-retrieval"],
            "Dialect pairing": [],
            "Dialog Systems": ["dialogue-modeling", "dialogue-generation"],
            "Discourse coherence": [],
            "Duplicate Image Retrieval": [],
            "Language identification": ["language-identification"],
            "Linguistic acceptability": ["acceptability-classification"],
            "Political classification": [],
            "Question answering": [
                "multiple-choice-qa",
                "question-answering",
            ],
            "Sentiment/Hate speech": [
                "sentiment-analysis",
                "sentiment-scoring",
                "sentiment-classification",
                "hate-speech-detection",
            ],
            "Thematic clustering": [],
            "Scientific Reranking": [],
            "Claim verification": ["fact-checking", "fact-checking-retrieval"],
            "Topic classification": ["topic-classification"],
            "Code retrieval": [],
            "False Friends": [],
            "Cross-Lingual Semantic Discrimination": [],
            "Textual Entailment": ["natural-language-inference"],
            "Counterfactual Detection": [],
            "Emotion classification": [],
            "Reasoning as Retrieval": [],
            "Rendered Texts Understanding": [],
            "Image Text Retrieval": [],
            "Object recognition": [],
            "Scene recognition": [],
            "Caption Pairing": ["image-captioning"],
            "Emotion recognition": [],
            "Textures recognition": [],
            "Activity recognition": [],
            "Tumor detection": [],
            "Duplicate Detection": [],
            "Rendered semantic textual similarity": [
                "semantic-similarity-scoring",
                "rendered semantic textual similarity",
            ],
            "Intent classification": [
                "intent-classification",
            ],
        }
        subtypes = []
        if self.task_subtypes:
            for subtype in self.task_subtypes:
                subtypes.extend(mteb_to_hf_subtype.get(subtype, []))
        return subtypes

    def _hf_task_type(self) -> list[str]:
        # to get full list of task_types execute:
        # requests.post("https://huggingface.co/api/validate-yaml", json={
        #     "content": "---\ntask_categories: ['test']\n---", "repoType": "dataset"
        # }).json()
        # or look at https://huggingface.co/tasks
        mteb_task_type_to_datasets = {
            # Text
            "BitextMining": ["translation"],
            "Classification": ["text-classification"],
            "MultilabelClassification": ["text-classification"],
            "Clustering": ["text-classification"],
            "PairClassification": ["text-classification"],
            "Reranking": ["text-ranking"],
            "Retrieval": ["text-retrieval"],
            "STS": ["sentence-similarity"],
            "Summarization": ["summarization"],
            "InstructionRetrieval": ["text-retrieval"],
            "InstructionReranking": ["text-ranking"],
            # Image
            "Any2AnyMultiChoice": ["visual-question-answering"],
            "Any2AnyRetrieval": ["visual-document-retrieval"],
            "Any2AnyMultilingualRetrieval": ["visual-document-retrieval"],
            "VisionCentricQA": ["visual-question-answering"],
            "ImageClustering": ["image-clustering"],
            "ImageClassification": ["image-classification"],
            "ImageMultilabelClassification": ["image-classification"],
            "DocumentUnderstanding": ["visual-document-retrieval"],
            "VisualSTS(eng)": ["other"],
            "VisualSTS(multi)": ["other"],
            "ZeroShotClassification": ["zero-shot-classification"],
            "Compositionality": ["other"],
        }
        if self.type == "ZeroShotClassification":
            if self.modalities == ["image"]:
                return ["zero-shot-image-classification"]
            return ["zero-shot-classification"]

        return mteb_task_type_to_datasets[self.type]

    def _hf_task_category(self) -> list[str]:
        dataset_type = []
        if self.category in ["i2i", "it2i", "i2it", "it2it"]:
            dataset_type.append("image-to-image")
        if self.category in ["i2t", "t2i", "it2t", "it2i", "t2it", "i2it", "it2it"]:
            dataset_type.extend(["image-to-text", "text-to-image"])
        if self.category in ["it2t", "it2i", "t2it", "i2it", "it2it"]:
            dataset_type.extend(["image-text-to-text"])
        return dataset_type

    def _hf_languages(self) -> list[str]:
        languages: list[str] = []
        if self.is_multilingual:
            for val in list(self.eval_langs.values()):
                languages.extend(val)
        else:
            languages = self.eval_langs
        # value "python" is not valid. It must be an ISO 639-1, 639-2 or 639-3 code (two/three letters),
        # or a special value like "code", "multilingual".
        readme_langs = []
        for lang in languages:
            lang_name, family = lang.split("-")
            if family == "Code":
                readme_langs.append("code")
            else:
                readme_langs.append(lang_name)
        return sorted(set(readme_langs))

    def _hf_license(self) -> str:
        dataset_license = self.license
        if dataset_license:
            license_mapping = {
                "not specified": "unknown",
                "msr-la-nc": "other",
                "cc-by-nd-2.1-jp": "other",
            }
            dataset_license = license_mapping.get(
                dataset_license,
                "other" if dataset_license.startswith("http") else dataset_license,
            )
        return dataset_license

bcp47_codes property

Return the languages and script codes of the dataset formatting in accordance with the BCP-47 standard.

descriptive_stat_path property

Return the path to the descriptive statistics file.

descriptive_stats property

Return the descriptive statistics for the dataset.

hf_subsets property

Return the huggingface subsets.

hf_subsets_to_langscripts property

Return a dictionary mapping huggingface subsets to languages.

intext_citation property

Create an in-text citation for the dataset.

is_multilingual property

Check if the task is multilingual.

languages property

Return the languages of the dataset as iso639-3 codes.

n_samples property

Returns the number of samples in the dataset

revision property

Return the dataset revision.

scripts property

Return the scripts of the dataset as iso15924 codes.

generate_dataset_card(existing_dataset_card=None)

Generates a dataset card for the task.

Parameters:

Name Type Description Default
existing_dataset_card DatasetCard | None

The existing dataset card to update. If None, a new dataset card will be created.

None

Returns:

Name Type Description
DatasetCard DatasetCard

The dataset card for the task.

Source code in mteb/abstasks/task_metadata.py
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
def generate_dataset_card(
    self,
    existing_dataset_card: DatasetCard | None = None,
) -> DatasetCard:
    """Generates a dataset card for the task.

    Args:
        existing_dataset_card: The existing dataset card to update. If None, a new dataset card will be created.

    Returns:
        DatasetCard: The dataset card for the task.
    """
    path = Path(__file__).parent / "dataset_card_template.md"
    existing_dataset_card_data = (
        existing_dataset_card.data if existing_dataset_card else None
    )
    dataset_card_data, template_kwargs = self._create_dataset_card_data(
        existing_dataset_card_data
    )
    dataset_card = DatasetCard.from_template(
        card_data=dataset_card_data,
        template_path=str(path),
        **template_kwargs,
    )
    return dataset_card

is_filled()

Check if all the metadata fields are filled.

Source code in mteb/abstasks/task_metadata.py
332
333
334
335
336
337
338
def is_filled(self) -> bool:
    """Check if all the metadata fields are filled."""
    return all(
        getattr(self, field_name) is not None
        for field_name in self.model_fields
        if field_name not in ["prompt", "adapted_from"]
    )

push_dataset_card_to_hub(repo_name)

Pushes the dataset card to the huggingface hub.

Parameters:

Name Type Description Default
repo_name str

The name of the repository to push the dataset card to.

required
Source code in mteb/abstasks/task_metadata.py
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
def push_dataset_card_to_hub(self, repo_name: str) -> None:
    """Pushes the dataset card to the huggingface hub.

    Args:
        repo_name: The name of the repository to push the dataset card to.
    """
    dataset_card = None
    if repo_exists(
        repo_name, repo_type=constants.REPO_TYPE_DATASET
    ) and file_exists(
        repo_name, constants.REPOCARD_NAME, repo_type=constants.REPO_TYPE_DATASET
    ):
        dataset_card = DatasetCard.load(repo_name)
    dataset_card = self.generate_dataset_card(dataset_card)
    dataset_card.push_to_hub(repo_name, commit_message="Add dataset card")

Metadata Types

mteb.abstasks.task_metadata.AnnotatorType = Literal['expert-annotated', 'human-annotated', 'derived', 'LM-generated', 'LM-generated and reviewed'] module-attribute

The type of the annotators. Is often important for understanding the quality of a dataset.

mteb.abstasks.task_metadata.SampleCreationMethod = Literal['found', 'created', 'human-translated and localized', 'human-translated', 'machine-translated', 'machine-translated and verified', 'machine-translated and localized', 'LM-generated and verified', 'machine-translated and LM verified', 'rendered', 'multiple'] module-attribute

How the text was created. It can be an important factor for understanding the quality of a dataset. E.g. used to filter out machine-translated datasets.

mteb.abstasks.task_metadata.TaskCategory = Literal['t2t', 't2c', 'i2i', 'i2c', 'i2t', 't2i', 'it2t', 'it2i', 'i2it', 't2it', 'it2it'] module-attribute

The category of the task. E.g. includes "t2t" (text to text), "t2i" (text to image) and "i2c" (image to category).

mteb.abstasks.task_metadata.TaskDomain = Literal['Academic', 'Blog', 'Constructed', 'Encyclopaedic', 'Engineering', 'Fiction', 'Government', 'Legal', 'Medical', 'News', 'Non-fiction', 'Poetry', 'Religious', 'Reviews', 'Scene', 'Social', 'Spoken', 'Subtitles', 'Web', 'Written', 'Programming', 'Chemistry', 'Financial', 'Entertainment'] module-attribute

mteb.abstasks.task_metadata.TaskType = Literal[_TASK_TYPE] module-attribute

The type of the task. E.g. includes "Classification", "Retrieval" and "Clustering".

mteb.abstasks.task_metadata.TaskSubtype = Literal['Article retrieval', 'Patent retrieval', 'Conversational retrieval', 'Dialect pairing', 'Dialog Systems', 'Discourse coherence', 'Duplicate Image Retrieval', 'Language identification', 'Linguistic acceptability', 'Political classification', 'Question answering', 'Sentiment/Hate speech', 'Thematic clustering', 'Scientific Reranking', 'Claim verification', 'Topic classification', 'Code retrieval', 'False Friends', 'Cross-Lingual Semantic Discrimination', 'Textual Entailment', 'Counterfactual Detection', 'Emotion classification', 'Reasoning as Retrieval', 'Rendered Texts Understanding', 'Image Text Retrieval', 'Object recognition', 'Scene recognition', 'Caption Pairing', 'Emotion recognition', 'Textures recognition', 'Activity recognition', 'Tumor detection', 'Duplicate Detection', 'Rendered semantic textual similarity', 'Intent classification'] module-attribute

The subtypes of the task. E.g. includes "Sentiment/Hate speech", "Thematic Clustering". This list can be updated as needed.

mteb.abstasks.task_metadata.PromptDict = TypedDict('PromptDict', {(prompt_type.value): strfor prompt_type in PromptType}, total=False) module-attribute

A dictionary containing the prompt used for the task.

Parameters:

Name Type Description Default
query

The prompt used for the queries in the task.

required
document

The prompt used for the passages in the task.

required

The Task Object

All tasks in mteb inherits from the following abstract class.

mteb.AbsTask

AbsTask

Bases: ABC

The abstract class for the tasks

Attributes:

Name Type Description
metadata TaskMetadata

The metadata describing the task

dataset dict[HFSubset, DatasetDict] | None

The dataset represented as a dictionary on the form {"hf subset": {"split": Dataset}} where "split" is the dataset split (e.g. "test") and Dataset is a datasets.Dataset objedct. "hf subset" is the data subset on Huggingface typically used to denote the language e.g. datasets.load_dataset("data", "en"). If the dataset does not have a subset this is simply "default".

seed

The random seed used for reproducibility.

hf_subsets list[HFSubset]

The list of Huggingface subsets to use.

data_loaded bool

Denotes if the dataset is loaded or not. This is used to avoid loading the dataset multiple times.

abstask_prompt str | None

The potential prompt of the abstask

superseded_by str | None

Denotes the task that this task is superseeded by. Used to issue warning to users of outdated datasets, while maintaining reproducibility of existing benchmarks.

fast_loading bool

Deprecated. Denotes if the task should be loaded using the fast loading method. This is only possible if the dataset have a "default" config. We don't recommend to use this method, and suggest to use different subsets for loading datasets. This was used only for historical reasons and will be removed in the future.

Source code in mteb/abstasks/AbsTask.py
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
class AbsTask(ABC):
    """The abstract class for the tasks

    Attributes:
        metadata: The metadata describing the task
        dataset: The dataset represented as a dictionary on the form {"hf subset": {"split": Dataset}} where "split" is the dataset split (e.g. "test")
            and Dataset is a datasets.Dataset objedct. "hf subset" is the data subset on Huggingface typically used to denote the language e.g.
            datasets.load_dataset("data", "en"). If the dataset does not have a subset this is simply "default".
        seed: The random seed used for reproducibility.
        hf_subsets: The list of Huggingface subsets to use.
        data_loaded: Denotes if the dataset is loaded or not. This is used to avoid loading the dataset multiple times.
        abstask_prompt: The potential prompt of the abstask
        superseded_by: Denotes the task that this task is superseeded by. Used to issue warning to users of outdated datasets, while maintaining
            reproducibility of existing benchmarks.
        fast_loading: **Deprecated**. Denotes if the task should be loaded using the fast loading method.
            This is only possible if the dataset have a "default" config. We don't recommend to use this method, and suggest to use different subsets for loading datasets.
            This was used only for historical reasons and will be removed in the future.
    """

    metadata: TaskMetadata
    abstask_prompt: str | None = None
    _eval_splits: list[str] | None = None
    superseded_by: str | None = None
    dataset: dict[HFSubset, DatasetDict] | None = None
    data_loaded: bool = False
    hf_subsets: list[HFSubset]
    fast_loading: bool = False

    support_cross_encoder: bool = False
    support_search: bool = False

    def __init__(self, seed: int = 42, **kwargs: Any):
        """The init function. This is called primarily to set the seed.

        Args:
            seed: An integer seed.
            kwargs: arguments passed to subclasses.
        """
        self.seed = seed
        self.rng_state, self.np_rng = set_seed(seed)
        self.hf_subsets = self.metadata.hf_subsets

    def check_if_dataset_is_superseded(self):
        """Check if the dataset is superseded by a newer version."""
        if self.superseded_by:
            logger.warning(
                f"Dataset '{self.metadata.name}' is superseded by '{self.superseded_by}', you might consider using the newer version of the dataset."
            )

    def dataset_transform(self):
        """A transform operations applied to the dataset after loading.

        This method is useful when the dataset from Huggingface is not in an `mteb` compatible format.
        Override this method if your dataset requires additional transformation.
        """
        pass

    def evaluate(
        self,
        model: MTEBModels,
        split: str = "test",
        subsets_to_run: list[HFSubset] | None = None,
        *,
        encode_kwargs: dict[str, Any],
        prediction_folder: Path | None = None,
        **kwargs: Any,
    ) -> dict[HFSubset, ScoresDict]:
        """Evaluates an MTEB compatible model on the task.

        Args:
            model: MTEB compatible model. Implements a encode(sentences) method, that encodes sentences and returns an array of embeddings
            split: Which split (e.g. *"test"*) to be used.
            subsets_to_run: List of huggingface subsets (HFSubsets) to evaluate. If None, all subsets are evaluated.
            encode_kwargs: Additional keyword arguments that are passed to the model's `encode` method.
            prediction_folder: Folder to save model predictions
            kwargs: Additional keyword arguments that are passed to the _evaluate_subset method.
        """
        if isinstance(model, CrossEncoderProtocol) and not self.support_cross_encoder:
            raise TypeError(
                f"Model {model} is a CrossEncoder, but this task {self.metadata.name} does not support CrossEncoders. "
                "Please use a Encoder model instead."
            )

        # encoders might implement search protocols
        if (
            isinstance(model, SearchProtocol)
            and not isinstance(model, Encoder)
            and not self.support_search
        ):
            raise TypeError(
                f"Model {model} is a SearchProtocol, but this task {self.metadata.name} does not support Search. "
                "Please use a Encoder model instead."
            )

        if not self.data_loaded:
            self.load_data()

        self.dataset = cast(dict[HFSubset, DatasetDict], self.dataset)

        scores = {}
        if self.hf_subsets is None:
            hf_subsets = list(self.dataset.keys())
        else:
            hf_subsets = copy(self.hf_subsets)

        if subsets_to_run is not None:  # allow overwrites of pre-filtering
            hf_subsets = [s for s in hf_subsets if s in subsets_to_run]

        for hf_subset in hf_subsets:
            logger.info(
                f"Task: {self.metadata.name}, split: {split}, subset: {hf_subset}. Running..."
            )
            if hf_subset not in self.dataset and hf_subset == "default":
                data_split = self.dataset[split]
            else:
                data_split = self.dataset[hf_subset][split]
            scores[hf_subset] = self._evaluate_subset(
                model,
                data_split,
                hf_split=split,
                hf_subset=hf_subset,
                encode_kwargs=encode_kwargs,
                prediction_folder=prediction_folder,
                **kwargs,
            )
            self._add_main_score(scores[hf_subset])
        return scores

    @abstractmethod
    def _evaluate_subset(
        self,
        model: Encoder,
        data_split: Dataset,
        *,
        encode_kwargs: dict[str, Any],
        hf_split: str,
        hf_subset: str,
        prediction_folder: Path | None = None,
        **kwargs: Any,
    ) -> ScoresDict:
        raise NotImplementedError(
            "If you are using the default evaluate method, you must implement _evaluate_subset method."
        )

    def save_task_predictions(
        self,
        predictions: dict[str, Any],
        model: MTEBModels,
        prediction_folder: Path,
        hf_split: str,
        hf_subset: str,
    ) -> None:
        predictions_path = self._predictions_path(prediction_folder)
        existing_results = {
            "mteb_model_meta": {
                "model_name": model.mteb_model_meta.name,
                "revision": model.mteb_model_meta.revision,
            }
        }
        if predictions_path.exists():
            with predictions_path.open("r") as predictions_file:
                existing_results = json.load(predictions_file)

        if hf_subset not in existing_results:
            existing_results[hf_subset] = {}

        existing_results[hf_subset][hf_split] = predictions
        with predictions_path.open("w") as predictions_file:
            json.dump(existing_results, predictions_file)

    def _predictions_path(
        self,
        output_folder: Path | str,
    ) -> Path:
        if isinstance(output_folder, str):
            output_folder = Path(output_folder)

        if not output_folder.exists():
            output_folder.mkdir(parents=True, exist_ok=True)
        return output_folder / self.prediction_file_name

    @property
    def prediction_file_name(self) -> str:
        return f"{self.metadata.name}_predictions.json"

    @staticmethod
    def stratified_subsampling(
        dataset_dict: datasets.DatasetDict,
        seed: int,
        splits: list[str] = ["test"],
        label: str = "label",
        n_samples: int = 2048,
    ) -> datasets.DatasetDict:
        """Subsamples the dataset with stratification by the supplied label.
        Returns a datasetDict object.

        Args:
            dataset_dict: the DatasetDict object.
            seed: the random seed.
            splits: the splits of the dataset.
            label: the label with which the stratified sampling is based on.
            n_samples: Optional, number of samples to subsample. Default is max_n_samples.
        """
        ## Can only do this if the label column is of ClassLabel.
        if not isinstance(dataset_dict[splits[0]].features[label], datasets.ClassLabel):
            try:
                dataset_dict = dataset_dict.class_encode_column(label)
            except ValueError as e:
                if isinstance(dataset_dict[splits[0]][label][0], Sequence):
                    return _multilabel_subsampling(
                        dataset_dict, seed, splits, label, n_samples
                    )
                else:
                    raise e

        for split in splits:
            if n_samples >= len(dataset_dict[split]):
                logger.debug(
                    f"Subsampling not needed for split {split}, as n_samples is equal or greater than the number of samples."
                )
                continue
            dataset_dict.update(
                {
                    split: dataset_dict[split].train_test_split(
                        test_size=n_samples, seed=seed, stratify_by_column=label
                    )["test"]
                }
            )  ## only take the specified test split.
        return dataset_dict

    def load_data(self) -> None:
        """Loads dataset from HuggingFace hub

        This is the main loading function for Task. Do not overwrite this, instead we recommend using `dataset_transform`, which is called after the
        dataset is loaded using `datasets.load_dataset`.
        """
        if self.data_loaded:
            return
        if self.metadata.is_multilingual:
            if self.fast_loading:
                self.fast_load()
            else:
                self.dataset = {}
                for hf_subset in self.hf_subsets:
                    self.dataset[hf_subset] = datasets.load_dataset(
                        name=hf_subset,
                        **self.metadata.dataset,
                    )
        else:
            # some of monolingual datasets explicitly adding the split name to the dataset name
            self.dataset = datasets.load_dataset(**self.metadata.dataset)  # type: ignore
        self.dataset_transform()
        self.data_loaded = True

    def fast_load(self) -> None:
        """**Deprecated**. Load all subsets at once, then group by language. Using fast loading has two requirements:

        - Each row in the dataset should have a 'lang' feature giving the corresponding language/language pair
        - The datasets must have a 'default' config that loads all the subsets of the dataset (see more [here](https://huggingface.co/docs/datasets/en/repository_structure#configurations))
        """
        self.dataset = {}
        merged_dataset = datasets.load_dataset(
            **self.metadata.dataset
        )  # load "default" subset
        for split in merged_dataset.keys():
            df_split = merged_dataset[split].to_polars()
            df_grouped = dict(df_split.group_by(["lang"]))
            for lang in set(df_split["lang"].unique()) & set(self.hf_subsets):
                self.dataset.setdefault(lang, {})
                self.dataset[lang][split] = datasets.Dataset.from_polars(
                    df_grouped[(lang,)].drop("lang")
                )  # Remove lang column and convert back to HF datasets, not strictly necessary but better for compatibility
        for lang, subset in self.dataset.items():
            self.dataset[lang] = datasets.DatasetDict(subset)

    def calculate_descriptive_statistics(
        self, overwrite_results: bool = False
    ) -> dict[str, DescriptiveStatistics]:
        """Calculates descriptive statistics from the dataset."""
        from mteb.abstasks import AbsTaskAnyClassification

        if self.metadata.descriptive_stat_path.exists() and not overwrite_results:
            logger.info("Loading metadata descriptive statistics from cache.")
            return self.metadata.descriptive_stats

        if not self.data_loaded:
            self.load_data()

        descriptive_stats: dict[str, DescriptiveStatistics] = {}
        hf_subset_stat = "hf_subset_descriptive_stats"
        eval_splits = self.metadata.eval_splits
        if isinstance(self, AbsTaskAnyClassification):
            eval_splits.append(self.train_split)

        pbar_split = tqdm.tqdm(eval_splits, desc="Processing Splits...")
        for split in pbar_split:
            pbar_split.set_postfix_str(f"Split: {split}")
            logger.info(f"Processing metadata for split {split}")
            if self.metadata.is_multilingual:
                descriptive_stats[split] = (
                    self._calculate_descriptive_statistics_from_split(
                        split, compute_overall=True
                    )
                )
                descriptive_stats[split][hf_subset_stat] = {}

                pbar_subsets = tqdm.tqdm(
                    self.metadata.hf_subsets,
                    desc="Processing Languages...",
                )
                for hf_subset in pbar_subsets:
                    pbar_subsets.set_postfix_str(f"Huggingface subset: {hf_subset}")
                    logger.info(f"Processing metadata for subset {hf_subset}")
                    split_details = self._calculate_descriptive_statistics_from_split(
                        split, hf_subset
                    )
                    descriptive_stats[split][hf_subset_stat][hf_subset] = split_details
            else:
                split_details = self._calculate_descriptive_statistics_from_split(split)
                descriptive_stats[split] = split_details

        with self.metadata.descriptive_stat_path.open("w") as f:
            json.dump(descriptive_stats, f, indent=4)

        return descriptive_stats

    def calculate_metadata_metrics(
        self, overwrite_results: bool = False
    ) -> dict[str, DescriptiveStatistics]:
        return self.calculate_descriptive_statistics(
            overwrite_results=overwrite_results
        )

    @abstractmethod
    def _calculate_descriptive_statistics_from_split(
        self, split: str, hf_subset: str | None = None, compute_overall: bool = False
    ) -> SplitDescriptiveStatistics:
        raise NotImplementedError

    @property
    def languages(self) -> list[str]:
        """Returns the languages of the task."""
        if self.hf_subsets:
            eval_langs = self.metadata.hf_subsets_to_langscripts
            languages = []

            for lang in self.hf_subsets:
                for langscript in eval_langs[lang]:
                    iso_lang, script = langscript.split("-")
                    languages.append(iso_lang)

            return sorted(set(languages))

        return self.metadata.languages

    def filter_eval_splits(self, eval_splits: list[str] | None) -> AbsTask:
        """Filter the evaluation splits of the task.

        Args:
            eval_splits: A list of evaluation splits to keep. If None, all splits are kept.

        Returns:
            The filtered task
        """
        self._eval_splits = eval_splits
        return self

    def filter_modalities(
        self, modalities: list[str] | None, exclusive_modality_filter: bool = False
    ) -> AbsTask:
        """Filter the modalities of the task.

        Args:
            modalities: A list of modalities to filter by. If None, the task is returned unchanged.
            exclusive_modality_filter: If True, only keep tasks where _all_ filter modalities are included in the
                task's modalities and ALL task modalities are in filter modalities (exact match).
                If False, keep tasks if _any_ of the task's modalities match the filter modalities.

        Returns:
            The filtered task
        """
        if modalities is None:
            return self
        filter_modalities_set = set(modalities)
        task_modalities_set = set(self.modalities)
        if exclusive_modality_filter:
            if not (filter_modalities_set == task_modalities_set):
                self.metadata.modalities = []
        else:
            if not filter_modalities_set.intersection(task_modalities_set):
                self.metadata.modalities = []
        return self

    def filter_languages(
        self,
        languages: list[str] | None,
        script: list[str] | None = None,
        hf_subsets: list[HFSubset] | None = None,
        exclusive_language_filter: bool = False,
    ) -> AbsTask:
        """Filter the languages of the task.

        Args:
            languages: list of languages to filter the task by can be either a 3-letter langauge code (e.g. "eng") or also include the script
                (e.g. "eng-Latn")
            script: A list of scripts to filter the task by. Will be ignored if language code specified the script. If None, all scripts are included.
                If the language code does not specify the script the intersection of the language and script will be used.
            hf_subsets: A list of huggingface subsets to filter on. This is useful if a dataset have multiple subsets containing the desired language,
                but you only want to test on one. An example is STS22 which e.g. have both "en" and "de-en" which both contains English.
            exclusive_language_filter: Some datasets contains more than one language e.g. for STS22 the subset "de-en" contain eng and deu. If
                exclusive_language_filter is set to False both of these will be kept, but if set to True only those that contains all the languages
                specified will be kept.

        Returns:
            The filtered task
        """
        lang_scripts = LanguageScripts.from_languages_and_scripts(languages, script)

        subsets_to_keep = []

        for hf_subset, langs in self.metadata.hf_subsets_to_langscripts.items():
            if (hf_subsets is not None) and (hf_subset not in hf_subsets):
                continue
            if exclusive_language_filter is False:
                for langscript in langs:
                    if lang_scripts.contains_language(
                        langscript
                    ) or lang_scripts.contains_script(langscript):
                        subsets_to_keep.append(hf_subset)
                        break

            if exclusive_language_filter is True and languages:
                if lang_scripts.contains_languages(langs):
                    subsets_to_keep.append(hf_subset)

        self.hf_subsets = subsets_to_keep
        return self

    def _add_main_score(self, scores: dict[HFSubset, ScoresDict]) -> None:
        scores["main_score"] = scores[self.metadata.main_score]

    def _upload_dataset_to_hub(
        self, repo_name: str, fields: list[str] | dict[str, str]
    ) -> None:
        if self.metadata.is_multilingual:
            for config in self.metadata.eval_langs:
                logger.info(f"Converting {config} of {self.metadata.name}")
                sentences = {}
                for split in self.dataset[config]:
                    if isinstance(fields, dict):
                        sentences[split] = Dataset.from_dict(
                            {
                                mapped_name: self.dataset[config][split][original_name]
                                for original_name, mapped_name in fields.items()
                            }
                        )
                    else:
                        sentences[split] = Dataset.from_dict(
                            {
                                field: self.dataset[config][split][field]
                                for field in fields
                            }
                        )
                sentences = DatasetDict(sentences)
                sentences.push_to_hub(
                    repo_name, config, commit_message=f"Add {config} dataset"
                )
        else:
            sentences = {}
            for split in self.dataset:
                if isinstance(fields, dict):
                    sentences[split] = Dataset.from_dict(
                        {
                            mapped_name: self.dataset[split][original_name]
                            for original_name, mapped_name in fields.items()
                        }
                    )
                else:
                    sentences[split] = Dataset.from_dict(
                        {field: self.dataset[split][field] for field in fields}
                    )
            sentences = DatasetDict(sentences)
            sentences.push_to_hub(repo_name, commit_message="Add dataset")

    def _push_dataset_to_hub(self, repo_name: str) -> None:
        raise NotImplementedError

    def push_dataset_to_hub(self, repo_name: str) -> None:
        """Push the dataset to the HuggingFace Hub.

        Args:
            repo_name: The name of the repository to push the dataset to.

        Examples:
            >>> import mteb
            >>> task = mteb.get_task("Caltech101")
            >>> repo_name = f"myorg/{task.metadata.name}"
            >>> # Push the dataset to the Hub
            >>> task.push_dataset_to_hub(repo_name)
        """
        if not self.data_loaded:
            self.load_data()

        self._push_dataset_to_hub(repo_name)
        # dataset repo not creating when pushing card
        self.metadata.push_dataset_card_to_hub(repo_name)

    @property
    def is_aggregate(self) -> bool:
        """Whether the task is an aggregate of multiple tasks."""
        return False

    @property
    def eval_splits(self) -> list[str]:
        if self._eval_splits:
            return self._eval_splits
        return self.metadata.eval_splits

    @property
    def modalities(self) -> list[Modalities]:
        """Returns the modalities of the task."""
        return self.metadata.modalities

    def __repr__(self) -> str:
        # Format the representation of the task such that it appears as:
        # TaskObjectName(name='{name}', languages={lang1, lang2, ...})

        langs = self.languages
        if len(langs) > 3:
            langs = langs[:3]
            langs.append("...")
        return (
            f"{self.__class__.__name__}(name='{self.metadata.name}', languages={langs})"
        )

    def __hash__(self) -> int:
        return hash(self.metadata)

    def unload_data(self) -> None:
        """Unloads the dataset from memory"""
        if self.data_loaded:
            self.dataset = None
            self.data_loaded = False
            logger.info(f"Unloaded dataset {self.metadata.name} from memory.")
        else:
            logger.warning(
                f"Dataset {self.metadata.name} is not loaded, cannot unload it."
            )
is_aggregate property

Whether the task is an aggregate of multiple tasks.

languages property

Returns the languages of the task.

modalities property

Returns the modalities of the task.

__init__(seed=42, **kwargs)

The init function. This is called primarily to set the seed.

Parameters:

Name Type Description Default
seed int

An integer seed.

42
kwargs Any

arguments passed to subclasses.

{}
Source code in mteb/abstasks/AbsTask.py
 94
 95
 96
 97
 98
 99
100
101
102
103
def __init__(self, seed: int = 42, **kwargs: Any):
    """The init function. This is called primarily to set the seed.

    Args:
        seed: An integer seed.
        kwargs: arguments passed to subclasses.
    """
    self.seed = seed
    self.rng_state, self.np_rng = set_seed(seed)
    self.hf_subsets = self.metadata.hf_subsets
calculate_descriptive_statistics(overwrite_results=False)

Calculates descriptive statistics from the dataset.

Source code in mteb/abstasks/AbsTask.py
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
def calculate_descriptive_statistics(
    self, overwrite_results: bool = False
) -> dict[str, DescriptiveStatistics]:
    """Calculates descriptive statistics from the dataset."""
    from mteb.abstasks import AbsTaskAnyClassification

    if self.metadata.descriptive_stat_path.exists() and not overwrite_results:
        logger.info("Loading metadata descriptive statistics from cache.")
        return self.metadata.descriptive_stats

    if not self.data_loaded:
        self.load_data()

    descriptive_stats: dict[str, DescriptiveStatistics] = {}
    hf_subset_stat = "hf_subset_descriptive_stats"
    eval_splits = self.metadata.eval_splits
    if isinstance(self, AbsTaskAnyClassification):
        eval_splits.append(self.train_split)

    pbar_split = tqdm.tqdm(eval_splits, desc="Processing Splits...")
    for split in pbar_split:
        pbar_split.set_postfix_str(f"Split: {split}")
        logger.info(f"Processing metadata for split {split}")
        if self.metadata.is_multilingual:
            descriptive_stats[split] = (
                self._calculate_descriptive_statistics_from_split(
                    split, compute_overall=True
                )
            )
            descriptive_stats[split][hf_subset_stat] = {}

            pbar_subsets = tqdm.tqdm(
                self.metadata.hf_subsets,
                desc="Processing Languages...",
            )
            for hf_subset in pbar_subsets:
                pbar_subsets.set_postfix_str(f"Huggingface subset: {hf_subset}")
                logger.info(f"Processing metadata for subset {hf_subset}")
                split_details = self._calculate_descriptive_statistics_from_split(
                    split, hf_subset
                )
                descriptive_stats[split][hf_subset_stat][hf_subset] = split_details
        else:
            split_details = self._calculate_descriptive_statistics_from_split(split)
            descriptive_stats[split] = split_details

    with self.metadata.descriptive_stat_path.open("w") as f:
        json.dump(descriptive_stats, f, indent=4)

    return descriptive_stats
check_if_dataset_is_superseded()

Check if the dataset is superseded by a newer version.

Source code in mteb/abstasks/AbsTask.py
105
106
107
108
109
110
def check_if_dataset_is_superseded(self):
    """Check if the dataset is superseded by a newer version."""
    if self.superseded_by:
        logger.warning(
            f"Dataset '{self.metadata.name}' is superseded by '{self.superseded_by}', you might consider using the newer version of the dataset."
        )
dataset_transform()

A transform operations applied to the dataset after loading.

This method is useful when the dataset from Huggingface is not in an mteb compatible format. Override this method if your dataset requires additional transformation.

Source code in mteb/abstasks/AbsTask.py
112
113
114
115
116
117
118
def dataset_transform(self):
    """A transform operations applied to the dataset after loading.

    This method is useful when the dataset from Huggingface is not in an `mteb` compatible format.
    Override this method if your dataset requires additional transformation.
    """
    pass
evaluate(model, split='test', subsets_to_run=None, *, encode_kwargs, prediction_folder=None, **kwargs)

Evaluates an MTEB compatible model on the task.

Parameters:

Name Type Description Default
model MTEBModels

MTEB compatible model. Implements a encode(sentences) method, that encodes sentences and returns an array of embeddings

required
split str

Which split (e.g. "test") to be used.

'test'
subsets_to_run list[HFSubset] | None

List of huggingface subsets (HFSubsets) to evaluate. If None, all subsets are evaluated.

None
encode_kwargs dict[str, Any]

Additional keyword arguments that are passed to the model's encode method.

required
prediction_folder Path | None

Folder to save model predictions

None
kwargs Any

Additional keyword arguments that are passed to the _evaluate_subset method.

{}
Source code in mteb/abstasks/AbsTask.py
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
def evaluate(
    self,
    model: MTEBModels,
    split: str = "test",
    subsets_to_run: list[HFSubset] | None = None,
    *,
    encode_kwargs: dict[str, Any],
    prediction_folder: Path | None = None,
    **kwargs: Any,
) -> dict[HFSubset, ScoresDict]:
    """Evaluates an MTEB compatible model on the task.

    Args:
        model: MTEB compatible model. Implements a encode(sentences) method, that encodes sentences and returns an array of embeddings
        split: Which split (e.g. *"test"*) to be used.
        subsets_to_run: List of huggingface subsets (HFSubsets) to evaluate. If None, all subsets are evaluated.
        encode_kwargs: Additional keyword arguments that are passed to the model's `encode` method.
        prediction_folder: Folder to save model predictions
        kwargs: Additional keyword arguments that are passed to the _evaluate_subset method.
    """
    if isinstance(model, CrossEncoderProtocol) and not self.support_cross_encoder:
        raise TypeError(
            f"Model {model} is a CrossEncoder, but this task {self.metadata.name} does not support CrossEncoders. "
            "Please use a Encoder model instead."
        )

    # encoders might implement search protocols
    if (
        isinstance(model, SearchProtocol)
        and not isinstance(model, Encoder)
        and not self.support_search
    ):
        raise TypeError(
            f"Model {model} is a SearchProtocol, but this task {self.metadata.name} does not support Search. "
            "Please use a Encoder model instead."
        )

    if not self.data_loaded:
        self.load_data()

    self.dataset = cast(dict[HFSubset, DatasetDict], self.dataset)

    scores = {}
    if self.hf_subsets is None:
        hf_subsets = list(self.dataset.keys())
    else:
        hf_subsets = copy(self.hf_subsets)

    if subsets_to_run is not None:  # allow overwrites of pre-filtering
        hf_subsets = [s for s in hf_subsets if s in subsets_to_run]

    for hf_subset in hf_subsets:
        logger.info(
            f"Task: {self.metadata.name}, split: {split}, subset: {hf_subset}. Running..."
        )
        if hf_subset not in self.dataset and hf_subset == "default":
            data_split = self.dataset[split]
        else:
            data_split = self.dataset[hf_subset][split]
        scores[hf_subset] = self._evaluate_subset(
            model,
            data_split,
            hf_split=split,
            hf_subset=hf_subset,
            encode_kwargs=encode_kwargs,
            prediction_folder=prediction_folder,
            **kwargs,
        )
        self._add_main_score(scores[hf_subset])
    return scores
fast_load()

Deprecated. Load all subsets at once, then group by language. Using fast loading has two requirements:

  • Each row in the dataset should have a 'lang' feature giving the corresponding language/language pair
  • The datasets must have a 'default' config that loads all the subsets of the dataset (see more here)
Source code in mteb/abstasks/AbsTask.py
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
def fast_load(self) -> None:
    """**Deprecated**. Load all subsets at once, then group by language. Using fast loading has two requirements:

    - Each row in the dataset should have a 'lang' feature giving the corresponding language/language pair
    - The datasets must have a 'default' config that loads all the subsets of the dataset (see more [here](https://huggingface.co/docs/datasets/en/repository_structure#configurations))
    """
    self.dataset = {}
    merged_dataset = datasets.load_dataset(
        **self.metadata.dataset
    )  # load "default" subset
    for split in merged_dataset.keys():
        df_split = merged_dataset[split].to_polars()
        df_grouped = dict(df_split.group_by(["lang"]))
        for lang in set(df_split["lang"].unique()) & set(self.hf_subsets):
            self.dataset.setdefault(lang, {})
            self.dataset[lang][split] = datasets.Dataset.from_polars(
                df_grouped[(lang,)].drop("lang")
            )  # Remove lang column and convert back to HF datasets, not strictly necessary but better for compatibility
    for lang, subset in self.dataset.items():
        self.dataset[lang] = datasets.DatasetDict(subset)
filter_eval_splits(eval_splits)

Filter the evaluation splits of the task.

Parameters:

Name Type Description Default
eval_splits list[str] | None

A list of evaluation splits to keep. If None, all splits are kept.

required

Returns:

Type Description
AbsTask

The filtered task

Source code in mteb/abstasks/AbsTask.py
418
419
420
421
422
423
424
425
426
427
428
def filter_eval_splits(self, eval_splits: list[str] | None) -> AbsTask:
    """Filter the evaluation splits of the task.

    Args:
        eval_splits: A list of evaluation splits to keep. If None, all splits are kept.

    Returns:
        The filtered task
    """
    self._eval_splits = eval_splits
    return self
filter_languages(languages, script=None, hf_subsets=None, exclusive_language_filter=False)

Filter the languages of the task.

Parameters:

Name Type Description Default
languages list[str] | None

list of languages to filter the task by can be either a 3-letter langauge code (e.g. "eng") or also include the script (e.g. "eng-Latn")

required
script list[str] | None

A list of scripts to filter the task by. Will be ignored if language code specified the script. If None, all scripts are included. If the language code does not specify the script the intersection of the language and script will be used.

None
hf_subsets list[HFSubset] | None

A list of huggingface subsets to filter on. This is useful if a dataset have multiple subsets containing the desired language, but you only want to test on one. An example is STS22 which e.g. have both "en" and "de-en" which both contains English.

None
exclusive_language_filter bool

Some datasets contains more than one language e.g. for STS22 the subset "de-en" contain eng and deu. If exclusive_language_filter is set to False both of these will be kept, but if set to True only those that contains all the languages specified will be kept.

False

Returns:

Type Description
AbsTask

The filtered task

Source code in mteb/abstasks/AbsTask.py
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
def filter_languages(
    self,
    languages: list[str] | None,
    script: list[str] | None = None,
    hf_subsets: list[HFSubset] | None = None,
    exclusive_language_filter: bool = False,
) -> AbsTask:
    """Filter the languages of the task.

    Args:
        languages: list of languages to filter the task by can be either a 3-letter langauge code (e.g. "eng") or also include the script
            (e.g. "eng-Latn")
        script: A list of scripts to filter the task by. Will be ignored if language code specified the script. If None, all scripts are included.
            If the language code does not specify the script the intersection of the language and script will be used.
        hf_subsets: A list of huggingface subsets to filter on. This is useful if a dataset have multiple subsets containing the desired language,
            but you only want to test on one. An example is STS22 which e.g. have both "en" and "de-en" which both contains English.
        exclusive_language_filter: Some datasets contains more than one language e.g. for STS22 the subset "de-en" contain eng and deu. If
            exclusive_language_filter is set to False both of these will be kept, but if set to True only those that contains all the languages
            specified will be kept.

    Returns:
        The filtered task
    """
    lang_scripts = LanguageScripts.from_languages_and_scripts(languages, script)

    subsets_to_keep = []

    for hf_subset, langs in self.metadata.hf_subsets_to_langscripts.items():
        if (hf_subsets is not None) and (hf_subset not in hf_subsets):
            continue
        if exclusive_language_filter is False:
            for langscript in langs:
                if lang_scripts.contains_language(
                    langscript
                ) or lang_scripts.contains_script(langscript):
                    subsets_to_keep.append(hf_subset)
                    break

        if exclusive_language_filter is True and languages:
            if lang_scripts.contains_languages(langs):
                subsets_to_keep.append(hf_subset)

    self.hf_subsets = subsets_to_keep
    return self
filter_modalities(modalities, exclusive_modality_filter=False)

Filter the modalities of the task.

Parameters:

Name Type Description Default
modalities list[str] | None

A list of modalities to filter by. If None, the task is returned unchanged.

required
exclusive_modality_filter bool

If True, only keep tasks where all filter modalities are included in the task's modalities and ALL task modalities are in filter modalities (exact match). If False, keep tasks if any of the task's modalities match the filter modalities.

False

Returns:

Type Description
AbsTask

The filtered task

Source code in mteb/abstasks/AbsTask.py
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
def filter_modalities(
    self, modalities: list[str] | None, exclusive_modality_filter: bool = False
) -> AbsTask:
    """Filter the modalities of the task.

    Args:
        modalities: A list of modalities to filter by. If None, the task is returned unchanged.
        exclusive_modality_filter: If True, only keep tasks where _all_ filter modalities are included in the
            task's modalities and ALL task modalities are in filter modalities (exact match).
            If False, keep tasks if _any_ of the task's modalities match the filter modalities.

    Returns:
        The filtered task
    """
    if modalities is None:
        return self
    filter_modalities_set = set(modalities)
    task_modalities_set = set(self.modalities)
    if exclusive_modality_filter:
        if not (filter_modalities_set == task_modalities_set):
            self.metadata.modalities = []
    else:
        if not filter_modalities_set.intersection(task_modalities_set):
            self.metadata.modalities = []
    return self
load_data()

Loads dataset from HuggingFace hub

This is the main loading function for Task. Do not overwrite this, instead we recommend using dataset_transform, which is called after the dataset is loaded using datasets.load_dataset.

Source code in mteb/abstasks/AbsTask.py
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
def load_data(self) -> None:
    """Loads dataset from HuggingFace hub

    This is the main loading function for Task. Do not overwrite this, instead we recommend using `dataset_transform`, which is called after the
    dataset is loaded using `datasets.load_dataset`.
    """
    if self.data_loaded:
        return
    if self.metadata.is_multilingual:
        if self.fast_loading:
            self.fast_load()
        else:
            self.dataset = {}
            for hf_subset in self.hf_subsets:
                self.dataset[hf_subset] = datasets.load_dataset(
                    name=hf_subset,
                    **self.metadata.dataset,
                )
    else:
        # some of monolingual datasets explicitly adding the split name to the dataset name
        self.dataset = datasets.load_dataset(**self.metadata.dataset)  # type: ignore
    self.dataset_transform()
    self.data_loaded = True
push_dataset_to_hub(repo_name)

Push the dataset to the HuggingFace Hub.

Parameters:

Name Type Description Default
repo_name str

The name of the repository to push the dataset to.

required

Examples:

>>> import mteb
>>> task = mteb.get_task("Caltech101")
>>> repo_name = f"myorg/{task.metadata.name}"
>>> # Push the dataset to the Hub
>>> task.push_dataset_to_hub(repo_name)
Source code in mteb/abstasks/AbsTask.py
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
def push_dataset_to_hub(self, repo_name: str) -> None:
    """Push the dataset to the HuggingFace Hub.

    Args:
        repo_name: The name of the repository to push the dataset to.

    Examples:
        >>> import mteb
        >>> task = mteb.get_task("Caltech101")
        >>> repo_name = f"myorg/{task.metadata.name}"
        >>> # Push the dataset to the Hub
        >>> task.push_dataset_to_hub(repo_name)
    """
    if not self.data_loaded:
        self.load_data()

    self._push_dataset_to_hub(repo_name)
    # dataset repo not creating when pushing card
    self.metadata.push_dataset_card_to_hub(repo_name)
stratified_subsampling(dataset_dict, seed, splits=['test'], label='label', n_samples=2048) staticmethod

Subsamples the dataset with stratification by the supplied label. Returns a datasetDict object.

Parameters:

Name Type Description Default
dataset_dict DatasetDict

the DatasetDict object.

required
seed int

the random seed.

required
splits list[str]

the splits of the dataset.

['test']
label str

the label with which the stratified sampling is based on.

'label'
n_samples int

Optional, number of samples to subsample. Default is max_n_samples.

2048
Source code in mteb/abstasks/AbsTask.py
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
@staticmethod
def stratified_subsampling(
    dataset_dict: datasets.DatasetDict,
    seed: int,
    splits: list[str] = ["test"],
    label: str = "label",
    n_samples: int = 2048,
) -> datasets.DatasetDict:
    """Subsamples the dataset with stratification by the supplied label.
    Returns a datasetDict object.

    Args:
        dataset_dict: the DatasetDict object.
        seed: the random seed.
        splits: the splits of the dataset.
        label: the label with which the stratified sampling is based on.
        n_samples: Optional, number of samples to subsample. Default is max_n_samples.
    """
    ## Can only do this if the label column is of ClassLabel.
    if not isinstance(dataset_dict[splits[0]].features[label], datasets.ClassLabel):
        try:
            dataset_dict = dataset_dict.class_encode_column(label)
        except ValueError as e:
            if isinstance(dataset_dict[splits[0]][label][0], Sequence):
                return _multilabel_subsampling(
                    dataset_dict, seed, splits, label, n_samples
                )
            else:
                raise e

    for split in splits:
        if n_samples >= len(dataset_dict[split]):
            logger.debug(
                f"Subsampling not needed for split {split}, as n_samples is equal or greater than the number of samples."
            )
            continue
        dataset_dict.update(
            {
                split: dataset_dict[split].train_test_split(
                    test_size=n_samples, seed=seed, stratify_by_column=label
                )["test"]
            }
        )  ## only take the specified test split.
    return dataset_dict
unload_data()

Unloads the dataset from memory

Source code in mteb/abstasks/AbsTask.py
601
602
603
604
605
606
607
608
609
610
def unload_data(self) -> None:
    """Unloads the dataset from memory"""
    if self.data_loaded:
        self.dataset = None
        self.data_loaded = False
        logger.info(f"Unloaded dataset {self.metadata.name} from memory.")
    else:
        logger.warning(
            f"Dataset {self.metadata.name} is not loaded, cannot unload it."
        )