Additional Types¶
MTEB implements a variety of utility types to allow us and you to better know what a model returns. This page documents some of these types.
mteb.types
¶
Array = Union[np.ndarray, torch.Tensor]
module-attribute
¶
General array type, can be a numpy array or a torch tensor.
BatchedInput = Union[TextInput, CorpusInput, QueryInput, ImageInput, AudioInput, MultimodalInput]
module-attribute
¶
The input to the encoder for a batch of data.
Conversation = list[ConversationTurn]
module-attribute
¶
A conversation, consisting of a list of messages.
CorpusDatasetType = Dataset
module-attribute
¶
Retrieval corpus dataset, containing documents. Should have columns id
, title
, body
.
HFSubset = str
module-attribute
¶
The name of a HuggingFace dataset subset, e.g. 'en-de', 'en', 'default' (default is used when there is no subset).
ISOLanguage = str
module-attribute
¶
A string representing the language. Language is denoted as a 3-letter ISO 639-3 language code (e.g. "eng").
ISOLanguageScript = str
module-attribute
¶
ISOScript = str
module-attribute
¶
A string representing the script. The script is denoted by a 4-letter ISO 15924 script code (e.g. "Latn").
InstructionDatasetType = Dataset
module-attribute
¶
Retrieval instruction dataset, containing instructions. Should have columns query-id
, instruction
.
Languages = Union[list[ISOLanguageScript], Mapping[HFSubset, list[ISOLanguageScript]]]
module-attribute
¶
A list of languages or a mapping from HFSubset to a list of languages. E.g. ["eng-Latn", "deu-Latn"] or {"en-de": ["eng-Latn", "deu-Latn"], "fr-it": ["fra-Latn", "ita-Latn"]}.
Licenses = Literal['not specified', 'mit', 'cc-by-2.0', 'cc-by-3.0', 'cc-by-4.0', 'cc-by-sa-3.0', 'cc-by-sa-4.0', 'cc-by-nc-3.0', 'cc-by-nc-4.0', 'cc-by-nc-sa-3.0', 'cc-by-nc-sa-4.0', 'cc-by-nc-nd-4.0', 'cc-by-nd-4.0', 'openrail', 'openrail++', 'odc-by', 'afl-3.0', 'apache-2.0', 'cc-by-nd-2.1-jp', 'cc0-1.0', 'bsd-3-clause', 'gpl-3.0', 'lgpl', 'lgpl-3.0', 'cdla-sharing-1.0', 'mpl-2.0', 'msr-la-nc', 'multiple', 'gemma']
module-attribute
¶
The different licenses that a dataset or model can have. This list can be extended as needed.
Modalities = Literal['text', 'image']
module-attribute
¶
The different modalities that a model can support.
ModelName = str
module-attribute
¶
The name of a model, typically as found on HuggingFace e.g. sentence-transformers/all-MiniLM-L6-v2
.
QueryDatasetType = Dataset
module-attribute
¶
Retrieval query dataset, containing queries. Should have columns id
, text
.
RelevantDocumentsType = Mapping[str, Mapping[str, float]]
module-attribute
¶
Relevant documents for each query, mapping query IDs to a mapping of document IDs and their relevance
scores. Should have columns query-id
, corpus-id
, score
.
RetrievalOutputType = dict[str, dict[str, float]]
module-attribute
¶
Retrieval output, containing the scores for each query-document pair.
Revision = str
module-attribute
¶
The revision of a model, typically a git commit hash. For APIs this can be a version string e.g. 1
.
Score = Any
module-attribute
¶
A score value, could e.g. be accuracy. Normally it is a float or int, but it can take on any value. Should be json serializable.
ScoresDict = dict[str, Any]
module-attribute
¶
A dictionary of scores, typically also include metadata, e.g {'main_score': 0.5, 'accuracy': 0.5, 'f1': 0.6, 'hf_subset': 'en-de', 'languages': ['eng-Latn', 'deu-Latn']}
SplitName = str
module-attribute
¶
The name of a data split, e.g. 'test', 'validation', 'train'.
StrDate = Annotated[str, BeforeValidator(lambda value: str(pastdate_adapter.validate_python(value)))]
module-attribute
¶
A string that is a valid date in the past, e.g. formatted as YYYY-MM-DD.
StrURL = Annotated[str, BeforeValidator(lambda value: str(http_url_adapter.validate_python(value)))]
module-attribute
¶
A string that is a valid URL.
TopRankedDocumentsType = Mapping[str, list[str]]
module-attribute
¶
Top-ranked documents for each query, mapping query IDs to a list of document IDs. Should
have columns query-id
, corpus-ids
.
ConversationTurn
¶
Bases: TypedDict
A conversation, consisting of a list of messages.
Attributes:
Name | Type | Description |
---|---|---|
role |
str
|
The role of the message sender. |
content |
str
|
The content of the message. |
Source code in mteb/types/_encoder_io.py
27 28 29 30 31 32 33 34 35 36 |
|
PromptType
¶
Bases: str
, Enum
The type of prompt used in the input for retrieval models. Used to differentiate between queries and documents.
Source code in mteb/types/_encoder_io.py
20 21 22 23 24 |
|
RetrievalEvaluationResult
¶
Bases: NamedTuple
Holds the results of retrieval evaluation metrics.
Source code in mteb/types/_result.py
16 17 18 19 20 21 22 23 24 25 26 27 |
|