Skip to content

BM25 Baselines

MTEB includes language-aware BM25 baselines that can be loaded like any other model:

import mteb

# Language-aware BM25: auto-selects stopwords, stemmer, and tokenizer from task metadata
model = mteb.get_model("mteb/baseline-bm25s")

# Subword BM25: uses a HuggingFace subword tokenizer (Qwen3) for better multilingual coverage
model = mteb.get_model("mteb/baseline-bm25s-subword")

Install the required extras first:

pip install "mteb[bm25s]"
Performance comparison on Chinese retrieval

The tokenizer choice has a large impact for non-Latin scripts. Results on LeCaRDv2 (Chinese legal case retrieval, 3 795 docs, 159 queries):

Model / tokenizer ndcg@10
mteb/baseline-bm25s (mteb<=2.13.5) 0.359
mteb/baseline-bm25s (mteb>2.13.5) 0.567
mteb/baseline-bm25s-subword (Qwen3-0.6B) 0.631
Custom Jieba tokenizer (see example below) 0.641

The default mteb/baseline-bm25s switched from whitespace to character-level tokenization for Chinese and other logographic scripts in PR #4405. For a language-agnostic baseline mteb/baseline-bm25s-subword performs well across scripts, but a language-specific tokenizer like Jieba will generally give the best results when the language is known.

Custom tokenizer

You can pass any text -> list[str] callable as a custom tokenizer, or provide a HuggingFace tokenizer name:

import mteb

# Using a HuggingFace tokenizer by name (e.g. for a specific language)
model = mteb.get_model("mteb/baseline-bm25s", tokenizer="bert-base-multilingual-cased")

# Using a custom callable (e.g. Jieba for Chinese)
import jieba

def jieba_tokenize(text: str) -> list[str]:
    return [t for t in jieba.lcut(text) if t.strip()]

model = mteb.get_model("mteb/baseline-bm25s", tokenizer=jieba_tokenize)