BM25 Baselines¶
MTEB includes language-aware BM25 baselines that can be loaded like any other model:
import mteb
# Language-aware BM25: auto-selects stopwords, stemmer, and tokenizer from task metadata
model = mteb.get_model("mteb/baseline-bm25s")
# Subword BM25: uses a HuggingFace subword tokenizer (Qwen3) for better multilingual coverage
model = mteb.get_model("mteb/baseline-bm25s-subword")
Install the required extras first:
pip install "mteb[bm25s]"
Performance comparison on Chinese retrieval
The tokenizer choice has a large impact for non-Latin scripts. Results on LeCaRDv2 (Chinese legal case retrieval, 3 795 docs, 159 queries):
| Model / tokenizer | ndcg@10 |
|---|---|
mteb/baseline-bm25s (mteb<=2.13.5) |
0.359 |
mteb/baseline-bm25s (mteb>2.13.5) |
0.567 |
mteb/baseline-bm25s-subword (Qwen3-0.6B) |
0.631 |
| Custom Jieba tokenizer (see example below) | 0.641 |
The default mteb/baseline-bm25s switched from whitespace to character-level tokenization for Chinese and other logographic scripts in PR #4405.
For a language-agnostic baseline mteb/baseline-bm25s-subword performs well across scripts, but a language-specific tokenizer like Jieba will generally give the best results when the language is known.
Custom tokenizer¶
You can pass any text -> list[str] callable as a custom tokenizer, or provide a HuggingFace tokenizer name:
import mteb
# Using a HuggingFace tokenizer by name (e.g. for a specific language)
model = mteb.get_model("mteb/baseline-bm25s", tokenizer="bert-base-multilingual-cased")
# Using a custom callable (e.g. Jieba for Chinese)
import jieba
def jieba_tokenize(text: str) -> list[str]:
return [t for t in jieba.lcut(text) if t.strip()]
model = mteb.get_model("mteb/baseline-bm25s", tokenizer=jieba_tokenize)