AudioClassification¶
- Number of tasks: 38
AmbientAcousticContext¶
The Ambient Acoustic Context dataset contains 1-second segments for activities that occur in a workplace setting. This is a downsampled version with ~100 train and ~50 test samples per class.
Dataset: mteb/ambient-acoustic-context-small • License: not specified • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | zxx | Speech, Spoken | human-annotated | found |
Citation
@inproceedings{10.1145/3379503.3403535,
address = {New York, NY, USA},
articleno = {33},
author = {Park, Chunjong and Min, Chulhong and Bhattacharya, Sourav and Kawsar, Fahim},
booktitle = {22nd International Conference on Human-Computer Interaction with Mobile Devices and Services},
doi = {10.1145/3379503.3403535},
isbn = {9781450375160},
keywords = {Acoustic ambient context, Conversational agents},
location = {Oldenburg, Germany},
numpages = {9},
publisher = {Association for Computing Machinery},
series = {MobileHCI '20},
title = {Augmenting Conversational Agents with Ambient Acoustic Contexts},
url = {https://doi.org/10.1145/3379503.3403535},
year = {2020},
}
BeijingOpera¶
Audio classification of percussion instruments into one of 4 classes: Bangu, Naobo, Daluo, and Xiaoluo
Dataset: mteb/beijing-opera • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | zxx | Music | human-annotated | created |
Citation
@inproceedings{6853981,
author = {Tian, Mi and Srinivasamurthy, Ajay and Sandler, Mark and Serra, Xavier},
booktitle = {2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
doi = {10.1109/ICASSP.2014.6853981},
keywords = {Decision support systems;Conferences;Acoustics;Speech;Speech processing;Time-frequency analysis;Beijing Opera;Onset Detection;Drum Transcription;Non-negative matrix factorization},
number = {},
pages = {2159-2163},
title = {A study of instrument-wise onset detection in Beijing Opera percussion ensembles},
volume = {},
year = {2014},
}
BirdCLEF¶
BirdCLEF+ 2025 dataset for species identification from audio, focused on birds, amphibians, mammals and insects from the Middle Magdalena Valley of Colombia. Downsampled to 50 classes with 20 samples each.
Dataset: mteb/birdclef25-mini • License: cc-by-nc-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | zxx | Bioacoustics, Speech, Spoken | expert-annotated | found |
Citation
@dataset{birdclef2025,
author = {Christopher},
publisher = {Hugging Face},
title = {BirdCLEF+ 2025},
url = {https://huggingface.co/datasets/christopher/birdclef-2025},
year = {2025},
}
CREMA_D¶
Emotion classification of audio into one of 6 classes: Anger, Disgust, Fear, Happy, Neutral, Sad.
Dataset: mteb/crema-d • License: http://opendatacommons.org/licenses/odbl/1.0/ • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Speech | human-annotated | created |
Citation
@article{Cao2014-ih,
author = {Cao, Houwei and Cooper, David G and Keutmann, Michael K and Gur,
Ruben C and Nenkova, Ani and Verma, Ragini},
copyright = {https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html},
journal = {IEEE Transactions on Affective Computing},
keywords = {Emotional corpora; facial expression; multi-modal recognition;
voice expression},
language = {en},
month = oct,
number = {4},
pages = {377--390},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
title = {{CREMA-D}: Crowd-sourced emotional multimodal actors dataset},
volume = {5},
year = {2014},
}
CSTRVCTKAccentID¶
Gender classification from CSTR-VCTK dataset. This is a stratified and downsampled version of the original dataset. The dataset was recorded with 2 different microphones, and this mini version uniformly samples data from the 2 microphone types.
Dataset: mteb/cstr-vctk-accent-mini • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to text (a2t) | accuracy | eng | Speech, Spoken | human-annotated | found |
Citation
@inproceedings{Yamagishi2019CSTRVC,
author = {Junichi Yamagishi and Christophe Veaux and Kirsten MacDonald},
title = {CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)},
url = {https://api.semanticscholar.org/CorpusID:213060286},
year = {2019},
}
CSTRVCTKGender¶
Gender classification from CSTR-VCTK dataset. This is a stratified and downsampled version of the original dataset. The dataset was recorded with 2 different microphones, and this mini version uniformly samples data from the 2 microphone types.
Dataset: mteb/cstr-vctk-gender-mini • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to text (a2t) | accuracy | eng | Speech, Spoken | human-annotated | found |
Citation
@inproceedings{Yamagishi2019CSTRVC,
author = {Junichi Yamagishi and Christophe Veaux and Kirsten MacDonald},
title = {CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)},
url = {https://api.semanticscholar.org/CorpusID:213060286},
year = {2019},
}
CommonLanguageAgeDetection¶
Age Classification. This is a stratified subsampled version of the original CommonLanguage dataset.
Dataset: mteb/commonlanguage-age-mini • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Scene, Speech, Spoken | human-annotated | found |
Citation
@dataset{ganesh_sinisetty_2021_5036977,
author = {Ganesh Sinisetty and
Pavlo Ruban and
Oleksandr Dymov and
Mirco Ravanelli},
doi = {10.5281/zenodo.5036977},
month = jun,
publisher = {Zenodo},
title = {CommonLanguage},
url = {https://doi.org/10.5281/zenodo.5036977},
version = {0.1},
year = {2021},
}
CommonLanguageGenderDetection¶
Gender Classification. This is a stratified subsampled version of the original CommonLanguage datasets.
Dataset: mteb/commonlanguage-gender-mini • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Scene, Speech, Spoken | human-annotated | found |
Citation
@dataset{ganesh_sinisetty_2021_5036977,
author = {Ganesh Sinisetty and
Pavlo Ruban and
Oleksandr Dymov and
Mirco Ravanelli},
doi = {10.5281/zenodo.5036977},
month = jun,
publisher = {Zenodo},
title = {CommonLanguage},
url = {https://doi.org/10.5281/zenodo.5036977},
version = {0.1},
year = {2021},
}
CommonLanguageLanguageDetection¶
Language Classification. This is a stratified subsampled version of the original CommonLanguage dataset.
Dataset: mteb/commonlanguage-lang-mini • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Scene, Speech, Spoken | human-annotated | found |
Citation
@dataset{ganesh_sinisetty_2021_5036977,
author = {Ganesh Sinisetty and
Pavlo Ruban and
Oleksandr Dymov and
Mirco Ravanelli},
doi = {10.5281/zenodo.5036977},
month = jun,
publisher = {Zenodo},
title = {CommonLanguage},
url = {https://doi.org/10.5281/zenodo.5036977},
version = {0.1},
year = {2021},
}
ESC50¶
Environmental Sound Classification Dataset.
Dataset: mteb/esc50 • License: cc-by-nc-sa-3.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | zxx | Spoken | human-annotated | found |
Citation
@inproceedings{piczak2015dataset,
author = {Piczak, Karol J.},
booktitle = {Proceedings of the 23rd {Annual ACM Conference} on {Multimedia}},
date = {2015-10-13},
doi = {10.1145/2733373.2806390},
isbn = {978-1-4503-3459-4},
location = {{Brisbane, Australia}},
pages = {1015--1018},
publisher = {{ACM Press}},
title = {{ESC}: {Dataset} for {Environmental Sound Classification}},
url = {http://dl.acm.org/citation.cfm?doid=2733373.2806390},
}
ExpressoConv¶
Multiclass expressive speech style classification. This is a stratfied and downsampled version of the original dataset that contains 40 hours of speech. The original dataset has two subsets - read speech and conversational speech, each having their own set of style labels. This task only includes the conversational speech subset.
Dataset: mteb/expresso-conv-mini • License: cc-by-nc-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to text (a2t) | accuracy | eng | Speech, Spoken | human-annotated | created |
Citation
@inproceedings{nguyen2023expresso,
author = {Nguyen, Tu Anh and Hsu, Wei-Ning and d'Avirro, Antony and Shi, Bowen and Gat, Itai and Fazel-Zarani, Maryam and Remez, Tal and Copet, Jade and Synnaeve, Gabriel and Hassid, Michael and others},
booktitle = {INTERSPEECH 2023-24th Annual Conference of the International Speech Communication Association},
pages = {4823--4827},
title = {Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis},
year = {2023},
}
ExpressoRead¶
Multiclass expressive speech style classification. This is a stratfied and downsampled version of the original dataset that contains 40 hours of speech. The original dataset has two subsets - read speech and conversational speech, each having their own set of style labels. This task only includes the read speech subset.
Dataset: mteb/expresso-read-mini • License: cc-by-nc-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to text (a2t) | accuracy | eng | Speech, Spoken | human-annotated | created |
Citation
@inproceedings{nguyen2023expresso,
author = {Nguyen, Tu Anh and Hsu, Wei-Ning and d'Avirro, Antony and Shi, Bowen and Gat, Itai and Fazel-Zarani, Maryam and Remez, Tal and Copet, Jade and Synnaeve, Gabriel and Hassid, Michael and others},
booktitle = {INTERSPEECH 2023-24th Annual Conference of the International Speech Communication Association},
pages = {4823--4827},
title = {Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis},
year = {2023},
}
FSDD¶
Spoken digit classification of audio into one of 10 classes: 0-9
Dataset: mteb/free-spoken-digit-dataset • License: cc-by-sa-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Music | human-annotated | created |
Citation
@misc{zohar2018free,
author = {J. Zohar and S. Cãar and F. Jason and P. Yuxin and N. Hereman and T. Adhish},
month = {aug},
title = {Jakobovski/Free-Spoken-Digit-Dataset: V1.0.8},
url = {https://doi.org/10.5281/zenodo.1342401},
year = {2018},
}
GLOBEV2Age¶
Age classification from the GLOBE v2 dataset (sampled and enhanced from CommonVoice dataset for TTS purpose). This dataset is a stratified and downsampled version of the original dataset, containing about 535 hours of speech data across 164 accents. We use the age column as the target label for audio classification.
Dataset: mteb/globe-v2-age-mini • License: cc0-1.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to text (a2t) | accuracy | eng | Speech, Spoken | human-annotated | found |
Citation
@misc{wang2024globe,
archiveprefix = {arXiv},
author = {Wenbin Wang and Yang Song and Sanjay Jha},
eprint = {2406.14875},
title = {GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech},
year = {2024},
}
GLOBEV2Gender¶
Gender classification from the GLOBE v2 dataset (sampled and enhanced from CommonVoice dataset for TTS purpose). This dataset is a stratified and downsampled version of the original dataset, containing about 535 hours of speech data across 164 accents. We use the gender column as the target label for audio classification.
Dataset: mteb/globe-v2-gender-mini • License: cc0-1.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to text (a2t) | accuracy | eng | Speech, Spoken | human-annotated | found |
Citation
@misc{wang2024globe,
archiveprefix = {arXiv},
author = {Wenbin Wang and Yang Song and Sanjay Jha},
eprint = {2406.14875},
title = {GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech},
year = {2024},
}
GLOBEV3Age¶
Age classification from the GLOBE v3 dataset (sampled and enhanced from CommonVoice dataset for TTS purpose). This dataset is a stratified and downsampled version of the original dataset, containing about 535 hours of speech data across 164 accents. We use the age column as the target label for audio classification.
Dataset: mteb/globe-v3-age-mini • License: cc0-1.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to text (a2t) | accuracy | eng | Speech, Spoken | automatic | found |
Citation
@misc{wang2024globe,
archiveprefix = {arXiv},
author = {Wenbin Wang and Yang Song and Sanjay Jha},
eprint = {2406.14875},
title = {GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech},
year = {2024},
}
GLOBEV3Gender¶
Gender classification from the GLOBE v3 dataset (sampled and enhanced from CommonVoice dataset for TTS purpose). This dataset is a stratified and downsampled version of the original dataset, containing about 535 hours of speech data across 164 accents. We use the gender column as the target label for audio classification.
Dataset: mteb/globe-v3-gender-mini • License: cc0-1.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to text (a2t) | accuracy | eng | Speech, Spoken | automatic | found |
Citation
@misc{wang2024globe,
archiveprefix = {arXiv},
author = {Wenbin Wang and Yang Song and Sanjay Jha},
eprint = {2406.14875},
title = {GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech},
year = {2024},
}
GTZANGenre¶
Music Genre Classification (10 classes)
Dataset: mteb/gtzan-genre • License: not specified • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | zxx | Music | human-annotated | found |
Citation
@article{1021072,
author = {Tzanetakis, G. and Cook, P.},
doi = {10.1109/TSA.2002.800560},
journal = {IEEE Transactions on Speech and Audio Processing},
keywords = {Humans;Music information retrieval;Instruments;Computer science;Multiple signal classification;Signal analysis;Pattern recognition;Feature extraction;Wavelet analysis;Cultural differences},
number = {5},
pages = {293-302},
title = {Musical genre classification of audio signals},
volume = {10},
year = {2002},
}
GunshotTriangulation¶
Classifying a weapon based on its muzzle blast
Dataset: mteb/GunshotTriangulationHear • License: not specified • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | zxx | not specified | derived | found |
Citation
@misc{raponi2021soundgunsdigitalforensics,
archiveprefix = {arXiv},
author = {Simone Raponi and Isra Ali and Gabriele Oligeri},
eprint = {2004.07948},
primaryclass = {eess.AS},
title = {Sound of Guns: Digital Forensics of Gun Audio Samples meets Artificial Intelligence},
url = {https://arxiv.org/abs/2004.07948},
year = {2021},
}
IEMOCAPEmotion¶
Classification of speech samples into emotions (angry, happy, sad, neutral, frustrated, excited, fearful, surprised, disgusted) from interactive emotional dyadic conversations.
Dataset: mteb/iemocap • License: cc-by-nc-sa-3.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Speech, Spoken | expert-annotated | created |
Citation
@article{busso2008iemocap,
author = {Busso, Carlos and Bulut, Murtaza and Lee, Chi-Chun and Kazemzadeh, Abe and Mower, Emily and Kim, Samuel and Chang, Jeannette N and Lee, Sungbok and Narayanan, Shrikanth S},
journal = {Language resources and evaluation},
number = {4},
pages = {335--359},
publisher = {Springer},
title = {IEMOCAP: Interactive emotional dyadic motion capture database},
volume = {42},
year = {2008},
}
IEMOCAPGender¶
Classification of speech samples by speaker gender (male/female) from the IEMOCAP database of interactive emotional dyadic conversations.
Dataset: mteb/iemocap • License: cc-by-nc-sa-3.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Speech, Spoken | expert-annotated | created |
Citation
@article{busso2008iemocap,
author = {Busso, Carlos and Bulut, Murtaza and Lee, Chi-Chun and Kazemzadeh, Abe and Mower, Emily and Kim, Samuel and Chang, Jeannette N and Lee, Sungbok and Narayanan, Shrikanth S},
journal = {Language resources and evaluation},
number = {4},
pages = {335--359},
publisher = {Springer},
title = {IEMOCAP: Interactive emotional dyadic motion capture database},
volume = {42},
year = {2008},
}
LibriCount¶
Multiclass speaker count identification. Dataset contains audio recordings with between 0 to 10 speakers.
Dataset: mteb/libricount • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Speech | algorithmic | created |
Citation
@inproceedings{Stoter_2018,
author = {Stoter, Fabian-Robert and Chakrabarty, Soumitro and Edler, Bernd and Habets, Emanuel A. P.},
booktitle = {2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
doi = {10.1109/icassp.2018.8462159},
month = apr,
pages = {436-440},
publisher = {IEEE},
title = {Classification vs. Regression in Supervised Learning for Single Channel Speaker Count Estimation},
url = {http://dx.doi.org/10.1109/ICASSP.2018.8462159},
year = {2018},
}
MInDS14¶
MInDS-14 is an evaluation resource for intent detection with spoken data in 14 diverse languages.
Dataset: mteb/minds14-multilingual • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | ces, deu, eng, fra, ita, ... (12) | Speech, Spoken | human-annotated | found |
Citation
@article{DBLP:journals/corr/abs-2104-08524,
author = {Daniela Gerz and Pei{-}Hao Su and Razvan Kusztos and Avishek Mondal and Michal Lis and Eshan Singhal and Nikola Mrkšić and Tsung{-}Hsien Wen and Ivan Vulic},
eprint = {2104.08524},
eprinttype = {arXiv},
journal = {CoRR},
title = {Multilingual and Cross-Lingual Intent Detection from Spoken Data},
url = {https://arxiv.org/abs/2104.08524},
volume = {abs/2104.08524},
year = {2021},
}
MridinghamStroke¶
Stroke classification of Mridingham (a pitched percussion instrument) into one of 10 classes: ["bheem", "cha", "dheem", "dhin", "num", "tham", "ta", "tha", "thi", "thom"]
Dataset: mteb/mridingham-stroke • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | zxx | Music | human-annotated | created |
Citation
@inproceedings{6637633,
author = {Anantapadmanabhan, Akshay and Bellur, Ashwin and Murthy, Hema A},
booktitle = {2013 IEEE International Conference on Acoustics, Speech and Signal Processing},
doi = {10.1109/ICASSP.2013.6637633},
keywords = {Instruments;Vectors;Hidden Markov models;Harmonic analysis;Modal analysis;Dictionaries;Music;Modal Analysis;Mridangam;automatic transcription;Non-negative Matrix Factorization;Hidden Markov models},
number = {},
pages = {181-185},
title = {Modal analysis and transcription of strokes of the mridangam using non-negative matrix factorization},
volume = {},
year = {2013},
}
MridinghamTonic¶
Tonic classification of Mridingham (a pitched percussion instrument) into one of 6 classes: B,C,C#,D,D#,E
Dataset: mteb/mridingham-tonic • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | zxx | Music | human-annotated | created |
Citation
@inproceedings{6637633,
author = {Anantapadmanabhan, Akshay and Bellur, Ashwin and Murthy, Hema A},
booktitle = {2013 IEEE International Conference on Acoustics, Speech and Signal Processing},
doi = {10.1109/ICASSP.2013.6637633},
keywords = {Instruments;Vectors;Hidden Markov models;Harmonic analysis;Modal analysis;Dictionaries;Music;Modal Analysis;Mridangam;automatic transcription;Non-negative Matrix Factorization;Hidden Markov models},
number = {},
pages = {181-185},
title = {Modal analysis and transcription of strokes of the mridangam using non-negative matrix factorization},
volume = {},
year = {2013},
}
NSynth¶
Instrument Source Classification: one of acoustic, electronic, or synthetic.
Dataset: mteb/nsynth-mini • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | zxx | Music | human-annotated | created |
Citation
@misc{engel2017neuralaudiosynthesismusical,
archiveprefix = {arXiv},
author = {Jesse Engel and Cinjon Resnick and Adam Roberts and Sander Dieleman and Douglas Eck and Karen Simonyan and Mohammad Norouzi},
eprint = {1704.01279},
primaryclass = {cs.LG},
title = {Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders},
url = {https://arxiv.org/abs/1704.01279},
year = {2017},
}
SpeechCommands¶
A set of one-second .wav audio files, each containing a single spoken English word or background noise. To keep evaluation fast, we use a downsampled version of the original dataset by keeping ~50 samples per class for training.
Dataset: mteb/speech-commands-mini • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Speech | human-annotated | found |
Citation
@article{DBLP:journals/corr/abs-1804-03209,
author = {Pete Warden},
bibsource = {dblp computer science bibliography, https://dblp.org},
biburl = {https://dblp.org/rec/journals/corr/abs-1804-03209.bib},
eprint = {1804.03209},
eprinttype = {arXiv},
journal = {CoRR},
timestamp = {Mon, 13 Aug 2018 16:48:32 +0200},
title = {Speech Commands: {A} Dataset for Limited-Vocabulary Speech Recognition},
url = {http://arxiv.org/abs/1804.03209},
volume = {abs/1804.03209},
year = {2018},
}
SpokeNEnglish¶
Human Sound Classification Dataset.
Dataset: mteb/SpokeN-100-English • License: cc-by-sa-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Spoken | LM-generated | found |
Citation
@misc{groh2024spoken100crosslingualbenchmarkingdataset,
archiveprefix = {arXiv},
author = {René Groh and Nina Goes and Andreas M. Kist},
eprint = {2403.09753},
primaryclass = {cs.SD},
title = {SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different Languages},
url = {https://arxiv.org/abs/2403.09753},
year = {2024},
}
SpokenQAForIC¶
SpokenQA dataset reformulated as Intent Classification (IC) task
Dataset: mteb/SpokenQA_SLUE • License: not specified • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Spoken | human-annotated | multiple |
Citation
@misc{shon2023sluephase2benchmarksuite,
archiveprefix = {arXiv},
author = {Suwon Shon and Siddhant Arora and Chyi-Jiunn Lin and Ankita Pasad and Felix Wu and Roshan Sharma and Wei-Lun Wu and Hung-Yi Lee and Karen Livescu and Shinji Watanabe},
eprint = {2212.10525},
primaryclass = {cs.CL},
title = {SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks},
url = {https://arxiv.org/abs/2212.10525},
year = {2023},
}
TAUAcousticScenes2022Mobile¶
TAU Urban Acoustic Scenes 2022 Mobile, development dataset consists of 1-second audio recordings from 12 European cities in 10 different acoustic scenes using 4 different devices. This is a stratified subsampled version of the evaluation_setup subset of the original dataset.
Dataset: mteb/tau-acoustic-scenes-2022-mobile-mini • License: not specified • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | zxx | AudioScene | expert-annotated | found |
Citation
@dataset{heittola_2022_6337421,
author = {Toni Heittola and Annamaria Mesaros and Tuomas Virtanen},
publisher = {Zenodo},
title = {TAU Urban Acoustic Scenes 2022 Mobile, Development Dataset},
url = {https://doi.org/10.5281/zenodo.6337421},
year = {2022},
}
TUTAcousticScenes¶
TUT Urban Acoustic Scenes 2018 dataset consists of 10-second audio segments from 10 acoustic scenes recorded in six European cities. This is a stratified subsampled version of the original dataset.
Dataset: mteb/tut-acoustic-scenes-mini • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | zxx | AudioScene | expert-annotated | found |
Citation
@inproceedings{Mesaros2018_DCASE,
address = {Tampere, Finland},
author = {Annamaria Mesaros and Toni Heittola and Tuomas Virtanen},
booktitle = {Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)},
publisher = {Tampere University of Technology},
title = {A Multi-Device Dataset for Urban Acoustic Scene Classification},
url = {https://arxiv.org/abs/1807.09840},
year = {2018},
}
UrbanSound8k¶
Environmental Sound Classification Dataset.
Dataset: mteb/urbansound8K • License: cc-by-nc-sa-3.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | zxx | AudioScene | human-annotated | found |
Citation
@inproceedings{Salamon:UrbanSound:ACMMM:14,
author = {Salamon, Justin and Jacoby, Christopher and Bello, Juan Pablo},
booktitle = {Proceedings of the 22nd ACM international conference on Multimedia},
organization = {ACM},
pages = {1041--1044},
title = {A Dataset and Taxonomy for Urban Sound Research},
year = {2014},
}
VocalSound¶
Human Vocal Sound Classification Dataset.
Dataset: mteb/vocalsound • License: cc-by-sa-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Spoken | human-annotated | found |
Citation
@inproceedings{Gong_2022,
author = {Gong, Yuan and Yu, Jin and Glass, James},
booktitle = {ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
doi = {10.1109/icassp43922.2022.9746828},
month = may,
publisher = {IEEE},
title = {Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition},
url = {http://dx.doi.org/10.1109/ICASSP43922.2022.9746828},
year = {2022},
}
VoxCelebSA¶
VoxCeleb dataset augmented for Sentiment Analysis task
Dataset: mteb/voxceleb-sentiment • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Spoken | human-annotated | found |
Citation
@misc{shon2022sluenewbenchmarktasks,
archiveprefix = {arXiv},
author = {Suwon Shon and Ankita Pasad and Felix Wu and Pablo Brusco and Yoav Artzi and Karen Livescu and Kyu J. Han},
eprint = {2111.10367},
primaryclass = {cs.CL},
title = {SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech},
url = {https://arxiv.org/abs/2111.10367},
year = {2022},
}
VoxLingua107_Top10¶
Spoken Language Identification for a given audio samples (10 classes/languages)
Dataset: mteb/voxlingua107-top10 • License: cc-by-4.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Speech | automatic-and-reviewed | found |
Citation
@misc{valk2020voxlingua107datasetspokenlanguage,
archiveprefix = {arXiv},
author = {Jörgen Valk and Tanel Alumäe},
eprint = {2011.12998},
primaryclass = {eess.AS},
title = {VoxLingua107: a Dataset for Spoken Language Recognition},
url = {https://arxiv.org/abs/2011.12998},
year = {2020},
}
VoxPopuliAccentID¶
Classification of English speech samples into one of 15 non-native accents from European Parliament recordings. This is a stratified subsampled version of the original VoxPopuli dataset.
Dataset: mteb/voxpopuli-accent-mini • License: cc0-1.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | eng | Speech, Spoken | human-annotated | found |
Citation
@inproceedings{wang-etal-2021-voxpopuli,
address = {Online},
author = {Wang, Changhan and
Riviere, Morgane and
Lee, Ann and
Wu, Anne and
Talnikar, Chaitanya and
Haziza, Daniel and
Williamson, Mary and
Pino, Juan and
Dupoux, Emmanuel},
booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
doi = {10.18653/v1/2021.acl-long.80},
month = aug,
pages = {993--1003},
publisher = {Association for Computational Linguistics},
title = {{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation},
url = {https://aclanthology.org/2021.acl-long.80},
year = {2021},
}
VoxPopuliGenderID¶
Subsampled Dataset Classification of speech samples by speaker gender (male/female) from European Parliament recordings.
Dataset: mteb/voxpopuli-mini • License: cc0-1.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | deu, eng, fra, pol, spa | Speech, Spoken | human-annotated | found |
Citation
@inproceedings{wang-etal-2021-voxpopuli,
address = {Online},
author = {Wang, Changhan and
Riviere, Morgane and
Lee, Ann and
Wu, Anne and
Talnikar, Chaitanya and
Haziza, Daniel and
Williamson, Mary and
Pino, Juan and
Dupoux, Emmanuel},
booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
doi = {10.18653/v1/2021.acl-long.80},
month = aug,
pages = {993--1003},
publisher = {Association for Computational Linguistics},
title = {{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation},
url = {https://aclanthology.org/2021.acl-long.80},
year = {2021},
}
VoxPopuliLanguageID¶
Subsampled Dataset for classification of speech samples into one of 5 European languages (English, German, French, Spanish, Polish) from European Parliament recordings.
Dataset: mteb/voxpopuli-mini • License: cc0-1.0 • Learn more →
| Task category | Score | Languages | Domains | Annotations Creators | Sample Creation |
|---|---|---|---|---|---|
| audio to category (a2c) | accuracy | deu, eng, fra, pol, spa | Speech, Spoken | human-annotated | found |
Citation
@inproceedings{wang-etal-2021-voxpopuli,
address = {Online},
author = {Wang, Changhan and
Riviere, Morgane and
Lee, Ann and
Wu, Anne and
Talnikar, Chaitanya and
Haziza, Daniel and
Williamson, Mary and
Pino, Juan and
Dupoux, Emmanuel},
booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
doi = {10.18653/v1/2021.acl-long.80},
month = aug,
pages = {993--1003},
publisher = {Association for Computational Linguistics},
title = {{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation},
url = {https://aclanthology.org/2021.acl-long.80},
year = {2021},
}