AudioZeroshotClassification¶

Number of tasks: 5

ESC50_Zeroshot¶

Environmental Sound Classification Dataset.

Dataset: mteb/esc50 • License: cc-by-nc-sa-3.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
audio to text (a2t)	accuracy	zxx	Spoken	human-annotated	found

Citation

@inproceedings{piczak2015dataset,
  author = {Piczak, Karol J.},
  booktitle = {Proceedings of the 23rd {Annual ACM Conference} on {Multimedia}},
  date = {2015-10-13},
  doi = {10.1145/2733373.2806390},
  isbn = {978-1-4503-3459-4},
  location = {{Brisbane, Australia}},
  pages = {1015--1018},
  publisher = {{ACM Press}},
  title = {{ESC}: {Dataset} for {Environmental Sound Classification}},
  url = {http://dl.acm.org/citation.cfm?doid=2733373.2806390},
}

RavdessZeroshot¶

Emotion classification Dataset. RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes neutral,calm, happy, sad, angry, fearful, surprise, and disgust expressions. These 8 emtoions also serve as labels for the dataset.

Dataset: mteb/RavdessZeroshot • License: cc-by-nc-sa-3.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
audio to text (a2t)	accuracy	eng	Spoken	human-annotated	found

Citation

@article{10.1371/journal.pone.0196391,
  author = {Livingstone, Steven R. AND Russo, Frank A.},
  doi = {10.1371/journal.pone.0196391},
  journal = {PLOS ONE},
  month = {05},
  number = {5},
  pages = {1-35},
  publisher = {Public Library of Science},
  title = {The Ryerson Audio-Visual Database ofal Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English},
  url = {https://doi.org/10.1371/journal.pone.0196391},
  volume = {13},
  year = {2018},
}

SpeechCommandsZeroshotv0.01¶

Sound Classification/Keyword Spotting Dataset. This is a set of one-second audio clips containing a single spoken English word or background noise. These words are from a small set of commands such as 'yes', 'no', and 'stop' spoken by various speakers. With a total of 10 labels/commands for keyword spotting and a total of 30 labels for other auxiliary tasks

Dataset: mteb/SpeechCommandsZeroshotv0.01 • License: cc-by-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
audio to text (a2t)	accuracy	eng	Spoken	human-annotated	found

Citation

@article{DBLP:journals/corr/abs-1804-03209,
  author = {Pete Warden},
  bibsource = {dblp computer science bibliography, https://dblp.org},
  biburl = {https://dblp.org/rec/journals/corr/abs-1804-03209.bib},
  eprint = {1804.03209},
  eprinttype = {arXiv},
  journal = {CoRR},
  timestamp = {Mon, 13 Aug 2018 16:48:32 +0200},
  title = {Speech Commands: {A} Dataset for Limited-Vocabulary Speech Recognition},
  url = {http://arxiv.org/abs/1804.03209},
  volume = {abs/1804.03209},
  year = {2018},
}

SpeechCommandsZeroshotv0.02¶

Sound Classification/Keyword Spotting Dataset. This is a set of one-second audio clips containing a single spoken English word or background noise. These words are from a small set of commands such as 'yes', 'no', and 'stop' spoken by various speakers. With a total of 10 labels/commands for keyword spotting and a total of 30 labels for other auxiliary tasks

Dataset: mteb/SpeechCommandsZeroshotv0.02 • License: cc-by-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
audio to text (a2t)	accuracy	eng	Spoken	human-annotated	found

Citation

@article{DBLP:journals/corr/abs-1804-03209,
  author = {Pete Warden},
  bibsource = {dblp computer science bibliography, https://dblp.org},
  biburl = {https://dblp.org/rec/journals/corr/abs-1804-03209.bib},
  eprint = {1804.03209},
  eprinttype = {arXiv},
  journal = {CoRR},
  timestamp = {Mon, 13 Aug 2018 16:48:32 +0200},
  title = {Speech Commands: {A} Dataset for Limited-Vocabulary Speech Recognition},
  url = {http://arxiv.org/abs/1804.03209},
  volume = {abs/1804.03209},
  year = {2018},
}

UrbanSound8kZeroshot¶

Environmental Sound Classification Dataset.

Dataset: mteb/urbansound8K • License: cc-by-nc-sa-3.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
audio to text (a2t)	accuracy	zxx	AudioScene	human-annotated	found

Citation

@inproceedings{Salamon:UrbanSound:ACMMM:14,
  author = {Salamon, Justin and Jacoby, Christopher and Bello, Juan Pablo},
  booktitle = {Proceedings of the 22nd ACM international conference on Multimedia},
  organization = {ACM},
  pages = {1041--1044},
  title = {A Dataset and Taxonomy for Urban Sound Research},
  year = {2014},
}