Skip to content

VisionCentricQA

  • Number of tasks: 6

BLINKIT2IMultiChoice

Retrieve images based on images and specific retrieval instructions.

Dataset: JamieSJS/blink-it2i-multi • License: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to image (it2i) accuracy eng Encyclopaedic derived found
Citation
@article{fu2024blink,
  author = {Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A and Ma, Wei-Chiu and Krishna, Ranjay},
  journal = {arXiv preprint arXiv:2404.12390},
  title = {Blink: Multimodal large language models can see but not perceive},
  year = {2024},
}

BLINKIT2TMultiChoice

Retrieve the correct text answer based on images and specific retrieval instructions.

Dataset: JamieSJS/blink-it2t-multi • License: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to text (it2t) accuracy eng Encyclopaedic derived found
Citation
@article{fu2024blink,
  author = {Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A and Ma, Wei-Chiu and Krishna, Ranjay},
  journal = {arXiv preprint arXiv:2404.12390},
  title = {Blink: Multimodal large language models can see but not perceive},
  year = {2024},
}

CVBenchCount

count the number of objects in the image.

Dataset: nyu-visionx/CV-Bench • License: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to text (it2t) accuracy eng Academic derived found
Citation
@article{tong2024cambrian,
  author = {Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng and Iyer, Adithya and Pan, Xichen and others},
  journal = {arXiv preprint arXiv:2406.16860},
  title = {Cambrian-1: A fully open, vision-centric exploration of multimodal llms},
  year = {2024},
}

CVBenchDepth

judge the depth of the objects in the image with similarity matching.

Dataset: nyu-visionx/CV-Bench • License: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to text (it2t) accuracy eng Academic derived found
Citation
@article{tong2024cambrian,
  author = {Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng and Iyer, Adithya and Pan, Xichen and others},
  journal = {arXiv preprint arXiv:2406.16860},
  title = {Cambrian-1: A fully open, vision-centric exploration of multimodal llms},
  year = {2024},
}

CVBenchDistance

judge the distance of the objects in the image with similarity matching.

Dataset: nyu-visionx/CV-Bench • License: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to text (it2t) accuracy eng Academic derived found
Citation
@article{tong2024cambrian,
  author = {Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng and Iyer, Adithya and Pan, Xichen and others},
  journal = {arXiv preprint arXiv:2406.16860},
  title = {Cambrian-1: A fully open, vision-centric exploration of multimodal llms},
  year = {2024},
}

CVBenchRelation

decide the relation of the objects in the image.

Dataset: nyu-visionx/CV-Bench • License: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to text (it2t) accuracy eng Academic derived found
Citation
@article{tong2024cambrian,
  author = {Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng and Iyer, Adithya and Pan, Xichen and others},
  journal = {arXiv preprint arXiv:2406.16860},
  title = {Cambrian-1: A fully open, vision-centric exploration of multimodal llms},
  year = {2024},
}