CharXiv

Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

1Princeton Language and Intelligence (PLI), Princeton University
2University of Wisconsin, Madison 3The University of Hong Kong
NeurIPS 2024

Watch the 80-second music video to learn the motivation and key findings of CharXiv!
(Lyrics by GPT-4o from the abstract and Music by Suno)

Introduction

Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from scientific papers. CharXiv includes two types of questions: (1) descriptive questions about examining basic chart elements and (2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress.

Figure: Many open-source models surpass proprietary model performance on existing benchmarks (subsets of DVQA, FigureQA and ChartQA from MathVista) yet fail consistently in reasoning questions from CharXiv


Leaderboard

We evaluate general-purpose MLLMs on CharXiv and provide a leaderboard for the community to track progress. Note that all models are evaluated in a zero-shot setting with a set of natural instructions for each question type. The numbers below are based on the model performance on the validation set, which consists of 1,000 charts and 5,000 questions in total.

The website is being constructed. We will roll out all the update for this project in the coming days!



Citation

          @article{wang2024charxiv,
  title={CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs},
  author={Wang, Zirui and Xia, Mengzhou and He, Luxi and Chen, Howard and Liu, Yitao and Zhu, Richard and Liang, Kaiqu and Wu, Xindi and Liu, Haotian and Malladi, Sadhika and Chevalier, Alexis and Arora, Sanjeev and Chen, Danqi},
  journal={arXiv preprint arXiv:2406.18521},
  year={2024}
}