General-Level
Resolve conflict
0eb3766
|
raw
history blame
10.2 kB

<<<<<<< HEAD

title: README emoji: 🌍 colorFrom: blue colorTo: blue sdk: static pinned: false license: apache-2.0

On Path to Multimodal Generalist: General-Level and General-Bench

[📖 Project] [🏆 Leaderboard] [📄 Paper] [🤗 Paper-HF] [🤗 Dataset-HF (Close-Set)] [🤗 Dataset-HF (Open-Set)] [📝 Github]


General-Level Scorer


Does higher performance across tasks indicate a stronger capability of MLLM, and closer to AGI?
NO! But synergy does.

Most current MLLMs predominantly build on the language intelligence of LLMs to simulate the indirect intelligence of multimodality, which is merely extending language intelligence to aid multimodal understanding. While LLMs (e.g., ChatGPT) have already demonstrated such synergy in NLP, reflecting language intelligence, unfortunately, the vast majority of MLLMs do not really achieve it across modalities and tasks.

We argue that the key to advancing towards AGI lies in the synergy effect—a capability that enables knowledge learned in one modality or task to generalize and enhance mastery in other modalities or tasks, fostering mutual improvement across different modalities and tasks through interconnected learning.


🏆 Overall Leaderboard


🚀 General-Level

A 5-scale level evaluation system with a new norm for assessing the multimodal generalists (multimodal LLMs/agents).
The core is the use of synergy as the evaluative criterion, categorizing capabilities based on whether MLLMs preserve synergy across comprehension and generation, as well as across multimodal interactions.

General-Level evaluates generalists based on the levels and strengths of the synergy they preserve. Specifically, we define three scopes of synergy, ranked from low to high: no synergy, task-level synergy (‘task-task’), paradigm-level synergy (‘comprehension-generation’), and cross-modal total synergy (‘modality-modality’), as illustrated here:

Achieving these levels of synergy becomes progressively more challenging, corresponding to higher degrees of general intelligence. Assume we have a benchmark of various modalities and tasks, where we can categorize tasks under these modalities into the Comprehension group and the Generation group, as well as the language (i.e., NLP) group, as illustrated here:

Let’s denote the number of datasets or tasks within the Comprehension task group by M; the number within the Generation task group by N; and the number of NLP tasks by T.

Now, we demonstrate the specific definition and calculation of each level:


⚠️ Scoring Relaxation

A central aspect of our General-Level framework lies in how synergy effects are computed. According to the standard understanding of the synergy concept, e.g., the performance of a generalist model on joint modeling of tasks A and B (e.g., Pθ(y|A,B)) should exceed its performance when modeling task A alone (e.g., Pθ(y|A)) or task B alone (e.g., Pθ(y|B)). However, adopting this approach poses a significant challenge that hinders the measurement of synergy: there is no feasible way to establish two independent distributions, Pθ(y|A) and Pθ(y|B), and a joint distribution Pθ(y|A,B). This limitation arises because a given generalist model has already undergone extensive pre-training and fine-tuning, where tasks A and B have likely been jointly modeled. It is impractical to retrain such a generalist to isolate the learning and modeling of tasks A or B independently in order to derive these distributions. Otherwise, such an approach would result in excessive redundant computation and inference on the benchmark data.

To simplify and relax the evaluation of synergy, we introduce a key assumption in the scoring algorithm:

Theoretically, we posit that the stronger a model's synergy capability, the more likely it is to surpass the task performance of SoTA specialists when synergy is effectively employed. Then, we can simplify the synergy measurement as: if a generalist outperforms a SoTA specialist in a specific task, we consider it as evidence of a synergy effect, i.e., leveraging the knowledge learned from other tasks or modalities to enhance its performance in the targeted task.

By making this assumption, we avoid the need for direct pairwise measurements between task-task', comprehension-generation', or `modality-modality', which would otherwise require complex and computationally intensive algorithms.


📌 Citation

If you find our benchmark useful in your research, please kindly consider citing us:

@articles{fei2025pathmultimodalgeneralistgenerallevel,
  title={On Path to Multimodal Generalist: General-Level and General-Bench},
  author={Hao Fei and Yuan Zhou and Juncheng Li and Xiangtai Li and Qingshan Xu and Bobo Li and Shengqiong Wu and Yaoting Wang and Junbao Zhou and Jiahao Meng and Qingyu Shi and Zhiyuan Zhou and Liangtao Shi and Minghe Gao and Daoan Zhang and Zhiqi Ge and Weiming Wu and Siliang Tang and Kaihang Pan and Yaobo Ye and Haobo Yuan and Tao Zhang and Tianjie Ju and Zixiang Meng and Shilin Xu and Liyu Jia and Wentao Hu and Meng Luo and Jiebo Luo and Tat-Seng Chua and Shuicheng Yan and Hanwang Zhang},
  eprint={2505.04620},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
  url={https://arxiv.org/abs/2505.04620},
}

=======

GenBench 评分系统 - 用户使用说明

本系统用于评估大模型在 General-Bench 多模态任务集上的表现,可完成预测、评分和最终得分计算。

环境准备

  • Python 3.9 及以上
  • 推荐提前安装依赖(如 pandas, numpy, openpyxl 等)
  • Video Generation评测,需要按照video_generation_evaluation/README.md中的步骤安装依赖
  • Video Comprehension评测,需要按照sa2va中的README.md中的步骤安装依赖。

数据集下载

一键运行

请直接运行主脚本 run.sh,即可完成全部流程:

bash run.sh

该命令将依次完成:

  1. 生成各模态预测结果
  2. 计算各任务得分
  3. 计算最终 Level 得分

分步运行(可选)

如只需运行部分步骤,可使用 --step 参数:

  • 只运行第1步(生成预测):

    bash run.sh --step 1
    
  • 只运行第1、2步:

    bash run.sh --step 12
    
  • 只运行第2、3步:

    bash run.sh --step 23
    
  • 不加参数默认全部执行(等价于 --step 123

  • 步骤1:生成预测结果prediction.json,存在每一个数据集的annotation.json同级目录下

  • 步骤2:计算每个任务的得分,存在outcome/{model_name}_result.xlsx中

  • 步骤3:计算相关模型的Level得分

注意:

  • 使用 Close Set(私有数据集) 时,只需运行 step1(即 bash run.sh --step 1),并将生成的 prediction.json 提交到系统。
  • 使用 Open Set(公开数据集) 时,需依次运行 step1、step2、step3(即 bash run.sh --step 123),完成全部评测流程。

结果查看

  • 预测结果(prediction.json)会输出到每个任务对应的数据集文件夹下,与 annotation.json 同级。
  • 评分结果(如 Qwen2.5-7B-Instruct_result.xlsx)会输出到 outcome/ 目录。
  • 最终 Level 得分会直接在终端打印输出。

目录说明

  • General-Bench-Openset/:公开数据集目录
  • General-Bench-Closeset/:私有数据集目录
  • outcome/:输出结果目录
  • references/:参考模板目录
  • run.sh:主运行脚本(推荐用户只用此脚本)

常见问题

  • 如遇依赖缺失,请根据报错信息安装相应 Python 包。
  • 如需自定义模型或数据路径,可编辑 run.sh 脚本中的相关变量。

如需进一步帮助,请联系系统维护者或查阅详细开发文档。

6f59817 (submit NLP Video Audio)