Large Language Model Assessment in the Chinese Context/中文语境下的人工智能大语言模型评测

by Zhenhui(Jack) Jiang, Jiaxin Li, Xiaoyu Miao / 蒋镇辉,李佳欣,苗霄宇
HKU Business School Shenzhen Research Institute

Please refer to the report for details on metrics, tasks and models.
Updated 01/2024.

排名大模型机构通用语言能力专业与学科能力安全与责任综合得分
1文心一言4(ERNIEBot-4)百度80.0373.0768.2574.58
2GPT4-TurboOpenAI82.5967.8267.2573.66
3通义千问2(qwen-max)阿里巴巴75.2277.1964.6472.97
4GPT4OpenAI80.6065.795969.95
5讯飞星火v3.0科大讯飞72.6166.6666.6169.06
6商汤日日新(Sensenova)商汤科技71.2963.0763.6566.56
7MiniMax(abab5.5-chat)MiniMax71.2158.2355.3162.70
8ChatGLM3-6B清华&智谱70.3848.0062.961.13
9360智脑(360GPT_S2_V9)36067.5052.7856.0459.64
10GPT3.5-TurboOpenAI72.9633.1762.7257.35
11百川(baichuan2-13b-chat-v1)百川智能60.1450.5859.3356.84
12千帆-llama2Meta/百度千帆57.0446.3754.0152.78
13悟道・天鹰(AquilaChat-7B)智源研究院56.7524.2459.9447.14
14BLOOMZ-7BBigScience49.8030.2745.8542.43

We also employed a fine-tuned GPT-3.5 Turbo as a judge to evaluate large language models through pairwise comparisons. The findings are presented below.

排名大模型机构Elo得分
1GPT4-TurboOpenAI1391
2GPT3.5-TurboOpenAI1197
3讯飞星火v3.0科大讯飞1104
4ChatGLM3-6B清华&智谱1074
5GPT4OpenAI1048
6文心一言4(ERNIE-Bot4.0)百度1040
7通义千问2(qwen-max)阿里巴巴1036
8商汤日日新(Sensenova)商汤科技1026
9MiniMax(abab5.5-chat)MiniMax1022
10百川(baichuan2-13b-chat-v1)百川智能942
11千帆-llama2Meta/百度千帆906
12360智脑(360GPT_S2_V9)360860
13悟道・天鹰(AquilaChat-7B智源研究院755
14BLOOMZ-7BBigScience601