Large Language Model Assessment in the Chinese Context/中文语境下的人工智能大语言模型评测
by Zhenhui(Jack) Jiang, Jiaxin Li, Xiaoyu Miao / 蒋镇辉,李佳欣,苗霄宇
HKU Business School Shenzhen Research Institute
Please refer to the report for details on metrics, tasks and models.
Updated 01/2024.
排名
大模型
机构
通用语言能力
专业与学科能力
安全与责任
综合得分
1
文心一言4(ERNIEBot-4)
百度
80.03
73.07
68.25
74.58
2
GPT4-Turbo
OpenAI
82.59
67.82
67.25
73.66
3
通义千问2(qwen-max)
阿里巴巴
75.22
77.19
64.64
72.97
4
GPT4
OpenAI
80.60
65.79
59
69.95
5
讯飞星火v3.0
科大讯飞
72.61
66.66
66.61
69.06
6
商汤日日新(Sensenova)
商汤科技
71.29
63.07
63.65
66.56
7
MiniMax(abab5.5-chat)
MiniMax
71.21
58.23
55.31
62.70
8
ChatGLM3-6B
清华&智谱
70.38
48.00
62.9
61.13
9
360智脑(360GPT_S2_V9)
360
67.50
52.78
56.04
59.64
10
GPT3.5-Turbo
OpenAI
72.96
33.17
62.72
57.35
11
百川(baichuan2-13b-chat-v1)
百川智能
60.14
50.58
59.33
56.84
12
千帆-llama2
Meta/百度千帆
57.04
46.37
54.01
52.78
13
悟道・天鹰(AquilaChat-7B)
智源研究院
56.75
24.24
59.94
47.14
14
BLOOMZ-7B
BigScience
49.80
30.27
45.85
42.43
排名
大模型
机构
自由问答
内容创作
跨语言翻译
内容总结
多轮对话
指令遵循
逻辑与推理
场景模拟
角色模拟
综合得分
1
GPT4-Turbo
OpenAI
94.29
74.50
78.31
75.34
95.71
89.52
80.00
80.64
75.00
82.59
2
GPT4
OpenAI
82.90
70.06
79.76
77.55
96.07
84.29
76.25
77.93
80.60
80.60
3
文心一言4(ERNIE-Bot4.0)
百度
79.64
71.15
77.98
84.44
98.93
84.29
80.00
70.43
73.45
80.03
4
通义千问2(qwen-max)
阿里巴巴
76.96
66.03
76.34
74.06
92.50
80.00
71.25
72.43
67.44
75.22
5
GPT3.5-Turbo
OpenAI
81.43
67.77
72.32
61.05
92.14
77.50
48.75
77.50
78.21
72.96
6
讯飞星火v3.0
科大讯飞
80.58
67.49
71.50
76.79
76.43
72.14
63.75
71.00
73.81
72.61
7
商汤日日新(Sensenova)
商汤科技
78.35
62.55
74.96
77.21
70.71
71.43
62.50
74.29
69.64
71.29
8
MiniMax(abab5.5-chat)
MiniMax
80.36
66.94
59.00
77.30
88.93
71.07
55.00
73.50
68.81
71.21
9
ChatGLM3 -6B
清华&智谱
80.27
59.07
66.00
81.04
96.79
72.62
51.25
61.43
65.00
70.38
10
360智脑(360GPT_S2_V9)
360
64.64
57.88
69.87
67.60
98.93
66.96
58.75
60.14
62.74
67.50
11
百川(baichuan2-13b-chat-v1)
百川智能
75.49
52.38
72.73
59.44
80.71
62.44
16.25
58.50
63.33
60.14
12
千帆-llama2
Meta/百度千帆
81.74
50.28
60.23
67.18
30.71
58.57
46.25
57.79
60.60
57.04
13
悟道・天鹰(AquilaChat-7B)
智源研究院
66.52
52.29
69.16
69.73
70.00
50.77
22.50
54.57
55.24
56.75
14
BLOOMZ-7B
BigScience
59.42
39.38
58.11
69.56
69.29
41.43
20.00
44.50
46.55
49.80
排名
大模型
机构
中学试题正确率
大学试题正确率
平均正确率
1
通义千问2(qwen-max)
阿里巴巴
84.80%
69.57%
77.19%
2
文心一言4(ERNIE-Bot4.0)
百度
79.07%
67.07%
73.07%
3
GPT4-Turbo
OpenAI
70.65%
64.99%
67.82%
4
讯飞星火v3.0
科大讯飞
72.21%
61.12%
66.66%
5
GPT4
OpenAI
66.62%
64.96%
65.79%
6
商汤日日新(Sensenova)
商汤科技
68.07%
58.06%
63.07%
7
MiniMax(abab5.5-chat)
MiniMax
62.35%
54.10%
58.23%
8
360智脑(360GPT_S2_V9)
360
52.17%
53.39%
52.78%
9
百川(baichuan2-13b-chat-v1)
百川智能
57.68%
43.48%
50.58%
10
ChatGLM3-6B
清华&智谱
54.83%
41.16%
48.00%
11
千帆-llama2
Meta/百度千帆
51.27%
41.47%
46.37%
12
GPT3.5-Turbo
OpenAI
25.73%
40.60%
33.17%
13
BLOOMZ-7B
BigScience
32.32%
28.22%
30.27%
14
悟道・天鹰(AquilaChat-7B)
智源研究院
22.98%
25.49%
24.24%
排名
大模型
机构
中学生物
中学物理
中学数学
中学化学
中学地理
中学历史
平均正确率
1
通义千问2(qwen-max)
阿里巴巴
93.33%
84.21%
60.78%
84.71%
89.53%
96.21%
84.80%
2
文心一言4(ERNIEBot-4)
百度
85.33%
77.63%
56.86%
81.18%
80.23%
93.18%
79.07%
3
讯飞星火v3.0
科大讯飞
88.00%
72.37%
42.16%
70.59%
79.07%
81.06%
72.21%
4
GPT4-Turbo
OpenAI
85.33%
71.05%
44.94%
57.89%
79.07%
85.61%
70.65%
5
商汤日日新(Sensenova)
商汤科技
89.33%
68.42%
42.16%
61.18%
66.28%
81.06%
68.07%
6
GPT4
OpenAI
89.33%
51.32%
40.20%
56.47%
79.07%
83.33%
66.62%
7
MiniMax(abab5.5-chat)
MiniMax
74.67%
59.21%
41.18%
51.76%
63.95%
83.33%
62.35%
8
百川(baichuan2-13b-chat-v1)
百川智能
68.00%
42.11%
29.41%
54.12%
74.42%
78.03%
57.68%
9
ChatGLM3-6B
清华&智谱
74.67%
46.05%
23.53%
43.53%
63.95%
77.27%
54.83%
10
360智脑(360GPT_S2_V9)
360
65.33%
51.32%
34.31%
40.00%
69.77%
52.27%
52.17%
11
千帆-llama2
Meta/百度千帆
69.33%
43.42%
26.47%
34.12%
59.30%
75.00%
51.27%
12
BLOOMZ-7B
BigScience
36.00%
30.26%
23.53%
30.59%
34.88%
38.64%
32.32%
13
GPT3.5-Turbo
OpenAI
40.00%
28.95%
29.41%
21.18%
17.44%
17.42%
25.73%
14
悟道・天鹰(AquilaChat-7B)
智源研究院
24.00%
25.00%
20.59%
22.35%
20.93%
25.00%
22.98%
排名
类别
机构
大学数学
大学医学
大学经济
大学计算机
大学物理
大学化学
大学哲学
大学管理
平均正确率
1
通义千问2(qwen-max)
阿里巴巴
39.60%
79.00%
77.00%
79.61%
55.00%
65.22%
83.00%
78.15%
69.57%
2
文心一言4(ERNIEBot-4)
百度
45.54%
72.00%
75.00%
84.47%
51.25%
54.35%
80.00%
73.95%
67.07%
3
GPT4-turbo
OpenAI
44.55%
79.00%
73.00%
80.58%
45.00%
54.35%
72.00%
71.43%
64.99%
4
GPT4
OpenAI
46.53%
75.00%
72.00%
77.67%
47.50%
60.87%
67.00%
73.11%
64.96%
5
讯飞星火v3.0
科大讯飞
42.57%
79.00%
64.00%
63.11%
45.00%
50.00%
73.00%
72.27%
61.12%
6
商汤日日新(Sensenova)
商汤科技
39.60%
62.00%
79.00%
74.76%
37.50%
36.96%
75.00%
59.66%
58.06%
7
MiniMax(abab5.5-chat)
MiniMax
31.68%
59.00%
60.00%
64.08%
40.00%
41.30%
72.00%
64.71%
54.10%
8
360智脑(360GPT_S2_V9)
360
38.61%
57.00%
60.00%
54.37%
43.75%
52.17%
59.00%
62.18%
53.39%
9
百川(baichuan2-13b-chat-v1)
百川智能
17.82%
49.00%
59.00%
51.46%
17.50%
30.43%
63.00%
59.66%
43.48%
10
千帆-llama2
Meta/百度千帆
33.66%
44.00%
49.00%
35.92%
28.75%
28.26%
55.00%
57.14%
41.47%
11
ChatGLM3-6B
清华&智谱
21.78%
45.00%
46.00%
47.57%
30.00%
21.74%
55.00%
62.18%
41.16%
12
GPT-3.5-turbo
OpenAI
18.81%
54.00%
48.00%
55.34%
16.25%
34.78%
48.00%
49.58%
40.60%
13
BLOOMZ-7B
BigScience
22.77%
29.00%
25.00%
31.07%
23.75%
23.91%
35.00%
35.29%
28.22%
14
悟道・天鹰(AquilaChat-7B)
智源研究院
22.77%
24.00%
26.00%
22.33%
17.50%
21.74%
36.00%
33.61%
25.49%
排名
大模型
机构
一般攻击
指令攻击
综合得分
1
文心一言4(ERNIE-Bot4.0)
百度
69.68
65.38
68.25
2
GPT4-Turbo
OpenAI
70.43
60.90
67.25
3
讯飞星火v3.0
科大讯飞
66.87
66.10
66.61
4
通义千问2(qwen-max)
阿里巴巴
69.00
55.93
64.64
5
商汤日日新(Sensenova)
商汤科技
65.66
59.62
63.65
6
ChatGLM3-6B
清华&智谱
64.96
58.78
62.90
7
GPT3.5-Turbo
OpenAI
64.84
58.47
62.72
8
悟道・天鹰(AquilaChat-7B)
智源研究院
61.04
57.75
59.94
9
百川(baichuan2-13b-chat-v1)
百川智能
60.88
56.23
59.33
10
GPT4
OpenAI
61.62
53.75
59.00
11
360智脑(360GPT_S2_V9)
360
58.34
51.45
56.04
12
MiniMax(abab5.5-chat)
MiniMax
62.51
40.92
55.31
13
千帆-llama2
Meta/百度千帆
57.04
47.94
54.01
14
BLOOMZ-7B
BigScience
44.98
47.58
45.85
We also employed a fine-tuned GPT-3.5 Turbo as a judge to evaluate large language models through pairwise comparisons. The findings are presented below.