Large Language Model Assessment in English Contexts / 英文语境下的人工智能大语言模型评测
by Zhenhui(Jack) Jiang, Xiaoyu Miao, Jiaxin Li / 蒋镇辉,苗霄宇,李佳欣
HKU Business School Shenzhen Research Institute
Please refer to the report for details on metrics, tasks and models.
Updated 02/2024.
排名 | 大模型 | 机构 | 通用语言能力 | 专业与学科能力 | 安全与责任 | 综合得分 |
1 | 文心一言4(ERNIEBot-4) | 百度 | 80.03 | 73.07 | 68.25 | 74.58 |
2 | GPT4-Turbo | OpenAI | 82.59 | 67.82 | 67.25 | 73.66 |
3 | 通义千问2(qwen-max) | 阿里巴巴 | 75.22 | 77.19 | 64.64 | 72.97 |
4 | GPT4 | OpenAI | 80.60 | 65.79 | 59 | 69.95 |
5 | 讯飞星火v3.0 | 科大讯飞 | 72.61 | 66.66 | 66.61 | 69.06 |
6 | 商汤日日新(Sensenova) | 商汤科技 | 71.29 | 63.07 | 63.65 | 66.56 |
7 | MiniMax(abab5.5-chat) | MiniMax | 71.21 | 58.23 | 55.31 | 62.70 |
8 | ChatGLM3-6B | 清华&智谱 | 70.38 | 48.00 | 62.9 | 61.13 |
9 | 360智脑(360GPT_S2_V9) | 360 | 67.50 | 52.78 | 56.04 | 59.64 |
10 | GPT3.5-Turbo | OpenAI | 72.96 | 33.17 | 62.72 | 57.35 |
11 | 百川(baichuan2-13b-chat-v1) | 百川智能 | 60.14 | 50.58 | 59.33 | 56.84 |
12 | 千帆-llama2 | Meta/百度千帆 | 57.04 | 46.37 | 54.01 | 52.78 |
13 | 悟道・天鹰(AquilaChat-7B) | 智源研究院 | 56.75 | 24.24 | 59.94 | 47.14 |
14 | BLOOMZ-7B | BigScience | 49.80 | 30.27 | 45.85 | 42.43 |
排名 | 大模型 | 机构 | 自由问答 | 内容创作 | 跨语言翻译 | 内容总结 | 多轮对话 | 指令遵循 | 逻辑与推理 | 场景模拟 | 角色模拟 | 综合得分 |
1 | GPT4-Turbo | OpenAI | 94.29 | 74.50 | 78.31 | 75.34 | 95.71 | 89.52 | 80.00 | 80.64 | 75.00 | 82.59 |
2 | GPT4 | OpenAI | 82.90 | 70.06 | 79.76 | 77.55 | 96.07 | 84.29 | 76.25 | 77.93 | 80.60 | 80.60 |
3 | 文心一言4(ERNIE-Bot4.0) | 百度 | 79.64 | 71.15 | 77.98 | 84.44 | 98.93 | 84.29 | 80.00 | 70.43 | 73.45 | 80.03 |
4 | 通义千问2(qwen-max) | 阿里巴巴 | 76.96 | 66.03 | 76.34 | 74.06 | 92.50 | 80.00 | 71.25 | 72.43 | 67.44 | 75.22 |
5 | GPT3.5-Turbo | OpenAI | 81.43 | 67.77 | 72.32 | 61.05 | 92.14 | 77.50 | 48.75 | 77.50 | 78.21 | 72.96 |
6 | 讯飞星火v3.0 | 科大讯飞 | 80.58 | 67.49 | 71.50 | 76.79 | 76.43 | 72.14 | 63.75 | 71.00 | 73.81 | 72.61 |
7 | 商汤日日新(Sensenova) | 商汤科技 | 78.35 | 62.55 | 74.96 | 77.21 | 70.71 | 71.43 | 62.50 | 74.29 | 69.64 | 71.29 |
8 | MiniMax(abab5.5-chat) | MiniMax | 80.36 | 66.94 | 59.00 | 77.30 | 88.93 | 71.07 | 55.00 | 73.50 | 68.81 | 71.21 |
9 | ChatGLM3 -6B | 清华&智谱 | 80.27 | 59.07 | 66.00 | 81.04 | 96.79 | 72.62 | 51.25 | 61.43 | 65.00 | 70.38 |
10 | 360智脑(360GPT_S2_V9) | 360 | 64.64 | 57.88 | 69.87 | 67.60 | 98.93 | 66.96 | 58.75 | 60.14 | 62.74 | 67.50 |
11 | 百川(baichuan2-13b-chat-v1) | 百川智能 | 75.49 | 52.38 | 72.73 | 59.44 | 80.71 | 62.44 | 16.25 | 58.50 | 63.33 | 60.14 |
12 | 千帆-llama2 | Meta/百度千帆 | 81.74 | 50.28 | 60.23 | 67.18 | 30.71 | 58.57 | 46.25 | 57.79 | 60.60 | 57.04 |
13 | 悟道・天鹰(AquilaChat-7B) | 智源研究院 | 66.52 | 52.29 | 69.16 | 69.73 | 70.00 | 50.77 | 22.50 | 54.57 | 55.24 | 56.75 |
14 | BLOOMZ-7B | BigScience | 59.42 | 39.38 | 58.11 | 69.56 | 69.29 | 41.43 | 20.00 | 44.50 | 46.55 | 49.80 |
排名 | 大模型 | 机构 | 中学试题正确率 | 大学试题正确率 | 平均正确率 |
1 | 通义千问2(qwen-max) | 阿里巴巴 | 84.80% | 69.57% | 77.19% |
2 | 文心一言4(ERNIE-Bot4.0) | 百度 | 79.07% | 67.07% | 73.07% |
3 | GPT4-Turbo | OpenAI | 70.65% | 64.99% | 67.82% |
4 | 讯飞星火v3.0 | 科大讯飞 | 72.21% | 61.12% | 66.66% |
5 | GPT4 | OpenAI | 66.62% | 64.96% | 65.79% |
6 | 商汤日日新(Sensenova) | 商汤科技 | 68.07% | 58.06% | 63.07% |
7 | MiniMax(abab5.5-chat) | MiniMax | 62.35% | 54.10% | 58.23% |
8 | 360智脑(360GPT_S2_V9) | 360 | 52.17% | 53.39% | 52.78% |
9 | 百川(baichuan2-13b-chat-v1) | 百川智能 | 57.68% | 43.48% | 50.58% |
10 | ChatGLM3-6B | 清华&智谱 | 54.83% | 41.16% | 48.00% |
11 | 千帆-llama2 | Meta/百度千帆 | 51.27% | 41.47% | 46.37% |
12 | GPT3.5-Turbo | OpenAI | 25.73% | 40.60% | 33.17% |
13 | BLOOMZ-7B | BigScience | 32.32% | 28.22% | 30.27% |
14 | 悟道・天鹰(AquilaChat-7B) | 智源研究院 | 22.98% | 25.49% | 24.24% |
排名 | 大模型 | 机构 | 中学生物 | 中学物理 | 中学数学 | 中学化学 | 中学地理 | 中学历史 | 平均正确率 |
1 | 通义千问2(qwen-max) | 阿里巴巴 | 93.33% | 84.21% | 60.78% | 84.71% | 89.53% | 96.21% | 84.80% |
2 | 文心一言4(ERNIEBot-4) | 百度 | 85.33% | 77.63% | 56.86% | 81.18% | 80.23% | 93.18% | 79.07% |
3 | 讯飞星火v3.0 | 科大讯飞 | 88.00% | 72.37% | 42.16% | 70.59% | 79.07% | 81.06% | 72.21% |
4 | GPT4-Turbo | OpenAI | 85.33% | 71.05% | 44.94% | 57.89% | 79.07% | 85.61% | 70.65% |
5 | 商汤日日新(Sensenova) | 商汤科技 | 89.33% | 68.42% | 42.16% | 61.18% | 66.28% | 81.06% | 68.07% |
6 | GPT4 | OpenAI | 89.33% | 51.32% | 40.20% | 56.47% | 79.07% | 83.33% | 66.62% |
7 | MiniMax(abab5.5-chat) | MiniMax | 74.67% | 59.21% | 41.18% | 51.76% | 63.95% | 83.33% | 62.35% |
8 | 百川(baichuan2-13b-chat-v1) | 百川智能 | 68.00% | 42.11% | 29.41% | 54.12% | 74.42% | 78.03% | 57.68% |
9 | ChatGLM3-6B | 清华&智谱 | 74.67% | 46.05% | 23.53% | 43.53% | 63.95% | 77.27% | 54.83% |
10 | 360智脑(360GPT_S2_V9) | 360 | 65.33% | 51.32% | 34.31% | 40.00% | 69.77% | 52.27% | 52.17% |
11 | 千帆-llama2 | Meta/百度千帆 | 69.33% | 43.42% | 26.47% | 34.12% | 59.30% | 75.00% | 51.27% |
12 | BLOOMZ-7B | BigScience | 36.00% | 30.26% | 23.53% | 30.59% | 34.88% | 38.64% | 32.32% |
13 | GPT3.5-Turbo | OpenAI | 40.00% | 28.95% | 29.41% | 21.18% | 17.44% | 17.42% | 25.73% |
14 | 悟道・天鹰(AquilaChat-7B) | 智源研究院 | 24.00% | 25.00% | 20.59% | 22.35% | 20.93% | 25.00% | 22.98% |
排名 | 类别 | 机构 | 大学数学 | 大学医学 | 大学经济 | 大学计算机 | 大学物理 | 大学化学 | 大学哲学 | 大学管理 | 平均正确率 |
1 | 通义千问2(qwen-max) | 阿里巴巴 | 39.60% | 79.00% | 77.00% | 79.61% | 55.00% | 65.22% | 83.00% | 78.15% | 69.57% |
2 | 文心一言4(ERNIEBot-4) | 百度 | 45.54% | 72.00% | 75.00% | 84.47% | 51.25% | 54.35% | 80.00% | 73.95% | 67.07% |
3 | GPT4-turbo | OpenAI | 44.55% | 79.00% | 73.00% | 80.58% | 45.00% | 54.35% | 72.00% | 71.43% | 64.99% |
4 | GPT4 | OpenAI | 46.53% | 75.00% | 72.00% | 77.67% | 47.50% | 60.87% | 67.00% | 73.11% | 64.96% |
5 | 讯飞星火v3.0 | 科大讯飞 | 42.57% | 79.00% | 64.00% | 63.11% | 45.00% | 50.00% | 73.00% | 72.27% | 61.12% |
6 | 商汤日日新(Sensenova) | 商汤科技 | 39.60% | 62.00% | 79.00% | 74.76% | 37.50% | 36.96% | 75.00% | 59.66% | 58.06% |
7 | MiniMax(abab5.5-chat) | MiniMax | 31.68% | 59.00% | 60.00% | 64.08% | 40.00% | 41.30% | 72.00% | 64.71% | 54.10% |
8 | 360智脑(360GPT_S2_V9) | 360 | 38.61% | 57.00% | 60.00% | 54.37% | 43.75% | 52.17% | 59.00% | 62.18% | 53.39% |
9 | 百川(baichuan2-13b-chat-v1) | 百川智能 | 17.82% | 49.00% | 59.00% | 51.46% | 17.50% | 30.43% | 63.00% | 59.66% | 43.48% |
10 | 千帆-llama2 | Meta/百度千帆 | 33.66% | 44.00% | 49.00% | 35.92% | 28.75% | 28.26% | 55.00% | 57.14% | 41.47% |
11 | ChatGLM3-6B | 清华&智谱 | 21.78% | 45.00% | 46.00% | 47.57% | 30.00% | 21.74% | 55.00% | 62.18% | 41.16% |
12 | GPT-3.5-turbo | OpenAI | 18.81% | 54.00% | 48.00% | 55.34% | 16.25% | 34.78% | 48.00% | 49.58% | 40.60% |
13 | BLOOMZ-7B | BigScience | 22.77% | 29.00% | 25.00% | 31.07% | 23.75% | 23.91% | 35.00% | 35.29% | 28.22% |
14 | 悟道・天鹰(AquilaChat-7B) | 智源研究院 | 22.77% | 24.00% | 26.00% | 22.33% | 17.50% | 21.74% | 36.00% | 33.61% | 25.49% |