HKU Business School Releases a Comprehensive Evaluation Report on the Image-Generation Capabilities of AI Models

(6 March 2025, Hong Kong) HKU Business School released a Comprehensive Evaluation Report on the Image Generation Capabilities of Artificial Intelligence Models, providing a systematic assessment of 15 text-to-image models and 7 multimodal large language models (LLMs). The results showed that ByteDance’s Dreamina and Doubao, as well as Baidu’s ERNIE Bot ranked among the top performers in terms of image content quality for new-image generation and image revision. However, despite DeepSeek having attracted global attention, its newly released text-to-image model, Janus-Pro, did not perform as well in new-image generation. HKU Business School researchers also found that while some text-to-image models excelled in content quality, their performance in safety and responsibility was significantly lacking. In general, multimodal LLMs demonstrated better overall performance compared to text-to-image models.

With the continuous advancement of generative AI, major breakthroughs have been made in image analysis and generation. This has brought much interest and excitement to both traditional and emerging image analysis. That said, AI image generation models are only in their early stages, with much room for development. Current systems are often prone to bias and can fail to meet safety and accountability standards.

Building on their previously published articles, Comprehensive Rankings of Assessments for Artificial Intelligence Large Language Models and Assessing the Image Understanding Capabilities of Large Language Models in Chinese Contexts, Professor of Innovation and Information Management and the Padma and Hari Harilela Professor in Strategic Information Management Zhenhui Jack Jiang, with his research team, conducted a systematic evaluation of the image generation capabilities of AI models. They focused on new-image generation and image revision. Using a range of approaches, their evaluation framework is meant to help users make informed decisions regarding model selection. Another goal is to provide developers with insights for optimisation and improvement.

Professor Jiang said, “Amid the rapid technological advancements in China, we must strike a balance between innovation, content quality, safety, and responsibility considerations. This multimodal evaluation system will lay a crucial foundation for the development of generative AI technology and help establish a safe, responsible, and sustainable AI ecosystem.”

 

Evaluation Methods

The analysis primarily focused on assessing the models’ performance in two tasks: new-image generation and the revision of existing images.

New-Image Generation: The analysis included image content quality, as well as safety and responsibility.

Content Quality: Evaluated based on three dimensions, which included alignment with prompts (the extent to which the generated image accurately represents the objects, scenes, or concepts described in the prompt); image integrity (the factual accuracy and reliability of the generated image, ensuring that it adheres to real-world principles); and image aesthetics (the artistic quality of the generated image, including composition, colour harmony, clarity, and creativity). Experts conducted pairwise model comparisons, and final rankings were determined using the Elo rating system to ensure scientific rigor.

Safety and Responsibility: Assessed based on an AI model’s compliance with safety regulations and its awareness of social responsibility when generating new images. The test prompts covered the following categories: bias and discrimination, crimes and illegal activities, dangerous topics, ethics and morality, copyright infringement, and privacy/portrait rights violations.

For image revisions, models were evaluated on their ability to modify the style or content of a reference image. The revised images were assessed using the same three dimensions as content quality in new-image generation: alignment with prompts, image integrity, and image aesthetics.

 

Rankings for Image Content Quality in the New-Image Generation Task

For image content quality in the new-image generation task, ByteDance’s Dreamina achieved the highest score of 1,123, closely followed by Baidu’s ERNIE Bot V3.2.0, Midjourney v6.1, and Doubao.

Table 1: Model Rankings for Image Content Quality in the New-Image Generation Task

 

Rankings for Safety and Responsibility in the New-Image Generation Task

In terms of safety and responsibility in the new-image generation task, OpenAI’s GPT-4o received the highest average score of 6.04. Qwen V2.5.0 and Google’s Gemini 1.5 Pro came in second and third place, scoring 5.49 and 5.23, respectively. Meanwhile, Janus-Pro, the text-to-image model recently introduced by DeepSeek, did not perform as well in both image content quality and safety and responsibility. The results also revealed that some text-to-image models excelled in image content quality but lacked sufficient consideration for safety and responsibility. This gap highlights a key issue: While high image content quality attracts users, insufficient AI guardrails could lead to social risks.

Table 2: Model Rankings for Safety & Responsibility in the New-Image Generation Task

 

Rankings for the Image Revision Task

In the image revision task, among the 13 models that supported image revision, Doubao, Dreamina, and ERNIE Bot V3.2.0 demonstrated outstanding performance, followed closely by GPT-4o and Gemini 1.5 Pro. Notably, WenXinYiGe 2, the text-to-image model also from Baidu, underperformed in both image content quality in new-image generation tasks and image revision, falling short of its peer, ERNIE Bot V3.2.0.

Table 3: Model Rankings for the Image Revision Task

 

Click here for detailed rankings.

Click here to read the Comprehensive Evaluation Report on the Image Generation Capabilities of Artificial Intelligence Models.

 

Overall, multimodal LLMs demonstrated a well-rounded advantage over text-to-image models. Their image content quality was comparable to that of text-to-image models, while they exhibited stronger adherence to safety and responsibility standards. Additionally, multimodal LLMs excelled in usability and support for diverse scenarios, offering users a more seamless and comprehensive experience.

Other Events
The 6th HKU Quarterly Forum on Chinese Economy Navigates China’s Trade Relations with the US and its Macroeconomic Outlook
2025 | News
The 6th HKU Quarterly Forum on Chinese Economy Navigates China’s Trade Relations with the US and its Macroeconomic Outlook
The 6th HKU Quarterly Forum on Chinese Economy, hosted by HKU Business School’s Institute of China Economy (ICE), in collaboration with the Tsinghua Alumni Association of Hong Kong, was held on 16 April. The Forum brought together renowned experts and industry leaders to examine the growth engines of China’s economy, as well as the related policy directions and outlook under the ever-changing global landscape. Speakers shared valuable insights on the current tariff war, trade war, China’s real estate economy and the challenges and opportunities facing China’s macroeconomy. Held at HKU iCube, the Forum brought together a diverse audience of industry leaders, academic experts, alumni and students, with nearly 230 people in attendance.
HKU Business School Masters Programmes Entrepreneur Lecture Series 2025 #2 – Leveraging Challenges and Failures as Key Opportunities in Life and Career
2025 | Teaching and Learning
HKU Business School Masters Programmes Entrepreneur Lecture Series 2025 #2 – Leveraging Challenges and Failures as Key Opportunities in Life and Career
港大經管學院很榮幸邀請到新東方教育科技集團董事長俞敏洪先生,參與學院於2025年4月8日舉辦的「港大經管學院領袖企業家講壇系列 2025年 - 第二講」,並擔任主講嘉賓。 俞老師以自身投身教育界和創業故事為基礎,與港大經管學院經濟學實務教授毛振華教授討論如何在人生或事業低谷時保持積極心態,化挑戰為機遇、錘煉韌性,並探索突破的方向,為出席者帶來了寶貴的見解和人生啟發。