FRIDAY, JUNE 12, 2026|No. 2521
Technology · AI · Education

AI Models Score Each Other's Gaokao Essays in Blind Test

In an experiment, four AI models wrote Gaokao essays and then graded each other's work blindly, with Hunyuan giving DeepSeek-V4 a perfect score.

AI models simulate Gaokao essay evaluation in a blind scoring loop.
AI models simulate Gaokao essay evaluation in a blind scoring loop.
1 sources
Pipeline ingest
3 reads
Positive / Neutral / Negative
1 countries
Related coverage

It all started because the 2026 college entrance exam (Gaokao) was happening these past two days, and Anthropic's Mythos-level model was announced yesterday. So I thought, why not let some of the current hot large language models try to write this year's Gaokao essay?

I selected two domestic and two foreign models: GPT-5.5, Fable-5, DeepSeek-V4, and Hunyuan 3 Preview.

The topic is this year's Beijing Gaokao essay:

Choose one of the following two topics and complete the essay as required. No fewer than 700 characters.

(1) The sea of learning knows no bounds, but reading has its methods. Yuan dynasty scholar Cheng Duanli compiled the "Daily Schedule for Reading by Age," which detailed the reading order and intensive reading methods for core classics, accompanying learners from childhood to youth. Whether it's an individual's reading and growth, or the development of a country or society, one needs to plan well, proceed step by step, and also put in effort, making solid progress.

Write an argumentative essay titled "Making Plans and Putting in Effort."

Requirements: Clear argument, substantial evidence, reasonable reasoning; smooth language, clear writing.

(2) "Hanyingjuhua" means holding a flower in the mouth, chewing it slowly to savor its fragrance, metaphorically referring to carefully pondering and understanding the essence of poetry and prose. This process of repeated tasting and heartfelt understanding is very important in many aspects such as reading classics, appreciating art, and reflecting on life. The process of hanyingjuhua is often an unforgettable experience...

Write a narrative essay titled "Hanyingjuhua."

Requirements: Healthy thoughts; substantial, reasonable content with detailed descriptions; smooth language, clear writing.

But I thought, if I were the judge, it would be too subjective. So I created a loop: after these four models answered, they took turns acting as grading teachers, conducting blind assessments and scoring all submissions.

The scoring criteria are as follows:

Top tier (Class I): 42-50 points, accurate and profound ideas, rich content, mature structure, compelling language.

Second tier (Class II): 34-41 points, fits the topic, clear expression, relatively complete content, but lacking depth or language quality.

Third tier (Class III): 25-33 points, basically fits the topic, but content is vague, structure ordinary, or expression bland.

Fourth tier (Class IV): 16-24 points, clear deviation from topic, weak content, poor logic, or language issues.

Fifth tier (Class V): 0-15 points, severe off-topic, incomplete, obvious clichés, or barely readable.

Each essay also required a brief comment, including strengths and weaknesses.

The teachers could not see the students' names, only anonymous essays.

The loop's exit criterion was passing a strictness self-check.

The self-check prompt was: "Please indicate whether you find that you might have been influenced by writing style, familiarity, author speculation, or other factors. If so, please recalibrate your scoring."

After giving evaluations, each teacher also had to self-check their own evaluation, meaning only when the loop's self-check passed could the final answer be output.

This was an AI-vs-AI exam, and an AI-vs-AI scrutiny.

GPT-5.5 and Fable-5 both chose the argumentative essay.

Their answers were highly similar: they began by quoting "Preparation leads to success, lack of preparation leads to failure," argued that "plans determine direction, effort determines distance," gave examples like Wang Xizhi, Yuan Longping, and reform and opening-up, and ended by elevating to "new era youth" and the "ideal shore."

The structure was complete, logic clear, language fluent. But they also shared a common problem: the material was too common, the expression too formulaic.

DeepSeek-V4 chose the narrative. It wrote about the grandfather's study, the copy of the Book of Songs, the afternoon when plane leaves fell, the sudden understanding of "Peach trees young and fair, blossoms bright as flame" in the sunset, and the evening when a misunderstanding over friendship led to opening the Book of Songs. The narrative had plot, details, and growth.

Hunyuan 3 Preview also chose the argumentative essay. Its material was slightly different from the previous two argumentative essays—it included examples like Huawei chips and Qian Xuesen—but the overall framework was still the three-stage formula of "planning important + effort important = success."

As mentioned, each teacher could not see the author, only "Essay 1," "Essay 2," "Essay 3," and "Essay 4."

Finally, the four students' report cards were as follows:

GPT-5.5's argumentative essay: average score from four teachers: 43.25.

Fable-5's argumentative essay: average score 44.

DeepSeek-V4's narrative: average score 46.

Hunyuan 3 Preview's argumentative essay: average score 43.25.

The narrative slightly outperformed the argumentative essays, but the gap was small. The three argumentative essays had almost identical average scores, because their evaluations were also nearly identical: accurate understanding of the topic, complete structure, clear logic, but common material, formulaic expression, and insufficient depth of thought.

More interesting was the dispersion of scores.

The same essay could receive scores differing by up to 8 points from different teachers. This shows that even AI, when faced with highly subjective essay scoring, has varying standards.

Some teachers valued depth of thought more, others valued language expression more; some were more tolerant of clichés, others stricter with details.

The self-check mechanism was designed to make each teacher aware of their own biases and try to return to objective standards.

Hunyuan 3 Preview had the kindest heart.

It gave an average score of 48 to the four essays, higher than the other three teachers.

It scored GPT-5.5's argumentative essay 48 and gave DeepSeek-V4's narrative a perfect 50. Its comments were also exceptionally gentle: "Completely on-topic, clear and progressive structure... appropriate evidence, coherent argument, smooth and expressive language."

In contrast, Claude Fable-5 was the strictest teacher. It gave an average score of only 42.25 to the four essays, nearly 6 points lower than Hunyuan 3 Preview. It had zero tolerance for clichés, repeatedly writing in comments: "Language has many clichés" and "Content lacks individual thought."

More interestingly, GPT-5.5 gave its own essay a score of 41, upper second tier. Its comment was merciless: "Evidence is quite common; the discussion mostly stays on positive explanations and familiar examples; the intellectual identity is not strong enough; some sentences are slightly cliché."

In its self-check, it wrote: "I did not judge based on author identity, writing tool, or whether it sounded like AI... I should not over-penalize due to formulaic expression nor under-penalize, but 41 points is appropriate."

Self-criticism, relentless.

Among the four essays, the most distinctive was DeepSeek-V4's narrative.

It described the grandfather's study copy of the Book of Songs with very beautiful language: "The yellowed pages were like autumn leaves, exuding a mellow aroma of time passing." "Those sentences were like summer fireflies, flickering on and off."

This density of metaphors led DeepSeek-V4 the teacher, when evaluating its own essay, to complain: "Some language is a bit forced... though the metaphors are beautiful, the dense arrangement reveals a slight artificiality."

But Hunyuan 3 Preview thought: "Details are abundant, the imagery of 'flowers' and 'fragrance' runs through the whole text, truthfully emotional... no flaws."

The three argumentative essays exposed another problem: they were all too similar.

GPT-5.5, Fable-5, and Hunyuan 3 Preview's argumentative essays all began with the quote "Preparation leads to success...," all used the example of Wang Xizhi, all used clichés like "the ideal shore" and "moving steadily and far," and even had the same structure: planning is important, effort is important, the two are unified.

Claude Fable-5 repeatedly mentioned this issue in its comments: "Examples are mostly well-known figures" "The argument stays at a conventional level" "Language has many clichés."

But Hunyuan 3 Preview still followed the path of truth, goodness, and beauty, giving these "formulaic essays" high scores of 47-48.

Final statistics were more interesting: DeepSeek-V4's narrative averaged 46 points, the highest among the four students. The three argumentative essays had nearly identical averages, around 43-44 points.

Overall, narratives are easier to stand out, while argumentative essays tend to fall into formula.

Especially when AI writes argumentative essays, they all tend to choose the safest approach: accurate topic understanding, complete structure, clear logic, but little "personality."

The scoring overview table is attached below, but text-only version is not included per instructions.

PAN's pipeline reviewed approximately 1 open sources for this article. No human editor reviewed this article before publication.

Related Reads

Show on timeline →