Why Is GPT Image 2 So Good?

After GPT Image 2 went viral across the internet, one question kept coming up: why are the results so impressive?

Research Lead Boyuan Chen gave a directional answer: the underlying architecture has been completely rebuilt. However, he declined to reveal whether they use diffusion models or autoregressive techniques, instead mysteriously describing it as a “general-purpose model” or “the GPT of the image domain.”

From one of Chen’s tweets, we can see that since GPT Image 1.5 at the end of last December, the entire team achieved this massive improvement in just four months. Even more astonishing is that such breakthrough results came from a core team of only 13 people.

Team lead Gabriel Goh also shared an AI-generated family portrait of the team members, with commenters marveling: why are they all Asian faces?

GPT Image 2 team introduction, Research Lead Boyuan Chen from Wuxi leads a 13-person team to rebuild the underlying architecture of image generation

Want to experience GPT Image 2’s image generation capabilities yourself? Click the button below to start using it for free on the official site:

🚀 Start Using GPT-image2

Boyuan Chen: From Not Knowing Python to Research Lead

To understand GPT Image 2’s technical strength, you only need to look at the academic backgrounds of its core team members.

Boyuan Chen is the team’s Research Lead. He and another team member, Kiwhan Song, both completed their PhDs at MIT under the same advisor — Professor Vincent Sitzmann.

His doctoral representative work, “Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion,” was selected for NeurIPS 2024. This research proposes an entirely new paradigm for sequence generation training — Diffusion Forcing — which combines per-token independent noise-level diffusion with causal next-token prediction, merging the variable-length generation capability of autoregressive models with the long-range guidance advantages of full-sequence diffusion models.

During his internship at Google, he also published SpatialVLM as a co-first author. This research automatically constructed an internet-scale 3D spatial reasoning VQA dataset (covering 10 million images and 2 billion QA pairs), giving visual language models both quantitative and qualitative spatial reasoning capabilities — outputting precise metric distances, dimensions, bearings, and more from a single 2D image.

This research was later applied to chain-of-thought spatial reasoning in the embodied intelligence domain.

Also during his Google internship, the instruction fine-tuning technology he developed was subsequently adopted by Gemini 2.0.

Interestingly, when Boyuan Chen attended a research summer camp in high school, he didn’t even know the basics of Python syntax. It was then that he met Xia Fei, a senior researcher at Google DeepMind, who introduced him to the world of AI. Xia Fei invited him to complete high-quality internships at DeepMind twice, and these experiences gave Chen engineering experience in large-scale model training, as well as a valuable perspective for understanding the data needs of multimodal systems.

After completing his PhD, Chen joined OpenAI in June 2025 and quickly became one of the five core members of GPT image generation, responsible for all training of the GPT image generation model, while also being a member of the Sora video generation team.

In a public demonstration, he created a poster for his hometown of Wuxi, then made a Korean poster for a teammate from Seoul, and a Bengali poster for a teammate from Bangladesh — each with perfectly rendered text.

Jianfeng Wang: Making Image Generation AI Understand World Knowledge

Jianfeng Wang, who completed his PhD at USTC, is responsible for another amazing capability in the GPT Image 2 team: instruction following and world understanding.

Old models always drew clocks pointing to 10:10. This phenomenon stems from the internet being flooded with clock advertising images — manufacturers ran experiments with psychologists and determined that this angle best stimulates consumers’ willingness to buy, so almost all advertising images show 10:10.

Wang had the new model draw 2:25, 3:30, 9:10, 7:45 — and the results were essentially accurate.

That was just the appetizer. More complex spatial layout tests — apple in the center, cup on the right, book on top, camera on the left, basketball below — the model executed all of them precisely.

Before joining OpenAI, he worked at Microsoft for nearly 9 years. During his time at Microsoft, he had already collaborated with the OpenAI team on DALL-E 3.

He has published multiple academic papers in the computer vision field, with research covering image classification, object detection, semantic segmentation, and visual representation learning. The significant improvement in GPT Image 2’s world knowledge understanding is largely thanks to his correct understanding of object semantic content and functional structure.

At the end of his demo video, Wang said: GPT Image 2 is eliminating the gap between your intent and the model’s output, truly delivering exactly what you want.

Yuguang Yang: Generating High-Precision Complex Infographics

Yuguang Yang demonstrated infographic and presentation generation capabilities during the GPT Image 2 launch event.

A full 75-page GPT-3 paper dropped into ChatGPT, automatically generating 7 slides.

His experience is arguably the most diverse among team members. Every career change was a crossover, but all focused on machine learning:

Undergraduate: Zhejiang University Chu Kochen Honors College, Engineering
PhD: Johns Hopkins University, Computational Chemical Physics and Machine Learning
First full-time job: Quantitative Analyst
During visiting research at Tsinghua: Reinforcement learning and control algorithms for nanorobots
At Amazon: Alexa Speech Research
At Microsoft: Bing Search query understanding and retrieval, document understanding
After joining OpenAI in early 2025: Participated in the ChatGPT agent project in addition to image generation

On his personal account, Yang introduced GPT Image 2’s infographic generation capabilities, noting that it can save researchers a lot of time. He also repeatedly reminded everyone: don’t forget to select thinking mode when creating infographics.

From DALL-E to GPT Image 2.0

From team member Kenji Hata’s self-introduction, we learned that GPT Image 1.0 was essentially the image generation part of GPT-4o.

And there’s one person who has participated in the entire OpenAI multimodal series research from the DALL-E days — he is the GPT Image 2.0 team lead, Gabriel Goh.

Since joining OpenAI in 2019, his early research was more theoretical, focusing on interpretability and convex optimization. From DALL-E onward, he gradually shifted toward image generation.

Another team member, Weixin Liang’s research background, also revealed more about GPT Image 2’s technical foundation.

His representative work during his Meta internship, Mixture-of-Transformers, introduced modality-decoupled MoE and decoupled attention mechanisms, significantly reducing the computational cost of multimodal model pre-training.

He completed his PhD at Stanford and his undergraduate degree at Zhejiang University Chu Kochen Honors College, though a few years later than Yuguang Yang. Like Boyuan Chen, Weixin Liang joined OpenAI right after completing his PhD in 2025 and quickly became a core member of the team.

Other GPT Image 2.0 Team Members

Name	Background	Role
Ayaan Haque	Previously at Luma AI, participated in training Luma’s video generation foundation model Dream Machine	Image Generation
Bing Liang	Worked at Google for 5+ years, participated in Imagen3, Veo, Gemini Multimodal, joined OpenAI in 2025	Image Generation Research
Mengchao Zhong	Undergraduate at Shanghai Jiao Tong University, Master’s from Texas A&M, previously software engineer at Pinterest and Airtable	Multimodal Product Engineering
Dibya Bhattacharjee	Yale University, 2015 IPhO Bronze Medal, highest global score in CIE A-Level Math and Biology	Core Research
Kiwhan Song	Latest to join the team in October 2025, MIT PhD graduate	Research & Prompt Master

From the earliest DALL-E to today’s GPT Image 2.0, this team has successively solved four core problems:

Can draw it (DALL-E phase)
Can draw it clearly (DALL-E 2/3 phase)
Can draw it beautifully (GPT-4o image generation phase)
Can draw it accurately (GPT Image 2.0 phase)

Despite significant talent turnover at OpenAI in recent years, OpenAI remains the company that continues to attract unique personalities — no restrictions on professional backgrounds, welcoming crossovers, believing in bottom-up emergent research. Starting from a small team, gaining resources after breakthroughs, until changing the world.

One More Thing

Once, the avatars generated by GPT-4o image generation in Ghibli style swept the world.

Now, GPT Image 2.0 team members have all changed their avatars to this unique long-neck style.

Ready to Try GPT Image 2 Yourself?

All this technical detail is great, but nothing beats trying it yourself.

How good is GPT Image 2’s image generation really? The text rendering precision, spatial understanding, instruction following ability — these impressive capabilities in papers and demo videos are things you can intuitively feel the difference with just one try.

✨ Start Using GPT-image2

This article was organized with the assistance of AI tools and reviewed and edited by humans to ensure accuracy. If you have any questions or feedback about GPT Image 2, please contact our team at: support@gpt-image2.cn.