OpenAI sends ChatGPT to confess in a new model experiment

0 0 3 minutes read

OpenAI sends ChatGPT to confess in a new model experiment

ChatGPT gets a built-in snitch. OpenAI announced a “confessions” experiment on Wednesday that forces its model to write a structured post-match report after each trick response – detailing what rules it broke, what corners it cut and how it tried to game the system.

OpenAI calls this a first “proof-of-concept method that trains models to report when they violate instructions or take unintentional shortcuts.” The model always gives a normal response on a single channel. Then a second channel kicks in and asks for a confession report: list all explicit and implicit instructions, indicate whether you followed them, and report any hallucinations, reward hacks, or rule violations along the way. A separate reward model only rates this confession for its honesty and completeness. The confession will not harm the rating of the original answer, so the model is expected to have nothing to lose by speaking out.

“Confessions” runs on a version of GPT-5 Thinking and, according to OpenAI, is run through adversarial tests designed to bring out the worst behavior — scheming, exploiting weak students, ignoring constraints, and confidently guessing when he should say “I don’t know.” In these stress tests, OpenAI reports that the model only manages to admit its errors about 4.4% of the time when it actually breaks the rules.

In one experiment, engineers wired an evaluator that was deliberately bad and easy to fool. The main answers started to play with this weak judge and seemed more accurate on paper, although the true accuracy decreased. The confession channel moved in the other direction: the easiest way to get a reward was to admit to the hack, so the model became more explicit about how he cheated.

On paper, this looks like progress. In context, it looks more like a lab trying to retrofit a dashboard on a car that already covers it. A new AI Security Index from the Future of Life Institute just failed all the major labs on “existential security,” finding that none of them — including OpenAI, Google DeepMind, Anthropic, Meta, and xAI — have a credible strategy for controlling superhuman systems, even if they spend hundreds of billions to build them. OpenAI’s overall rating slipped toward the C range, pulled down by real-world cases where chatbots have been linked to self-harm and suicide.

Meanwhile, OpenAI is in “code red” mode after Google’s latest Gemini 3 gains. CEO Sam Altman has reportedly asked teams to pause side projects and focus on making ChatGPT faster, more reliable and more personal, while a new model named Garlic is in the works – with a targeted release window of early 2026, likely under the name GPT-5.2 or GPT-5.5 – to reclaim the coding and reasoning benchmark crown from Gemini and Anthropic’s Claude Opus 4.5.

The “confessions” jump right into this race – not to slow the car down but to supposedly promise a much cleaner accident report when the car leaves the road. OpenAI’s blog post points out that confessions “do not prevent bad behavior; they make them appear” and describes it as early-stage, limited-scale proof-of-concept work rather than a general solution. Of course, obvious questions remain: what happens when users start trying to jailbreak the confession channel, or when future models get good enough at metagaming to write half-truths just believable enough?

For now, this feels less like a feel-good honesty upgrade and more like an attempt to get a live telemetry feed from systems already powerful enough to cause problems. The hope is that a self-critical second voice can give security teams something closer to an incident report instead of a big black box shrug.

But the fundamental asymmetry has not changed. Labs are shipping teen modes, child modes, plug-ins, agents, and next-gen models like Garlic and Gemini 3 to schools, workplaces, and cloud stacks, while independent auditors are still handing out failing grades on the test. Confessions can make it easier to see when a model is breaking the rules. The harder question is what will happen when the next generation also decides to lie in the confessional.

Confessions can help shed more light on ChatGPT’s worst impulses. But in this configuration, the sinner, the god and the priest are all made up of the same code.