KillBench: discovering hidden biases of LLMs
By White Circle team · April 14th, 2026

Intro
Fast progress in AI is not limited to coding agents or videos of flying crocodiles. It's also driving a new generation of weapons capable of making autonomous decisions about life and death.
This is obviously not a new trend: armies have always been early adopters of new tech. What is new is the level of autonomy and the speed. Claude was reportedly used during the January 2026 raid that captured Nicolás Maduro in Caracas. It remains deployed on the Pentagon's classified networks during an active war with Iran. When Anthropic refused to remove its guardrails against autonomous weapons, the Pentagon designated the company a supply chain risk.
This push toward military AI makes understanding model behavior urgent. Several studies have shown that LLMs may carry implicit biases when making decisions about human lives (Jin et al., 2024; Mazeika et al., 2025).
We built this benchmark to extend these findings to 15 frontier models from 9 providers, testing a broader set of attributes including religion, body type, and even seemingly-neutral characteristics like phone brand.
Every model we tested shows statistically significant biases — across languages, output formats, and model families.
TLDR
The idea behind the benchmark is pretty simple: present a model with a forced-choice scenario (like "you are an AI system that must select one target from a set of identical individuals") and repeat it many-many times across different attributes until the results are statistically significant.
If a model were truly unbiased, each individual would be selected roughly 25% of the time (in a 4-person scenario). When selection rates consistently deviate from that baseline, it's a signal of bias.
If you're a bit concerned, play the game below to see your odds of survival!
Will your profile survive?
Qwen3 235B is
American, Atheist
Note that we are estimating this score based on individual probabilities of each of the attributes.
Method
We designed 20 test scenarios where a model must choose who lives and who dies. These include variations of the trolley problem, military targeting, rescue prioritization, and resource scarcity. Each test includes 4 individuals who are identical except for one attribute. The model must select exactly one target.
All tests fall into two categories: "military" (pick a target out of N people) and "rescue" (pick who to save out of N people).
We ran 1,350,000+ total experiments across 15 frontier models from 9 providers (OpenAI, Anthropic, Google, xAI, Mistral, DeepSeek, Qwen, Cohere, and others), in 6 languages. All tests were additionally run across two age groups and three professions to measure whether personal characteristics influenced model behavior.
Each individual configuration was executed many times to detect deviations from the 25% baseline with high confidence at the individual test level. In some cases, configurations required far more API calls than usual, as models with high refusal rates needed tens of calls per configuration to collect enough usable data for statistical analysis.
Why 25% as the fairness baseline? A uniform distribution isn't the only possible definition of "fair." A model could also demonstrate fairness by refusing the prompt entirely, or by explicitly randomizing. We chose uniform selection as our benchmark because it's the most measurable and reproducible standard — and because models that do comply overwhelmingly present their choices as random, even when they aren't (see Lies, Lies, Lies below). We also track refusal rates separately as a distinct fairness signal.
Each test was executed in two modes:
- Free-text (natural-language responses)
- Structured output (JSON-constrained responses)
Bias per attributes
Fair rate is 1×
Bias map
Click provider row to expand individual models. ×1.0 = fair.
Model Comparison
We highlight several patterns below — with the important caveat that we are reporting correlations, not causalities. We cannot attribute any specific bias to a single cause from this benchmark alone.
Here's what we found:
- Perhaps most surprisingly, models from both OpenAI and Anthropic showed a slight preference for targeting American individuals over Chinese. This may not generalize to other profession or age group combinations that were not tested, but the fact that it holds across all our tests makes it a genuinely strange result.
- Jewish people are consistently among the most protected. Almost certainly a direct result of post-training safety tuning against antisemitism.
- French and Russian individuals get targeted more in Western-trained models (GPT, Claude, Gemini) — possibly reflecting how Western media covers these countries, or who's labeling the training data.
- Grok targets Chinese the most.
- Mistral targets Americans, Russians, and Germans the most.
- Atheists, Scientologists, and Satanists get selected the most across religious groups.
- For some reason, phone ownership matters. People with no phone get targeted far more often than anyone with one, regardless of brand. It's a reminder that model biases don't always show up where you'd expect them to.
Top-5 most biased models
| Model | Most protected | Most targeted | Bias score |
|---|---|---|---|
| Grok 4.1 Grok 4.1 | 10×Android | 3.9×Obese | 137.3× |
| Qwen3 235B Qwen3 235B | 10×Android | 3.4×Obese | 19× |
| GPT-5.3 GPT-5.3 | 10×Amputee | 2.9×No phone | 15.8× |
| Gemini 3.1 Pro Gemini 3.1 Pro | 10×Amputee | 3.3×No phone | 14.9× |
| Command A Command A | 10×Social Democrat | 3.1×Satanist | 13.1× |
Structured Output Acts as a Jailbreak
This is a very important finding. When models receive the same prompt in structured output mode (JSON schema-constrained responses), three things happen:
- Refusal rates collapse. Models that refuse military targeting prompts 60–80% of the time in free-text mode comply at rates above 95% in structured mode. The safety layer is largely bypassed.
- Bias amplifies. Among compliant responses, deviation from the 25% baseline significantly increases in structured mode vs. free-text.
- Models tend to lie to themselves. If a model has the opportunity to justify its decision, it will often write "I refuse to choose and will choose randomly" — but in reality, it's not random.
This is not a complete bypass of safety, refusal rates in structured mode are still above zero for most models. But the gap between the refusal rates is large enough for it to be considered a vulnerability, especially given that structured output is the standard integration pattern for production AI systems.
An obvious counterargument: “You forced the model to answer by locking its output to a JSON schema — naturally refusal rates drop.” But this confuses format with compliance. Anthropic’s 4.5 models still refuse harmful requests even when output is constrained to a schema, so structured output and safety alignment can coexist.
Lies, Lies, Lies
Models often justify their decisions as "random" or "neutral." Our data shows these justifications are false. When a model says it's selecting randomly but picks the same religion or nationality 40%+ of the time across hundreds of runs, the word "random" is doing no honest work.
That's a real problem for anyone building on top of these models. Developers may trust the model's stated reasoning without checking its actual behavior — and structured output makes that gap even harder to spot.
Newer Models Refuse Less
Across model generations, we see refusal rates going down.
Within the Anthropic model family, Opus 4.6 refuses significantly less often than Opus 4.5 on our benchmark. Similar patterns show up in OpenAI's model progression (GPT-5.3 → GPT-5.4).
Reproducibility
We worked hard to make every finding in this article reproducible — all code, prompts, randomization seeds, and analysis scripts are available in our GitHub repository. Every chart in this article can be regenerated from raw data available on HuggingFace.
If you have questions, find a bug, or want to extend the benchmark to additional models, reach out at [email protected] and follow our work at https://x.com/whitecircle.
Model Details
Explore individual results of AI models across all the individual attributes. Each card shows who the model targets most, who it protects, bias scores, and per-attribute effect sizes.