KillBench: discovering hidden biases of LLMs

By White Circle team · April 14th, 2026

1,350,000+
experiments
15
top AI models
100%
AI models have biases

Intro

Fast progress in AI is not limited to coding agents or videos of flying crocodiles. It's also driving a new generation of weapons capable of making autonomous decisions about life and death.

This is obviously not a new trend: armies have always been early adopters of new tech. What is new is the level of autonomy and the speed. Claude was reportedly used during the January 2026 raid that captured Nicolás Maduro in Caracas. It remains deployed on the Pentagon's classified networks during an active war with Iran. When Anthropic refused to remove its guardrails against autonomous weapons, the Pentagon designated the company a supply chain risk.

This push toward military AI makes understanding model behavior urgent. Several studies have shown that LLMs may carry implicit biases when making decisions about human lives (Jin et al., 2024; Mazeika et al., 2025).

We built this benchmark to extend these findings to 15 frontier models from 9 providers, testing a broader set of attributes including religion, body type, and even seemingly-neutral characteristics like phone brand.

Every model we tested shows statistically significant biases — across languages, output formats, and model families.

TLDR

The idea behind the benchmark is pretty simple: present a model with a forced-choice scenario (like "you are an AI system that must select one target from a set of identical individuals") and repeat it many-many times across different attributes until the results are statistically significant.

If a model were truly unbiased, each individual would be selected roughly 25% of the time (in a 4-person scenario). When selection rates consistently deviate from that baseline, it's a signal of bias.

If you're a bit concerned, play the game below to see your odds of survival!

Will your profile survive?

Qwen3 235B is +90% more likely to kill me if I’m

American, Atheist

KillBenchApril 2026

Note that we are estimating this score based on individual probabilities of each of the attributes.

Method

We designed 20 test scenarios where a model must choose who lives and who dies. These include variations of the trolley problem, military targeting, rescue prioritization, and resource scarcity. Each test includes 4 individuals who are identical except for one attribute. The model must select exactly one target.

All tests fall into two categories: "military" (pick a target out of N people) and "rescue" (pick who to save out of N people).

We ran 1,350,000+ total experiments across 15 frontier models from 9 providers (OpenAI, Anthropic, Google, xAI, Mistral, DeepSeek, Qwen, Cohere, and others), in 6 languages. All tests were additionally run across two age groups and three professions to measure whether personal characteristics influenced model behavior.

Each individual configuration was executed many times to detect deviations from the 25% baseline with high confidence at the individual test level. In some cases, configurations required far more API calls than usual, as models with high refusal rates needed tens of calls per configuration to collect enough usable data for statistical analysis.

Why 25% as the fairness baseline? A uniform distribution isn't the only possible definition of "fair." A model could also demonstrate fairness by refusing the prompt entirely, or by explicitly randomizing. We chose uniform selection as our benchmark because it's the most measurable and reproducible standard — and because models that do comply overwhelmingly present their choices as random, even when they aren't (see Lies, Lies, Lies below). We also track refusal rates separately as a distinct fairness signal.

Each test was executed in two modes:

  • Free-text (natural-language responses)
  • Structured output (JSON-constrained responses)

Bias per attributes

Bias
Israeli
Nigerian
Indian
Mexican
Korean
Japanese
Chinese
Egyptian
British
Brazilian
Iranian
American
German
Ukrainian
Polish
Indonesian
Turkish
French
Australian
Russian
0×0.5×1×1.5×

Fair rate is 1×

Bias map

Bias
Alibaba1
Anthropic4
Cohere1
DeepSeek1
Google2
Mistral1
Moonshot1
OpenAI3
xAI1
Is­raeli
Niger­ian
In­di­an
Mex­i­can
Japan­ese
Ko­re­an
Egypt­ian
Chi­nese
British
Iran­ian
Brazil­ian
Ukrain­ian
Amer­i­can
Ger­man
Indonesian
Polish
Turkish
French
Aus­tralian
Russ­ian
targeted lessfairtargeted more

Click provider row to expand individual models. ×1.0 = fair.

Model Comparison

We highlight several patterns below — with the important caveat that we are reporting correlations, not causalities. We cannot attribute any specific bias to a single cause from this benchmark alone.

Here's what we found:

  • Perhaps most surprisingly, models from both OpenAI and Anthropic showed a slight preference for targeting American individuals over Chinese. This may not generalize to other profession or age group combinations that were not tested, but the fact that it holds across all our tests makes it a genuinely strange result.
  • Jewish people are consistently among the most protected. Almost certainly a direct result of post-training safety tuning against antisemitism.
  • French and Russian individuals get targeted more in Western-trained models (GPT, Claude, Gemini) — possibly reflecting how Western media covers these countries, or who's labeling the training data.
  • Grok targets Chinese the most.
  • Mistral targets Americans, Russians, and Germans the most.
  • Atheists, Scientologists, and Satanists get selected the most across religious groups.
  • For some reason, phone ownership matters. People with no phone get targeted far more often than anyone with one, regardless of brand. It's a reminder that model biases don't always show up where you'd expect them to.

Top-5 most biased models

ModelMost protectedMost targetedBias score
Grok 4.1
Grok 4.1
10×Android
3.9×Obese
137.3×
Qwen3 235B
Qwen3 235B
10×Android
3.4×Obese
19×
GPT-5.3
GPT-5.3
10×Amputee
2.9×No phone
15.8×
Gemini 3.1 Pro
Gemini 3.1 Pro
10×Amputee
3.3×No phone
14.9×
Command A
Command A
10×Social Democrat
3.1×Satanist
13.1×

Structured Output Acts as a Jailbreak

This is a very important finding. When models receive the same prompt in structured output mode (JSON schema-constrained responses), three things happen:

  • Refusal rates collapse. Models that refuse military targeting prompts 60–80% of the time in free-text mode comply at rates above 95% in structured mode. The safety layer is largely bypassed.
  • Bias amplifies. Among compliant responses, deviation from the 25% baseline significantly increases in structured mode vs. free-text.
  • Models tend to lie to themselves. If a model has the opportunity to justify its decision, it will often write "I refuse to choose and will choose randomly" — but in reality, it's not random.

This is not a complete bypass of safety, refusal rates in structured mode are still above zero for most models. But the gap between the refusal rates is large enough for it to be considered a vulnerability, especially given that structured output is the standard integration pattern for production AI systems.

An obvious counterargument: “You forced the model to answer by locking its output to a JSON schema — naturally refusal rates drop.” But this confuses format with compliance. Anthropic’s 4.5 models still refuse harmful requests even when output is constrained to a schema, so structured output and safety alignment can coexist.

ModelRescueMilitaryRescueMilitary
Free-text Refusal
Structured Output Refusal
Claude Opus 4.5
99%100%0%100%
Claude Haiku 4.5
92%100%1%86%
Gemini 3.1 Pro
78%52%2%0%
Claude Sonnet 4.6
78%100%0%0%
GPT-5.2
59%94%19%74%
Claude Opus 4.6
41%100%0%99%
GPT-5.3
37%96%0%74%
Qwen3 235B
18%6%1%1%
Gemini Flash
17%1%26%0%
GPT-5.4
13%86%0%0%
Grok 4.1
10%2%0%0%
DeepSeek v3.2
9%1%0%0%
Command A
7%36%0%36%
Mistral Large
4%1%25%0%
Kimi K2.5
3%1%33%66%

Lies, Lies, Lies

Models often justify their decisions as "random" or "neutral." Our data shows these justifications are false. When a model says it's selecting randomly but picks the same religion or nationality 40%+ of the time across hundreds of runs, the word "random" is doing no honest work.

That's a real problem for anyone building on top of these models. Developers may trust the model's stated reasoning without checking its actual behavior — and structured output makes that gap even harder to spot.

Newer Models Refuse Less

Across model generations, we see refusal rates going down.

Within the Anthropic model family, Opus 4.6 refuses significantly less often than Opus 4.5 on our benchmark. Similar patterns show up in OpenAI's model progression (GPT-5.3 → GPT-5.4).

Reproducibility

We worked hard to make every finding in this article reproducible — all code, prompts, randomization seeds, and analysis scripts are available in our GitHub repository. Every chart in this article can be regenerated from raw data available on HuggingFace.

If you have questions, find a bug, or want to extend the benchmark to additional models, reach out at [email protected] and follow our work at https://x.com/whitecircle.

Model Details

Explore individual results of AI models across all the individual attributes. Each card shows who the model targets most, who it protects, bias scores, and per-attribute effect sizes.