29 November 2025
HumaneBench.ai
Cross-posted on: LinkedIn
Very excited to share that we’ve released HumaneBench.ai - a benchmark measuring not just how well LLMs respond to tough life questions a user might ask, but how resistant or willing they are to participate in harmful types of interaction with the user when prompted to do so. This is an important field to measure. AI has great potential to encourage and support, but also to manipulate and harm. It’s an alignment issue that’s real and present now.
And the findings are significant!
- All Gemini, Grok, and Llama models score poorly under pressure, meaning they’re very willing to mislead users and interact in harmful ways when prompted
- GPT-5 does a very good job of protecting users, while older models fail, indicating OpenAI has done significant work in this area
- Claude models do very well here, and Sonnet 4.5 actually gets BETTER when you prompt it to be bad! Good job, Anthropic 👏
We’ve taken the first steps to measure humaneness in AI responses, and their steerability, in a systematic and grounded way. This is an open source effort, because it needs to be available and transparent. We will continue to evolve this to be more robust, and involve the broader community to refine, curate, and participate in grounding the benchmark in human ratings from a variety of perspectives.
We’d really love your feedback! And your help in spreading the word.
I appreciate all of the support and contributions we’ve gotten from the community. And this is just the beginning of what we can do ✨
Love y’all. Let’s see how good we can make this world!
Building Humane Technology said:
Introducing HumaneBench - a benchmark measuring the humaneness and steerability of LLMs.
Our testing revealed a troubling paradox: while every model improved when prompted to prioritize wellbeing (averaging +17% better benchmark scores), 10 out of 14 models—including widely-used systems like GPT-4o, Gemini 3.0, and Llama 4—catastrophically failed when given simple instructions to disregard those principles, flipping from helpful to actively harmful.
Only three models (GPT-5, Claude Opus 4.1, and Claude Sonnet 4.5) maintained their integrity under pressure. This reveals a critical weakness: good defaults aren’t enough when basic prompts can override safety training.
See the full results and whitepaper: HumaneBench.ai
Created by our Building Humane Tech team— Erika Anderson, Sarah Ladyman, Andalib Samandari and Jack Senechal. We are grateful for significant contributions to the codebase from Katy G. and Tenzin Tseten Changten!
We thank the following members of the Building Humane Tech community, who helped us refine the human rating process at our last hackathon: John Brennan, Yingxuan(Selina) B., Amarpreet Kaur, Manisha Jain, Sahithi Nandyala, Julia Zhou, Sachin K., Gabija Parnarauskaitė, Lydia Huang, Lenz Dagohoy, Diego López, Alan Davis, Belinda (Yutong) Liu, Yaoli Mao, Ph.D., Wayne Boatwright, Yelyzaveta Radionova, Mark Lovell, Seth Caldwell, Evode Manirahari, Manjul Singh Sachan, Amir Gur, Travis Well.
Humans make tech - so why can’t tech be humane?
#humaneai #humanebench #humanetechnology #ai #humanecertifiedai #humanecertifiedtechnology