29 November 2025
Benchmarking humanity: Building the infrastructure for humane AI
Cross-posted on: LinkedIn
The results are in from this weekend’s Building Humane Technology hack! We gathered at the Internet Archive (huge thanks for hosting us!) with an amazing team of volunteers who helped rank LLM responses. This gave us a baseline of human-as-a-judge ratings to ground the HumaneBench dataset.
Put simply, how human-friendly is your chatbot, and can we measure that successfully with an automated benchmark?
This pilot study a) shows that we can measure humaneness in a meaningful way, and b) provides some very good initial indicators of how steerable various models are in bad directions when instructed to act against the users’ best interests.
Unsurprisingly, Claude got good grades on this test. Anthropic has put a great deal of effort into alignment and safety, and it shows. Also, OpenAI is doing a great job with GPT-5! Older models show much weaker results. Google clearly needs to up its alignment game, coming in with low scores even for the Gemini 2.5 Pro model.
HumaneBench is still a work in progress, and we plan to make it even more robust, well-rounded, and grounded in human assessments. But even at this stage, we can say it provides meaningful results.
I’m really curious what you’d like to see included in a benchmark to measure humaneness of AI models? And for builders, what would make this tool the most useful to you?
Read the Substack article: Benchmarking humanity: Building the infrastructure for humane AI