Case Studies

How the University of Bristol benchmarked LLM emotional intelligence with human data

Simon Banks
|July 21, 2025

Researchers used Prolific's participant pool to test whether AI can match human judgment in evaluating emotional life events – with surprising results.

Task: Benchmarking AI emotional intelligence

While developing their human ranking system, researchers Dr Conor Houghton, Professor Ian Penton-Voak and PhD student Kimberley Beaumont recognized a broader opportunity: if they could establish how humans evaluate emotional events, could large language models (LLMs) do the same?

Answering this question would have clear practical benefits, like potentially automating emotionally sensitive tasks to reduce participants' exposure to triggering content. It also posed deeper philosophical questions about whether AI could genuinely match human emotional understanding.

To test this idea, the researchers from the University of Bristol first needed authentic human judgments as a benchmark. This is where Prolific came in, providing a way to recruit participants who offered thoughtful evaluations of emotional experiences, from everyday frustrations ("I dropped my phone") to significant life changes ("I got engaged").

Challenge: Multiple phases, ethical concerns, and technical complexity

The researchers faced three main challenges in setting up their study:

Managing complex, multi-phase studies

The project required multiple distinct phases, each needing a different set of participants with specific demographics and characteristics. Researchers needed a way to source fresh, relevant participants at each stage, without overlap or duplication.

Ethical considerations around sensitive content

Participants were asked to evaluate emotionally charged descriptions of real-life events, some of which involved sensitive subjects such as grief or trauma. The researchers had to manage the exposure to potentially triggering content, all while making sure participant welfare remained a priority.

Kimberley highlighted the seriousness of this issue:

“Because of the sensitive content, you don't know the impact that could potentially have on some of the participants. Obviously, you can warn them beforehand, but that does have one clear advantage for an LLM to be able to do this task – that it's not going to be emotionally impacted." 

Sophisticated technical infrastructure

The technical requirements were complex, involving real-time, multi-user, synchronous updating of the pairwise comparisons of emotional statements. The system needed controlled access to avoid server overload and guarantee accurate data collection. It meant researchers had to control participant flow, limiting the number of simultaneous users at any one time.

Solution: Using Prolific’s tailored capabilities

Dr Conor and Kimberley tackled their complex requirements with Prolific’s targeted, flexible participant recruitment and management features. Each challenge was addressed directly:

Precision participant targeting

Using Prolific’s pre-screeners, the researchers quickly and accurately recruited participants within their exact demographic target ages of 18 to 30. With ease, they collected meaningful, high-quality human judgments of emotional experiences without lengthy screening processes.

Participant groups for phased studies

Prolific’s participant-group management made it simple for researchers to organize and separate different participant cohorts. As a result, every study phase had fresh, unique participants and prevented any overlap or contamination between stages.

Integrated ethical content warnings

Prolific’s built-in content warning feature meant the researchers could manage sensitive emotional material, clearly informing participants of potential triggers upfront. This helped safeguard participant welfare and minimize the risk of unintended distress during the study.

Throttled participant access feature

Prolific’s throttled-access capability played a central role in managing the technically complex benchmarking tasks. Dr Connor and Kimberley were able to control how many participants accessed their servers simultaneously, which prevented overload and ensured accurate data collection during the pairwise comparison tasks. As Conor explained:

"We didn't want too many people trying to use [the shared database] at the same time… being able to throttle the participants coming over was useful there."

Together, Prolific’s features simplified complex logistical issues and helped the researchers focus on benchmarking human emotional judgment against AI capabilities.

Results: Human-AI benchmarking with clear, meaningful insights

From recruitment to results, the team moved efficiently through each stage of the project using Prolific.

Speed and scalability

The researchers recruited around 1,000 participants, all aged 18 to 30, with speed and ease through Prolific. For the main benchmarking task (which used throttled access capped at 10 participants at a time), 166 participants were approved, arriving at a rate of around 25 per hour, with a median completion time of 10 minutes and 55 seconds. 

In the three earlier studies without throttling, recruitment ranged from 20 to 70 participants per hour depending on the time of day or week. Given that their target demographic only included around 4,000 to 5,000 of Prolific’s total users, this recruitment speed was notably efficient.

Remarkable alignment

The benchmarking revealed an unexpectedly high level of agreement: human judgments and AI rankings were closely matched, with a correlation of 88%. This clearly showed that the human responses gathered through Prolific provided a reliable standard for comparison.

Quality assurance

Participants took the task seriously, engaging thoughtfully even with sensitive and lengthy descriptions. Kimberley highlighted the quality of their responses:

"It was very promising to see that our Prolific participants were actually reading the statements, even when they were very long, to be able to have this level of similarity."

Because participants genuinely focused on the task, the researchers could trust the human data they gathered when comparing it to AI.

Implications: Demonstrating Prolific’s value in AI benchmarking research

Dr Conor and Kimberley's research demonstrates why Prolific has become central to the infrastructure for AI evaluation.

Speed without compromise

With instant access to verified participants and seamless technical integration, AI researchers can move from concept to results in days, not weeks.

Trust through transparency

In an era where AI safety depends on understanding how models perform, Prolific's approach matters. Every participant is verified, every response can be traced. This transparency is essential for building AI systems that people can trust.

Built for AI's unique challenges

From handling sensitive content to managing server loads, Prolific's API-first platform is infrastructure built specifically for modern requirements in AI training and evaluation data.

What’s next for AI and emotional intelligence research?

Dr Conor Houghton and Kimberley’s research is already paving the way for a better understanding of how LLMs evaluate emotional experiences, opening doors to safer and more effective AI applications. Their initial findings have gained recognition as late-breaking work at the CHI conference, with further research papers on the horizon. To support wider collaboration in the research community, they’ve also made their dataset and analysis code openly available on GitHub.

The team continues to explore all-important questions at the intersection of human cognition and artificial intelligence, with the aim of uncovering further nuance in how closely AI models can truly replicate genuine human emotional judgment.