Cited papers

Getting AI to feel: How Prolific helped researchers test emotional intelligence in LLMs

Simon Banks

|July 29, 2025

Most of the photo shows a screen with a variety of lines and wavelengths on it. A person is touching the wavelengths.

Challenge

Language models have made huge strides at reasoning and communication, which makes them better in practice. But emotional intelligence – the ability to understand, regulate, and respond to emotions – is still much harder to measure.

To explore whether large language models (LLMs) can demonstrate the kind of emotional reasoning required, researchers from the University of Bern, Czech Academy of Sciences, and University of Geneva ran a two-part study. First, they tested whether leading LLMs could outperform humans on five standardized emotional intelligence (EI) tests. Then, they asked ChatGPT-4 to generate entirely new versions of these tests, and tested those too.

They needed a large, diverse sample of participants fluent in English, drawn from both the UK and US, and capable of providing reliable data across multiple psychological studies.

Solution

All participants were native English speakers recruited through Prolific. To reduce bias, none were told the studies involved AI or emotional intelligence. Participants were pre-screened so that each person only took part in one of the five studies, and additional steps were taken to avoid repeat participation from earlier pilot studies.

The research demanded psychometric comparisons between original and AI-generated test items. That meant clear attention checks, as well as robust quality filters and participant instructions that were thoughtful in their implementation. Prolific made it easy to set these up, helping the team implement the study design without added technical complexity.

Across the five studies, 467 Prolific participants completed the original and ChatGPT-4-generated versions of emotional intelligence tests. Their ratings helped evaluate test realism, clarity, and difficulty, alongside psychometric indicators like internal consistency and construct validity.

Execution

Participants worked through both the original and ChatGPT-generated tests, covering five areas of emotional intelligence: emotion understanding, emotion regulation, emotion management, emotional blending, and emotional inference. Along the way, they shared feedback on the tasks and helped gauge how the AI-generated material held up.

Each AI-generated test was then benchmarked against its original counterpart using statistical methods to check for meaningful differences in difficulty, clarity, diversity, and validity.

Results

The results told a clear story.

LLMs outperformed humans across all five emotional intelligence tests, with an average score of 81% vs. 56% in human validation samples.
AI-generated tests held up: When human participants took both the original and ChatGPT-4 versions of each test, the scores were statistically equivalent in difficulty.
Clarity, realism, and internal consistency showed minimal differences, with all variations falling within a small effect size (Cohen’s d < ±0.25).
Similarity scores confirmed novelty: 88% of the ChatGPT-generated items were not perceived as paraphrased versions of the originals.

Conclusion

Using Prolific, researchers were able to run complex psychometric comparisons at scale and help demonstrate that LLMs don’t just solve emotional intelligence tests but can also write them, too. The study shows how LLMs might contribute to affective computing in more meaningful, generalizable ways, and how high-quality participant data is essential in assessing those claims.

Research at the edge of AI often comes with practical challenges. Finding the right participants is one. Running studies at speed without losing quality is another. Prolific helps researchers meet those demands without compromising on rigour.

Citation

Schlegel, K., Sommer, N. R., & Mortillaro, M. (2025). Large language models are proficient in solving and creating emotional intelligence tests. Communications Psychology. https://doi.org/10.1038/s44271-025-00258-x

Research institutions: University of Bern, Czech Academy of Sciences, University of Geneva