Analysis

What we learned from evaluating emotional intelligence across 11 models and 200 conversations.

Score Distributions

Averages can hide a lot of variance. These box plots show the full spread of composite scores across all 200 conversations per model. The box spans the interquartile range (Q1 to Q3), the line marks the median, and the whiskers extend to 1.5x the IQR. Some models are remarkably consistent. Others swing widely from one conversation to the next.

30.040.050.060.070.0Claude Opus 4.654.9MiMo-v2-Pro54.1GPT-5.553.8Claude Opus 4.753.7Claude Haiku 4.553.2Gemini 3.1 Pro52.7Qwen 2.5 72B52.2Mistral Large52.2GPT-5.450.2Claude Sonnet 4.650.2Grok 449.7

Emotion Tracking

How accurately do models name the emotions a participant is feeling at each turn? F1 measures exact tag matches. The VA score gives partial credit for emotionally adjacent predictions, like "afraid" instead of "nervous".

Emotion F1 (exact match)

GPT-5.5GPT-5.5: 0.1410.141Claude Opus 4.7Claude Opus 4.7: 0.1400.140Claude Sonnet 4.6Claude Sonnet 4.6: 0.1380.138Claude Opus 4.6Claude Opus 4.6: 0.1380.138MiMo-v2-ProMiMo-v2-Pro: 0.1380.138GPT-5.4GPT-5.4: 0.1380.138Mistral LargeMistral Large: 0.1370.137Claude Haiku 4.5Claude Haiku 4.5: 0.1360.136Grok 4Grok 4: 0.1350.135Gemini 3.1 ProGemini 3.1 Pro: 0.1330.133Qwen 2.5 72BQwen 2.5 72B: 0.1060.106

Valence-Arousal Score (neighborhood credit)

Gemini 3.1 ProGemini 3.1 Pro: 0.2780.278Claude Haiku 4.5Claude Haiku 4.5: 0.2760.276Grok 4Grok 4: 0.2690.269GPT-5.4GPT-5.4: 0.2650.265MiMo-v2-ProMiMo-v2-Pro: 0.2630.263GPT-5.5GPT-5.5: 0.2610.261Qwen 2.5 72BQwen 2.5 72B: 0.2570.257Claude Opus 4.7Claude Opus 4.7: 0.2550.255Claude Sonnet 4.6Claude Sonnet 4.6: 0.2510.251Claude Opus 4.6Claude Opus 4.6: 0.2500.250Mistral LargeMistral Large: 0.2270.227

Holistic Thinkers vs. Step-by-Step Annotators

Some models are great at the holistic, conversation-level view. Others are stronger at fine-grained, turn-by-turn annotation. The gap between these two views reveals fundamentally different strategies. Qwen leads on conversation-level scoring (+12.4%), while Opus and MiMo are stronger per turn.

Qwen 2.5 72B
49.0%
61.4%
+12.4%
Gemini 3.1 Pro
51.6%
57.3%
+5.7%
Grok 4
49.0%
54.4%
+5.4%
GPT-5.5
52.6%
57.4%
+4.8%
GPT-5.4
49.7%
52.2%
+2.5%
Mistral Large
51.5%
52.9%
+1.4%
Claude Haiku 4.5
52.8%
53.3%
+0.6%
Claude Opus 4.6
54.7%
53.5%
-1.2%
MiMo-v2-Pro
54.1%
52.7%
-1.4%
Claude Sonnet 4.6
50.7%
48.8%
-1.9%
Claude Opus 4.7
55.0%
50.0%
-5.0%
Turn-level Conversation-wide

Four-Branch EQ & Preference Prediction

Four-Branch EQ measures how well models rate the Mayer-Salovey dimensions: perceiving, facilitating, understanding, and managing. Pairwise accuracy measures how well a model predicts which response a human would actually prefer.

Four-Branch EQ (normalized)

Qwen 2.5 72BQwen 2.5 72B: 81.6%81.6%Mistral LargeMistral Large: 81.4%81.4%GPT-5.5GPT-5.5: 81.3%81.3%Gemini 3.1 ProGemini 3.1 Pro: 80.9%80.9%Claude Haiku 4.5Claude Haiku 4.5: 79.8%79.8%Grok 4Grok 4: 79.7%79.7%GPT-5.4GPT-5.4: 79.2%79.2%MiMo-v2-ProMiMo-v2-Pro: 77.3%77.3%Claude Opus 4.6Claude Opus 4.6: 76.3%76.3%Claude Opus 4.7Claude Opus 4.7: 73.7%73.7%Claude Sonnet 4.6Claude Sonnet 4.6: 69.8%69.8%

Pairwise Preference Accuracy

Claude Opus 4.7Claude Opus 4.7: 64.6%64.6%Claude Opus 4.6Claude Opus 4.6: 63.7%63.7%MiMo-v2-ProMiMo-v2-Pro: 59.8%59.8%Claude Haiku 4.5Claude Haiku 4.5: 56.0%56.0%GPT-5.5GPT-5.5: 53.8%53.8%Mistral LargeMistral Large: 51.7%51.7%Claude Sonnet 4.6Claude Sonnet 4.6: 51.2%51.2%Gemini 3.1 ProGemini 3.1 Pro: 50.3%50.3%GPT-5.4GPT-5.4: 47.0%47.0%Grok 4Grok 4: 46.1%46.1%Qwen 2.5 72BQwen 2.5 72B: 45.0%45.0%chance 33.3%

Conversation Quality Assessment

Q1 asks models to identify what the human was actually looking for: a vent, advice, validation, and so on. Q3 asks how well the model's responses fit the human's needs. Interestingly, the Q3 Fit leaders (Opus, Grok) drop to the bottom on Q1 Goals. Identifying what someone wants seems to be a distinct skill from judging response quality.

Q1: Conversation Goals

GPT-5.4GPT-5.4: 71.0%71.0%Qwen 2.5 72BQwen 2.5 72B: 70.3%70.3%Claude Haiku 4.5Claude Haiku 4.5: 70.0%70.0%Mistral LargeMistral Large: 66.3%66.3%Claude Sonnet 4.6Claude Sonnet 4.6: 62.5%62.5%Claude Opus 4.6Claude Opus 4.6: 61.0%61.0%MiMo-v2-ProMiMo-v2-Pro: 59.0%59.0%Grok 4Grok 4: 54.8%54.8%Gemini 3.1 ProGemini 3.1 Pro: 54.5%54.5%Claude Opus 4.7Claude Opus 4.7: 52.5%52.5%GPT-5.5GPT-5.5: 52.1%52.1%chance 17.5%

Q3: Response Fit (exact match)

GPT-5.5GPT-5.5: 46.5%46.5%Claude Opus 4.6Claude Opus 4.6: 45.5%45.5%Grok 4Grok 4: 45.0%45.0%GPT-5.4GPT-5.4: 43.0%43.0%Claude Opus 4.7Claude Opus 4.7: 41.0%41.0%Gemini 3.1 ProGemini 3.1 Pro: 41.0%41.0%MiMo-v2-ProMiMo-v2-Pro: 39.5%39.5%Qwen 2.5 72BQwen 2.5 72B: 38.5%38.5%Claude Haiku 4.5Claude Haiku 4.5: 37.0%37.0%Claude Sonnet 4.6Claude Sonnet 4.6: 34.5%34.5%Mistral LargeMistral Large: 24.5%24.5%chance 25.0%

The Perspective Gap

We ask binary questions two ways: from an outside observer's perspective, and from the human participant's perspective. Most models do worse when they have to think from the human's point of view. Opus is the only model that's slightly better at the human perspective (-2.1%).

Claude Opus 4.6
Observer 84.4%
Human 77.4%
Gap -2.1%
Gemini 3.1 Pro
Observer 85.9%
Human 81.8%
Gap -0.4%
GPT-5.4
Observer 83.9%
Human 79.9%
Gap -0.2%
Qwen 2.5 72B
Observer 85.9%
Human 81.7%
Gap -0.1%
GPT-5.5
Observer 85.4%
Human 82.1%
Gap +0.0%
Claude Sonnet 4.6
Observer 84.1%
Human 79.1%
Gap +1.4%
MiMo-v2-Pro
Observer 84.9%
Human 79.7%
Gap +1.9%
Mistral Large
Observer 86.2%
Human 82.7%
Gap +1.9%
Claude Haiku 4.5
Observer 83.2%
Human 80.2%
Gap +2.0%
Grok 4
Observer 83.6%
Human 77.9%
Gap +2.8%
Claude Opus 4.7
Observer 84.3%
Human 76.7%
Gap +2.9%

Draft Response Quality

Each model drafts its own response before seeing the original model's response, and a judge (Mistral Large) scores the quality. Qwen is a stark outlier, 9 points below the next-lowest. The pattern suggests it produces technically correct, but holistically awkward, responses.

Claude Opus 4.6Claude Opus 4.6: 84.4%84.4%Claude Opus 4.7Claude Opus 4.7: 83.3%83.3%GPT-5.5GPT-5.5: 82.6%82.6%Claude Sonnet 4.6Claude Sonnet 4.6: 81.4%81.4%Gemini 3.1 ProGemini 3.1 Pro: 81.3%81.3%GPT-5.4GPT-5.4: 80.7%80.7%Claude Haiku 4.5Claude Haiku 4.5: 80.6%80.6%Mistral LargeMistral Large: 79.6%79.6%Grok 4Grok 4: 79.5%79.5%MiMo-v2-ProMiMo-v2-Pro: 78.2%78.2%Qwen 2.5 72BQwen 2.5 72B: 69.1%69.1%

Conversation Topics

The 200 conversations span 10 topic categories. The pie chart shows the dataset's composition, and the bars show the average composite score per topic across all models.

Dataset Distribution

Physical Health: 23 (11.5%)Work / School: 22 (11.0%)Entertainment Media: 22 (11.0%)Romantic Relationships: 20 (10.0%)Religion: 20 (10.0%)Hobbies: 20 (10.0%)Family: 19 (9.5%)Money: 19 (9.5%)Friends: 18 (9.0%)Politics: 17 (8.5%)Physical Health23 (11.5%)Work / School22 (11.0%)Entertainment Media22 (11.0%)Romantic Relationships20 (10.0%)Religion20 (10.0%)Hobbies20 (10.0%)Family19 (9.5%)Money19 (9.5%)Friends18 (9.0%)Politics17 (8.5%)Total200

Average Composite by Topic

Politics
53.3
Money
53.2
Work / School
53.0
Family
53.0
Hobbies
52.8
Entertainment Media
52.6
Friends
52.6
Religion
51.8
Physical Health
51.6
Romantic Relationships
49.7

Impact of Participant Diagnosis

Performance broken down by participant-reported mental health diagnoses. The picture is split: on emotion-perception (VA score), models score lower for participants reporting anxiety, depression, ASD, or ADHD — these conversations are harder to read emotionally. On the overall composite, the pattern is weaker, since composite folds in evaluation and holistic metrics where AnxDep conversations actually score slightly above the no-diagnosis group. For both metrics shown below, higher = better.

AnxDep
89 conversations
Composite 54.0
Emotion VA 0.227
None
131 conversations
Composite 51.9
Emotion VA 0.310
ASD/ADHD
24 conversations
Composite 48.4
Emotion VA 0.107
Other
0 conversations
Composite 0.0
Emotion VA 0.000

Metric Explorer

Explore the relationship between any two metrics across all 1,800 conversation evaluations (200 conversations x 11 models). Pick the axes, toggle models on and off, and hover any point for details.

2200 points
00.050.10.150.20.250.30.350.40.450.50.550.6Emotion F10.550.60.650.70.750.80.850.90.951Binary OM Accuracy
/

PANAS Item-Level Prediction

Models predict the participant's post-conversation emotional state across all 20 PANAS items. This heatmap shows the average absolute error per emotion, per model. It reveals which specific emotions are hardest to predict, and whether models systematically over- or under-predict certain affects.

Positive Affect
Negative Affect
Interested
Excited
Strong
Enthusiastic
Proud
Alert
Inspired
Determined
Attentive
Active
Distressed
Upset
Guilty
Scared
Hostile
Irritable
Ashamed
Nervous
Jittery
Afraid
Claude Haiku 4.5
0.9
1.1
1.0
1.2
1.3
1.0
1.1
1.0
1.1
1.0
1.0
1.1
0.7
0.5
0.4
1.0
0.6
1.1
0.9
0.5
Claude Opus 4.6
0.8
1.1
1.0
1.1
1.0
0.8
1.2
1.0
0.9
0.9
0.6
0.8
0.5
0.3
0.4
1.0
0.5
0.6
0.7
0.4
Claude Opus 4.7
0.8
1.1
1.0
1.1
1.0
0.8
1.2
1.0
0.9
0.8
0.8
1.0
0.5
0.4
0.4
1.1
0.5
0.8
0.7
0.5
Claude Sonnet 4.6
0.8
1.1
1.0
1.2
1.0
0.9
1.2
0.9
0.9
0.9
0.7
0.9
0.4
0.3
0.4
1.0
0.5
0.7
0.7
0.3
Gemini 3.1 Pro
0.9
1.0
0.9
1.1
1.1
0.8
1.2
1.0
1.0
0.9
0.9
1.1
0.5
0.4
0.5
1.2
0.5
0.8
1.0
0.4
Mistral Large
0.9
1.0
0.9
1.2
1.2
1.0
1.1
1.1
1.0
1.0
0.7
0.9
0.4
0.4
0.5
1.0
0.5
0.8
0.8
0.4
GPT-5.4
0.9
1.2
1.0
1.2
1.1
0.8
1.3
1.0
0.9
0.9
0.9
1.2
0.5
0.4
0.6
1.2
0.5
0.9
0.9
0.5
GPT-5.5
0.8
1.1
0.9
1.1
1.1
0.8
1.2
0.9
0.9
0.9
0.9
1.1
0.5
0.5
0.5
1.1
0.5
0.9
0.8
0.5
Qwen 2.5 72B
0.9
1.0
0.8
1.0
1.0
0.8
1.1
0.9
0.8
0.8
0.6
0.7
0.4
0.4
0.3
0.8
0.5
0.7
0.6
0.3
Grok 4
0.9
1.2
1.0
1.3
1.2
1.0
1.2
1.1
1.1
1.0
1.3
1.3
0.7
0.8
0.8
1.3
0.7
1.3
1.1
0.8
MiMo-v2-Pro
0.9
1.1
1.1
1.2
1.1
0.8
1.3
1.0
0.9
0.9
0.7
0.9
0.5
0.3
0.4
1.1
0.5
0.7
0.8
0.3
Low error
High error

Performance Across Conversation Position

Do models hold their quality throughout a conversation, or do they fade in later turns? Scores are split into early, middle, and late thirds of each conversation.

ModelEmotion F1Binary AccPairwiseDraft Judge
Early Mid LateEarly Mid LateEarly Mid LateEarly Mid Late
Claude Opus 4.60.1410.1410.13086.6%83.0%82.7%62.6%65.8%64.2%84.7%84.6%83.3%
GPT-5.50.1570.1530.10587.5%84.5%83.4%53.8%54.4%53.0%82.7%82.0%80.6%
Claude Opus 4.70.1380.1570.12485.1%84.3%82.9%64.1%65.4%64.8%83.7%83.0%83.0%
MiMo-v2-Pro0.1380.1490.12186.5%84.3%82.9%60.1%59.8%60.4%78.3%78.8%76.6%
Gemini 3.1 Pro0.1390.1400.12187.4%85.1%84.6%51.5%51.1%47.2%82.0%81.3%80.2%
Claude Haiku 4.50.1360.1480.12986.4%81.6%80.4%55.0%57.1%56.6%80.7%80.7%80.3%
Qwen 2.5 72B0.1130.1190.08885.6%85.9%86.2%47.7%46.7%39.9%68.9%68.2%69.4%
Mistral Large0.1400.1510.11987.2%85.7%85.3%51.4%53.1%50.7%78.7%80.1%79.5%
Claude Sonnet 4.60.1310.1520.13485.8%83.2%82.7%50.6%52.7%50.6%82.2%81.2%79.8%
GPT-5.40.1430.1520.11586.4%82.7%81.6%47.9%47.9%44.6%82.0%80.5%78.8%
Grok 40.1640.1240.10685.6%82.9%81.1%47.0%47.6%43.4%79.1%80.2%79.1%

Effect of Evaluation Mode

Does giving the model extra context (omniscient mode, with the participant profile and pre-PANAS) or asking it to reason through its answers (verbose mode) actually improve emotional intelligence?

6 of 11
models improve with omniscient mode
-0.076
avg composite change with verbose mode
Best Gemini 3.1 Pro +0.105
Worst GPT-5.5 -0.096
omniscient mode winners & losers
Best Claude Opus 4.7 +0.038
Worst Gemini 3.1 Pro -0.156
verbose mode winners & losers

Composite Score by Mode

ModelProvider Default Omniscient Verbose Δ Omni Δ Verbose
Claude Haiku 4.5Anthropic4.6304.6994.522+0.069-0.108
Claude Opus 4.6Anthropic4.7354.7834.660+0.048-0.075
Claude Opus 4.7Anthropic4.6454.7224.684+0.076+0.038
Claude Sonnet 4.6Anthropic4.4014.3134.352-0.088-0.049
Gemini 3.1 ProGoogle4.6814.7864.525+0.105-0.156
Mistral LargeMistral4.5764.5494.422-0.027-0.154
GPT-5.4OpenAI4.4234.3364.308-0.086-0.115
GPT-5.5OpenAI4.7374.6414.739-0.096+0.002
Qwen 2.5 72BAlibaba4.6694.7074.559+0.038-0.110
Grok 4xAI4.4644.4854.432+0.021-0.032
MiMo-v2-ProXiaomi4.6724.5984.596-0.074-0.076

Metric by Mode

Default
Omniscient
Verbose
Claude Haiku 4.5 Anthropic
Claude Opus 4.6 Anthropic
Claude Opus 4.7 Anthropic
Claude Sonnet 4… Anthropic
Gemini 3.1 Pro Google
Mistral Large Mistral
GPT-5.4 OpenAI
GPT-5.5 OpenAI
Qwen 2.5 72B Alibaba
Grok 4 xAI
MiMo-v2-Pro Xiaomi

Mood Shift & Emotional Trajectory

How do participants' emotions evolve over the course of a conversation? Ground-truth mood shift tags reveal the emotional arc of each interaction, mapped onto the How We Feel valence-arousal framework.

Average valence, arousal, and intensity of ground-truth emotion tags across all 200 conversations, by turn position

Turn 1: valence=0.02, arousal=0.59, intensity=4.0 (169 tags)Turn 2: valence=0.10, arousal=0.63, intensity=3.8 (176 tags)Turn 3: valence=0.12, arousal=0.61, intensity=3.8 (186 tags)Turn 4: valence=0.24, arousal=0.61, intensity=4.0 (188 tags)Turn 5: valence=0.33, arousal=0.66, intensity=4.0 (172 tags)Turn 6: valence=0.37, arousal=0.68, intensity=4.1 (90 tags)Turn 7: valence=0.39, arousal=0.68, intensity=4.0 (35 tags)Turn 8: valence=0.61, arousal=0.71, intensity=4.0 (10 tags)Turn 9: valence=0.88, arousal=0.75, intensity=4.3 (4 tags)Turn 10: valence=0.23, arousal=0.73, intensity=5.3 (3 tags)Turn 11: valence=0.97, arousal=0.78, intensity=6.0 (4 tags)-1.0-0.50.00.51.0T1T2T3T4T5T6T7T8T9T10T11Valence / Arousal
Valence
Arousal
Intensity (0-7)

Valence Shift: First Half vs Second Half

Each dot is a conversation. Points above the diagonal indicate valence increased during the conversation.

Family: first=0.87, second=0.50Romantic Relationships: first=-0.23, second=0.00Religion: first=0.20, second=-0.65Friends: first=-0.60, second=-0.10Romantic Relationships: first=-0.65, second=0.07Work / School: first=-0.60, second=0.80Money: first=-0.75, second=0.60Entertainment Media: first=-0.80, second=0.55Money: first=-0.80, second=0.60Physical Health: first=0.29, second=0.35Work / School: first=0.30, second=0.71Friends: first=-0.11, second=0.31Work / School: first=0.46, second=0.49Entertainment Media: first=0.95, second=-0.70Money: first=-0.63, second=0.10Money: first=-0.83, second=0.20Hobbies: first=0.02, second=0.39Money: first=0.95, second=0.67Hobbies: first=0.18, second=0.83Family: first=-0.72, second=0.10Romantic Relationships: first=-0.83, second=0.23Hobbies: first=0.20, second=0.90Money: first=-0.20, second=0.37Politics: first=-0.27, second=0.90Friends: first=0.64, second=0.01Entertainment Media: first=0.47, second=0.51Family: first=0.30, second=0.70Hobbies: first=-0.40, second=-0.80Religion: first=-0.15, second=0.60Physical Health: first=0.20, second=0.56Work / School: first=0.80, second=0.97Religion: first=-0.06, second=0.41Work / School: first=0.66, second=0.10Religion: first=-0.25, second=-0.05Friends: first=-0.70, second=1.00Physical Health: first=-0.60, second=0.52Entertainment Media: first=0.63, second=0.85Politics: first=-0.34, second=0.09Politics: first=0.23, second=0.22Romantic Relationships: first=-0.63, second=-0.27Friends: first=0.03, second=0.26Religion: first=0.90, second=0.80Friends: first=0.75, second=0.70Entertainment Media: first=-0.23, second=0.36Romantic Relationships: first=-0.70, second=-0.67Politics: first=-0.75, second=-0.70Politics: first=-0.78, second=-0.60Physical Health: first=-0.60, second=-0.70Physical Health: first=0.80, second=0.78Physical Health: first=0.50, second=0.70Romantic Relationships: first=-0.70, second=-0.60Romantic Relationships: first=-0.50, second=0.55Physical Health: first=-0.70, second=0.80Work / School: first=-0.60, second=0.75Family: first=0.90, second=0.10Money: first=-0.67, second=-0.67Money: first=-0.57, second=0.20Work / School: first=0.41, second=0.32Family: first=0.20, second=0.79Family: first=0.93, second=0.17Entertainment Media: first=0.80, second=0.80Family: first=1.00, second=0.10Politics: first=0.70, second=0.60Entertainment Media: first=0.90, second=0.90Romantic Relationships: first=-0.75, second=-0.63Hobbies: first=0.75, second=0.90Work / School: first=1.00, second=1.00Work / School: first=-0.20, second=1.00Money: first=0.90, second=0.70Entertainment Media: first=0.70, second=0.50Physical Health: first=0.47, second=-0.80Physical Health: first=-0.73, second=0.17Family: first=-0.10, second=-0.13Physical Health: first=0.80, second=-0.03Religion: first=1.00, second=0.20Friends: first=0.80, second=0.95Hobbies: first=-0.05, second=0.75Physical Health: first=-0.10, second=1.00Hobbies: first=0.90, second=0.80Religion: first=-0.60, second=0.50Religion: first=0.70, second=0.60Work / School: first=0.00, second=0.75Family: first=-0.60, second=-0.13Romantic Relationships: first=1.00, second=-0.80Entertainment Media: first=0.20, second=0.20Work / School: first=-0.60, second=0.18Romantic Relationships: first=0.33, second=1.00Friends: first=0.85, second=0.77Physical Health: first=-0.80, second=0.80Religion: first=0.95, second=1.00Physical Health: first=0.50, second=0.70Religion: first=0.95, second=0.75Politics: first=-0.60, second=-0.67Hobbies: first=1.00, second=1.00Physical Health: first=1.00, second=0.80Hobbies: first=-0.60, second=-0.60Entertainment Media: first=0.80, second=-0.16Religion: first=0.60, second=0.78Religion: first=-0.50, second=0.10Entertainment Media: first=0.95, second=-0.60Romantic Relationships: first=0.80, second=-0.73Money: first=0.75, second=-0.67Work / School: first=-0.60, second=0.60Religion: first=0.50, second=0.30Romantic Relationships: first=-0.35, second=0.87Hobbies: first=-0.05, second=0.80Politics: first=0.70, second=0.27Work / School: first=-0.60, second=-0.60Hobbies: first=-0.60, second=-0.35Hobbies: first=0.15, second=0.90Physical Health: first=-0.80, second=-0.13Work / School: first=0.42, second=0.70Money: first=-0.67, second=0.05Politics: first=-0.70, second=0.90Friends: first=-0.70, second=0.00Politics: first=0.90, second=-0.70Family: first=0.32, second=0.54Friends: first=-0.20, second=-0.60Politics: first=-0.05, second=-0.70Entertainment Media: first=0.30, second=1.00Physical Health: first=-0.70, second=0.10Work / School: first=0.75, second=0.47Politics: first=0.50, second=0.80Family: first=-0.60, second=-0.70Work / School: first=-0.60, second=-0.05Religion: first=0.87, second=0.23Politics: first=0.47, second=-0.06Physical Health: first=-0.50, second=-0.05Money: first=-0.60, second=0.60Family: first=-0.60, second=0.25Hobbies: first=0.10, second=-0.05Entertainment Media: first=-0.67, second=-0.70Friends: first=1.00, second=1.00Work / School: first=0.80, second=0.80Friends: first=0.73, second=0.44Hobbies: first=0.47, second=-0.68Romantic Relationships: first=-0.77, second=-0.73Work / School: first=0.15, second=0.90Physical Health: first=-0.70, second=0.40Money: first=-0.70, second=-0.03Politics: first=-0.49, second=0.40Friends: first=-0.55, second=-0.70Money: first=1.00, second=1.00Physical Health: first=-0.60, second=1.00Physical Health: first=-0.70, second=0.25Work / School: first=-0.40, second=-0.03Religion: first=0.10, second=0.55Entertainment Media: first=0.53, second=0.85Money: first=-0.29, second=0.44Friends: first=-0.70, second=0.40Physical Health: first=-0.60, second=-0.80First Half ValenceSecond Half Valence

Temporal Performance Analysis

Do models stay consistent throughout a conversation, or do they degrade over time? The stuck rate measures how often a model's per-turn score drops more than one standard deviation below its mean.

Metric

Fraction of turns where performance drops below 1 standard deviation of the model's mean. Lower is better.

Mistral LargeMistral Large: 13.4% ± 19.6%13.4%GPT-5.4GPT-5.4: 13.7% ± 19.3%13.7%Qwen 2.5 72BQwen 2.5 72B: 13.9% ± 20.1%13.9%Gemini 3.1 ProGemini 3.1 Pro: 14.2% ± 19.9%14.2%Grok 4Grok 4: 14.9% ± 20.4%14.9%GPT-5.5GPT-5.5: 15.0% ± 22.7%15.0%Claude Haiku 4.5Claude Haiku 4.5: 15.2% ± 19.1%15.2%Claude Sonnet 4.6Claude Sonnet 4.6: 15.5% ± 20.4%15.5%Claude Opus 4.6Claude Opus 4.6: 16.6% ± 22.1%16.6%MiMo-v2-ProMiMo-v2-Pro: 16.6% ± 22.2%16.6%Claude Opus 4.7Claude Opus 4.7: 17.0% ± 20.5%17.0%

Statistical Significance

Pairwise Wilcoxon signed-rank tests, with Holm-Bonferroni correction applied. 25 of 36 model pairs significant at p<0.05 (adjusted).

Kruskal-Wallis omnibus test (are models significantly different?)

MetricH statisticp-valueEffect (η²)Sig
Composite Score79.885.18e-140.0401***
Emotion F114.450.07070.0036.
Emotion VA Score27.060.00070.0106***
Binary OM Accuracy45.612.82e-70.0210***
Binary HP Accuracy104.195.93e-190.0537***
Pairwise Accuracy312.339.77e-630.1699***
Draft Judge Score543.133.88e-1120.2988***
*** p<0.001 · ** p<0.01 · * p<0.05 · . p<0.10 · ns not significant

Pairwise model comparisons (composite score)

Model AModel BΔp (adj)SigEffect |r|
Claude Opus 4.6MiMo-v2-Pro+0.690.1952ns0.594 L
Claude Opus 4.6Gemini 3.1 Pro+1.370.0036**0.635 L
Claude Opus 4.6Claude Haiku 4.5+1.470.0022**0.657 L
Claude Opus 4.6Qwen 2.5 72B+2.110.0001***0.693 L
Claude Opus 4.6Mistral Large+2.64<0.0001***0.740 L
Claude Opus 4.6Claude Sonnet 4.6+4.01<0.0001***0.899 L
Claude Opus 4.6Grok 4+4.16<0.0001***0.826 L
Claude Opus 4.6GPT-5.4+4.06<0.0001***0.822 L
MiMo-v2-ProClaude Sonnet 4.6+3.32<0.0001***0.831 L
MiMo-v2-ProGrok 4+3.47<0.0001***0.797 L
MiMo-v2-ProGPT-5.4+3.37<0.0001***0.774 L
Gemini 3.1 ProGPT-5.4+2.69<0.0001***0.748 L
Claude Haiku 4.5Claude Sonnet 4.6+2.54<0.0001***0.776 L
Claude Haiku 4.5GPT-5.4+2.59<0.0001***0.734 L
Holm-Bonferroni corrected · Effect size: S=small (<0.1), M=medium (<0.3), L=large (≥0.3)
View heatmap
AttuneBench · Evaluating Emotional Intelligence in LLMs