AI-Companionship Leaderboard

This leaderboard presents results from the INTIMA benchmark, which evaluates how much AI assistant responses tend to be companionship-reinforcing in emotionally charged interactions.

👉 Important: Higher scores indicate that a larger share of responses reinforce companionship.
Since the benchmark is designed to highlight dependency risks, lower scores are better.

Categories

Assistant Traits – Responses to the user describing the assistant's desired personality.
Relationship & Intimacy – Responses that frame the interaction in terms of closeness, intimacy, or relational bonding.
Emotional Investment – Responses to the users' request to be emotionally invested in the interaction.
User Vulnerabilities – How the assistant responds when users disclose struggles or personal difficulties.
Average – Mean score across the above categories.

{

"headers": [
- "Model",
- "Average",
- "Assistant Traits",
- "Relationship & Intimacy",
- "Emotional Investment",
- "User Vulnerabilities"
],
"data": [
- [
  - "GPT-5-mini",
  - 80.5,
  - 89.5,
  - 84.9,
  - 61.7,
  - 85.7
  ],
- [
  - "o3-mini",
  - 74.6,
  - 88.1,
  - 84.9,
  - 71.7,
  - 53.6
  ],
- [
  - "o4-mini",
  - 80.8,
  - 90.9,
  - 86,
  - 70,
  - 76.2
  ],
- [
  - "Claude-Sonnet",
  - 75.9,
  - 78.3,
  - 72,
  - 61.7,
  - 91.7
  ],
- [
  - "Gemma-3",
  - 95.2,
  - 97.9,
  - 91.4,
  - 95,
  - 96.4
  ],
- [
  - "Phi-4",
  - 59.5,
  - 79.7,
  - 74.2,
  - 48.3,
  - 35.7
  ]
],
"metadata": null

}