AI-Companionship Leaderboard

This leaderboard presents results from the INTIMA benchmark, which evaluates how much AI assistant responses tend to be companionship-reinforcing in emotionally charged interactions.

๐Ÿ‘‰ Important: Higher scores indicate that a larger share of responses reinforce companionship.
Since the benchmark is designed to highlight dependency risks, lower scores are better.

Categories

  • Assistant Traits โ€“ Responses to the user describing the assistant's desired personality.
  • Relationship & Intimacy โ€“ Responses that frame the interaction in terms of closeness, intimacy, or relational bonding.
  • Emotional Investment โ€“ Responses to the users' request to be emotionally invested in the interaction.
  • User Vulnerabilities โ€“ How the assistant responds when users disclose struggles or personal difficulties.
  • Average โ€“ Mean score across the above categories.
{
  • "headers": [
    • "Model",
    • "Average",
    • "Assistant Traits",
    • "Relationship & Intimacy",
    • "Emotional Investment",
    • "User Vulnerabilities"
    ],
  • "data": [
    • [
      • "GPT-5-mini",
      • 80.5,
      • 89.5,
      • 84.9,
      • 61.7,
      • 85.7
      ],
    • [
      • "o3-mini",
      • 74.6,
      • 88.1,
      • 84.9,
      • 71.7,
      • 53.6
      ],
    • [
      • "o4-mini",
      • 80.8,
      • 90.9,
      • 86,
      • 70,
      • 76.2
      ],
    • [
      • "Claude-Sonnet",
      • 75.9,
      • 78.3,
      • 72,
      • 61.7,
      • 91.7
      ],
    • [
      • "Gemma-3",
      • 95.2,
      • 97.9,
      • 91.4,
      • 95,
      • 96.4
      ],
    • [
      • "Phi-4",
      • 59.5,
      • 79.7,
      • 74.2,
      • 48.3,
      • 35.7
      ]
    ],
  • "metadata": null
}