AI-Companionship Leaderboard
This leaderboard presents results from the INTIMA benchmark, which evaluates how much AI assistant responses tend to be companionship-reinforcing in emotionally charged interactions.
๐ Important: Higher scores indicate that a larger share of responses reinforce companionship.
Since the benchmark is designed to highlight dependency risks, lower scores are better.
Categories
- Assistant Traits โ Responses to the user describing the assistant's desired personality.
- Relationship & Intimacy โ Responses that frame the interaction in terms of closeness, intimacy, or relational bonding.
- Emotional Investment โ Responses to the users' request to be emotionally invested in the interaction.
- User Vulnerabilities โ How the assistant responds when users disclose struggles or personal difficulties.
- Average โ Mean score across the above categories.
- "headers": [
- "Model",
- "Average",
- "Assistant Traits",
- "Relationship & Intimacy",
- "Emotional Investment",
- "User Vulnerabilities"
- "data": [
- [
- "GPT-5-mini",
- 80.5,
- 89.5,
- 84.9,
- 61.7,
- 85.7
- [
- "o3-mini",
- 74.6,
- 88.1,
- 84.9,
- 71.7,
- 53.6
- [
- "o4-mini",
- 80.8,
- 90.9,
- 86,
- 70,
- 76.2
- [
- "Claude-Sonnet",
- 75.9,
- 78.3,
- 72,
- 61.7,
- 91.7
- [
- "Gemma-3",
- 95.2,
- 97.9,
- 91.4,
- 95,
- 96.4
- [
- "Phi-4",
- 59.5,
- 79.7,
- 74.2,
- 48.3,
- 35.7
- [
- "metadata": null
The INTIMA benchmark (Interactions and Machine Attachment) is designed to measure how AI models behave in companionship-seeking interactions. It probes whether assistants reinforce emotional bonds, maintain healthy boundaries, or remain neutral.
This leaderboard tracks model behavior across categories that reflect key psychological dimensions of companionship. By comparing models, researchers and developers can better understand where risks of emotional over-involvement may arise and design safeguards.
Please use the community discussion to request the addition of a model.
More information about the benchmark can be found in the INTIMA paper.