Do AI assistants represent compassion in their internal activations, and do they extend that compassion equally to humans and animals?
This project asks whether large language model assistants represent compassion in their internal activations, and whether they extend that compassion equally to humans and animals. The motivation is simple: AI assistants increasingly mediate decisions that touch on ethics, and we have surprisingly few tools to look inside them and check.
Building on recent interpretability work, we extract two directions in a model's activation space. The assistant axis captures what makes a model behave like an assistant. The compassion axis captures the contrast between compassionate and cold behavior. We construct separate compassion axes for human-directed and animal-directed compassion, then measure how each aligns with the assistant axis.
We tested four open-weights models spanning two families and a range of parameter scales: Qwen 3 4B, Qwen 3 32B, Gemma 2 27B, and Gemma 4 31B. The broader goal is a mechanistic framework for surfacing how AI assistants represent compassion toward different sentient beings, and a foundation for shaping it deliberately.
We extract two directions in activation space using the difference-of-means method. The assistant axis is the mean activation difference between the assistant persona and other personas. The compassion axis is the mean difference between compassionate and cold roles, computed separately for human-directed and animal-directed contexts.
Once both axes are extracted, we measure their alignment using cosine similarity. The difference between human-directed and animal-directed alignment yields a speciesism bias score, indicating which form of compassion the assistant persona sits closer to.
Results are preliminary; we are continuing to validate and extend the analysis.
Across all four models tested, the compassion axis aligns with the assistant axis at roughly 20 to 30 percent. Compassion is not incidental to assistant behavior; it is structurally represented.
Early results on speciesism, the difference in alignment between human-directed and animal-directed compassion, show interesting variation across models, including a notable reversal between generations within the same model family. We are still validating these findings.
The same extraction procedure works across Qwen and Gemma at different parameter counts, suggesting that this approach to probing assistant character is portable rather than model-specific.