TY - GEN
T1 - Consistent Explainers or Unreliable Narrators? Understanding LLM-generated Group Recommendations
AU - Waterschoot, Cedric
AU - Tintarev, Nava
AU - Barile, Francesco
PY - 2025/8/7
Y1 - 2025/8/7
N2 - Large Language Models (LLMs) are increasingly being implemented as joint decision-makers and explanation generators for Group Recommender Systems (GRS). In this paper, we evaluate these recommendations and explanations by comparing them to social choice-based aggregation strategies. Our results indicate that LLM-generated recommendations often resembled those produced by Additive Utilitarian (ADD) aggregation. However, the explanations typically referred to averaging ratings (resembling but not identical to ADD aggregation). Group structure, uniform or divergent, did not impact the recommendations. Furthermore, LLMs regularly claimed additional criteria such as user or item similarity, diversity, or used undefined popularity metrics or thresholds. Our findings have important implications for LLMs in the GRS pipeline as well as standard aggregation strategies. Additional criteria in explanations were dependent on the number of ratings in the group scenario, indicating potential inefficiency of standard aggregation methods at larger item set sizes. Additionally, inconsistent and ambiguous explanations undermine transparency and explainability, which are key motivations behind the use of LLMs for GRS.
AB - Large Language Models (LLMs) are increasingly being implemented as joint decision-makers and explanation generators for Group Recommender Systems (GRS). In this paper, we evaluate these recommendations and explanations by comparing them to social choice-based aggregation strategies. Our results indicate that LLM-generated recommendations often resembled those produced by Additive Utilitarian (ADD) aggregation. However, the explanations typically referred to averaging ratings (resembling but not identical to ADD aggregation). Group structure, uniform or divergent, did not impact the recommendations. Furthermore, LLMs regularly claimed additional criteria such as user or item similarity, diversity, or used undefined popularity metrics or thresholds. Our findings have important implications for LLMs in the GRS pipeline as well as standard aggregation strategies. Additional criteria in explanations were dependent on the number of ratings in the group scenario, indicating potential inefficiency of standard aggregation methods at larger item set sizes. Additionally, inconsistent and ambiguous explanations undermine transparency and explainability, which are key motivations behind the use of LLMs for GRS.
KW - Large Language Models
KW - Group Recommender Systems
KW - Social choice-based aggregation strategies
KW - Explanations
U2 - 10.1145/3705328.3748015
DO - 10.1145/3705328.3748015
M3 - Conference article in proceeding
SN - 9798400713644
T3 - RecSys : Proceedings of the ACM Conference on Recommender Systems
SP - 539
EP - 544
BT - Proceedings of the Nineteenth ACM Conference on Recommender Systems
PB - Association for Computing Machinery
CY - New York, NY, USA
ER -