A team of eleven researchers led by Iordanis Fostiropoulos submitted GISTBench to arXiv on March 31, 2026 — a benchmark designed to evaluate how well large language models can extract and verify user interests from engagement histories in recommendation systems. The paper introduces two new metric families and a synthetic dataset built on real interactions from a global short-form video platform.
- GISTBench introduces two metric families — Interest Groundedness (IG) and Interest Specificity (IS) — to evaluate LLM-generated user profiles against actual engagement data.
- Eight open-weight LLMs ranging from 7B to 120B parameters were evaluated, with performance bottlenecks identified across all of them.
- The dataset is synthetic but was constructed from real user interactions on a global short-form video platform and validated against user surveys.
- A key failure mode identified: current LLMs struggle to accurately count and attribute engagement signals across heterogeneous interaction types.
What Happened
Fostiropoulos and co-authors — including Muhammad Rafay Azhar, Abdalaziz Sawwan, Boyu Fang, Yuchen Liu, Jiayi Liu, Hanchao Yu, Qi Guo, Jianyu Wang, Fei Liu, and Xiangjun Fan — published GISTBench as a direct response to what they describe as a gap in how recommendation systems are evaluated. Traditional benchmarks in the RecSys field measure item prediction accuracy: did the model correctly guess the next item a user would engage with? GISTBench instead asks whether an LLM can accurately model the user themselves — inferring and verifying their interests from prior engagement data.
The paper was submitted on March 31, 2026 and is available at arXiv identifier 2603.29112. The authors also released the dataset alongside the benchmark.
Why It Matters
Recommendation systems increasingly rely on LLMs not just to rank content but to build semantic user profiles — summarizing what a user likes, dislikes, or is likely to engage with next. If those profiles contain hallucinated or vague interest categories, downstream recommendations degrade in ways that standard accuracy metrics do not capture. GISTBench is designed to surface exactly that failure mode.
Prior work in the RecSys field has focused on collaborative filtering and sequential recommendation tasks. The shift toward LLM-based user modeling introduces a new class of errors — fabricated interest categories and poorly attributed engagement signals — that existing benchmarks were not designed to detect.
Technical Details
The benchmark introduces two metric families. Interest Groundedness (IG) is decomposed into precision and recall components: precision penalizes an LLM for hallucinating interest categories not supported by the engagement data, while recall rewards coverage of interests that are genuinely present. Interest Specificity (IS) measures how distinctive the LLM-predicted user profile is — whether it meaningfully differentiates one user from another, rather than producing generic descriptions.
The researchers evaluated eight open-weight LLMs spanning a parameter range of 7B to 120B. The dataset was constructed synthetically but grounded in real user interactions on a global short-form video platform, and includes both implicit engagement signals (such as watch time or scroll behavior) and explicit signals (such as likes or saves), accompanied by rich textual content descriptions.
Dataset fidelity was validated against user surveys, giving the synthetic labels a degree of real-world calibration. The authors report that across all eight models tested, a consistent bottleneck emerged: the models showed limited ability to accurately count and attribute engagement signals when those signals came from heterogeneous interaction types — that is, when implicit and explicit signals needed to be weighed against each other simultaneously.
As the authors state in the paper: “Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.”
Who’s Affected
The benchmark is directly relevant to teams building LLM-powered personalization layers in content platforms — particularly short-form video, news aggregation, and e-commerce — where user interest modeling is a core component. Developers integrating LLMs into RecSys pipelines will find the IG and IS metrics directly applicable to evaluating whether their models are generating grounded or hallucinated user profiles.
Researchers working on user modeling, personalization, and LLM evaluation methodology also have a new public dataset to work with. The dataset’s dual signal structure — implicit plus explicit — makes it more representative of production environments than datasets that capture only one signal type.
What’s Next
The dataset and benchmark are publicly released, making immediate replication and extension possible. One stated limitation is the synthetic nature of the dataset: while the authors validated it against user surveys, real-world deployment will determine how well performance on GISTBench correlates with actual recommendation quality. The identified bottleneck — heterogeneous signal attribution — represents a concrete research direction for improving LLM user modeling.
The evaluation covered only open-weight models. How closed-weight frontier models perform on GISTBench was not assessed in this paper, leaving a gap that follow-up work could address.