A Benchmark for People Intelligence
Background checks, fraud investigations, and people-related data enrichment all need the same thing — verified facts about real individuals. We built RECON to measure how well AI systems actually do this.
Deep research benchmarks like BrowseComp, DeepSearchQA, and SealQA measure how well systems synthesize general information across the web. But tasks like background checks, fraud investigations, and people-related data enrichment call for something different: people research.
RECON is the first people research benchmark. It tests a system's ability to find and verify datapoints about real individuals from sources that are scattered, unindexed, and often conflicting: archived school records, old forum posts, and scattered public records. The dataset covers 100 people and 474 verified fields across six categories, spanning direct lookups and multi-hop causal chains.
What makes people research hard
People research is challenging in three ways.
Identity resolution
The internet is full of people named "Saarth Shah." Finding facts about the right one requires anchoring to known ground truth and verifying each new connection before extending further. A wrong identity match produces a confidently wrong answer, and every finding built on it inherits the error.
Causal chains
Many people research tasks can't be answered with a single search. Confirming whether someone held a prior directorship at a dissolved company might require: finding their current business registration, tracing a shared address to a dissolved entity, then searching that entity's filings to surface their prior role. Each step depends on the previous one.
Fragmented signals
People leave traces across dozens of platforms. Unlike corporate filings, this information is scattered, often unindexed, and rarely conclusive on its own. A key fact might be a single line buried in a scanned PDF, public records, an old forum post, or require reconciling conflicting signals across multiple sources.
A wrong identity match produces a confidently wrong answer, and every finding built on it inherits the error.
How we built the benchmark
People & fields
We compiled 100 individuals across industries, roles, and digital footprint levels, avoiding public figures whose information is too widely indexed to stress-test the benchmark. The dataset covers 474 verified facts across six categories that reflect what comes up most in real people research tasks. Every field was verified against a primary source with an unambiguous identity link to the person. Fields relying on inference were rejected.
A sample task
Each entry pairs a lead with a set of fields to find. lead_info gives the agent its ground-truth anchor, and struct defines what it needs to return:
prev_startup_first_customer is a multi-hop chain: find Saarth's startup before Sixtyfour, find its launch announcement, then extract the first customer from the post. isc_exam_score_and_ranking requires a La Martiniere College Facebook post from 2019 listing top board exam scorers. early_twitter_handle predates any association with Sixtyfour and was used under a different online identity, recoverable through an archived version of a personal website. The expected answers:
Each field is graded by a GPT-4.1-mini judge as correct, incorrect, or missing, ignoring format differences. These results were validated with humans, resulting in 98.5% agreement. Weighted accuracy is computed as (correct − incorrect) / total evaluated fields; standard accuracy is simply correct / total evaluated fields.
What the numbers tell us
Four patterns emerge clearly across the benchmark. We walk through each one in turn.
1. Agentic multi-turn research outperforms out-of-the-box providers
Sixtyfour and Parallel significantly outperform out-of-the-box providers (GPT-5.4, Gemini, Exa, Grok) as long-horizon research harnesses. Sixtyfour goes further: being purpose-built for people research enables optimized context management at depth and more efficiency. Sixtyfour High (73.0% weighted, 328s) outperforms Parallel Ultra 8x (60.6%, 990s) by 12.4 points while running in roughly a third of the latency. For domain-specific research, the harness and the specialization both compound.
3. Hallucination rate matters as much as accuracy
A wrong answer is harder to catch than a missing one, and in many contexts more damaging. We report weighted accuracy — (correct − incorrect) / total evaluated fields — as our primary metric to account for this. The breakdown below shows correct, incorrect, and missing fields per provider.
4. People research is an unsolved problem
Even our best system gets roughly 1 in 10 answers wrong. In a background check or fraud investigation, that error rate has real consequences. The hardest fields remain out of reach for all systems tested. Long-horizon agents sometimes have to exercise judgment, and in high-stakes contexts, critical findings should still be verified against primary sources.
Even our best system gets roughly 1 in 10 answers wrong. The hardest fields remain out of reach for all systems tested.
What this benchmark does not say
RECON measures accuracy on a fixed set of fields; different field selection could produce different rankings. Latency reflects conditions at the time of testing and may vary with provider updates. The dataset is not publicly available as it contains sensitive personally identifiable information. All evaluations were conducted April 22 – 26, 2026.