// RECONResearch file · 2026Sixtyfour Research

People research, benchmarked.
100People
474Verified fields
10Research categories
Filed
April 26, 2026
Author
Sixtyfour Research
Pages
9 min read
Benchmark · Sixtyfour Research

A Benchmark for People Intelligence

Background checks, fraud investigations, and people-related data enrichment all need the same thing — verified facts about real individuals. We built RECON to measure how well AI systems actually do this.

Deep research benchmarks like BrowseComp, DeepSearchQA, and SealQA measure how well systems synthesize general information across the web. But tasks like background checks, fraud investigations, and people-related data enrichment call for something different: people research.

RECON is the first people research benchmark. It tests a system's ability to find and verify datapoints about real individuals from sources that are scattered, unindexed, and often conflicting: archived school records, old forum posts, and scattered public records. The dataset covers 100 people and 474 verified fields across six categories, spanning direct lookups and multi-hop causal chains.

Figure 01 · Weighted accuracy
Weighted accuracy = (correct - incorrect) / total evaluated fields. Sixtyfour systems are shown in solid black; external providers are shown in gray.
Figure 02 · Precision & recall
Precision = correct fields / attempted fields. Recall = correct fields / total evaluated fields.
The problem

What makes people research hard

People research is challenging in three ways.

Identity resolution

The internet is full of people named "Saarth Shah." Finding facts about the right one requires anchoring to known ground truth and verifying each new connection before extending further. A wrong identity match produces a confidently wrong answer, and every finding built on it inherits the error.

Causal chains

Many people research tasks can't be answered with a single search. Confirming whether someone held a prior directorship at a dissolved company might require: finding their current business registration, tracing a shared address to a dissolved entity, then searching that entity's filings to surface their prior role. Each step depends on the previous one.

Fragmented signals

People leave traces across dozens of platforms. Unlike corporate filings, this information is scattered, often unindexed, and rarely conclusive on its own. A key fact might be a single line buried in a scanned PDF, public records, an old forum post, or require reconciling conflicting signals across multiple sources.

A wrong identity match produces a confidently wrong answer, and every finding built on it inherits the error.
Methodology

How we built the benchmark

People & fields

We compiled 100 individuals across industries, roles, and digital footprint levels, avoiding public figures whose information is too widely indexed to stress-test the benchmark. The dataset covers 474 verified facts across six categories that reflect what comes up most in real people research tasks. Every field was verified against a primary source with an unambiguous identity link to the person. Fields relying on inference were rejected.

Category
Count
Share
Example fields
Online Presence
165
34.8%
GitHub username, old Twitter handle, gaming profile, linked aliases
Education
75
15.8%
University attended, student ID, GPA, research lab
Professional History
70
14.8%
First employer, co-founder names, incorporation date
Identity & Contact
59
12.4%
Legal name, personal email, birth date, phone
Achievements & Published Work
66
13.9%
Hackathon placements, ArXiv papers, conference talks, awards
Personal
39
8.2%
Hobbies, personal milestones, life events

A sample task

Each entry pairs a lead with a set of fields to find. lead_info gives the agent its ground-truth anchor, and struct defines what it needs to return:

JSON
8 ln
{
"lead_info": { "name": "Saarth Shah", "company": "Sixtyfour" },
"struct": {
"prev_startup_first_customer": "First client of his startup directly before Sixtyfour",
"isc_exam_score_and_ranking": "ISC board exam percentage and approximate ranking among high scorers",
"early_twitter_handle": "Early Twitter handle used during high school"
}
}

prev_startup_first_customer is a multi-hop chain: find Saarth's startup before Sixtyfour, find its launch announcement, then extract the first customer from the post. isc_exam_score_and_ranking requires a La Martiniere College Facebook post from 2019 listing top board exam scorers. early_twitter_handle predates any association with Sixtyfour and was used under a different online identity, recoverable through an archived version of a personal website. The expected answers:

JSON
5 ln
{
"prev_startup_first_customer": "Walnut Creek Dental",
"isc_exam_score_and_ranking": "95.0%, ranked 9th among La Martiniere ISC high scorers (2019)",
"early_twitter_handle": "herossgraphics"
}

Each field is graded by a GPT-4.1-mini judge as correct, incorrect, or missing, ignoring format differences. These results were validated with humans, resulting in 98.5% agreement. Weighted accuracy is computed as (correct − incorrect) / total evaluated fields; standard accuracy is simply correct / total evaluated fields.

Key findings

What the numbers tell us

Four patterns emerge clearly across the benchmark. We walk through each one in turn.

1. Agentic multi-turn research outperforms out-of-the-box providers

Sixtyfour and Parallel significantly outperform out-of-the-box providers (GPT-5.4, Gemini, Exa, Grok) as long-horizon research harnesses. Sixtyfour goes further: being purpose-built for people research enables optimized context management at depth and more efficiency. Sixtyfour High (73.0% weighted, 328s) outperforms Parallel Ultra 8x (60.6%, 990s) by 12.4 points while running in roughly a third of the latency. For domain-specific research, the harness and the specialization both compound.

Figure 03 · Accuracy graphs
Standard accuracy = correct fields / total evaluated fields. Weighted accuracy = (correct − incorrect) / total evaluated fields. Latency is p50 runtime.

3. Hallucination rate matters as much as accuracy

A wrong answer is harder to catch than a missing one, and in many contexts more damaging. We report weighted accuracy (correct − incorrect) / total evaluated fields — as our primary metric to account for this. The breakdown below shows correct, incorrect, and missing fields per provider.

Figure 05 · Error breakdown
Correct, incorrect, and missing fields are shown as shares of all evaluated fields. Weighted accuracy = (correct − incorrect) / total evaluated fields.

4. People research is an unsolved problem

Even our best system gets roughly 1 in 10 answers wrong. In a background check or fraud investigation, that error rate has real consequences. The hardest fields remain out of reach for all systems tested. Long-horizon agents sometimes have to exercise judgment, and in high-stakes contexts, critical findings should still be verified against primary sources.

Even our best system gets roughly 1 in 10 answers wrong. The hardest fields remain out of reach for all systems tested.
Limitations

What this benchmark does not say

RECON measures accuracy on a fixed set of fields; different field selection could produce different rankings. Latency reflects conditions at the time of testing and may vary with provider updates. The dataset is not publicly available as it contains sensitive personally identifiable information. All evaluations were conducted April 22 – 26, 2026.

Published April 26, 2026 by Sixtyfour Research.RECON Benchmark · Sixtyfour