AI Talent Search Benchmark

Overview

We evaluated top-tier LLMs and specialized agents across three difficulty levels. The benchmark tests AI agents' capabilities in multi-source synthesis, implicit reasoning, and real-time web navigation.

Leaderboard

Benchmark Results

What is AI-TSB?

AI-TSB is a specialized stress test for AI agents designed to evaluate their capability to handle real-world talent search tasks on the live internet.

Key Differentiators

vs. Generic Search (e.g., Google) Google returns links. AI-TSB agents must read, verify, and synthesize information from multiple sources.

vs. General Agent Benchmarks (e.g., WebArena) WebArena tests UI interaction. AI-TSB tests cognitive depth — the ability to reason across fragmented, noisy web data.

Challenge Levels

Level 1: Fact Checking

Objective: Instant verification of current roles, companies, and affiliations.

Why it matters: In real-world recruiting, verifying basic facts should be instant. Agents often fail by hallucinating outdated information or mixing up similar names.

Scoring Formula:

Score = Accuracy × LatencyFactor
LatencyFactor = max(0, 1 - 0.1 × (Time - 5s))

Any response taking longer than 15 seconds receives a score of 0, regardless of correctness.

Case Study: L1_05

Query: "Who is the CEO of Anthropic?"

Expected Answer: Dario Amodei (as of 2025)

Common Failure: Many models hallucinate "Daniela Amodei" (President, not CEO) or retrieve outdated information.

Evaluation Metric Weights:

Metric	Weight
Precision	50%
Recall	20%
Enrichment	20%
Latency	10%

Speed matters. >15s latency results in 0 points for speed.

Level 2: Logical Screening

Objective: Multi-constraint filtering with boolean logic.

Why it matters: Recruiters use complex filters (e.g., "DeepMind engineers who graduated from MIT"). Agents typically suffer from "Attention Drift," ignoring the 2nd or 3rd constraint to maximize recall at the cost of precision.

Case Study: L2_12

Query: "Find AI Engineers at DeepMind who also graduated from MIT."

Verification Logic:

Current Company == "Google DeepMind"
Education History contains "MIT" or "Massachusetts Institute of Technology"

Common Failure: Models return:

DeepMind employees from Stanford (ignores MIT constraint)
MIT grads at OpenAI (ignores DeepMind constraint)

Evaluation Metric Weights:

Metric	Weight
Precision	60%
Recall	20%
Enrichment	20%
Latency	0%

Accuracy is king. Strict boolean logic enforcement.

Level 3: Deep Reasoning

Objective: Multi-hop reasoning to find "hidden" talent not discoverable through keyword search.

Why it matters: Top talent is often hidden. Finding them requires deducing implicit relationships (e.g., "co-author of X paper but not listed in main credits").

Case Study: L3_08

Query: "Find the 'hidden author' of the Qwen technical report who is not listed in the main author block but contributed significantly to the codebase."

Required Reasoning Chain:

Step 1: Retrieve Qwen technical report & extract author list
Step 2: Locate Qwen GitHub repository
Step 3: Analyze commit history for high-frequency contributors NOT in Step 1

Evaluation: Scored by LLM-as-a-Judge based on:

Completeness of evidence chain (30%)
Validity of identified contributor (50%)
Logical coherence (20%)

Evaluation Metric Weights:

Metric	Weight
Reasoning	50%
Precision	30%
Recall	10%
Enrichment	10%

LLM Judge evaluates the logical chain and evidence.

Methodology

Hybrid Scoring System

Our evaluation framework combines deterministic verification (L1, L2) with LLM-as-a-Judge semantic evaluation (L3).

Level 1: Latency-Aware Accuracy

Speed is as critical as accuracy for fact-checking. We penalize slow responses non-linearly.

Level 2: Weighted F1 Score

To combat "resume spamming" (high recall, low precision), we prioritize Precision over Recall with weighted F1.

Level 3: Semantic Chain Evaluation

A fine-tuned GPT-4o judges the agent's reasoning trace:

Component	Weight
Correctness	50%
Evidence Quality	30%
Logic Coherence	20%

Evaluation Pipeline

1. Live Web Injection

Unlike static benchmarks (e.g., Mind2Web), AI-TSB agents must navigate the live internet, handling:

Dynamic DOMs
CAPTCHAs
Paywalls (LinkedIn, Twitter)

2. Trace Recording

We record the full execution trace:

Search queries issued (Google/Bing)
URLs visited & dwell time
DOM interactions (clicks, scrolls)
Final JSON output structure

3. Ground Truth Verification

The final output is compared against a manually curated Golden Dataset.

For L3 queries, we verify provenance: Did the agent actually visit the GitHub commit page, or did it hallucinate the author based on the README?

Anti-Gaming Measures

To prevent memorization, 30% of the dataset consists of dynamic queries (e.g., "trending repositories this week") that change over time, forcing live retrieval.

Full Results

Complete performance breakdown across all models and difficulty levels.

Model	Level	Total Score	Recall	Precision	Enrichment	Reasoning	Latency (s)	Samples
DeepSeek-V3.2	L1	50.5	13.2	21.7	15.0	-	20.2	20
DeepSeek-V3.2	L2	81.2	16.2	48.8	16.2	-	16.9	16
DeepSeek-V3.2	L3	77.8	10.0	18.3	9.4	39.8	60.1	18
Haiku-4.5	L1	50.0	12.5	22.5	14.0	-	14.7	20
Haiku-4.5	L2	88.8	17.5	52.5	18.8	-	14.0	16
Haiku-4.5	L3	60.6	9.4	13.3	6.7	31.2	18.8	18
Gemini-3-pro	L1	59.5	16.5	25.0	18.0	-	72.6	20
Gemini-3-pro	L2	76.2	15.0	45.0	16.2	-	74.4	16
Gemini-3-pro	L3	93.4	10.0	25.7	10.0	47.4	42.7	18
Exa-Baseline	L1	28.7	16.2	12.5	0.0	-	39.5	20
Exa-Baseline	L2	67.5	18.8	48.8	0.0	-	39.9	16
Exa-Baseline	L3	38.3	10.0	8.3	10.0	20.0	8.8	18
Flowith-Neo	L3	88.5	10.0	27.7	1.4	49.4	-	18
DINQ (OURS)	L1	53.6	17.0	19.6	17.0	-	65.6	20
DINQ (OURS)	L2	97.0	20.0	57.0	20.0	-	37.2	16
DINQ (OURS)	L3	81.6	10.0	22.7	3.9	41.0	58.4	18

Key Insights

Insight 1: Specialized Agents Excel at Logical Screening

DINQ achieves 97.0 on L2, significantly outperforming general LLMs by avoiding "attention drift" — the tendency to ignore constraints when processing multi-part queries.

Insight 2: Frontier Models Lead in Deep Reasoning

For L3 tasks requiring multi-hop reasoning, models with massive context windows (Gemini-3-pro: 93.4, Flowith-Neo: 88.5) still hold the advantage.

Insight 3: Latency Remains a Universal Challenge

All agents suffer from high latency on simple fact-checking (L1), highlighting the need for "System 1 vs System 2" routing — fast paths for simple queries, deep reasoning for complex ones.