The ancestry of intelligence

High-quality data, from scale to per-sample.

Whether it's generated at scale by our expert-designed pipelines or verified by a specialist sample by sample, every Ancest dataset clears the same high-quality bar — source-traced, leakage-checked, and built for the research problem you're solving. Every stage, every modality.

Expert-designed & verified Pretraining → SFT → RL → alignment → eval 10+ research domains Every modality
What we build

One data partner for your whole pipeline

Pretraining corpora

Recaptioned, deduplicated, provenance-traced web-scale corpora — text, image-text, and multimodal — cleaned for signal, not volume.

SFT & instruction tuning

Expert-written instructions and demonstrations, plus reasoning traces with verified intermediate steps for chain-of-thought and tool use.

RL & preference data

Preference pairs, process-reward labels, and verifiable reward signals authored by people who know the right answer in their field.

Alignment & safety

Curated safety-alignment and red-team sets that harden reasoning models without dulling their capability.

Evaluation & benchmarks

Leakage-sealed benchmarks scored on held-out data, with measured anchors so a number means what it claims to mean.

Agent trajectories

Long-horizon, closed-loop rollouts in RL/SFT-ready format — including our AutoResearch task bank — archived uniformly to Hugging Face.

How we verify

Quality is a process, not a promise.

Experts design every pipeline and the checks that gate it — automated quality controls at scale, hand-verification where it counts — so every dataset clears the same bar. Provenance is recorded, leakage paths are cut, and benchmarks are scored from scratch on sealed held-out data, checked against four gates before anything ships.

01

Value — does this measure something a real researcher cares about?

02

Measurability — is success defined by an objective, reproducible metric?

03

Real data, no leakage — sourced, anonymized, leakage paths cut.

04

Metrics + measured anchors — every number tied to a verified baseline.

Track record · the agents the field runs on

We build what researchers run on

Our open-source agents have tens of thousands of GitHub stars, and our datasets train models across the industry — published at CVPR, ICML, ICLR, NeurIPS, and ECCV. The long-horizon data we sell comes from the same lab.

Open-source agents & benchmarks
Autonomous research2026

AutoResearchClaw

Chat an idea, get a paper — fully autonomous, self-evolving research.

13.6k GitHub stars
Research-agent benchmark2026

AutoResearch-Bench

Measures AutoResearch agents — improve a real method from a weak baseline, scored on sealed hidden data.

10+ expert-authored domains
Agent memory2026

SimpleMem

Efficient lifelong memory for LLM agents — text and multimodal.

3.6k GitHub stars
Self-evolving agent2026

MetaClaw

Talk to your agent; it learns and turns conversation into training data.

3.4k GitHub stars
Self-evolving agent2026

Agent0

Two agents co-evolve from zero data via tool-integrated reasoning.

1.2k GitHub stars
Agent RL2026

SkillRL

RL agents distill trajectories into a reusable, co-evolving skill library.

855 GitHub stars
Multimodal agent2026

VisualClaw

Real-time self-evolving VLM agent — frame-gating and skill banks cut API cost dramatically.

~98% lower API cost
Agent benchmark2026

VisualClawArena

200-scenario benchmark pairing video clips with a persistent workspace and executable checkers.

9.4k downloads / mo
Agent governance2026

AutoHarness

Lightweight runtime harness adding risk control, cost tracking, and audit trails to any LLM client.

334 GitHub stars
Multimodal benchmarkCVPR 2026

MIRA

Visual chain-of-thought: models must draw intermediate images to reason.

546 visual-CoT tasks
Agent benchmark2026

ClawArena

Benchmarking AI agents in evolving information environments.

12 domains · 337 rounds
Agent benchmarknew · 2026

ClawForge

Executable interactive benchmarks for command-line agents.

best frontier model 45%
Agent safety2026

CIK-Bench

Safety benchmark — 88 attacks probing capability, identity, and knowledge poisoning of personal agents.

70 GitHub stars
GUI agent2026

VLAA-GUI

Modular GUI-automation agent that knows when to stop, recover, and search.

34 GitHub stars
Image-edit benchmarkTMLR 2026

Complex-Edit

Complexity-controllable image-editing benchmark with a Chain-of-Edit evaluation pipeline.

29 GitHub stars
Physics reasoningNeurIPS 2025

PHYBench

500 original physics problems, high-school to Olympiad — best model 37% vs humans 62%.

500 expert-curated problems
Video reasoningEMNLP 2025 · Oral

GLIMPSE

Tests whether video LVLMs truly reason over time — 3,269 videos, 4,342 human-crafted questions.

3,269 videos · 4,342 Qs
Datasets & training data

Selected releases · HF downloads (trailing 30 days) and GitHub stars verified June 2026.

Built on Harbor

Massively parallel cloud sandboxes, native RL / SFT rollout export, two-stage anti-cheat verifier, and trajectories archived uniformly to Hugging Face — born out of the Terminal-Bench ecosystem.

Read the docs

Tell us your training stage and your domain.

We build the expert-verified data your pipeline is missing — from pretraining corpora to RL trajectories, in any field and any modality.

Talk to us hello@ancest.ai