The ancestry of intelligence

High-quality data, from scale to per-sample.

Whether it's generated at scale by our expert-designed pipelines or verified by a specialist sample by sample, every Ancest dataset clears the same high-quality bar — source-traced, leakage-checked, and built for the research problem you're solving. Every stage, every modality.

License the data What we build hello@ancest.ai

Expert-designed & verified Pretraining → SFT → RL → alignment → eval 10+ research domains Every modality

What we build

One data partner for your whole pipeline

Pretraining corpora

Recaptioned, deduplicated, provenance-traced web-scale corpora — text, image-text, and multimodal — cleaned for signal, not volume.

SFT & instruction tuning

Expert-written instructions and demonstrations, plus reasoning traces with verified intermediate steps for chain-of-thought and tool use.

RL & preference data

Preference pairs, process-reward labels, and verifiable reward signals authored by people who know the right answer in their field.

Alignment & safety

Curated safety-alignment and red-team sets that harden reasoning models without dulling their capability.

Evaluation & benchmarks

Leakage-sealed benchmarks scored on held-out data, with measured anchors so a number means what it claims to mean.

Agent trajectories

Long-horizon, closed-loop rollouts in RL/SFT-ready format — including our AutoResearch task bank — archived uniformly to Hugging Face.

How we verify

Quality is a process, not a promise.

Experts design every pipeline and the checks that gate it — automated quality controls at scale, hand-verification where it counts — so every dataset clears the same bar. Provenance is recorded, leakage paths are cut, and benchmarks are scored from scratch on sealed held-out data, checked against four gates before anything ships.

01

Value — does this measure something a real researcher cares about?

02

Measurability — is success defined by an objective, reproducible metric?

03

Real data, no leakage — sourced, anonymized, leakage paths cut.

04

Metrics + measured anchors — every number tied to a verified baseline.

Track record · the agents the field runs on

We build what researchers run on

Our open-source agents have tens of thousands of GitHub stars, and our datasets train models across the industry — published at CVPR, ICML, ICLR, NeurIPS, and ECCV. The long-horizon data we sell comes from the same lab.

Open-source agents & benchmarks

Autonomous research2026

AutoResearchClaw

Chat an idea, get a paper — fully autonomous, self-evolving research.

13.6k GitHub stars

Research-agent benchmark2026

AutoResearch-Bench

Measures AutoResearch agents — improve a real method from a weak baseline, scored on sealed hidden data.

10+ expert-authored domains

Agent memory2026

SimpleMem

Efficient lifelong memory for LLM agents — text and multimodal.

3.6k GitHub stars

Self-evolving agent2026

MetaClaw

Talk to your agent; it learns and turns conversation into training data.

3.4k GitHub stars

Self-evolving agent2026

Agent0

Two agents co-evolve from zero data via tool-integrated reasoning.

1.2k GitHub stars

SkillRL

RL agents distill trajectories into a reusable, co-evolving skill library.

855 GitHub stars

Multimodal agent2026

VisualClaw

Real-time self-evolving VLM agent — frame-gating and skill banks cut API cost dramatically.

~98% lower API cost

Agent benchmark2026

VisualClawArena

200-scenario benchmark pairing video clips with a persistent workspace and executable checkers.

9.4k downloads / mo

Agent governance2026

AutoHarness

Lightweight runtime harness adding risk control, cost tracking, and audit trails to any LLM client.

334 GitHub stars

Multimodal benchmarkCVPR 2026

MIRA

Visual chain-of-thought: models must draw intermediate images to reason.

546 visual-CoT tasks

Agent benchmark2026

ClawArena

Benchmarking AI agents in evolving information environments.

12 domains · 337 rounds

Agent benchmarknew · 2026

ClawForge

Executable interactive benchmarks for command-line agents.

best frontier model 45%

Agent safety2026

CIK-Bench

Safety benchmark — 88 attacks probing capability, identity, and knowledge poisoning of personal agents.

70 GitHub stars

VLAA-GUI

Modular GUI-automation agent that knows when to stop, recover, and search.

34 GitHub stars

Image-edit benchmarkTMLR 2026

Complex-Edit

Complexity-controllable image-editing benchmark with a Chain-of-Edit evaluation pipeline.

29 GitHub stars

Physics reasoningNeurIPS 2025

PHYBench

500 original physics problems, high-school to Olympiad — best model 37% vs humans 62%.

500 expert-curated problems

Video reasoningEMNLP 2025 · Oral

GLIMPSE

Tests whether video LVLMs truly reason over time — 3,269 videos, 4,342 human-crafted questions.

3,269 videos · 4,342 Qs

Datasets & training data

Image editing2025

GPT-Image-Edit-1.5M

1.5M+ GPT-4o-refined image-edit triplets for training instruction-based editors.

33.7k downloads / mo

Image-textICML 2025

Recap-DataComp-1B

1.3B web images recaptioned with LLaMA-3 to train CLIP and diffusion models.

19.5k downloads / mo

MedicalICLR 2025

MedTrinity-25M

25M+ medical images across ten modalities with multigranular annotations.

Medical reasoning2025

MedReason

32,682 medical QA pairs with knowledge-graph reasoning paths for training clinical reasoners.

278 GitHub stars

Reasoning tracesTMLR 2025

VLAA-Thinking

~150K image-question-answer reasoning traces for training R1-style reasoning VLMs.

~150K traces (SFT + RL)

Reward dataEMNLP 2025

ViLReward-73K

73K vision-language process-reward samples for training VL reward models.

73K reward samples

Safety alignmentAAAI 2026

STAR-1

Safer alignment of reasoning LLMs (e.g. DeepSeek-R1) from just 1K curated examples.

1K-example safety set

RLVR curriculumICML 2026

VLM-CapCurriculum

Capability-staged RLVR curriculum — perception, visual-reasoning, and text-reasoning data for VLM post-training.

~33K curriculum samples

Medical reasoning2026

MedVerse14k

13.9K medical problems with DAG-structured, knowledge-grounded reasoning traces for training reasoners.

~13.9K reasoning traces

Video preferenceNeurIPS 2025 · Spotlight

MJ-Bench-Video

Large-scale video-preference data scoring generation across five aspects and 28 fine-grained criteria.

5 aspects · 28 criteria

Selected releases · HF downloads (trailing 30 days) and GitHub stars verified June 2026.

Built on Harbor

Massively parallel cloud sandboxes, native RL / SFT rollout export, two-stage anti-cheat verifier, and trajectories archived uniformly to Hugging Face — born out of the Terminal-Bench ecosystem.

Tell us your training stage and your domain.

We build the expert-verified data your pipeline is missing — from pretraining corpora to RL trajectories, in any field and any modality.

Talk to us hello@ancest.ai