From 10% to 90% on Adyen's DABstep: configuring a data warehouse agent

April 30, 2026byAlex Tatarinov7 min read

DABstep is Adyen's open benchmark for AI data agents — hundreds of multi-step reasoning questions built on real payment data. Google's Data Science Agent — the strongest public AI agent on the leaderboard — reaches around 52%. Trained human analysts at Adyen spend three or more hours of work and land near 62% on the easier tasks. A configured Dot answered 90% of the assigned slice in seconds — same underlying model, sharper configuration around it.

The benchmark

DABstep was published by Adyen and Hugging Face as a deliberately uncomfortable test for AI data agents. Hundreds of questions, the majority on the hard split, all built on real payment data — the kind of brittle joins and undocumented conventions that production warehouses are full of.

The questions look simple but require composition:

"What would the fee have been for this merchant if they'd been on a different plan in March?"
"Which merchants would have crossed a volume tier if their activity had been 10% higher?"
"How does the issuing country change the fee for a Visa transaction at this merchant?"

The answers can't be looked up in a dashboard. They have to be reasoned out, often across three or four tables, under business rules the warehouse never wrote down.

DABstep dataset

CLICK A SOURCE TO PREVIEW

Payments

138,237 rows

merchant scheme amount country Crossfit_Hanna NexPay 151.74 SE Crossfit_Hanna NexPay 45.70 NL Belles_cookbook GlobalC. 14.11 NL

Customers

30 merchants

merchant mcc account Crossfit_Hanna 7997 F Belles_cookbook 5942 R Martinis_Stkhse 5812 H

Pricing rules

1,000 rules

scheme credit fixed rate TransactPlus false 0.10 19 GlobalCard NULL 0.13 86 TransactPlus true 0.09 16

Card networks

9 + 770 codes

acquirer country gringotts GB savings_and_loan US dagoberts_vault NL

Easy questions hover near 80% across most agents. The hard ones, where reasoning has to compose across multiple steps, sit much lower. Google's Data Science Agent — the top-ranked public agent on the benchmark — reaches around 52% weighted overall. Trained human analysts at Adyen, given as much time as they need, land near 62% on the easier tasks — with three or more hours of work.

The result

90%

Dot, configured

easy + hard

~62%

Trained human analysts

easy only

~52%

Google Data Science Agent

easy + hard

Sources: DABstep paper · Google Data Science Agent (Nov 2025).

DABstep — accuracy across systems

Same underlying model, different system around it. Trained human analysts top out near 62% on the easier tasks. Configured Dot answered 90% of the assigned slice — in seconds, not hours.

DABstep · adyen × hugging face · feb 2026

90% accuracy, reproduced on the public Hugging Face leaderboard. The model in the loop is the same one that runs in production. The lift came from the configuration around it.

TIME PER QUESTION

Roughly 1,000× faster, on the same questions.

We reached 90% on the assigned slice with focused configuration. The point wasn't to chase 100% — we likely could have, given more time. The point was to show how fast a configurable agent reaches the kind of accuracy enterprise teams need on their own data, with the kind of effort they'd spend onboarding any other tool.

What follows is how that configuration took shape — what changed, what regressed, what stuck.

Reading the warehouse like an analyst would

DABstep is built like any business database that's been around long enough to matter. The schema — the part you can read — tells you almost nothing about what the data really means. The rest is unwritten: small but consequential rules about how a fee is recorded, what a blank cell signals, how a list is stored, which joins behave the way they look. Senior analysts pick this up over time. The schema doesn't.

That's the layer where every analytics agent stalls. Hand the schema to a generic model and you'll get confident answers that look right and are off by orders of magnitude. The first stretch of the climb was getting Dot to read DABstep the way someone who'd worked at Adyen would. Once the conventions surface as context the agent uses on every question, a whole class of confidently-wrong answers disappears together.

What an analyst reads in the warehouse

The schema is what's written down. Most of how a warehouse actually behaves sits underneath — conventions a senior analyst carries from years of working with the data.

That work got us to about 30%. The questions past it need more than one query — counterfactuals, multi-tier comparisons, joins that don't compose into a single SQL statement. And the shapes repeat: the same kinds of analysis surface across the dataset in slightly different language.

A generic model redoes that reasoning from scratch on every request. Answers come back slow, expensive, and stochastic. Two queries with the same intent can disagree by the model's mood that pass.

The next stretch closed that gap. Once Dot worked out a piece of analysis correctly, every future variation of it resolved the same way — in seconds, with the same logic, at a fraction of the cost. The agent gets sharper at the data the more it works against it. That compounding is what makes the system feel like infrastructure.

The remaining climb to 90% on the assigned slice was a different kind of work — driven not by us, but by the agent itself.

Dot updates its own configuration

To improve a system across many iterations, you need to remember what you've already tried. The configuration carries forward not just what works, but what failed and why — so each round builds on the last instead of relitigating the same dead ends.

An AI agent drove that iteration. It read what failed, proposed the next change, ran it, read the result, and proposed again — without anyone in the middle writing the next prompt.

The biggest jumps came from the agent identifying a category of failures and fixing the right thing once. The version that broke past the 30% ceiling came from a class of subtle correctness issues in how skills handled their inputs. The version that landed the slice at 90% came from reconciling several open hypotheses and committing the one that held up.

Tested ideas across versions

Each node is one hypothesis. Solid edges link ideas that built on each other; light edges mark paths that were rolled back. Categories cluster naturally.

illustrative · structure of the iteration ledger

Accuracy by configuration version

Each point is one full hypothesis: read the failures, change the configuration, re-measure. Dips are kept on the curve on purpose — they're the most informative part.

14 versions · 30 tasks each · feb 2026

What this means outside the benchmark

DABstep is hard because real warehouses are hard. The data is messy, the rules are undocumented, and the questions don't compose into a single SQL statement. An AI data analyst that holds 90% accuracy on a benchmark designed to break agents — answering in seconds what takes trained analysts hours — is the kind of system enterprise teams should look at for their own data. Not because every warehouse looks like Adyen's, but because every warehouse has its own version of basis points and monthly tiers.

METHODOLOGY

The assigned slice is 30 representative tasks drawn from the public DABstep benchmark from Adyen × Hugging Face — a mix of easy and hard questions. Each task ran through the same Dot endpoint that powers production, scored with DABstep's own scorer. Results were reproduced on the public Hugging Face leaderboard.

"Google Data Science Agent" refers to the top-ranked AI agent on DABstep at the time of writing — Google's DS-Star, around 52% weighted overall (Nov 2025). The Adyen human-analyst baseline of ~62% on easy tasks comes from the original DABstep paper.

Every version's score, configuration, and scoring output was tracked in a versioned harness so the curve above is reproducible.

Alex Tatarinov

Alex is a Forward Deployed AI/ML Engineer at Dot — embedded with enterprise customers, architecting the systems that get Dot working accurately and autonomously across their data warehouses and BI tools.