Benchmark Open Models on Your Own Tool Schemas

Leaderboard scores for tool calling are measured on clean, synthetic schemas — not the nested, polymorphic mess your production MCP server actually exposes. Before you commit to an open model for your agent pipeline, run your own eval harness against your own tool specs. The numbers move, sometimes by a lot.

This is a 60-minute job, not a research project. Three metrics, ~70 lines of Python, a handful of test cases. By the end you'll have a CSV that settles which model goes into your agent loop, and a harness that actually measures all three metrics.

Why public benchmarks lie about your schema

BFCL v4 (April 2026 release, the one that moved toward holistic agentic evaluation) and MCP-Bench use well-formed synthetic schemas. They're cleaner and more regular than the real production specs your team has accumulated. The ones with optional nested objects, polymorphic argument unions, and that one required field nobody documented. Scores on public benchmarks systematically overstate performance on your schemas.

There's also a hidden cost the chat-benchmark world doesn't surface: every tool-call request prefills 400–800 tokens of schema before the first completion token. A 10-tool agent eats that on every turn. If you sized your infrastructure off chat-benchmark latency numbers, you'll be surprised in production.

WildToolBench (arXiv 2604.06185) is worth a read — they evaluated 57 models against real-world user behaviour and introduced multi-step metrics like Optimal Path Rate and Accomplish Progress Rate. Single-call accuracy can't proxy these. Their headline result is the sobering part: no model in the suite cleared 15% session accuracy, against near-saturation on the older single-call benchmarks. That gap is the gap between leaderboard and production.

The three metrics that actually matter

Most teams measure one thing — "did it call the right tool" — and call it accuracy. That's not enough. You need three numbers, separately:

Tool-selection accuracy, did the model pick the right function name?
Argument hallucination rate, given the right tool, did it fill the parameters with real values from the prompt, or invent a plausible-looking wrong one? Measure this narrowly: a hallucination is a confidently incorrect value, not a missing field (that's an omission) and not a badly-formatted one (that's a format error). Lumping all three into one "invalid args" bucket inflates the single number you actually care about.
Retry cost, when it failed, did the agent recover on the next turn, or did the loop stall and need a full re-plan?

Argument hallucination is the one almost nobody measures separately, and it's usually the one that bites in production. A model picks create_calendar_event, fills start_iso with a confident-looking but wrong timestamp, the downstream API accepts it, and the bug shows up two days later when the customer asks why their meeting is on the wrong Tuesday.

Encode this in a dataclass so the distinctions are machine-readable, not prose — three failure buckets kept apart, plus the retry fields so the loop cost is a first-class value rather than an afterthought:

The scoring rubric, as code

from dataclasses import dataclass, field


@dataclass
class EvalResult:
    case_id: str
    tool_selected_correct: bool
    args_valid: bool
    hallucinated_fields: list[str] = field(default_factory=list)  # confidently wrong values
    missing_fields: list[str] = field(default_factory=list)       # omissions
    malformed_fields: list[str] = field(default_factory=list)     # bad format / unparseable
    retry_count: int = 0
    recovered: bool = False
    notes: str = ""

    @property
    def passed(self) -> bool:
        return self.tool_selected_correct and self.args_valid

    @property
    def hallucinated(self) -> bool:
        # Only wrong values count. Omissions and format errors are different failures.
        return bool(self.hallucinated_fields)

A note on sub-7B models: their failure isn't "a bit worse." It's structural. The DEV post on Llama 3.2 3B is the cleanest illustration — across nine tasks the model made zero tool calls. It reasoned about what it didn't know, then confabulated an answer rather than reaching for the tool sitting right in front of it; its only success was a Fibonacci number it could compute directly. That's the dangerous failure mode — not a malformed call you can catch in validation, but a confident wrong answer that looks like success. A separate r/LocalLLaMA judgment benchmark makes the mirror-image point: BitNet 2B-4T emits flawless tool-call JSON and is the only small model that handles multi-tool requests, yet its judgment about when to call collapses on hard prompts. Clean single-call JSON is a capability island, not multi-step reliability — and "produces valid JSON" tells you nothing about either of the two failure modes above.

And prefer deterministic argument checks where you can: exact match on enums, range checks on numerics, regex on structured strings (ISO timestamps, UUIDs, emails). Save LLM-as-judge for fields where semantic equivalence genuinely matters — paraphrased summaries, free-form titles. Otherwise you're stacking model error on top of model error.

The eval harness

Here's the harness. Drop in your own schemas and test cases. It runs against any OpenAI-compatible endpoint, Ollama, vLLM, llama.cpp server, or a hosted API, and, unlike the usual single-shot eval, it closes the loop: when a call fails validation it feeds the error back as a tool message and lets the model correct itself. That's what turns retry cost from a footnote into a column you can read.

Start with validation. One pass over the expected fields, each failure dropped into exactly one bucket:

validate_args — one failure bucket per field

import json
import re

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

ISO_RE = re.compile(r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}")


def validate_args(expected: dict, got: dict) -> dict[str, list[str]]:
    """Classify each expected field into at most one failure bucket."""
    buckets: dict[str, list[str]] = {"missing": [], "malformed": [], "wrong": []}
    for key, want in expected.items():
        if key not in got:
            buckets["missing"].append(key)
            continue
        value = got[key]
        if key == "start_iso" and not ISO_RE.match(str(value)):
            buckets["malformed"].append(key)
        elif value != want:
            # Nested objects (e.g. recurrence_rule) are compared by value here too —
            # dict equality is order-independent, so this validates the whole structure.
            buckets["wrong"].append(key)
    return buckets

Note what changed from the naive version: there's no special-case that skips the nested recurrence_rule, so its value is actually checked, not just its presence. And start_iso only lands in malformed on a real format failure — a well-formed-but-wrong timestamp falls through to wrong, where a hallucinated value belongs.

Next, the feedback the model sees when it gets something wrong. This is the piece that makes a retry meaningful — a bare "try again" teaches nothing:

Feeding failures back so the model can actually retry

def explain(buckets: dict[str, list[str]]) -> str:
    parts = []
    if buckets["missing"]:
        parts.append(f"missing required fields: {buckets['missing']}")
    if buckets["malformed"]:
        parts.append(f"badly formatted fields: {buckets['malformed']}")
    if buckets["wrong"]:
        parts.append(f"incorrect values for: {buckets['wrong']}")
    return "; ".join(parts) or "unknown validation error"


def replay(call, buckets: dict[str, list[str]]) -> list[dict]:
    """Echo the assistant's tool call, then return a tool error so it can correct itself."""
    return [
        {
            "role": "assistant",
            "content": None,
            "tool_calls": [
                {
                    "id": call.id,
                    "type": "function",
                    "function": {
                        "name": call.function.name,
                        "arguments": call.function.arguments,
                    },
                }
            ],
        },
        {
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps({"error": explain(buckets)}),
        },
    ]

Now the loop itself. Each attempt records the buckets, and a pass on attempt > 0 is flagged as a recovery — so retry_count and recovered are both populated for real:

run_case — the retry loop that makes retry-cost measurable

def run_case(model: str, tools: list, case: dict, max_retries: int = 2) -> EvalResult:
    messages: list[dict] = [{"role": "user", "content": case["prompt"]}]
    result = EvalResult(case["id"], tool_selected_correct=False, args_valid=False)

    for attempt in range(max_retries + 1):
        result.retry_count = attempt
        msg = client.chat.completions.create(
            model=model, tools=tools, messages=messages
        ).choices[0].message

        if not msg.tool_calls:
            result.notes = "no_call"
            messages.append({"role": "user", "content": "You must call a tool to answer."})
            continue

        call = msg.tool_calls[0]  # single-tool cases only — see "Two edges to file down"
        result.tool_selected_correct = call.function.name == case["expected_tool"]

        try:
            args = json.loads(call.function.arguments or "{}")
            buckets = validate_args(case["expected_args"], args)
        except json.JSONDecodeError:
            buckets = {"missing": [], "malformed": ["__unparseable__"], "wrong": []}

        result.missing_fields = buckets["missing"]
        result.malformed_fields = buckets["malformed"]
        result.hallucinated_fields = buckets["wrong"]
        result.args_valid = not any(buckets.values())

        if result.passed:
            result.recovered = attempt > 0
            return result

        messages += replay(call, buckets)  # feed the error back and let it try again

    return result

And a test case for create_calendar_event — the kind of schema BFCL doesn't have, with optional fields, typed args, and a nested recurrence rule:

One real-ish test case (date and offset verified: 2026-05-26 is a Tuesday, +02:00 is correct CEST)

case = {
    "id": "cal_001",
    "prompt": "Book a 30-min sync with anya@acme.com next Tuesday at 10am Berlin time, weekly.",
    "expected_tool": "create_calendar_event",
    "expected_args": {
        "title": "Sync",
        "start_iso": "2026-05-26T10:00:00+02:00",
        "attendees": ["anya@acme.com"],
        "recurrence_rule": {"freq": "WEEKLY", "interval": 1},
    },
}

Run 20–50 cases per model and aggregate. Every one of the three metrics is now a real value, plus a recovery rate that tells you whether a retry budget actually buys anything:

Aggregating the run

n = len(results)
accuracy = sum(r.passed for r in results) / n
halluc_rate = sum(r.tool_selected_correct and r.hallucinated for r in results) / n
avg_retries = sum(r.retry_count for r in results) / n

needed_retry = [r for r in results if r.retry_count > 0]
recovery_rate = sum(r.recovered for r in needed_retry) / max(1, len(needed_retry))

Serving backend gotcha: Ollama vs. vLLM

Ollama applies the model's chat template including tool-call tokens automatically. vLLM with --enable-auto-tool-choice behaves differently depending on the --tool-call-parser flag (hermes, llama3_json, mistral, etc.). A harness that parses tool calls correctly against Ollama may silently mis-parse vLLM responses. Test against the backend you'll deploy with, not the one that was easiest to spin up.

Two edges to file down

The harness inspects msg.tool_calls[0] — fine for single-tool cases like this one. For parallel or multi-tool turns you'll want to iterate over every call in msg.tool_calls; the first-call shortcut silently ignores the rest, and if your schema invites parallel calls that omission will quietly skew your numbers.

validate_args does an exact comparison on start_iso after the format gate, which is stricter than the regex implies: two timestamps that denote the same instant in different offsets (...+02:00 vs the same moment as ...Z) compare unequal. If your schema accepts either, normalise to a single representation, or compare parsed instants, before you score it, or you'll log real successes as hallucinations.

Schema token budget: check before running 500 cases

Reading the results

Once you have numbers for 3–4 models, the retry-cost column usually settles the debate more cleanly than raw accuracy — and now it's a real column, not an aspiration. A model with 88% first-pass accuracy that recovers on the first retry is cheaper to run than a 92% model that burns its retry budget and still stalls. Read avg_retries next to recovery_rate: high recovery with low average retries is the profile you want — it means failures are correctable rather than terminal.

Look at the argument-hallucination column second. It now counts only confidently-wrong values, not omissions or format errors, so the number means what it says. Anything above ~5% on required fields means you'll be writing defensive validation in the calling code anyway, which costs latency and complexity. Selection accuracy is the floor — if a model picks the wrong tool more than 10% of the time on your schemas, the rest of the metrics don't matter.

Run the harness against your real tool schemas before you write a single line of agent code. On Monday morning, copy the snippets above, plug in your three top-of-leaderboard candidates, and let the retry-cost column tell you which model to commit to.