Back to home
Research

Improving Synthetic Data Generation Bounds via Constrained Decoding

Auxerta ResearchFebruary 202510 min read

Large language models can generate synthetic training data at scale, but "plausible" is not the same as "correct." In high-stakes domains — healthcare, legal reasoning, engineering — even small amounts of noise in training data compound into downstream failures. This post examines how constrained decoding narrows the error distribution at generation time, and why that matters more than post-hoc filtering.

The Fidelity Gap in Synthetic Data

The promise of synthetic data is straightforward: generate training samples programmatically instead of collecting and labeling them manually. By 2025, synthetic data is expected to constitute a significant share of all AI training data, driven by privacy regulations, data scarcity in specialized fields, and the sheer cost of human annotation.

But there's a well-documented problem. Synthetic data generated by unconstrained LLMs suffers from what researchers call the fidelity gap — a disconnect between the statistical patterns the model has learned and the domain-specific invariants that real data must satisfy. The model produces text that looks right but violates constraints that matter:

  • A synthetic clinical note prescribes a drug at a dosage that falls within the model's learned distribution but exceeds the maximum approved dose for the indicated condition
  • A generated legal brief cites a case in perfect Bluebook format — but the case doesn't exist
  • An engineering specification lists material tolerances that are internally consistent but physically impossible for the specified alloy

These aren't hallucinations in the colloquial sense. They're plausible errors — outputs that pass surface-level inspection but fail domain validation. Models trained on this data don't just underperform; they develop confidently wrong behaviors that are difficult to diagnose after the fact.

Research published in 2024 has also raised concerns about model collapse — the progressive degradation of model quality when training recursively on synthetic data. Each generation of synthetic data loses rare but important edge cases, and the distribution narrows over time. Constrained decoding addresses this directly by preserving the structural diversity that unconstrained generation tends to erode.

Constrained Decoding: Restricting the Output Space

Constrained decoding is a family of techniques that restrict an LLM's generation process so that outputs must conform to predefined rules at each decoding step. Instead of sampling freely from the model's full probability distribution over the vocabulary, you mask out tokens that would produce structurally or semantically invalid outputs.

This is distinct from post-hoc filtering (generate-then-validate), which is the naive approach most pipelines still use. Post-hoc filtering is wasteful: you generate a large number of samples, run validation checks, and discard the ones that fail. Depending on the domain and constraint complexity, rejection rates can exceed 40-60% of generated samples. Constrained decoding eliminates most of that waste by preventing invalid outputs from being generated in the first place.

The key mechanisms that have matured over the past year include:

Grammar-Guided Decoding (GCD)

GCD enforces syntactic structure at the token level. At each decoding step, a grammar (typically expressed as a context-free grammar, regular expression, or JSON schema) determines which tokens are valid continuations. Invalid tokens are masked from the probability distribution before sampling.

This technique has seen significant adoption since OpenAI introduced structured outputs in their API in August 2024, achieving what they report as 100% schema compliance when strict mode is enabled. Open-source implementations like Outlines and Guidance provide similar capabilities for local models.

A key advancement in 2024 was Grammar-Aligned Decoding (GAD), presented at NeurIPS. Traditional GCD can distort the model's probability distribution — by masking tokens, you change the relative probabilities of the remaining tokens, which can degrade output quality. GAD introduces algorithms like ASAp (Adaptive Sampling with Approximate Expected Futures) that preserve the model's original distribution while still guaranteeing grammatical correctness. This is a meaningful improvement: you get structural guarantees without sacrificing the model's learned knowledge.

Knowledge-Grounded Constraints

Grammar constraints handle structure but not semantics. Knowledge-grounded constraints cross-reference generated content against verified knowledge bases during generation. Research presented at ESWC 2024 (the GECKO system) demonstrated how encoder-decoder models can be constrained to generate outputs grounded in knowledge graphs, significantly reducing factual errors in question-answering tasks.

For domain-specific data generation, this means connecting the decoding process to authoritative sources: drug databases for clinical data, case law indices for legal text, materials science databases for engineering specifications. The constraint isn't just "is this valid JSON?" — it's "does this drug-dosage pair exist in the FDA's approved labeling?"

Domain-Specific Validators

The third layer applies field-specific rules as hard constraints on the output space. These go beyond what a grammar or knowledge base can express: unit consistency checks, cross-field relationship validation, temporal coherence in sequential records.

Work on type-constrained code generation (accepted at PLDI 2025) demonstrates the principle: by enforcing type system rules during decoding, compilation errors in generated code drop substantially and functional correctness improves. The same principle applies to any domain with formal validation rules.

Why This Matters: Bounding the Error Distribution

The core insight is geometric. Unconstrained generation produces samples from a wide distribution in output space. Some samples land inside the valid region; others don't. Post-hoc filtering draws a boundary around the valid region and discards everything outside it. Constrained decoding reshapes the generation distribution itself so that it concentrates probability mass inside the valid region.

This has three practical consequences:

Higher effective throughput
You generate fewer invalid samples and discard less. The pipeline produces usable data faster, even though individual decoding steps may be marginally slower.
Predictable error rates
Instead of an unknown error rate that depends on the model's mood, you get a bounded error rate determined by your constraint specification. This is what production systems actually need.
Preserved distributional diversity
Unlike aggressive filtering that can collapse the output distribution, well-designed constraints maintain diversity within the valid region — mitigating the model collapse problem.

Implementation: A Cascading Architecture

In practice, constraints are most effective when applied at multiple levels in a cascading architecture. The key principle is that cheaper checks should run first:

  1. Token-level masks — applied at each decoding step. These enforce structural validity (schema compliance, syntax rules) and are computationally cheap because they operate on the logit vector before sampling. Libraries like Outlines compile grammars into finite automata for fast token masking. The DOMINO method (ICML 2024) demonstrated that grammar-constrained decoding can achieve zero or negative overhead compared to unconstrained decoding through careful subword alignment.
  2. Segment-level validators — applied after generating logical chunks (a sentence, a field, a record). These check domain-specific invariants: does this dosage fall within the approved range? Is this citation verifiable? Does this measurement use consistent units? Invalid segments trigger targeted regeneration of just that segment, not the entire output.
  3. Document-level consistency checks — applied to complete generated records. These verify cross-field relationships: does the diagnosis match the prescribed treatment? Are the dates in this legal filing internally consistent? This is the most expensive layer, but levels 1 and 2 prevent most errors from reaching it.

The Constraint Specification Problem

The hardest part of constrained decoding isn't the decoding — it's specifying the constraints. This is fundamentally a knowledge engineering problem: someone has to define what "valid" means for each domain, and that definition has to be precise enough to compile into decoding rules.

For structured outputs (JSON, code, tabular data), this is relatively straightforward — schemas and type systems provide natural constraint specifications. For clinical text, legal documents, and engineering reports, it requires working directly with domain experts who can articulate the invariants that distinguish valid data from plausible-but-wrong data.

This is where the research community has the most work left to do. The SynthTextEval framework (2025) proposes standardized evaluation metrics for synthetic text quality across fidelity, utility, and privacy dimensions, particularly for high-stakes domains. But evaluation is downstream of specification — you need to define the constraints before you can measure compliance.

Trade-offs Worth Acknowledging

Constrained decoding isn't a silver bullet. There are real trade-offs that any honest assessment should address:

  • Diversity vs. accuracy: Tighter constraints reduce the variety of generated outputs. Research in 2024 showed that overly strict format restrictions can degrade an LLM's reasoning performance — the model spends capacity on satisfying constraints rather than generating useful content. Multi-step pipelines that separate reasoning from formatting can mitigate this, but add complexity.
  • Constraint coverage: Formal constraints can only catch errors that are formally specifiable. Subtle issues like tone mismatches in clinical notes, or reasoning errors in legal arguments, often fall outside what grammar or knowledge-base constraints can express. Human review remains necessary for these cases.
  • Distribution distortion: As noted in the GAD work, naive token masking changes the model's output distribution in ways that can reduce quality. This is a solved problem in theory (GAD provides distribution-preserving algorithms) but still an active area of engineering in practice.
  • Domain portability: Constraints built for one domain don't transfer to another. A pipeline validated for clinical NER needs entirely different constraints for legal document synthesis. Each new domain requires fresh expert input.

Where We Go From Here

The trajectory is clear: as synthetic data becomes a larger fraction of training corpora, the quality bar rises proportionally. Generating more data isn't the bottleneck anymore — generating trustworthy data is.

Constrained decoding gives us a principled framework for defining what "trustworthy" means in formal terms and enforcing it during generation. The combination of grammar-guided decoding for structure, knowledge grounding for factual accuracy, and domain-specific validators for semantic correctness creates a layered defense that makes the error rate predictable and bounded — which is what production deployments actually require.

At Auxerta, this is central to how we build data pipelines for regulated industries. We don't generate data and hope it passes validation. We work with domain experts to define the constraints, encode them into our generation pipeline, and verify compliance at every level.

The hard problem was never generation. It was knowing what to constrain, and having the domain expertise to get it right.

References
  • Geng et al. "Grammar-Aligned Decoding" — NeurIPS 2024
  • Ugare et al. "Type-Constrained Code Generation with Language Models" — PLDI 2025
  • DOMINO: Fast Grammar-Constrained Decoding — ICML 2024
  • GECKO: Constrained Knowledge-Based Expression Decoding — ESWC 2024
  • Lehman et al. "Data-Constrained Synthesis of Training Data for De-Identification" — ACL 2025
  • Constrained Deep Generative Models for Tabular Data — arXiv 2024
  • SynthTextEval: Evaluation Framework for Synthetic Text — arXiv 2025

Questions about our approach? Reach out at service@auxerta.com