Interlocking Specialized Models: Routing and Merging Domain Experts for Compound AI Systems — Auxerta

A single general-purpose model cannot be the best at everything. In domains where accuracy is non-negotiable — financial risk, clinical decision support, autonomous systems — the emerging architecture is not one large model but several specialized models that interlock: each an expert in its vertical, coordinated by a routing layer that knows where to send each query. This post examines how interlocking specialized models work, why they outperform monolithic alternatives in production, and what this means for training data infrastructure.

The Problem with One Model to Rule Them All

The default assumption in most LLM deployments is simple: take the largest available model, fine-tune it on your domain data, deploy it, and hope for the best. This works surprisingly well for general-purpose tasks — summarization, translation, open-ended chat. But it fails predictably in specialized verticals.

A model fine-tuned on both legal compliance and agricultural sensor data will be mediocre at both. The training signals interfere with each other — a phenomenon the research community calls negative transfer. Legal reasoning requires precise citation, cautious hedging, and strict adherence to jurisdictional rules. Agricultural modeling requires spatial reasoning, temporal pattern recognition, and tolerance for noisy sensor inputs. These are fundamentally different cognitive modes, and forcing them into the same parameter space creates compromises in both.

The empirical evidence is consistent. Internal benchmarks across our client deployments show that single-domain expert models outperform multi-domain generalists by 15–30% on domain-specific evaluation tasks, even when the generalist has 3–5× more parameters. The smaller model wins because every parameter is dedicated to one job.

Interlocking Architecture: Design Principles

An interlocking system is not simply an ensemble. Ensembles run every model on every query and average the results. Interlocking systems are sparse by design — for any given input, only the relevant expert is activated. The key components are:

1. The Router

The router is a lightweight classifier that examines incoming queries and determines which expert model should handle them. This is architecturally similar to the gating function in Mixture-of-Experts (MoE) models like Switch Transformer and Mixtral, but operates at the system level rather than the layer level.

In production, the router is typically a small transformer or even a logistic regression model trained on query embeddings. It needs to be fast (sub-millisecond inference), accurate (misrouting degrades the entire system), and calibrated (it should know when a query falls outside all expert domains and escalate accordingly). Recent work on learned routers achieves 97%+ routing accuracy on well-defined domain boundaries with less than 0.5ms latency overhead.

2. Domain Expert Models

Each expert is a model fine-tuned exclusively on data from its target domain. The key insight: these don't all need to be the same architecture or even the same size. A risk assessment expert might be a 7B parameter model fine-tuned on regulatory filings and case law. A drone telemetry expert might be a 1.3B model trained on time-series sensor data. An education expert might be a 3B model optimized for pedagogical dialogue.

The heterogeneity is a feature, not a bug. Each model is right-sized for its domain's complexity. This has direct cost implications: a system of five 3B-parameter experts is cheaper to serve than one 70B generalist, and in our benchmarks, it produces better results on every domain-specific task.

3. The Merge Layer

Some queries span multiple domains. A financial fraud investigation might require both transaction pattern analysis (the fraud expert) and regulatory compliance checking (the risk expert). The merge layer handles these compound queries by orchestrating multiple experts and synthesizing their outputs.

This is the hardest component to get right. Naive concatenation of expert outputs produces incoherent responses. The current best practice is a hierarchical synthesis approach: each expert produces a structured output (findings, confidence score, supporting evidence), and a lightweight synthesis model combines them into a unified response. The synthesis model is itself specialized — trained on examples of multi-domain reasoning — but its job is coordination, not domain expertise.

Why the Data Layer Is the Hard Part

Building the router and the serving infrastructure is well-understood systems engineering. The genuinely difficult problem is upstream: producing the domain-specific training data that makes each expert actually expert.

Each expert model requires a curated dataset that is:

Domain-pure: Free of contamination from adjacent domains. A risk model trained on data that includes unrelated trading signals will develop spurious correlations. Data curation requires domain experts who can identify and remove subtle contamination.
Structurally consistent: Annotations must follow a uniform schema within the domain. If some clinical records use ICD-10 codes and others use free-text diagnoses, the model learns the inconsistency rather than the domain.
Edge-case rich: General-purpose datasets are dominated by common patterns. Domain experts need to identify the rare-but-critical cases — the unusual fraud pattern, the atypical drug interaction, the edge-case sensor failure — and ensure they're represented proportionally in the training set.
Continuously refreshed: Domains evolve. Regulations change. New drug approvals alter clinical guidelines. The training data pipeline must be a pipeline, not a one-time collection effort.

Training Snapshot — Epoch 04

Domain-specific convergence

Loss drops from 0.335 (fraud) to 0.273 (agriculture) within a single epoch as the expert model specializes. General-purpose models trained on the same data plateau at ~0.41.

Efficient resource utilization

78.9 GB VRAM at 144ms/iteration on a single node. Specialized models train in hours, not weeks — enabling rapid iteration when domain requirements change.

Cross-domain isolation

Each domain step trains independently. Defense and education models never share gradients, eliminating the negative transfer that degrades multi-domain fine-tuning.

Routing Strategies in Practice

Router design has converged on three main patterns, each with different trade-offs:

Embedding-Based Classification

The simplest approach. Encode the query using a general-purpose embedding model, then classify it against pre-defined domain clusters. This works well when domains are semantically distinct (medical vs. legal) but struggles with overlapping domains (financial compliance vs. financial trading). In production, we see 94–97% accuracy with sub-millisecond latency.

Cascading Routers

A two-stage approach for ambiguous queries. The first router makes a coarse domain classification. If confidence is below a threshold, a second, more expensive router examines the query in more detail — potentially using a small LLM to analyze intent. This reduces misrouting on edge cases by roughly 60% compared to single-stage classification, at the cost of ~3ms additional latency on uncertain queries.

Multi-Expert Activation

For compound queries, the router activates multiple experts simultaneously. The merge layer then synthesizes their outputs. This is the most expensive pattern but necessary for real-world queries that genuinely span domains. The key constraint: the router must be conservative about multi-activation, since each additional expert increases latency and cost linearly.

Production Considerations

Deploying an interlocking system introduces operational complexity that monolithic deployments avoid. The trade-off is worth it for accuracy-critical applications, but teams should go in with clear expectations:

Versioning becomes multi-dimensional. Each expert model has its own version, training data version, and evaluation benchmark. The router has its own version. A "system version" is a specific combination of all component versions. This requires disciplined MLOps tooling.
Monitoring is per-expert. A degradation in the fraud expert doesn't affect the agriculture expert. This is an advantage (blast radius is contained) but requires per-domain monitoring dashboards and alerting rules.
Adding a new domain is modular. You train a new expert, add it to the router's classification targets, and deploy. The existing experts are untouched. This is dramatically simpler than retraining a monolithic model to add a new capability.
Cost scales with usage, not capacity. Domains that receive fewer queries can be served on smaller infrastructure or even cold-started on demand. A monolithic model must be provisioned for peak aggregate load at all times.

What This Means for Training Data

The interlocking architecture shifts the bottleneck from model scale to data quality per domain. A 70B parameter model can absorb noisy, heterogeneous data and still produce passable outputs through sheer parameter count. A 3B expert model cannot — it needs clean, domain-pure, expert-validated training data, or it will fail on exactly the edge cases that matter most.

This is why we believe the training data layer is becoming the most critical infrastructure in production AI. The model architectures are converging. The serving infrastructure is commoditizing. What differentiates a system that works from one that doesn't is the quality, specificity, and freshness of the data each expert was trained on.

At Auxerta, we build the data pipelines that make interlocking systems work. Each domain pipeline produces curated, annotated, continuously refreshed datasets — with the structural consistency and edge-case coverage that expert models require. We don't train the models. We build the data that makes them expert.

References

Fedus et al. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" — JMLR 2022
Jiang et al. "Mixtral of Experts" — Mistral AI, 2024
Shnitzer et al. "Large Language Model Routing with Benchmark Datasets" — NeurIPS 2024
Sukhbaatar et al. "Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models" — NeurIPS 2023
Li et al. "Merge, Then Compress: Demystifying Efficient SMoE with Hints from Its Routing Policy" — ICLR 2024
Wan et al. "Knowledge Fusion of Large Language Models" — ICLR 2024

Questions about our approach? Reach out at service@auxerta.com