>Ryan Watts
All posts
LLMs
10 min read
January 14, 2025

LLMs in Production Data Pipelines: What Actually Works

After integrating LLMs into data pipelines across multiple companies, here's the unfiltered truth about what holds up.

RW

Ryan Watts

Principal AI & Data Engineer

There's a gap between LLM demos and LLM production systems that almost no one talks about honestly. I've integrated LLMs into production data pipelines at multiple companies now, and I want to share what actually works — and what the hype glosses over.

Where LLMs genuinely add value in data pipelines

1. Unstructured → structured transformation

The strongest production use case I've found: transforming unstructured text (emails, PDFs, free-form notes, API responses) into structured records. LLMs are remarkably good at this when you use structured output schemas.

At MANTL, we used this pattern to parse and normalize account application text from diverse banking partners. What would have been hundreds of brittle regex rules became a reliable extraction pipeline with ~96% accuracy on first pass.

2. Data quality classification

LLMs can classify data quality issues in ways that rule-based systems struggle with. Ambiguous addresses, conflicting records, plausible-but-wrong values — LLMs catch these with nuance that's hard to encode in rules.

3. Schema inference and documentation

Point an LLM at a SQL table or API response and ask it to document the schema with field-level descriptions. It's not perfect, but it dramatically accelerates the documentation work that data engineers typically skip.

Where LLMs fail in pipelines

1. Consistency under volume

At small scale, LLM outputs feel consistent. At 10,000 records per hour, you'll discover the edge cases: records that trigger different reasoning paths, temperature drift, subtle instruction following failures. Robust validation and retry logic is not optional.

2. Cost at scale

GPT-4 at $0.03/1k tokens sounds cheap until you're processing 5M records/day. Run the math before committing to an architecture. For high-volume pipelines, fine-tuned smaller models almost always beat frontier models on cost/performance ratio.

3. Debugging

When a traditional pipeline fails, you get a stack trace. When an LLM pipeline fails, you get... a plausible but wrong output. Building observability into LLM pipelines — logging inputs, outputs, token counts, and model versions — is essential and often underbuilt.

The pattern I use

For every LLM node in a pipeline:

1. Input validation — sanitize and type-check before sending to the model

2. Structured output schemas — Pydantic models, always

3. Retry with error feedback — inject validation failures back into the prompt

4. Fallback — deterministic fallback for high-failure-rate inputs

5. Sampling audit — randomly sample 1-5% of outputs for human review

This adds overhead, but it's what separates a proof of concept from something you can run in production and sleep soundly about.

The tools that actually help

·Pydantic AI — the cleanest structured output experience I've used
·LangSmith — essential for LLM pipeline observability
·Prefect or Dagster — for retry orchestration and pipeline observability
·Instructor — if you're not on Pydantic AI, this is the next best thing

The bottom line: LLMs in data pipelines are genuinely useful, but they need to be treated as probabilistic components in a deterministic system. Design accordingly.

RW

Ryan Watts

Principal AI & Data Engineer with 15+ years building enterprise systems. Head of AI at DVx Ventures, Staff Data Engineer at Cork, and independent consultant.