skip to content
Kostas Pardalis

Designing the Ideal Synthetic Data Generation Pipeline for LLMs

AI synthetic-data LLM data-engineering

Robust, maintainable, expressive and composable pipelines are critical for scaling synthetic data generation. This post advocates for abstractions that reduce boilerplate, avoiding ad-hoc scripts, and leveraging dataframe APIs with structured document representations.

Use Case: SEC Filings

The concrete example involves fine-tuning a smaller model using synthetic QA pairs generated from SEC corporate reports by a frontier model, maintaining quality while reducing inference costs.

Data Processing Workflow

OCR and Initial Processing

Documents convert from PDF to Markdown format using OCR models (like Mistral’s), preserving structure including sections, subsections, and tables while capturing mathematical notation.

Data Cleanup

Three primary cleaning operations:

  • Removing HTML tags via regex
  • Stripping embedded image references
  • Converting checkbox symbols to boolean representations (“Yes”/“No”)

Markdown as First-Class Data Type

The proposal treats Markdown documents similarly to how databases handle JSON—as hierarchical structures with AST-like representations. Operations include:

  • Extracting document structure/schema
  • Transforming strings into structured columns
  • Extracting specific sections

Chunking Strategy

Rather than breaking documents by pages, the approach maintains semantic integrity by chunking at section boundaries, preserving context and enabling consistent evaluation.

Synthetic Data Generation

The process involves:

  1. Summarization: Creating document summaries using semantic reduction
  2. QA Generation: Using LLMs with prompts requesting 10 question-answer pairs per section
  3. Parsing Results: Extracting structured QA pairs from model responses

The actual generation code requires minimal boilerplate—approximately one line for model invocation.

Additional Considerations

  • Session management for defining model choices and constraints
  • Row-based lineage tracking for debugging failed QA pairs and iterating on quality