Designing the Ideal Synthetic Data Generation Pipeline for LLMs
Robust, maintainable, expressive and composable pipelines are critical for scaling synthetic data generation. This post advocates for abstractions that reduce boilerplate, avoiding ad-hoc scripts, and leveraging dataframe APIs with structured document representations.
Use Case: SEC Filings
The concrete example involves fine-tuning a smaller model using synthetic QA pairs generated from SEC corporate reports by a frontier model, maintaining quality while reducing inference costs.
Data Processing Workflow
OCR and Initial Processing
Documents convert from PDF to Markdown format using OCR models (like Mistral’s), preserving structure including sections, subsections, and tables while capturing mathematical notation.
Data Cleanup
Three primary cleaning operations:
- Removing HTML tags via regex
- Stripping embedded image references
- Converting checkbox symbols to boolean representations (“Yes”/“No”)
Markdown as First-Class Data Type
The proposal treats Markdown documents similarly to how databases handle JSON—as hierarchical structures with AST-like representations. Operations include:
- Extracting document structure/schema
- Transforming strings into structured columns
- Extracting specific sections
Chunking Strategy
Rather than breaking documents by pages, the approach maintains semantic integrity by chunking at section boundaries, preserving context and enabling consistent evaluation.
Synthetic Data Generation
The process involves:
- Summarization: Creating document summaries using semantic reduction
- QA Generation: Using LLMs with prompts requesting 10 question-answer pairs per section
- Parsing Results: Extracting structured QA pairs from model responses
The actual generation code requires minimal boilerplate—approximately one line for model invocation.
Additional Considerations
- Session management for defining model choices and constraints
- Row-based lineage tracking for debugging failed QA pairs and iterating on quality