Inside Meta's Synthetic-Data Kit for Llama Fine-Tuning • Kostas Heaven on Net

Meta’s synthetic-data-kit is a toolkit designed to generate high-quality synthetic datasets for fine-tuning Large Language Models. The tool streamlines the process of creating training data through an ETL-like pipeline with four key operations.

Core Functionality

The toolkit exposes a simple CLI interface with these commands:

1. Ingest

Converts various file formats into clean text. Supported formats include:

PDF (using pdfminer)
HTML (with BeautifulSoup4)
YouTube (via pytube and transcript APIs)
DOCX (python-docx)
PPTX (python-pptx)
TXT (standard parsing)

The ingestion phase always exports text as its output format.

2. Create

Generates synthetic training data in multiple formats:

QA Pairs: Question-answer pairs extracted from source material
QA with Chain-of-Thought: Pairs augmented with reasoning traces
Summaries: Document-level summaries

The creation process involves:

Generating low-temperature summaries
Chunking text into ~4K character segments
Generating QA pairs per chunk using configurable prompts
Consolidating results with document context

3. Curate

Rates and filters generated content using the same LLM as a quality judge. The curation process:

Loads generated datasets
Splits pairs into batches
Rates each batch using evaluation prompts
Filters below user-defined quality thresholds
Reports retention metrics

The rating prompt evaluates pairs on “accuracy, relevance, clarity, and usefulness” across a 10-point scale.

4. Save-as

Converts filtered data into downstream-compatible formats:

jsonl: Line-by-line JSON
alpaca: Instruction/input/output fields
ft: OpenAI fine-tuning format
chatml: Chat format with role-based messages

Storage options include JSON or HuggingFace Arrow datasets.

Technical Requirements

The toolkit assumes local deployment with vLLM serving a Llama model. It connects via OpenAI-compatible API to http://localhost:8000/v1/chat/completions (configurable).

Workflow Summary

The four-stage pipeline mirrors traditional data engineering practices:

Extraction: Ingest diverse source formats
Transformation: Generate synthetic training data
QA: Curate for quality assurance
Load: Export in training-framework formats