Inside Meta's Synthetic-Data Kit for Llama Fine-Tuning
Meta’s synthetic-data-kit is a toolkit designed to generate high-quality synthetic datasets for fine-tuning Large Language Models. The tool streamlines the process of creating training data through an ETL-like pipeline with four key operations.
Core Functionality
The toolkit exposes a simple CLI interface with these commands:
1. Ingest
Converts various file formats into clean text. Supported formats include:
- PDF (using pdfminer)
- HTML (with BeautifulSoup4)
- YouTube (via pytube and transcript APIs)
- DOCX (python-docx)
- PPTX (python-pptx)
- TXT (standard parsing)
The ingestion phase always exports text as its output format.
2. Create
Generates synthetic training data in multiple formats:
- QA Pairs: Question-answer pairs extracted from source material
- QA with Chain-of-Thought: Pairs augmented with reasoning traces
- Summaries: Document-level summaries
The creation process involves:
- Generating low-temperature summaries
- Chunking text into ~4K character segments
- Generating QA pairs per chunk using configurable prompts
- Consolidating results with document context
3. Curate
Rates and filters generated content using the same LLM as a quality judge. The curation process:
- Loads generated datasets
- Splits pairs into batches
- Rates each batch using evaluation prompts
- Filters below user-defined quality thresholds
- Reports retention metrics
The rating prompt evaluates pairs on “accuracy, relevance, clarity, and usefulness” across a 10-point scale.
4. Save-as
Converts filtered data into downstream-compatible formats:
- jsonl: Line-by-line JSON
- alpaca: Instruction/input/output fields
- ft: OpenAI fine-tuning format
- chatml: Chat format with role-based messages
Storage options include JSON or HuggingFace Arrow datasets.
Technical Requirements
The toolkit assumes local deployment with vLLM serving a Llama model. It connects via OpenAI-compatible API to http://localhost:8000/v1/chat/completions (configurable).
Workflow Summary
The four-stage pipeline mirrors traditional data engineering practices:
- Extraction: Ingest diverse source formats
- Transformation: Generate synthetic training data
- QA: Curate for quality assurance
- Load: Export in training-framework formats