skip to content
Kostas Pardalis

Inside Meta's Synthetic-Data Kit for Llama Fine-Tuning

AI synthetic-data LLM Meta

Meta’s synthetic-data-kit is a toolkit designed to generate high-quality synthetic datasets for fine-tuning Large Language Models. The tool streamlines the process of creating training data through an ETL-like pipeline with four key operations.

Core Functionality

The toolkit exposes a simple CLI interface with these commands:

1. Ingest

Converts various file formats into clean text. Supported formats include:

  • PDF (using pdfminer)
  • HTML (with BeautifulSoup4)
  • YouTube (via pytube and transcript APIs)
  • DOCX (python-docx)
  • PPTX (python-pptx)
  • TXT (standard parsing)

The ingestion phase always exports text as its output format.

2. Create

Generates synthetic training data in multiple formats:

  • QA Pairs: Question-answer pairs extracted from source material
  • QA with Chain-of-Thought: Pairs augmented with reasoning traces
  • Summaries: Document-level summaries

The creation process involves:

  • Generating low-temperature summaries
  • Chunking text into ~4K character segments
  • Generating QA pairs per chunk using configurable prompts
  • Consolidating results with document context

3. Curate

Rates and filters generated content using the same LLM as a quality judge. The curation process:

  • Loads generated datasets
  • Splits pairs into batches
  • Rates each batch using evaluation prompts
  • Filters below user-defined quality thresholds
  • Reports retention metrics

The rating prompt evaluates pairs on “accuracy, relevance, clarity, and usefulness” across a 10-point scale.

4. Save-as

Converts filtered data into downstream-compatible formats:

  • jsonl: Line-by-line JSON
  • alpaca: Instruction/input/output fields
  • ft: OpenAI fine-tuning format
  • chatml: Chat format with role-based messages

Storage options include JSON or HuggingFace Arrow datasets.

Technical Requirements

The toolkit assumes local deployment with vLLM serving a Llama model. It connects via OpenAI-compatible API to http://localhost:8000/v1/chat/completions (configurable).

Workflow Summary

The four-stage pipeline mirrors traditional data engineering practices:

  • Extraction: Ingest diverse source formats
  • Transformation: Generate synthetic training data
  • QA: Curate for quality assurance
  • Load: Export in training-framework formats