Kostas Pardalis

Tag: synthetic-data

Designing the Ideal Synthetic Data Generation Pipeline for LLMs

Robust, maintainable, expressive and composable pipelines are critical for scaling synthetic data generation. This post advocates for abstractions that reduce boilerplate, avoiding ad-hoc scripts, and leveraging dataframe APIs with structured document representations. The concrete example involves fine-tuning a smaller model using synthetic QA pairs generated from SEC corporate reports by a frontier model, maintaining quality while reducing inference costs.

[... 260 words]

31 May 2025 AI synthetic-data LLM data-engineering
Exploring Synthetic Data for LLM Fine Tuning

In this post, I explore how synthetic data is used to train and fine-tune large language models. I'll focus on Meta's open-source **synthetic-data-kit**, a tool built for exactly this purpose. LLMs owe their success to two factors: human ingenuity and the vast, annotated text of the internet.

[... 1,007 words]

26 May 2025 AI synthetic-data LLM Meta
Inside Meta's Synthetic-Data Kit for Llama Fine-Tuning

Meta's **synthetic-data-kit** is a toolkit designed to generate high-quality synthetic datasets for fine-tuning Large Language Models. The tool streamlines the process of creating training data through an ETL-like pipeline with four key operations. The toolkit exposes a simple CLI interface with these commands:

[... 310 words]

15 May 2025 AI synthetic-data LLM Meta