Exploring Synthetic Data for LLM Fine Tuning
In this post, I explore how synthetic data is used to train and fine-tune large language models. I’ll focus on Meta’s open-source synthetic-data-kit, a tool built for exactly this purpose.
Why synthetic data matters
LLMs owe their success to two factors: human ingenuity and the vast, annotated text of the internet.
As model capacity grows, however, human-curated data no longer scales, making synthetic data an attractive alternative.
Synthetic data is already powering state-of-the-art models. Meta notes that most of Llama 3’s supervised fine-tuning corpus was generated synthetically after six iterative rounds of data creation.
Academic groups have matched that success on a shoestring: Berkeley’s TinyZero trains a 3-billion-parameter model to DeepSeek-R1 reasoning quality for under $30 using CoT traces, while NovaSky’s Sky-T1 boots a 32-B model for $450 and beats o1-preview on math and coding benchmarks.
I got really curious about how generating synthetic data for LLMs is happening. I’ve been seeking synthetic data for various reasons for many years now, primarily as a way to emulate data processing workloads, but synthetic data in LLMs has a completely different scale of impact and importance.
So, I got excited when I found that around the time of the first ever LlamaCon in 2025, Meta released synthetic-data-kit.
What is it? Meta’s synthetic-data-kit generates high-quality synthetic datasets for fine-tuning LLMs.
Being an open sourced tool, it was a great candidate to get into the code itself and see how synthetic data generation for LLMs is happening.
Training Models to Use Tools Better
An important new feature of Large Language Models, is their ability to call tools. This feature is what makes practical AI agents possible.
The demo is based on the assumption that Chain-of-Thought traces about tool use, can be used to fine tune models and thus improve their performance when it comes to tool calling dialogs.
The workflow breaks down into four clear stages:
- Select a relevant dataset — the example uses ToolACE.
- Enrich each record with chain-of-thought traces using synthetic-data-kit.
- Fine-tune the base Llama model on the enriched dataset.
- Evaluate the model’s tool-calling performance and reasoning quality.
The fun part is that they introduce reasoning capabilities by doing that. The model is not only making a better choice when it comes to picking the tool but it also reasons about it.
Which is pretty cool, isn’t?
And all that, based on synthetic data that is generated by using another LLM.
Generating the synthetic data
The process of generating the data is surprisingly straightforward and might also seem familiar to some folks with data engineering background.
The data-generation pipeline follows a familiar ETL rhythm:
- Load the source dataset from Hugging Face.
- Transform each record into the shape expected by synthetic-data-kit (add a UUID, prefix the system prompt, wrap tool calls in
<tool></tool>tags). - Enrich the transformed data with chain-of-thought traces via an LLM and a purpose-built prompt.
- Save the final JSON to disk for downstream fine-tuning jobs.
Data Preparation
The original dataset comes either as a JSON file or a Parquet file that contains LLM dialogs that include tool invocation.
To use this data with the tool, the data needs to change shape, here’s what the author does:
- Every example gets a new UUID
- Its system prompt is prefixed for ToolLlama
- Any assistant messages that directly represent a tool call are encapsulated in XML tags
Data Generation
After the data is prepared, we are ready for the generation phase where we enrich the original data with CoT traces generated by an LLM.
There’s a bit of configuration that needs to happen:
- Define the model you want to use and any associated parameters
- Define the prompt that will be used for generating the data
The most important part of this, is the prompt:
You are a high 170IQ reasoning super smart AI, your job is to enhance
existing conversation examples. Remember return the entire conversation
as is BUT
We are adding Chain of Thought and planning to "Assistant" messages
whenever it returns a tool call.
Remember ONLY When it does return a tool, we add thinking and reasoning
Traces before it to add logic otherwise we don't touch the conversation
history.
With everything in place, we just have to run:
synthetic-data-kit -c cot_tools_config.yaml create \
../../tool_examples/multi_conversations.json \
--type cot-enhance \
-o ../../enhanced_results/
Did it work?
The authors claim it did! Here’s an example:
Before:
**Assistant:** [calc_binomial_probability(n=20, k=5, p=1/6)]
After:
**Assistant:** To solve this problem, I need to identify the relevant
function that can calculate the probability of getting a certain number
of successes in a given number of trials. The function
'calc_binomial_probability' seems to fit this purpose...
[calc_binomial_probability(n=20, k=5, p=1/6)]
The additional reasoning makes the assistant’s decision process transparent, which helps humans verify the logic and, in practice, boosts tool-selection accuracy in benchmark tests.
Thoughts and what’s next
Gratitude first. Meta’s decision to open-source synthetic-data-kit demystifies how synthetic datasets are built and lowers the barrier for anyone who wants to fine-tune a model.
Why that matters. Fine-tuning is quickly becoming a competitive moat—think of Cursor’s autocomplete or any vertical LLM that quietly outperforms its base model.
It’s still engineering. Under the hood the workflow looks familiar: load, transform, generate, validate. The “magic” lives in the model, but the day-to-day work is good old software and data engineering.
Iteration is the hard part. Today the kit excels at one-shot dataset creation. What it doesn’t offer is a feedback loop that lets you regenerate slices of data after seeing training results, version datasets alongside model checkpoints, and rerun evaluation automatically.
Bottom line. Synthetic-data-kit is a great on-ramp, but production-grade pipelines still need orchestration, data versioning, and automated evaluation.
References
- Synthetic Data Kit GitHub
- Meta AI. The Llama 3 Herd of Models. 2024.
- Jiayi Pan et al. TinyZero: Minimal reproduction of DeepSeek R1-Zero. GitHub, 2025.
- NovaSky Team. Sky-T1: Train your own O1-preview model within $450. Jan 2025.
- ToolACE