= dataset.filter(
dataset lambda example: example['error'] is None and example['video'] is not None
)
On this page
The Idea of the project:
As part of my exploration into building a prototype that generates Manim animations from natural language descriptions, I came across the bespoke-manim dataset on Hugging Face. It looked like a perfect starting point — each example includes narration
, visual structure
, and the Python code
used to generate an animation video using ManimCE.
But to actually use this dataset for fine-tuning an instruction-following model, I had to reshape it significantly. In this post, I’ll walk through how I approached this process: inspecting the raw dataset, deciding what to keep (and what to filter), and ultimately turning it into a set of instruction → output pairs that a model can learn from.
Exploring the Dataset
When I loaded the dataset, it looked like this:
DatasetDict({
train: Dataset({
features: [
'subject', 'topic', 'question', 'title', 'narration',
'visual_elements', 'equations', 'key_timestamps', 'visual_style',
'concept_id', 'python_code', 'scene_class_name', 'generation_time',
'filename', 'message', 'error', 'stdout', 'stderr', 'video',
'instruction', 'output'
],
num_rows: 1000
})
})
At first glance, I noticed that not all of the 1000 entries were usable. Some had Python code that failed to execute (captured in the error field), and others were missing videos entirely — which probably meant the code never produced a usable animation.
Also I noticed that many fields are created in order to store some meta data.
After analyzing each field’s potential value for fine-tuning I think not all fields are necessary:
Essential Fields:
question
: Core input that defines what the animation should explainpython_code
- The target output the model needs to generatetitle
- Provides concise focus for the animationnarration
- Detailed explanation that significantly shapes the animation’s contentvisual_elements
- Critical for understanding what should be visually represented in the codeequations
- Mathematical formulas that need to be rendered in the animation
Potentially Useful Fields:
subject and topic
- Provide context that might help the model generate more appropriate codevisual_style
- Contains styling information that influences the visual appearancekey_timestamps
- Provides structure for the animation sequence
Unnecessary Fields:
scene_class_name
- Typically derived from the code itselfgeneration_time
- Metadata unlikely to influence code generationfilename, message
- More metadatastdout, stderr
- Execution logs not needed for generationerror
- sould be already filtered for successful examplesvideo
- The rendered output, not needed for trainingconcept_id
- not needed
Filtering Clean Examples
To keep the training signal clean, I applied a simple filter to remove any examples where:
The
error
field was not NoneThe
video
field was None
This will reduce the size of the dataset but we will have only data-points where the code is executable.
dataset
As expected now we only have 252
which is not bad, since we are now sure that the dataset is high quality.
Designing Prompt-Format:
At this point we need to create training examples by combining the essential fields into well-structured prompts paired with the working code. This will give our model the most relevant information while removing noise from the training data.
Let’s implement this data transformation using the Hugging Face datasets
library.
def create_prompt(example):
= f"Subject: {example['subject']}\n"
prompt += f"Topic: {example['topic']}\n\n"
prompt += f"Question: {example['question']}\n\n"
prompt += f"Title: {example['title']}\n\n"
prompt += f"Narration:\n{example['narration']}\n\n"
prompt
if example['visual_elements'] and len(example['visual_elements']) > 0:
+= "Visual Elements:\n"
prompt for i, elem in enumerate(example['visual_elements']):
+= f"- {elem['description']} (Timestamp: {elem['timestamp']})\n"
prompt += "\n"
prompt
if example['equations'] and len(example['equations']) > 0:
+= "Equations:\n"
prompt for i, eq in enumerate(example['equations']):
+= f"- {eq}\n"
prompt += "\n"
prompt
if example['visual_style']:
+= f"Visual Style:\n{example['visual_style']}\n\n"
prompt
+= "Generate manim code to create this animation:"
prompt
return prompt
# Create the training dataset with input-output pairs
def process_example(example):
return {
"instruction": create_prompt(example),
"output": example['python_code']
}
This resulted in a rich, semantically meaningful prompt — not just a raw dump of metadata, but something that reads like a real instruction.
Mapping Instruction → Output Pairs
Once I had a clear structure for the prompt, I turned each example into an input-output pair for fine-tuning:
# Map the function over the filtered dataset
= dataset.map(process_example) training_dataset
# Select only the columns we need for fine-tuning
= training_dataset.select_columns(["instruction", "output"]) training_dataset
Now I had exactly what I needed: a list of instructions describing an animation, and the Manim Python code that generates it.
Train-Validation Split and Saving
To prepare for fine-tuning, I split the dataset into a training and validation set:
from datasets import Dataset
= training_dataset['train'].train_test_split(test_size=0.1)
dataset_split = dataset_split['train']
train_dataset = dataset_split['test'] validation_dataset
I also saved each split to disk and pushed the final version to the Hugging Face Hub:
"manim_train_dataset")
train_dataset.save_to_disk("manim_validation_dataset") validation_dataset.save_to_disk(
What’s next?
This preprocessing pipeline is just the first step in my journey to build a system that can generate Manim animations from natural language prompts. Now that I have a clean dataset of instruction → output pairs, I can move on to experimenting with fine-tuning small instruction-following models like Mistral or Phi.
The end goal is to create a workflow where you can go from:
A structured prompt → Valid Manim code → A rendered animation
There’s still a lot to explore — from prompt engineering and model evaluation, to automated testing of generated code and scalable video rendering. But having this dataset ready brings me much closer to building a working prototype.