Manim Project Part 1: Prepare The Dataset

The Idea of the project:

As part of my exploration into building a prototype that generates Manim animations from natural language descriptions, I came across the bespoke-manim dataset on Hugging Face. It looked like a perfect starting point — each example includes narration, visual structure, and the Python code used to generate an animation video using ManimCE.

But to actually use this dataset for fine-tuning an instruction-following model, I had to reshape it significantly. In this post, I’ll walk through how I approached this process: inspecting the raw dataset, deciding what to keep (and what to filter), and ultimately turning it into a set of instruction → output pairs that a model can learn from.

Exploring the Dataset

When I loaded the dataset, it looked like this:

DatasetDict({
    train: Dataset({
        features: [
            'subject', 'topic', 'question', 'title', 'narration',
            'visual_elements', 'equations', 'key_timestamps', 'visual_style',
            'concept_id', 'python_code', 'scene_class_name', 'generation_time',
            'filename', 'message', 'error', 'stdout', 'stderr', 'video',
            'instruction', 'output'
        ],
        num_rows: 1000
    })
})

At first glance, I noticed that not all of the 1000 entries were usable. Some had Python code that failed to execute (captured in the error field), and others were missing videos entirely — which probably meant the code never produced a usable animation.
Also I noticed that many fields are created in order to store some meta data.
After analyzing each field’s potential value for fine-tuning I think not all fields are necessary:

Essential Fields:
- question : Core input that defines what the animation should explain
- python_code - The target output the model needs to generate
- title - Provides concise focus for the animation
- narration - Detailed explanation that significantly shapes the animation’s content
- visual_elements- Critical for understanding what should be visually represented in the code
- equations - Mathematical formulas that need to be rendered in the animation
Potentially Useful Fields:
- subject and topic - Provide context that might help the model generate more appropriate code
- visual_style - Contains styling information that influences the visual appearance
- key_timestamps - Provides structure for the animation sequence
Unnecessary Fields:
- scene_class_name - Typically derived from the code itself
- generation_time - Metadata unlikely to influence code generation
- filename, message - More metadata
- stdout, stderr - Execution logs not needed for generation
- error - sould be already filtered for successful examples
- video - The rendered output, not needed for training
- concept_id - not needed

Filtering Clean Examples

To keep the training signal clean, I applied a simple filter to remove any examples where:

The errorfield was not None
The video field was None

This will reduce the size of the dataset but we will have only data-points where the code is executable.

dataset = dataset.filter(
    lambda example: example['error'] is None and example['video'] is not None
)

dataset

As expected now we only have 252 which is not bad, since we are now sure that the dataset is high quality.

Designing Prompt-Format:

At this point we need to create training examples by combining the essential fields into well-structured prompts paired with the working code. This will give our model the most relevant information while removing noise from the training data.
Let’s implement this data transformation using the Hugging Face datasets library.

def create_prompt(example):
    prompt = f"Subject: {example['subject']}\n"
    prompt += f"Topic: {example['topic']}\n\n"
    prompt += f"Question: {example['question']}\n\n"
    prompt += f"Title: {example['title']}\n\n"
    prompt += f"Narration:\n{example['narration']}\n\n"

    if example['visual_elements'] and len(example['visual_elements']) > 0:
        prompt += "Visual Elements:\n"
        for i, elem in enumerate(example['visual_elements']):
            prompt += f"- {elem['description']} (Timestamp: {elem['timestamp']})\n"
        prompt += "\n"

    if example['equations'] and len(example['equations']) > 0:
        prompt += "Equations:\n"
        for i, eq in enumerate(example['equations']):
            prompt += f"- {eq}\n"
        prompt += "\n"

    if example['visual_style']:
        prompt += f"Visual Style:\n{example['visual_style']}\n\n"

    prompt += "Generate manim code to create this animation:"

    return prompt

# Create the training dataset with input-output pairs
def process_example(example):
    return {
        "instruction": create_prompt(example),
        "output": example['python_code']
    }

This resulted in a rich, semantically meaningful prompt — not just a raw dump of metadata, but something that reads like a real instruction.

Mapping Instruction → Output Pairs

Once I had a clear structure for the prompt, I turned each example into an input-output pair for fine-tuning:

# Map the function over the filtered dataset

training_dataset = dataset.map(process_example)

# Select only the columns we need for fine-tuning

training_dataset = training_dataset.select_columns(["instruction", "output"])

Now I had exactly what I needed: a list of instructions describing an animation, and the Manim Python code that generates it.

Train-Validation Split and Saving

To prepare for fine-tuning, I split the dataset into a training and validation set:

from datasets import Dataset
dataset_split = training_dataset['train'].train_test_split(test_size=0.1)
train_dataset = dataset_split['train']
validation_dataset = dataset_split['test']

I also saved each split to disk and pushed the final version to the Hugging Face Hub:

train_dataset.save_to_disk("manim_train_dataset")
validation_dataset.save_to_disk("manim_validation_dataset")

What’s next?

This preprocessing pipeline is just the first step in my journey to build a system that can generate Manim animations from natural language prompts. Now that I have a clean dataset of instruction → output pairs, I can move on to experimenting with fine-tuning small instruction-following models like Mistral or Phi.

The end goal is to create a workflow where you can go from:

A structured prompt → Valid Manim code → A rendered animation

There’s still a lot to explore — from prompt engineering and model evaluation, to automated testing of generated code and scalable video rendering. But having this dataset ready brings me much closer to building a working prototype.