Building Video Agents

Learn how to create video agents that autonomously generate video content by combining AI models, web search, and the Video Jungle API.

Example Output

Here's an example of a video generated by this agent workflow:

Example video generated by the Nathan Fielder Agent workflow

Introduction

Agents are autonomous programs that can perform complex tasks by combining multiple tools and APIs. In this guide, we'll build an agent that:

Searches the web for current topics
Generates voiceover narration
Downloads relevant video clips
Creates edited videos with synchronized audio

This example demonstrates building a Nathan Fielder content generator, but the patterns can be adapted for any video generation workflow.

Prerequisites

Before building agents, ensure you have:

API Keys:
- Video Jungle API key (VJ_API_KEY)
- Serper API key for web search (SERPER_API_KEY)
- Anthropic API key for AI models
Python Packages:

I recommend using uv to manage dependencies. First make sure uv is installed, then:

uv init
uv add pydantic-ai videojungle instructor anthropic logfire click

This creates a uv environment, and adds all the packages to your project.

Agent Architecture

Our agent system consists of two specialized agents working together:

The architecture uses Pydantic models to ensure structured outputs from our agents, making the data flow predictable and type-safe.

Setting Up MCP Servers

Model Context Protocol (MCP) servers provide tools that agents can use. We'll set up two servers:

# Video Editor MCP Server - provides video editing tools
vj_server = MCPServerStdio(  
    'uvx',
    args=[
        '-p', '3.11',
        '--from', 'video_editor_mcp@0.1.36',
        'video-editor-mcp'
    ],
    env={
        'VJ_API_KEY': vj_api_key,
    },
    timeout=30
)

# Serper MCP Server - provides web search capabilities
serper_server = MCPServerStdio(
    'uvx',
    args=[
        '-p', '3.11',
        'serper-mcp-server@latest',
    ],
    env={
        'SERPER_API_KEY': serper_api_key,
    },
    timeout=30
)

MCP servers run as separate processes and communicate with agents via standard I/O. They provide a secure way to give agents access to external tools without exposing credentials directly.

Creating the Search Agent

The search agent specializes in finding video content from the web:

# Use a cheaper model for search tasks
cheap_model = GeminiModel("gemini-2.5-flash-preview-05-20")

search_agent = Agent(  
    model=cheap_model,
    instructions='You are an expert video sourcer. You find the best source videos for a given topic.', 
    mcp_servers=[vj_server, serper_server],
    output_type=VideoList,
    instrument=True,
)

# Using the search agent
async with search_agent.run_mcp_servers():
    result = await search_agent.run(
        "Find recent Nathan Fielder clips from 'The Rehearsal'",
        usage_limits=UsageLimits(request_limit=5)
    )
    videos = result.output.videos  # Structured list of videos

The search agent combines web search capabilities with structured output formatting to return clean, usable video data.

Creating the Edit Agent

The edit agent handles the complex task of creating video edits:

# Use a more capable model for complex editing tasks
good_model = AnthropicModel("claude-sonnet-4-20250514")

edit_agent = Agent(
    model=good_model,
    instructions='You are an expert video editor, creating fast paced, interesting video edits for social media. ' \
    'You can answer questions, download and analyze videos, and create rough video edits using a mix of project assets and remote videos.' \
    'By default, if a project id is provided, you will use ONLY the assets in that project to create the edit. If no project id is provided,' \
    'you will create a new project, and search videofiles to create an edit instead. For video assets in a project, you will use the type "user" instead of "videofile".' \
    'if you are doing a voice over, you will use the audio asset in the project as the voiceover for the edit, and set the video asset\'s audio level to 0 so that the voiceover is the only audio in the edit.',
    mcp_servers=[vj_server],
    output_type=VideoEdit,
    instrument=True,
)

The edit agent has detailed instructions about handling audio levels, asset types, and project management to ensure high-quality outputs.

Generating Audio Content

Before creating video edits, we generate contextual audio narration:

def search_and_render_audio():
    # Use Instructor for structured prompt generation
    workflow_client = Anthropic()
    client = instructor.from_anthropic(workflow_client)
    
    # Search for current topics
    search_prompt = """
    I'm trying to come up with an interesting spoken dialogue prompt about nathan fielder's the rehearsal season 2. 
    can you help me come up with ideas for what might be interesting? you can search the web to get up to date info.
    """
    
    # Get structured topic suggestions
    resp = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{"role": "user", "content": search_prompt}],
        tools=[{
            "type": "web_search_20250305",
            "name": "web_search",
            "max_uses": 5
        }],
        response_model=ClipParameters,
    )
    
    # Create a prompt template for voice generation
    prompt = vj.prompts.generate(
        task="You are a 'The Rehearsal' episode analyzer, diving deep into meta idea to discuss. You aim for 30 second long read script concept that is funny and insightful.",
        parameters=["clip topic", "latest episode topic"]
    )
    
    # Create project and generate audio
    project = vj.projects.create(
        name="Nathan Fielder Clips", 
        description="Clips from Nathan Fielder episodes",
        prompt_id=prompt.id,
        generation_method="prompt-to-speech"
    )
    
    # Generate the actual audio asset
    audio = vj.projects.generate(
        script_id=project.scripts[0].id,
        project_id=project.id,
        parameters={
            "clip topic": random.choice(resp.clip_topics),
            "latest episode topic": resp.latest_episode_topic
        }
    )
    
    return (project.id, audio['asset_id'])

This workflow demonstrates how to:

Use AI to research current topics
Create parameterized prompts for content generation
Generate audio assets that will guide the video edit

Complete Workflow

The main workflow orchestrates all components:

async def async_main(project_id: Optional[str] = None, asset_id: Optional[str] = None):
    if project_id:
        # Use existing project
        project = vj.projects.get(project_id)
        audio_asset_id = asset_id
    else:
        # Create new project with audio
        project_id, audio_asset_id = search_and_render_audio()
        project = vj.projects.get(project_id)
        
        # Search and download videos
        successful_videos = 0
        processed_urls = set()
        
        async with search_agent.run_mcp_servers():
            result = await search_agent.run(
                "Find recent Nathan Fielder clips",
                usage_limits=UsageLimits(request_limit=5)
            )
        
        # Download and upload videos to project
        for video in result.output.videos:
            if video.url not in processed_urls and successful_videos < 5:
                processed_urls.add(video.url)
                try:
                    # Download video
                    safe_title = video.title.replace('/', '-')
                    output_filename = f"{safe_title}.mp4"
                    download(video.url, output_path=output_filename)
                    
                    # Upload to project
                    project.upload_asset(
                        name=video.title,
                        description=f"Agent downloaded: {video.title}",
                        filename=output_filename
                    )
                    successful_videos += 1
                    os.remove(output_filename)  # Clean up
                except Exception as e:
                    print(f"Error processing {video.title}: {e}")
        
        # Wait for video analysis
        time.sleep(45)
    
    # Create the final edit
    async with edit_agent.run_mcp_servers():
        asset = vj.assets.get(audio_asset_id)
        asset_length = asset.create_parameters['metadata']['duration_seconds']
        
        result = await edit_agent.run(
            f"""Create an edit using all video assets in project '{project.id}'.
            Use audio asset '{audio_asset_id}' as voiceover (0 to {asset_length} seconds).
            Set all video audio levels to 0. Show outdoor scenes first.
            Total video duration must match voiceover duration ({asset_length} seconds).
            Create the edit but don't render the final video.""",
            usage_limits=UsageLimits(request_limit=14)
        )
    
    print(f"Created edit: {result.output.edit_id} in project: {result.output.project_id}")

The workflow handles:

Creating or reusing projects
Downloading and managing video assets
Synchronizing video edits with audio duration
Error handling and retries

Running the Agent

The agent can be run from the command line:

@click.command()
@click.option('--project-id', '-p', help='Existing project ID to use')
@click.option('--asset-id', '-a', help='Audio asset ID for the edit')
def main(project_id: Optional[str] = None, asset_id: Optional[str] = None):
    import asyncio
    asyncio.run(async_main(project_id, asset_id))

if __name__ == "__main__":
    main()

Best Practices

When building agents with Video Jungle:

1. Error Handling

Always wrap download and upload operations in try-except blocks
Keep track of processed URLs to avoid duplicates
Implement retry logic for network operations

2. Resource Management

Clean up downloaded files after uploading
Use appropriate timeouts for MCP servers
Set usage limits to control API costs

3. Agent Design

Use specialized agents for different tasks (search vs. edit)
Choose appropriate models based on task complexity
Provide detailed instructions to guide agent behavior

4. Audio-Video Synchronization

Always check audio duration before creating edits
Set video audio levels to 0 when using voiceovers
Ensure total edit duration matches audio length

5. Monitoring

Use logfire for observability
Instrument agents to track performance
Log progress for long-running operations

Remember to handle API rate limits and implement appropriate delays between operations. The time.sleep(45) in the example allows time for video analysis to complete.

By following these patterns, you can build sophisticated agents that automate complex video generation workflows while maintaining quality and reliability.