If a picture is worth a thousand words, a video is worth a million.
For creators, generative video holds the promise of bringing any story or concept to life. However, the reality has often been a frustrating cycle of “prompt and pray” – typing a prompt and hoping for a usable result, with little to no control over character consistency, cinematic quality, or narrative coherence.
This guide is a framework for directing Veo 3.1, our latest model that marks a shift from simple generation to creative control. Veo 3.1 builds on Veo 3, with stronger prompt adherence and improved audiovisual quality when turning images into videos.
What you’ll learn in this guide:
-
Learn Veo 3.1’s full range of capabilities on Vertex AI.
-
Implement a formula to direct scenes with consistent characters and styles.
-
Direct video and sound using professional cinematic techniques.
-
Execute complex ideas by combining Veo with Gemini 2.5 Flash Image (Nano Banana) in advanced workflows.
Veo 3.1 model capabilities
First, it’s essential to understand the model’s full range of capabilities. Veo 3.1 brings audio to existing capabilities to help you craft the perfect scene. These features are experimental and actively improving, and we’re excited to see what you create as we iterate based on your feedback.
Core generation features:
-
High-fidelity video: Generate video at 720p or 1080p resolution.
-
Aspect ratio: 16:9 or 9:16
-
Variable clip length: Create clips of 4, 6, or 8 seconds.
-
Rich audio & dialogue: Veo 3.1 excels at generating realistic, synchronized sound, from multi-person conversations to precisely timed sound effects, all guided by the prompt.
-
Complex scene comprehension: The model has a deeper understanding of narrative structure and cinematic styles, enabling it to better depict character interactions and follow storytelling cues.
Advanced creative controls:
-
Improved image-to-video: Animate a source image with greater prompt adherence and enhanced audio-visual quality.
-
Consistent elements with “ingredients to video”: Provide reference images of a scene, character, object, or style to maintain a consistent aesthetic across multiple shots. This feature now includes audio generation.
-
Seamless transitions with “first and last frame”: Generate a natural video transition between a provided start image and end image, complete with audio.
-
Add/remove object: Introduce new objects or remove existing ones from a generated video. Veo preserves the scene’s original composition.
-
Digital watermarking: All generated videos are marked with SynthID to indicate the content is AI-generated.
Note: Add/remove object currently utilizes the Veo 2 model and does not generate audio.
A formula for effective prompts
A structured prompt yields consistent, high-quality results. Consider this five-part formula for optimal control.
[Cinematography] + [Subject] + [Action] + [Context] + [Style & Ambiance]
-
Cinematography: Define the camera work and shot composition.
-
Subject: Identify the main character or focal point.
-
Action: Describe what the subject is doing.
-
Context: Detail the environment and background elements.
-
Style & ambiance: Specify the overall aesthetic, mood, and lighting.
Example prompt: Medium shot, a tired corporate worker, rubbing his temples in exhaustion, in front of a bulky 1980s computer in a cluttered office late at night. The scene is lit by the harsh fluorescent overhead lights and the green glow of the monochrome monitor. Retro aesthetic, shot as if on 1980s color film, slightly grainy.