Timestamp Prompting: One Call, Multiple Shots
Veo 3.1 understands temporal structure using a [HH:MM–HH:MM] notation inside a single prompt. You can direct four distinct shots within one 8-second generation call:
prompt = """[00:00-00:02] Wide shot from above: a lone hiker cresting a mountain ridge at golden hour.
SFX: Wind, boots on gravel.
[00:02-00:04] Close-up of the hiker's face, eyes narrowing against the light, slight smile.
Shallow depth of field. SFX: Wind fades.
[00:04-00:06] Reverse shot, hiker's POV: a vast valley stretching to the horizon,
mist on the far peaks. SFX: Distant birdsong begins.
[00:06-00:08] Slow crane pull-back, the hiker silhouetted against the sunset sky.
Ambient: quiet orchestral swell, building."""
operation = client.models.generate_videos(
model="veo-3.1-generate-preview",
prompt=prompt,
config=types.GenerateVideosConfig(
aspect_ratio="16:9",
resolution="1080p",
duration_seconds=8,
enhance_prompt=False, # keep our structured prompt intact
)
)
The model makes editorial decisions about cut timing and transition style — you don’t specify whether it’s a hard cut or a dissolve. What you control is camera position, subject action, and audio for each segment. The output is a single continuous MP4 with no visible seam between shots when your prompts are internally consistent.
The failure mode: inconsistent subjects. If your first segment specifies “a male hiker in a red jacket” and a later segment describes just “the hiker,” the model maintains the subject reasonably well. But any segment that accidentally implies a different context — a different time of day, a different location — gets interpreted literally. The model follows instructions; it doesn’t infer that segment 3 is meant to follow segment 2 temporally unless the prompt makes that explicit. Be redundant about scene continuity details across segments.
First and Last Frame: Controlled Transitions
You provide a starting image and an ending image; the model generates the motion between them with synchronized audio. The main use cases are product transitions, before/after comparisons, and scene changes where you need visual precision at both endpoints.
import base64
def load_image_b64(path: str) -> bytes:
with open(path, "rb") as f:
return f.read()
operation = client.models.generate_videos(
model="veo-3.1-generate-preview",
prompt=(
"A smooth cinematic transition: the empty coffee mug fills with steaming espresso, "
"steam curling upward. SFX: Espresso machine hiss, liquid settling."
),
config=types.GenerateVideosConfig(
first_frame_image=types.Image(
image_bytes=load_image_b64("empty_mug.png"),
mime_type="image/png"
),
last_frame_image=types.Image(
image_bytes=load_image_b64("full_mug.png"),
mime_type="image/png"
),
aspect_ratio="16:9",
duration_seconds=6,
)
)
Two things that bite developers here. First: your source images must match the requested aspect ratio. A 16:9 aspect_ratio config with a square source image produces awkward cropping, not letterboxing. Crop your input images before the call, not after. Second: the model respects the first frame strongly but treats the last frame as a guide rather than a hard constraint. For 4-second clips, endpoint adherence is tighter. For 8-second clips, expect the model to take more liberty with how it arrives at the final frame. If the ending frame precision matters — product close-up, specific text on screen — use a 4-second duration and run a few variants.
Ingredients to Video: Character Consistency Across Clips
Before Veo 3.1, maintaining consistent character appearance across multiple separate generation calls required careful prompt engineering and produced visible drift after 3–4 clips. Ingredients to Video fixes this by accepting reference images as additional input. The model anchors character appearance, style, and setting to your provided references.
def load_image_b64(path: str) -> bytes:
with open(path, "rb") as f:
return f.read()
operation = client.models.generate_videos(
model="veo-3.1-fast-generate-preview",
prompt=(
"Using the provided detective and office images: medium shot of the detective "
"behind his desk. He looks up and says in a weary voice, "
"'Of all the offices in this town, you had to walk into mine.'"
),
config=types.GenerateVideosConfig(
reference_images=[
types.ReferenceImage(
reference_image=types.Image(
image_bytes=load_image_b64("detective_character.png"),
mime_type="image/png"
)
),
types.ReferenceImage(
reference_image=types.Image(
image_bytes=load_image_b64("office_setting.png"),
mime_type="image/png"
)
),
],
aspect_ratio="16:9",
duration_seconds=8,
)
)
Three reference images is the practical ceiling. Beyond three, the model starts averaging features across references in ways that produce blended, uncanny characters. Two is usually sufficient for one character plus one setting. If you need a second character in the scene, describe them in the prompt rather than adding a third reference image. The recommended workflow is to generate character and setting references first using Gemini 2.5 Flash Image (Nano Banana Pro for higher fidelity), then feed those into Veo 3.1 for the actual video.
Audio Prompting Syntax
Veo 3.1 generates audio by default, and the prompting syntax is explicit enough that learning it pays off immediately. Three patterns:
Dialogue uses quotation marks around the spoken text with the speaker described before the quote: A woman says, "We have to leave now." The model infers a voice that matches the described character. Multiple speakers work within the same prompt: A man in a suit says, "Sign here." The woman shakes her head: "Not yet." The model will cast two distinct voices and position them in the stereo field based on any camera angle cues in your prompt.
Sound effects use the SFX: prefix: SFX: thunder cracks in the distance. Timing relative to the visual action is inferred from context — if your prompt shows a character slamming a door followed by SFX: door slam echoes, the model places the sound at the door-slam action. You can’t set millisecond timing, but the inference is accurate enough for editorial use.
Ambient audio uses Ambient noise: or Ambient:: Ambient noise: the quiet hum of a starship bridge, crew murmur, distant alerts. This sets the background bed for the entire clip. Combining Ambient with SFX produces layered audio: the background ambience plus discrete event sounds on top.
To disable audio entirely, set generate_audio=False in your GenerateVideosConfig. This saves approximately $0.13/sec on Standard, $0.05/sec on Fast. For any clip where you’re adding a post-production music track or voice-over, disabling generated audio avoids paying for audio you won’t use.
Pricing in Practice
At 8 seconds per clip, 100 daily generations on Fast costs $960/month. That’s a real production budget, and most pipelines can reduce it without meaningful quality loss.
The cheapest workflow that maintains publishable quality: use Lite for all iteration and Fast only for approved shots. If you run 4 Lite drafts per final clip before approving: 4 × $0.40 + 1 × $1.20 = $2.80 per published 8-second clip. Versus $3.20 if you iterated on Fast throughout. Across 100 clips per month, that’s $280 versus $320 — a 12% saving, not transformative on its own.
The bigger lever is audio. If your workflow adds music or voice-over in post, turning off audio generation on all Lite iterations cuts those iteration costs by 33%. On a 100-clip/month pipeline at Fast with audio off for drafts, you’re at roughly $640/month versus $960 with audio on throughout. Over a year, that’s $3,840 back.
Standard is justified when the clip goes directly to client or broadcast without post color grading. The visible quality gap shows under technical scrutiny — hair detail, fine textures, complex lighting interactions. For casual viewing on mobile, Fast is indistinguishable to most audiences. Run your actual prompts on both tiers and evaluate at your target display size before committing to Standard for production.
What Veo 3.1 Still Can’t Do
The SynthID watermark is non-optional. It’s invisible to the eye but detectable by Google’s verification tools and increasingly by third-party detection services. If client contracts specify undetectable AI generation, Veo 3.1 doesn’t comply. This isn’t a bug; Google has been explicit that all Veo output will carry SynthID permanently.
The Add/Remove Object feature — available through AI Studio but not the generate_videos API — still runs on Veo 2 internally. No audio. Google hasn’t announced a timeline for migrating editing features to Veo 3.1.
Native long-form generation doesn’t exist in the API. The 60-second clips referenced in Google’s marketing are assembled from sequential 8-second calls chained in post. Character consistency across chained clips is good using Ingredients to Video, but you’re managing a reference image library and re-prompting context for every segment. There is no single API call that produces a 60-second video.
Extreme close-up human faces still produce occasional artifacts. The uncanny valley problem is reduced versus Veo 3, but not eliminated. For shots where a face is the primary subject, generate 3–4 variants and select. For mid-ground or wider framing, this is a non-issue.
Where Veo 3.1 Fits Right Now
Three scenarios where it’s the right tool today.
Content pipelines at volume. If you need 20+ short-form clips per day — product highlights, social teasers, ad variants — Fast tier at roughly 90-second generation time is the only API-native path that fits a real production cadence. Sora 2 and Kling 2 are comparable on quality but slower on average; neither matches Fast’s throughput for pipeline use.
Multi-shot narrative from a single call. Timestamp prompting has no direct equivalent in competitor models as of early June 2026. If you need a structured 4-shot sequence in a single 8-second clip, this is the only model with that built in.
Character-consistent series. The Ingredients to Video workflow produces noticeably better character consistency than prompt-only approaches in Runway ML or Pika. If you’re building multi-episode content with recurring characters, this matters enough to drive model selection on its own.
Where it’s not the right call: photorealistic portrait close-ups (Veo 3.1 Standard is adequate but not market-leading), highly stylized 2D animation aesthetics (Kling 2 performs better there), and any context where AI attribution must remain undisclosed. The free tier also doesn’t exist, so there’s no cost-free way to evaluate it before your first invoice — budget at least $50 for a realistic evaluation run against your actual prompts.
Comments · 0
No comments yet. Be the first to share your thoughts.