Happy Horse — Alibaba's Four-Mode Video Suite Is Now Live on GenFire
Happy Horse is Alibaba's new video generation family — four endpoints (image-to-video, text-to-video, reference-to-video, and video edit) sharing a single 720p/1080p, 3–15 second model. Here's what it does and how to use it inside GenFire.
Meet Happy Horse
Happy Horse is Alibaba's new video model family, now live on GenFire. It's not a single model — it's a four-endpoint suite covering the full short-form generation workflow with a consistent visual language across all four entry points.
The four modes are:
- Image-to-Video — animate a still image
- Text-to-Video — generate from a prompt alone
- Reference-to-Video — feed up to nine reference images and control which one drives which subject
- Video Edit — modify an existing video with a text prompt
What makes Happy Horse worth paying attention to is that all four modes share a model. You can plan a piece across multiple shots, switch endpoints depending on what each shot needs, and the visual identity holds. That's the reason we've integrated all four endpoints rather than just the headline text-to-video.
What Each Mode Is For
Image-to-Video
The classic use case. Upload a still — a product photo, a generated hero frame, a portrait — and Happy Horse animates it for 3 to 15 seconds at 720p or 1080p. The model is image-led: it commits hard to the input frame and builds motion outward from there rather than re-interpreting the subject.
The 15-second duration ceiling matters. For sustained product shots, hero-frame extensions, and slow-burn cinematic moments, that runway is meaningful — it's the difference between a beat and a scene.
Text-to-Video
Prompt-only generation, with full aspect-ratio control (16:9, 9:16, 1:1, 4:3, 3:4) and the same 3–15 second range. The aspect ratio matrix is broader than most current models — 4:3 and 3:4 are unusual options, and they're exactly the right shape for stacked multi-frame layouts where you want a vertical panel and a horizontal panel feeling like the same piece.
Reference-to-Video
This is the mode that genuinely differentiates Happy Horse. You can supply up to nine reference images, and the model uses them as character/subject anchors — meaning a single video can render multiple distinct, visually-locked subjects without losing identity from frame to frame.
The control mechanism is unusually intuitive: reference subjects in your prompt by writing `character1`, `character2`, all the way up to `character9`, and Happy Horse maps those tokens to the corresponding reference images you uploaded. A prompt like "a dance battle between character1 and character2, cinematic lighting" with two uploaded references will produce a video where each character holds their identity for the entire shot.
For brand work, this is a major capability. You can lock a spokesperson, a product, and a setting in a single generation — no compositing, no separate passes, no identity drift halfway through the clip.
Video Edit
The fourth mode is the most underrated. Feed Happy Horse an existing video plus a text prompt, and it modifies the source — recoloring, restyling, replacing subjects, adjusting environments — while preserving the underlying motion and structure of the original.
It optionally accepts up to nine reference images for guided edits (same character1..character9 convention), and an audio_setting parameter that controls whether the original audio is preserved (origin) or auto-handled (auto). The duration is derived from the source video, so there's no duration parameter to set — it edits exactly the runtime you provide.
This is the mode that turns Happy Horse from "another generation model" into a post-production tool. You can take a piece of footage you already love — generated, shot, or licensed — and re-stylize it without re-cutting it.
What Makes Happy Horse Distinct
Rather than a head-to-head spec table, here's the honest read on where Happy Horse fits in GenFire's video catalog:
- It's the only model in our catalog with all four modes (I2V, T2V, Reference, Edit) sharing a unified visual language. Mixed-model pipelines can do all of this, but every endpoint switch introduces a small but visible shift in style and motion character. Happy Horse sidesteps that.
- The 15-second duration ceiling on I2V and T2V is generous for short-form work that needs sustained shots rather than rapid beats.
- The 9-reference cap with named tokens is the strongest in-prompt subject control of any model we've integrated. Other reference-capable models accept references; Happy Horse lets you address them by position in the prompt itself.
- The 4:3 and 3:4 aspect ratios in T2V and Reference modes are unusual and pair well with multi-frame compositions.
- Video Edit accepting both a text prompt and reference images is a combination most edit-capable models don't expose.
For comparisons against specific models in your catalog (Veo 3.1, Sora 2, Seedance, Kling, etc.), the model selector in Video Studio displays each model's current resolution, duration, and audio capabilities side-by-side at generation time — those are the authoritative specs to plan against.
Where to Use Each Mode
Use Image-to-Video when…
- You've already generated a hero frame in Recraft V4 Pro or Nano Banana Pro and want to animate it
- You're doing extended product shots that need 10–15 seconds of motion, not just a beat
- The visual identity must be locked to a specific reference photo
Use Text-to-Video when…
- You're iterating on concepts and want fast turnaround without uploading assets
- You need 4:3 or 3:4 aspect ratios specifically
- You're producing for stacked multi-frame compositions that need both vertical and horizontal panels in the same visual world
Use Reference-to-Video when…
- You have multiple distinct subjects (a spokesperson, a product, a setting) that all need to hold identity in a single shot
- You're producing campaign work where character consistency across multiple clips is non-negotiable
- You need to render a scene with named, controllable subjects rather than describe them in prose
Use Video Edit when…
- You have existing footage — generated, shot, or licensed — and want to restyle it
- You're producing color or aesthetic variants of a hero piece for A/B testing
- You need to change the look of a clip without changing its cut
Building a Multi-Mode Workflow in GenFire
The reason we integrated all four endpoints is that the most interesting workflows use them in sequence. A typical Happy Horse pipeline in the Workflow editor might look like:
- 1Reference-to-Video to generate a hero shot with two locked characters and a fixed setting (5s, 1080p, 16:9)
- 2Image-to-Video on a still pulled from that hero shot to extend a specific beat into a 12-second linger
- 3Text-to-Video for a transitional cutaway (3s, 9:16) using the same prompt language to keep visual continuity
- 4Video Edit on the final stitched cut to apply a unified color treatment across all three clips
Because all four modes share the underlying Happy Horse model, the visual identity holds across the whole piece. This is the workflow that's hard to do with mixed-model pipelines, where every endpoint switch introduces a shift in style. For brand work where consistency is the entire point, that's the win.
Per-Shot Routing in Storyboard
Happy Horse is also available in Storyboard, where you can assign a specific mode per shot. The pattern that works well:
- Wide establishers — Text-to-Video (3:4 or 4:3, longer duration)
- Character-driven beats — Reference-to-Video with locked references
- Insert details and product shots — Image-to-Video from a Recraft V4 Pro hero frame
- Final pass — Video Edit to unify color and style across the whole storyboard
The storyboard's per-shot model assignment was built for exactly this kind of multi-mode workflow. You're not committing the whole piece to one endpoint — you're routing each beat to the mode that suits it, while keeping the visual signature consistent.
Audio Strategy
Happy Horse is image- and motion-focused; for any piece that needs voiced narration or a music bed, pair it with the rest of GenFire's audio stack:
- Audio node in the Workflow editor with ElevenLabs v3 for any voiced narration
- Music bed dropped in via the Clips editor's audio mixing during final assembly
- For Video Edit specifically, set
audio_setting: 'origin'to preserve the source audio while transforming the visual
The audio_setting: 'auto' option is useful when you're radically transforming a video and the original audio no longer matches — it allows the model to drop or adapt the audio rather than producing a mismatch.
Practical Notes
A few things to know about working with Happy Horse:
- Duration is an integer enum. The model expects whole seconds (3, 4, 5, …, 15). GenFire handles the conversion automatically when you pick from the duration picker, but if you're hitting the API directly via the Workflow editor's advanced fields, pass an integer rather than a string like
"5s". - Aspect ratio differs by mode. Image-to-Video and Video Edit don't accept aspect ratio (the input dictates it). Text-to-Video and Reference-to-Video accept 16:9, 9:16, 1:1, 4:3, 3:4.
- Reference images cap at 9. You can pass fewer, but the
character1..character9token range is fixed. If your prompt only mentionscharacter1andcharacter2, you only need two reference images. - Safety checker is opt-in. The
enable_safety_checkerflag is exposed for advanced users. Most users should leave it on. - Video Edit derives duration from the source. You don't pick a runtime — the edit runs the length of the input video.
Getting Started
- 1Open Video Studio from your dashboard.
- 2Select Happy Horse from the model dropdown — you'll see all four sub-modes (I2V, T2V, Reference, Edit) listed.
- 3Pick the mode that matches your input. Upload an image for I2V, write a prompt for T2V, upload up to 9 references and use
character1..character9in your prompt for Reference, or upload a video for Edit. - 4Set duration (3–15s, integer seconds) and resolution (720p or 1080p). Aspect ratio is available for T2V and Reference modes.
- 5Generate.
For workflow-driven projects, drop Happy Horse nodes into the Workflow editor and chain modes — Reference for hero shots, Image for extensions, Text for cutaways, Edit for color unification.
The Bottom Line
Happy Horse is the most versatile single-family video model we've integrated this year. It doesn't try to win on every spec dimension — it wins on workflow flexibility, because the four modes share enough underlying behavior that you can build a complete short-form piece without leaving the family.
For brand work, multi-character scenes, and post-production restyles, it's a strong default. For glossy hero shots where you need maximum fidelity or native audio, pair it with one of the other models in your catalog and let Happy Horse handle the connective tissue.