VideoHappy HorseAlibabaAI Video

Happy Horse — Alibaba's Four-Mode Video Suite Is Now Live on GenFire

Happy Horse is Alibaba's new video generation family — four endpoints (image-to-video, text-to-video, reference-to-video, and video edit) sharing a single 720p/1080p, 3–15 second model. Here's what it does and how to use it inside GenFire.

GenFire Team

·April 30, 2026·9 min read

Meet Happy Horse

Happy Horse is Alibaba's new video model family, now live on GenFire. It's not a single model — it's a four-endpoint suite covering the full short-form generation workflow with a consistent visual language across all four entry points.

The four modes are:

Image-to-Video — animate a still image
Text-to-Video — generate from a prompt alone
Reference-to-Video — feed up to nine reference images and control which one drives which subject
Video Edit — modify an existing video with a text prompt

What makes Happy Horse worth paying attention to is that all four modes share a model. You can plan a piece across multiple shots, switch endpoints depending on what each shot needs, and the visual identity holds. That's the reason we've integrated all four endpoints rather than just the headline text-to-video.

What Each Mode Is For

Image-to-Video

The classic use case. Upload a still — a product photo, a generated hero frame, a portrait — and Happy Horse animates it for 3 to 15 seconds at 720p or 1080p. The model is image-led: it commits hard to the input frame and builds motion outward from there rather than re-interpreting the subject.

The 15-second duration ceiling matters. For sustained product shots, hero-frame extensions, and slow-burn cinematic moments, that runway is meaningful — it's the difference between a beat and a scene.

Text-to-Video

Prompt-only generation, with full aspect-ratio control (16:9, 9:16, 1:1, 4:3, 3:4) and the same 3–15 second range. The aspect ratio matrix is broader than most current models — 4:3 and 3:4 are unusual options, and they're exactly the right shape for stacked multi-frame layouts where you want a vertical panel and a horizontal panel feeling like the same piece.

Reference-to-Video

This is the mode that genuinely differentiates Happy Horse. You can supply up to nine reference images, and the model uses them as character/subject anchors — meaning a single video can render multiple distinct, visually-locked subjects without losing identity from frame to frame.

The control mechanism is unusually intuitive: reference subjects in your prompt by writing `character1`, `character2`, all the way up to `character9`, and Happy Horse maps those tokens to the corresponding reference images you uploaded. A prompt like "a dance battle between character1 and character2, cinematic lighting" with two uploaded references will produce a video where each character holds their identity for the entire shot.

For brand work, this is a major capability. You can lock a spokesperson, a product, and a setting in a single generation — no compositing, no separate passes, no identity drift halfway through the clip.

Video Edit

The fourth mode is the most underrated. Feed Happy Horse an existing video plus a text prompt, and it modifies the source — recoloring, restyling, replacing subjects, adjusting environments — while preserving the underlying motion and structure of the original.

It optionally accepts up to nine reference images for guided edits (same character1..character9 convention), and an audio_setting parameter that controls whether the original audio is preserved (origin) or auto-handled (auto). The duration is derived from the source video, so there's no duration parameter to set — it edits exactly the runtime you provide.

This is the mode that turns Happy Horse from "another generation model" into a post-production tool. You can take a piece of footage you already love — generated, shot, or licensed — and re-stylize it without re-cutting it.

What Makes Happy Horse Distinct

Rather than a head-to-head spec table, here's the honest read on where Happy Horse fits in GenFire's video catalog:

It's the only model in our catalog with all four modes (I2V, T2V, Reference, Edit) sharing a unified visual language. Mixed-model pipelines can do all of this, but every endpoint switch introduces a small but visible shift in style and motion character. Happy Horse sidesteps that.
The 15-second duration ceiling on I2V and T2V is generous for short-form work that needs sustained shots rather than rapid beats.
The 9-reference cap with named tokens is the strongest in-prompt subject control of any model we've integrated. Other reference-capable models accept references; Happy Horse lets you address them by position in the prompt itself.
The 4:3 and 3:4 aspect ratios in T2V and Reference modes are unusual and pair well with multi-frame compositions.
Video Edit accepting both a text prompt and reference images is a combination most edit-capable models don't expose.

For comparisons against specific models in your catalog (Veo 3.1, Sora 2, Seedance, Kling, etc.), the model selector in Video Studio displays each model's current resolution, duration, and audio capabilities side-by-side at generation time — those are the authoritative specs to plan against.

Where to Use Each Mode

Use Image-to-Video when…

You've already generated a hero frame in Recraft V4 Pro or Nano Banana Pro and want to animate it
You're doing extended product shots that need 10–15 seconds of motion, not just a beat
The visual identity must be locked to a specific reference photo

Use Text-to-Video when…

You're iterating on concepts and want fast turnaround without uploading assets
You need 4:3 or 3:4 aspect ratios specifically
You're producing for stacked multi-frame compositions that need both vertical and horizontal panels in the same visual world

Use Reference-to-Video when…

You have multiple distinct subjects (a spokesperson, a product, a setting) that all need to hold identity in a single shot
You're producing campaign work where character consistency across multiple clips is non-negotiable
You need to render a scene with named, controllable subjects rather than describe them in prose

Use Video Edit when…

You have existing footage — generated, shot, or licensed — and want to restyle it
You're producing color or aesthetic variants of a hero piece for A/B testing
You need to change the look of a clip without changing its cut

Building a Multi-Mode Workflow in GenFire

The reason we integrated all four endpoints is that the most interesting workflows use them in sequence. A typical Happy Horse pipeline in the Workflow editor might look like:

1Reference-to-Video to generate a hero shot with two locked characters and a fixed setting (5s, 1080p, 16:9)
2Image-to-Video on a still pulled from that hero shot to extend a specific beat into a 12-second linger
3Text-to-Video for a transitional cutaway (3s, 9:16) using the same prompt language to keep visual continuity
4Video Edit on the final stitched cut to apply a unified color treatment across all three clips

Because all four modes share the underlying Happy Horse model, the visual identity holds across the whole piece. This is the workflow that's hard to do with mixed-model pipelines, where every endpoint switch introduces a shift in style. For brand work where consistency is the entire point, that's the win.

Per-Shot Routing in Storyboard

Happy Horse is also available in Storyboard, where you can assign a specific mode per shot. The pattern that works well:

Wide establishers — Text-to-Video (3:4 or 4:3, longer duration)
Character-driven beats — Reference-to-Video with locked references
Insert details and product shots — Image-to-Video from a Recraft V4 Pro hero frame
Final pass — Video Edit to unify color and style across the whole storyboard

The storyboard's per-shot model assignment was built for exactly this kind of multi-mode workflow. You're not committing the whole piece to one endpoint — you're routing each beat to the mode that suits it, while keeping the visual signature consistent.

Audio Strategy

Happy Horse is image- and motion-focused; for any piece that needs voiced narration or a music bed, pair it with the rest of GenFire's audio stack:

Audio node in the Workflow editor with ElevenLabs v3 for any voiced narration
Music bed dropped in via the Clips editor's audio mixing during final assembly
For Video Edit specifically, set audio_setting: 'origin' to preserve the source audio while transforming the visual

The audio_setting: 'auto' option is useful when you're radically transforming a video and the original audio no longer matches — it allows the model to drop or adapt the audio rather than producing a mismatch.

Practical Notes

A few things to know about working with Happy Horse:

Duration is an integer enum. The model expects whole seconds (3, 4, 5, …, 15). GenFire handles the conversion automatically when you pick from the duration picker, but if you're hitting the API directly via the Workflow editor's advanced fields, pass an integer rather than a string like "5s".
Aspect ratio differs by mode. Image-to-Video and Video Edit don't accept aspect ratio (the input dictates it). Text-to-Video and Reference-to-Video accept 16:9, 9:16, 1:1, 4:3, 3:4.
Reference images cap at 9. You can pass fewer, but the character1..character9 token range is fixed. If your prompt only mentions character1 and character2, you only need two reference images.
Safety checker is opt-in. The enable_safety_checker flag is exposed for advanced users. Most users should leave it on.
Video Edit derives duration from the source. You don't pick a runtime — the edit runs the length of the input video.

Getting Started

1Open Video Studio from your dashboard.
2Select Happy Horse from the model dropdown — you'll see all four sub-modes (I2V, T2V, Reference, Edit) listed.
3Pick the mode that matches your input. Upload an image for I2V, write a prompt for T2V, upload up to 9 references and use character1..character9 in your prompt for Reference, or upload a video for Edit.
4Set duration (3–15s, integer seconds) and resolution (720p or 1080p). Aspect ratio is available for T2V and Reference modes.
5Generate.

For workflow-driven projects, drop Happy Horse nodes into the Workflow editor and chain modes — Reference for hero shots, Image for extensions, Text for cutaways, Edit for color unification.

The Bottom Line

Happy Horse is the most versatile single-family video model we've integrated this year. It doesn't try to win on every spec dimension — it wins on workflow flexibility, because the four modes share enough underlying behavior that you can build a complete short-form piece without leaving the family.

For brand work, multi-character scenes, and post-production restyles, it's a strong default. For glossy hero shots where you need maximum fidelity or native audio, pair it with one of the other models in your catalog and let Happy Horse handle the connective tissue.

Ready to try it yourself?

50+ AI creative tools, no credit card required.

Get Started Free

ENFIRE