Back to the list

China’s $9 AI Video Tool Kling 2.1 Adds Audio—Can It Beat Google’s $250 Veo 3?

decrypt.co 5 h

Chinese short video platform Kuaishou has added an audio generation feature to Kling 2.1, its AI-powered video creation tool, enabling users to produce clips with synchronized sound effects such as footsteps, rainfall, and ambient noise.

The feature, which launched quietly last week, is available in Kling's image-to-video mode, where users upload a still image and the platform animates it with both motion and audio generated by artificial intelligence.

The timing pits Kling against Google's Veo 3, which launched with integrated audio capabilities from day one.

Early users on X praised Kling's seamless audio-visual synchronization, with creator Roberto Nickson calling it "one of the most useful models on the market" for producing generative video content.

The feature is free during initial rollout, accessible through Kling's website and mobile app.

Kling 2.1 one of the most useful models on the market

— Roberto Nickson (@rpnickson) June 12, 2025

Kling 2.1 generates 5- to 10-second clips at up to 1080p resolution, utilizing what the company describes as "3D spatiotemporal attention mechanisms" to synchronize sounds with visuals.

The audio tool currently generates sound effects only—no dialogue or music—and produces something similar to Southeast Asian language audio when text is involved—very tonal, and completely unintelligible. But that by itself isn’t enough to crown Google as the undisputed King of generative video.

We tested Kling 2.1's new audio features against Google's Veo 3 to see how the upstart stacks up.

The Price of Creation

The price gap between the two platforms turns out to be massive.

Kling 2.1's audio feature is only compatible with the standard version, not the higher-end Master edition. However, at current rates, users can generate more than 20 videos on Kling for every single Veo 3 creation.

For example, using Freepik’s credit system, one generation with Google Veo 3 is currently on sale for 4,000 credits (with the normal price being 8,000 credits per video), whereas Kling 2.1 costs 300 credits per video.

Google's model runs exclusively through its $250-per-month Ultra subscription. Kling is available on its official site, offering some free generations, with subscriptions starting at around $9 per month.

Even with Google's current promotional pricing, Veo 3 remains ten times more expensive than Kling.

For creators who know video generation involves plenty of trial and error, with failure rates that frustrate even patient users, Kling's economics make experimentation feasible.

The Premium plan on Kling unlocks 1080p resolution, improving overall video quality while still maintaining the cost advantage.

Audio Capabilities

But you get what you pay for. Veo 3 offers sophisticated sound generation, accurately synthesizing speech and matching complex audio elements to visual scenes.

Its understanding of spatial audio and contextual sounds surpassed Kling's offerings by a wide margin.

While Kling 2.1 can’t compete, in fairness, it aimed at something different: ambient sounds and background effects—no dialogue, no music. So forget about those viral AI street interviews for now. Attempts to generate audio produce speech gibberish.

Yet for scenes or videos requiring atmospheric audio, its results were serviceable.

2. An off-road SUV drives through rocky, muddy, and wet forest terrain.

You hear the crunch, the splash, the growl of the engine. Felt like a real shoot. pic.twitter.com/S0gVhCAQjk

— ZOYA ✪ (@Zoya_ai) June 12, 2025

The platform's new ability to add effects to existing silent videos gives it an edge that Veo 3 couldn't match.

Users can upload finished videos and retrofit them with appropriate soundscapes, a workflow that Google's model doesn't support. Weirdly, Veo can create videos, but it can’t edit them.

Besides the ability to create sounds for silent videos, Kling also offers a lip-syncing feature.

Users can upload a photo and a speech or dialogue separately, and the model will make a video in which the subjects interact naturally, as if they were speaking to each other according to the uploaded audio.

【Kling AI(@Kling_ai)】リップシンク update!!📢
動画に登場するキャラクターを選択して、どの人物が話しているかを選択できたり、音声のタイミングを調整するリップシンクの編集機能が追加されました。… pic.twitter.com/brvGUOgLKs

— SEIIIRU😈動画生成AI×AfterEffects (@seiiiiiiiiiiru) June 10, 2025

The twenty-to-one generation ratio meant creators can experiment with different audio approaches on Kling while Veo 3 users have to nail their sound design in fewer attempts.

For hobbyists and those learning generative video, Kling's approach offers more room for trial and error.

But professional creators needing precise audio-visual synchronization and dialogue will find Veo 3's sophisticated sound engine worth the premium.

Video Generation Quality

Video quality testing produced unexpected results. In a test scene featuring a woman fleeing from a giant spider, Kling 2.1's standard version outperformed both Veo 3 and its own Master edition.

The standard model accurately represented the scene dynamics, exhibiting fluid motion and proper directional movement. Veo 3 inexplicably generated the woman running toward the spider instead of away from it.

The Master edition typically produces sharper, crisper visuals, but the standard version demonstrated superior scene comprehension and more fluid movement.

This is odd since higher resolution should always translate to better results, but maybe the problem boiled down to prompt technique issues or simply bad luck in the generation.

That said, Kling 2.1 standard with 1080p generations is a great model that holds its own against Google Veo 3 here.

Platform Workflows and Limitations

Platform limitations shape each tool's workflow differently. Kling 2.1's audio feature works only with image-to-video generation, not text-to-video, which remains exclusive to the Master edition without audio support—yes, this is odd, but it is what it is.

The best workaround is using Kolors, Kuaishou's image generator, to create starting frames before converting them to video with synchronized audio. Kolors produces highly realistic images that serve as excellent starting points for video generation.

However, you might find that models including Reve, MidJourney, Recraft, Flux, and even ChatGPT are easier to prompt.

Veo 3 took the opposite approach, offering only text-to-video generation without any image-to-video option.

This forces users to rely entirely on prompt engineering, with no way to control the starting visual.

Google's decision also seems particularly odd given that the previous Veo 2 does actually support image-to-video through its separate Flow platform.

The lack of visual control means users have to generate videos blindly, hoping their text prompts will produce the desired starting frames.

Content Moderation Approaches

Content moderation revealed contrasting philosophies. Veo 3 employs aggressive keyword filtering and post-generation checks, blocking content that violates Google's policies.

The system flags potentially problematic prompts before generation and analyzes completed videos for policy violations.

Kling applies more liberal restrictions, allowing content that Veo will block outright.

However, the model's training data naturally excluded explicit content—the model generates figures without anatomical details and violence without gore.

So, users can generate certain types of content that bypass keyword filters while still maintaining safety boundaries.

Both platforms refund credits when post-generation censorship blocks a video, but Kling's lighter touch allows more creative freedom within boundaries.

Conclusions

Veo 3 might still be the king, but Kling 2.1 is definitely close to a populist on a mission to overthrow the monarchy.

Its audio feature is pretty revolutionary when you consider it’s a $9 tool competing against a $250 subscription.

The atmospheric sounds work, the rain sounds like rain, footsteps match the movement most of the time, and you can generate twenty attempts while Veo users carefully craft their single shot.

That retrofit feature, where you add sound to finished videos, is something Google doesn't offer, and it's genuinely useful for salvaging silent clips.

Things will look completely different if your primary goal is speech. Kling’s gibberish won't fool anyone.

For this kind of specific requirement, Google Veo 3 is the obvious and only choice. The king is (almost) dead. Long live the Kling!

Edited by Josh Quittner and Sebastian Sinclair

decrypt.co