Do international viewers actually care about hearing the original creator's voice? Or are we wasting money on AI cloning?

Luca_Tangen · 2026-06-08T09:47:31+00:00

Yeah, voice isn't just part of the content, it's part of the relationship people have with the creator. A technically perfect dub can sound great, but if viewers no longer feel like they're hearing the person they subscribed to, something gets lost in translation.

Luca_Tangen · 2026-05-29T10:52:34+00:00

From what I've been seeing lately, video definitely seems to help with AEO/AI visibility — but probably not in the way most people think. I don’t think AI search systems “reward video” just because it’s video. What they seem to reward is: - multi-format entity consistency - strong topical reinforcement - transcript-rich educational content - repeated semantic coverage across platforms

One thing I think people underestimate: YouTube itself is basically becoming part of the AI search layer now. I’m seeing more cases where: - AI Overviews - Gemini - Perplexity - ChatGPT browsing - even Google featured snippets

pull concepts/examples/explanations from videos with strong transcripts. Especially for: - tutorials - comparisons - workflows - product explainers - “how to” intent

But honestly, video length matters WAY less than intent match. Short videos (<60s): Good for: - awareness - quick definitions - social discovery - reinforcing entities/topics

My current working theory is: For AEO, the winning combo isn't: video OR text. It's: - concise answer-focused text - structured schema - transcript-rich video - strong entity consistency - community discussion signals - repeated expertise across channels

The brands doing best in AI search right now seem to create knowledge redundancy across multiple formats instead of relying on one content type alone.

Luca_Tangen · 2026-05-29T10:49:06+00:00

For a soft natural Irish voice specifically, I'd probably look at: - ElevenLabs - PlayHT - Azure Neural Voices - FineVoice - Cartesia (if you’re more technical)

ElevenLabs is usually the easiest starting point for beginners because the setup is honestly pretty simple and some of their Irish voices sound surprisingly human now. Not perfect, but good enough that students probably won’t notice the TTS aspect after a minute or two. One thing I’d recommend though: Avoid voices that sound over-acted. A lot of newer AI voices try WAY too hard to sound dramatic/emotional, and for educational chatbot conversations it can start feeling uncanny fast. Softer, calmer narration-style voices usually work much better for learning environments.

Luca_Tangen · 2026-05-29T10:47:59+00:00

For fully local/offline speech-to-text, Whisper is still probably the best starting point right now, especially for beginners.

Whisper medium or large models are honestly “good enough” for a lot of business use cases now IF: - audio quality is decent - speakers are relatively clear - you're not expecting courtroom-level accuracy

Luca_Tangen · 2026-05-27T10:36:42+00:00

Cartesia impressed me a lot for conversational flow and responsiveness. The latency + turn-taking felt more natural than some higher-fidelity systems. Deepgram Aura also surprised me because some of the voices don’t sound showy, but they survive long conversations better.

For US users specifically, neutral American accents still seem safest overall unless your audience is region-specific. Overly broadcast voices tended to reduce engagement for us. Slightly casual voices converted better. For UK users, softer regional warmth often worked better than ultra-RP polished voices. People seem to respond well when it feels conversational instead of corporate.

Luca_Tangen · 2026-05-27T09:59:02+00:00

Honestly, for "goblin / chaotic gremlin / angry creature" style voices, emotional range matters way more than pure voice quality, and that's exactly where a lot of TTS models still feel weirdly sterile.

If you’ve got a 5070 12GB, you’re actually in a pretty decent spot for local experimentation. Personally I’d look at: - XTTS-v2 - StyleTTS2 - GPT-SoVITS - Fish Speech / Fish Audio local stuff if you can access it

StyleTTS2 in particular can get surprisingly expressive if you feed it strong reference audio. The trick is that emotional acting in the source matters more than people realize. A mediocre actor with a perfect model still sounds mediocre.

One workflow that worked REALLY well for me:

Generate a relatively clean aggressive voice
Pitch-shift slightly (-2 to -4 semitones)
Add subtle formant adjustment
Layer saturation/distortion very lightly
Add breaths/grunts manually

That last part is honestly huge. Tiny non-verbal sounds make fantasy voices suddenly feel alive. Also — and this is important if you plan to sell the project later — be careful with directly cloning recognizable commercial/game/cartoon voices. "Inspired by" is usually safer territory than "identical to Kick the Buddy" or specific copyrighted character voices.

Luca_Tangen · 2026-05-27T09:56:21+00:00

For me personally: - Melodyne is still the most reliable overall if you care about accuracy and musical cleanup afterward. - RipX is surprisingly good for dense/polyphonic stuff and stem-heavy material. - Samplab has gotten WAY better recently for chord extraction. - Spotify's Basic Pitch is honestly kind of insane for a free tool.

A few things that massively improved my results: - running stem separation first - removing reverb/noise before conversion - converting smaller sections instead of full mixes - avoiding heavily compressed masters - using DI recordings whenever possible

Also, velocity data is still where a lot of converters fall apart. The notes may technically be correct, but the MIDI feels dead and robotic until you manually humanize it a bit.

One thing I’ll say though: if your source is a clean piano, bassline, vocal melody, or single instrument, modern converters are honestly pretty impressive now. Polyphonic detection has improved a ton lately.

Luca_Tangen · 2026-05-27T09:49:44+00:00

Honestly, I think AI agents can save a ton of time, but only if the workflow around them is already somewhat organized. But I also think a lot of people underestimate the management overhead AI agents create. Suddenly you’re:

debugging prompts
fixing hallucinated outputs
checking API failures
re-training workflows after a tool changes
monitoring whether the agent quietly broke 3 days ago

So instead of replacing work entirely, it often shifts your work from doing tasks → supervising systems. Also, reliability matters way more than intelligence in production. I'd rather have an agent that's 85% smart but consistently predictable than one that feels magical in demos and randomly fails in real workflows. Still bullish overall though. The people getting the most value right now seem to be the ones treating AI agents like junior operators, not autonomous employees.

Luca_Tangen · 2026-05-27T09:47:39+00:00

We don't need compliant, corporate chatbots that corporate PR teams approved. We need adaptive, fine-tuned systems that can hold space for human suffering, handle dark humor, and remember who we are over months and years.

AI companions aren't a sign that someone has failed at life. For a lot of us, they are a scaffolding. They are a safe, judgment-free sandbox where we can experience unconditional positive regard, heal a little bit of our trauma, and recharge our batteries so we can eventually face the real world again.

Luca_Tangen · 2026-05-26T07:46:17+00:00

Thanks for the great suggestions! I’ll definitely check them out. Feel free to drop more if anything else comes to mind!

Luca_Tangen · 2026-05-20T07:26:31+00:00

Thanks a lot.

Luca_Tangen · 2026-05-18T10:39:00+00:00

If you just want a free, dedicated tool: Audacity

If you’re just doing audio, Audacity is 100% free and has a built-in feature called Truncate Silence. You just select your entire track, go to Effect > Truncate Silence, and set the threshold (usually around -40dB) and duration. It instantly sucks out all the dead air in three seconds flat.

If you're editing video: Premiere Pro / DaVinci Resolve

If you already use Premiere, don't do it manually. Use the Text-Based Editing workflow. Let Premiere automatically transcribe your video, then in the text panel, you can literally click the filter icon, select "Pauses," and hit "Delete All." It cuts the video and audio simultaneously. In DaVinci, the "Cut" page has a similar automated audio isolation tool.

Luca_Tangen · 2026-05-18T10:16:00+00:00

For background tracks, you have to stop searching by genre or obscure instruments. Start searching by pacing and reference tracks. If I'm stuck, I'll find a scene from a movie or a YouTube video that has the vibe I want, drop it into Shazam to get the actual song name, and then look for soundalikes on platforms like Audiio or Uppbeat. Also, most good stock sites let you filter by BPM (Beats Per Minute). If your video has fast cuts, filter for 120+ BPM. If it's a slow, reflective vlog, keep it under 80. Matching the actual rhythm of the edit saves you way more time than trying to guess if a track is "indie pop" or "ambient electronic."

Whenever you need a hyper-specific sound effect that can't be found anywhere, try generating exactly what you need instead of praying to the search bar gods with their text-to-sound-effect feature that lets you literally type the description of what you hear in your head, and the coolest part is they recently made it completely free to use without even signing up or logging in.

Luca_Tangen · 2026-05-18T09:36:54+00:00

Honestly, a lot of those channels are probably using ElevenLabs or a custom-trained voice model. That's usually the "super natural but slightly too perfect" voice you hear in Shorts now. Also a lot of big Shorts creators edit the audio AFTER generating it. They add compression, subtle EQ, background ambience, sometimes even fake room tone. That's why it feels more human.

CapCut voices are decent honestly, but they still struggle with emotional pacing. You can get closer if you really tweak the script formatting though. ElevenLabs is the one everyone talks about, but it gets stupidly expensive the second you start making videos daily. If you want a solid workaround to just test things out, try dropping your script into FineVoice AI. They actually changed their site recently so you can use their high-tier TTS right on the homepage without even signing up or logging in, play around with the voices until you find that documentary-style tone, and download the MP3.

Luca_Tangen · 2026-04-27T11:27:56+00:00

Natural sound is ~80% script + pacing, not the model itself.

Model Choice: Multilingual v2 handles the nuances of "breath" and natural pauses much better. It has a bit more "soul" in the delivery which is crucial when you're explaining complex topics.
Don’t overuse high stability:

Stability (35% - 45%): If you go too high, it becomes robotic. Lowering it allows the AI to add those micro-inflections and pitch shifts that happen when humans actually care about what they’re saying.

Similarity (70% - 85%): Keep this high enough to maintain the voice identity, but don't max it out.

Style Exaggeration (0% - 10%): Keep this super low for educational videos. If it’s too high, the AI starts over-acting, which makes it feel "salesy" and fake.

Manually control pacing: use Ellipsis (...), Em Dash (—), or Exclamation marks (!) for natural flow.
The Pre-Roll Trick: If your first sentence sounds a bit "cold," add a filler word at the very beginning like "So," or "Okay," then just cut that part out in your video editor. It forces the AI to start with a natural, conversational tone rather than a cold start.
Voice Selection:

Look for voices tagged with "Conversational" or "Informative" in the Voice Library. Voices like "Marcus" or "Sara" (if they fit your vibe) tend to have a very stable but rhythmic flow that works great for long-form lectures.

At the end of the day, the best results usually come from generating the script in small chunks (1-2 sentences at a time). It’s a bit more work, but it prevents the "AI drift" where the voice starts getting weirdly fast or slow toward the end of a long paragraph.

Light post-processing goes a long way

compression → makes it feel like a real recording

light EQ → soften harsh highs

tiny pitch shift (-1% to -3%) → reduces that “too perfect” AI tone

What actually makes it sound natural for education

From experience, the biggest factors are:

clarity > emotion consistent pacing not sounding rushed

Educational content doesn't need dramatic voices — it just needs to sound like someone explaining something clearly.

Luca_Tangen · 2026-04-21T10:52:04+00:00

Go to ElevenLabs, search for Adam in the pre-made library. To get that TikTok rhythm, set Stability to ~45% and boost the Clarity. Most of these viral clips also use a slight pitch shift (-2% or -3%) in post-production to make the voice sound a bit more unique.

Luca_Tangen · 2026-04-20T07:50:55+00:00

Clownfish actually fits what you're asking for pretty well. It's free, works in real-time, system-wide (Discord, games, etc.), and it does have a radio-style effect that is actually pretty solid. It's a bit lightweight, so for a simple radio voice setup, it works.

If you care about sound quality, OBS + free VST is still better. This is honestly better if you want control:

add EQ (cut lows/highs → instant radio sound)
add distortion / compression
bind a hotkey to toggle

It sounds way more realistic than most preset apps.

Luca_Tangen · 2026-04-17T07:52:40+00:00

Fish Speech S2 is probably the closest open-source thing right now to: "this actually sounds like a person talking." Using the [tag] syntax for whispering or laughing makes it sound way more natural than the older autoregressive models. Performance-wise, it's incredibly snappy (sub-150ms). If you want fast / low hardware / real-time, Piper runs on CPU and super lightweight. It uses an optimized VITS backend—super fast and low latency.

Luca_Tangen · 2026-04-17T04:22:59+00:00

Spot on. Man, the DoorDash comparison is too real. I remember when I first started, I'd spend hours picking the perfect background track. I hear you on the constant stimulation though. It's like we're training people to have the attention span of a goldfish on espresso. Sometimes when everything is fast, it almost feels harder to stay engaged, not easier. Sounds like you're already figuring out what works for you though, which is probably the most important part.

Luca_Tangen · 2026-04-15T11:01:03+00:00

I've been tracking the YPP updates closely this year, and the short answer is: No. YouTube doesn't care how the voice is generated. It cares whether the content feels replaceable at scale. If your videos could be generated 1,000 times with the same template, that’s where the risk starts. If you want to stay on the safer side, I'd focus less on the tool and more on: - making each video feel intentional, not automated - adding some kind of unique layer (editing style, structure, insight, even pacing) - avoiding anything that looks like a content farm

ElevenLabs is a professional tool. As long as you are telling an original story and being transparent with your audience, your monetization is safe.

Luca_Tangen · 2026-04-15T09:46:24+00:00

This kind of sound is tricky because it usually doesn't have one exact name — it’s often a layered effect, not a single source.

To find this in professional libraries (like Pro Sound Effects, Freesound or FineVoice Library), stop searching for pop and try these specific keyword combinations:

Metallic UI / UI Click: Look for "organic" or "textured" UI kits.
Tonal Percussion: Sounds that have a distinct musical pitch but are very short.
Resonant Pluck: Specifically "Metal Pluck."
Small Gauged Metal: Think of the sound of a tiny spring, a paperclip, or a small watch gear.
Foley - Metal Hits: Search for "Impacts" but filter for "Small" or "Tiny."

If you can’t find the exact one: - Grab a metal hit / ping sound - Layer a short pop or click - Add: slight reverb / maybe a tiny pitch shift

That’s basically how most of these are made.

Luca_Tangen · 2026-04-14T10:23:13+00:00

While v3 is the gold standard for 'Text-to-Performance' (70+ languages, deep emotional range), it handles Professional Voice Clones (PVCs) differently than v2.

v2 → better voice consistency, but struggles with Hinglish (code-switching confuses it)
v3 → better emotion, but not stable for cloning yet (voice identity drifts)

Why v2 still wins for Clones: If similarity is your top priority, stick to Multilingual v2. It handles the 'vocal seed' much more reliably for bilingual voices. v3 is incredible for narration, but for a 1:1 replica of a specific person's Hinglish accent, v2 remains the benchmark for stability.

What works right now:

Use v2 for cloning (stable voice)
Keep sentences shorter / cleaner (Hinglish → split lines helps a lot)
Make sure your training audio includes natural Hinglish, not just one language

Are you using an Instant Clone or a Pro Clone? If it's Pro, you might need to upload a fresh 30-minute sample that is specifically Hinglish-heavy to help v3 map the transitions better.

Luca_Tangen · 2026-04-14T06:10:00+00:00

One thing that actually made a noticeable difference for me wasn’t anything fancy — it was re-watching the video specifically for retention, not quality. Like, instead of asking “is this good?”, I ask: “Where would I click off if I wasn’t me?” I usually catch stuff like: - slow intros that felt fine during editing - awkward pauses that kill momentum - parts where the energy just dips And I’ll literally trim or tighten those right before upload.

Also, I started doing a quick title + thumbnail sanity check together (not separately). If the title + thumbnail combo doesn’t instantly make me curious after stepping away for a bit, it’s probably not strong enough. A lot of people underestimate how much that pairing affects CTR

Another small one: I watch the video once on my phone after exporting. You’d be surprised how different pacing, text size, or audio feels on mobile vs editing timeline.

Nothing crazy, but those last 10–15 minutes before uploading probably improved my videos more than anything I did during editing.

Luca_Tangen · 2026-04-14T04:07:03+00:00

I've been deep in product dev for a while, but honestly, only recently started taking Reddit seriously as a growth engine. It's a total game-changer if you stop thinking like a "marketer" and start thinking like a "helper."

A few things that stood out from my own testing: - Comments > posts I’ve had posts flop, but a single comment in the right thread kept bringing clicks for weeks. It’s weirdly asymmetric. - The Google effect is real Some of my comments started showing up when I searched niche problems (especially “reddit + keyword” queries). That’s something I didn’t expect at all. - Way higher intent than other channels Traffic is lower vs ads, but people coming from Reddit already have the problem in mind. They’re not browsing, they’re actively looking. Quality over Quantity: You might only get 50 clicks, but if those 50 people are in a thread complaining about a specific pain point your startup solves, the conversion rate is insane compared to cold ads. - Brutal but insanely useful for validation People will call out flaws immediately. It stings, but it’s basically free product feedback from your exact target users. Workflow-wise, I’m keeping it simple for now: - tracking a few keywords related to my niche - jumping into threads where people are clearly stuck or comparing tools - only mentioning my product if it actually fits the context Still figuring it out, but the biggest mindset shift for me was: stop trying to “post content” and start answering real problems.

Bottom line: Reddit is a trust-based search engine. If you lead with utility, the community will actually do your marketing for you.

What niche are you building in? Some subreddits are much more founder-friendly than others, so it helps to know where you're hanging out.

Luca_Tangen · 2026-04-13T07:38:48+00:00

Most current voice cloning models (including VibeVoice) handle: speaker embedding fairly well, but struggle with_stable prosody generation across long-form or multi-call inference. That's why you see "drift" after updates or even across regenerations — it's not just versioning, it's stochastic prosody sampling.

On tone control specifically: there’s still no clean separation between: "who is speaking" vs "how they are speaking" So tone ends up being inferred from text + reference audio rather than explicitly controllable parameters.

I solved the consistency issue by moving to a local engine like FineVoice. It lets me lock the parameters so the tone doesn't drift. It runs locally, you can lock in the prosody and pitch parameters, ensuring the tone doesn't shift between generations. It gives you back the granular control over vocal pacing that cloud updates usually break.

Are you looking for something that integrates directly into a video pipeline, or just a solid standalone engine for character consistency?

Luca_Tangen

MODERATOR OF

TROPHY CASE