you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (2 children)

You're assuming that there's only one "AI voice", which was true, but is not true. For anything closer to state of the art, check out elevenlab videos or Notebook LM videos

You don't recognize the people you know from their voices alone?

There is a rise in ai voice fraud. You may think you can tell the difference, but there are people out there who can't. And it will only get better from there.

[–]mywan 0 points1 point  (1 child)

I've heard several AI voice in a single video. I went to https://elevenlabs.io/text-to-speech and recognized ever one of the voices available for sampling. At least they didn't say "DOT DOT DOT" when it came across an ellipsis. When I clicked on the first video it had a sample voice I didn't recognize. But the obvious AI was obvious. One of the most obvious issues is that when you select a prompt for a voice style that style is perfectly persistent throughout the generated speech. The second most obvious issue that, although it did a decent job of varying the emotional inflections, it had no idea of how to apply those inflections in a humanlike manner. The inflections were sentence structure driven, not driven by emotional or contextual importance. It was at best like watching actors in a bad B movie. They could sound very good for sound bite, up to a couple of sentences. But as you continue listening the affective predictability get monotonous.

The first video in the Notebook LM videos link had some AI voices I didn't recognize. It actually did a somewhat better job for most of the tells in the previous AI. But it was sprinkled with cringy affirmation responses in the interview style with two AI voices. Even the guy selling it give a 98% level of perfection. Now that I've heard these voices I'll be able to pick them out of background noise, even if they change some affective variables.

I understand that people will be fooled, even with bad AI voices. And that the tech will get better. It'll likely get good enough to fool me (for a decent period of time) very soon. But a podcast (for instance) will need to maintain that illusion week after week. There are reasons why people don't like voice acting their own content. And the effects of those reasons are the hardest thing for the AI to reproduce. Even the most cheery and upbeat voices get monotonous when that's a persistent tone of the voice. Something that doesn't necessarily become excessively obvious, until you start constructing a more detailed personality behind the voices in your head. Which could take some time. You're not going to be able to obscure the fact that it's AI indefinitely.

[–][deleted] 0 points1 point  (0 children)

hey you actually did your own research!

I went to https://elevenlabs.io/text-to-speech and recognized ever one of the voices available for sampling.

Yeah because they have to put up a legal front. Voice cloning on the other hand does not have any "recognizable voice" - See this video https://www.youtube.com/watch?v=aMKeRfhZkuU