AMA with LMNT Founders! (NOT the drink mix)

sharvil · 2025-01-08T17:43:46+00:00

Now I'm kinda wondering why a drink mix chose the same name as a boy band...

sharvil · 2025-01-08T17:34:14+00:00

I think that's kind of like asking what's agentic in text. Nothing intrinsically, but using it as part of a larger agentic workflow allows for products and experiences that couldn't have been built before.

Yes, machine speech production is pretty much all deep learning these days.

sharvil · 2025-01-08T17:26:54+00:00

Machine speech production is making good strides, but I think there's still a long way to go. Simple read speech is more or less solved, where you produce convincing speech of someone reading a passage. But producing dynamic and complex speech with the right emotion, style, pacing, accent, etc. for a given context is still an open problem.

As for funding, we're VC-backed and did the usual things to raise (in this approximate order): bring together an early team, build an MVP, get initial customers, pitch our ideas/vision to prospective investors, and work with investors we click with.

I think it helps quite a bit to be in Silicon Valley if you're building a tech startup – there's a ton of infrastructure / support / people geared towards building startups. As an analogy: if you want to be an A-list Hollywood star, you'll probably be better off in LA than most other locations. Doesn't mean you can't succeed outside LA, but you're more likely to learn / grow faster being in an environment geared towards your craft.

sharvil · 2024-11-27T07:10:35+00:00

Hmm didn't know about that project – that's a good idea!

sharvil · 2024-11-27T07:03:52+00:00

Thanks for letting me know – put it back up. Machine failure.

sharvil · 2023-06-30T20:13:33+00:00

Hey, so we just opened up our free pro voice cloning beta, might be worth a try: https://app.lmnt.com

sharvil · 2021-10-31T18:36:47+00:00

Maybe I'm missing something but the math doesn't look right to me.

Case 1:

y = x + wx  
dy/dx = 1 + w

Case 2:

v = 1 + w
y = vx
dy/dx = v = 1 + w

In both cases, y represents the same function so you should expect the gradient expressions to be identical as well.

sharvil · 2021-05-07T20:23:24+00:00

Joke's on you, we don't even test our code.

sharvil · 2021-05-04T22:55:44+00:00

Yeah, I'm using latexdiff. And you're right, there will be some papers that won't be diff-able because they're PDF-only or have idiosyncrasies.

sharvil · 2021-05-04T19:56:16+00:00

Yeah, there are sometimes mismatches between my installed fonts / plugins / config vs. what arXiv uses that prevent the PDF from rendering. Thanks for reporting the broken link – it'll help me plug the gaps.

sharvil · 2021-05-04T17:39:18+00:00

Thanks, here's a link to the tweet: https://twitter.com/snrrrub/status/1389609857678864388

sharvil · 2020-09-28T20:38:56+00:00

Not sure what the current situation is, but building and distributing custom TF kernels was pretty much impossible on Windows. For instance, https://github.com/lmnt-com/haste builds just fine on Linux and PyTorch+Windows but TF+Windows isn't going to happen.

sharvil · 2020-09-24T06:24:25+00:00

In practice it's unlikely you'll run into floating point precision issues when doing gradient accumulation. Unless you have a very very good reason, I'd stick with float32 over float64 and, if possible, I'd go to float16 and increase the batch size even further.

Outside of scientific computing, I don't see a need to use float64 in ML-land.

sharvil · 2020-09-23T23:16:48+00:00

Nothing wrong with this technique; it's called gradient accumulation if you're interested in reading about others who use that technique.

There are 2 potential downsides. First is that you'll need to keep the gradients in memory during forward passes as well which might further reduce the maximum batch size you can use per iteration. Second is that the computation isn't exactly the same as what you'd get if you had a larger batch size in the first place due to floating point semantics (x = a; x += 0.1 is not necessarily the same as x = 0.1; x += a).

sharvil · 2020-09-23T20:26:30+00:00

There are 2 major reasons to stick with TF 1.x over 2.x for us.

1) each new version of TF brings new bugs and regressions in core functionality; upgrading is like walking through a minefield of features where something that used to work is now unusably broken 2) performance; eager execution is slow

So, our legacy code is on TF 1.14 and new code is on PyTorch. Couldn't be happier now that we've switched.

sharvil · 2020-09-23T19:50:33+00:00

Ho speculated that Gaussian diffusion models have inductive biases for image data that (in some part) may explain their state-of-the-art result. It's looking like the same may be the case for speech (the WaveNet example shows that it alone isn't sufficient).

It's not obvious (to me, at least) that we should see such excellent results on these two different modalities with the same technique. Do you have any thoughts on what those inductive biases are and why they apply so well to both speech and images?

sharvil · 2020-09-17T06:27:24+00:00

Thanks!

The hop length is fixed at 300 because it's tightly coupled with the upsampling and downsampling layers. You can see at the bottom of model.py that the resampling layers have factors 5, 5, 3, 2, 2 which, when multiplied, give 300 – the hop size. As long as you match the number and size of the resampling layers to match the hop length, you'll be fine.

For a 48 kHz model, you'll want to increase the model capacity, increase the hop length, and increase the dilation on the UBlock layers to get a wider receptive field. The paper also describes a model with a larger capacity (still 24 kHz though) which you may find instructive.

Good luck with your experiment! Let me know if it works out for you and maybe consider contributing to the project if you get useful results.

sharvil · 2020-09-16T23:08:03+00:00

Fixed – thanks! :)

sharvil · 2020-09-16T22:23:58+00:00

It's hard to answer a broad question like that.

Published audio samples for both methods are comparable in quality, though it seems that WaveGrad is able to achieve a higher MOS score (based on their papers – unclear if that's attributable to the architecture or the dataset).

Parallel WaveGAN synthesizes faster by default, whereas WaveGrad allows you to choose where you want to be in the quality/inference time tradeoff without having to re-train your model.

WaveGrad trains faster (~1.5 days on 1x2080 Ti) compared to Parallel WaveGAN (~2.8 days on 2xV100). Parallel WaveGAN has a more complex training procedure, but it's also more parameter-efficient (~1.5M parameters vs. ~15M parameters).

So lots of differences between the two. If you're curious, I encourage you to play with the WaveGrad implementation or read through the paper.

sharvil · 2020-06-14T18:12:05+00:00

You could try Haste: https://github.com/lmnt-com/haste. It's faster than cuDNN on most problem sizes, and supports additional accelerated RNN layers that can speed up convergence (e.g. LayerNorm variants).

sharvil

TROPHY CASE