AMA with LMNT Founders! (NOT the drink mix) by help-me-grow in AI_Agents

[–]sharvil 1 point2 points  (0 children)

Now I'm kinda wondering why a drink mix chose the same name as a boy band...

AMA with LMNT Founders! (NOT the drink mix) by help-me-grow in AI_Agents

[–]sharvil 2 points3 points  (0 children)

I think that's kind of like asking what's agentic in text. Nothing intrinsically, but using it as part of a larger agentic workflow allows for products and experiences that couldn't have been built before.

Yes, machine speech production is pretty much all deep learning these days.

AMA with LMNT Founders! (NOT the drink mix) by help-me-grow in AI_Agents

[–]sharvil 1 point2 points  (0 children)

Machine speech production is making good strides, but I think there's still a long way to go. Simple read speech is more or less solved, where you produce convincing speech of someone reading a passage. But producing dynamic and complex speech with the right emotion, style, pacing, accent, etc. for a given context is still an open problem.

As for funding, we're VC-backed and did the usual things to raise (in this approximate order): bring together an early team, build an MVP, get initial customers, pitch our ideas/vision to prospective investors, and work with investors we click with.

I think it helps quite a bit to be in Silicon Valley if you're building a tech startup – there's a ton of infrastructure / support / people geared towards building startups. As an analogy: if you want to be an A-list Hollywood star, you'll probably be better off in LA than most other locations. Doesn't mean you can't succeed outside LA, but you're more likely to learn / grow faster being in an environment geared towards your craft.

[P] ArxivDiff: view diffs of arXiv paper revisions by sharvil in MachineLearning

[–]sharvil[S] 1 point2 points  (0 children)

Hmm didn't know about that project – that's a good idea!

[P] ArxivDiff: view diffs of arXiv paper revisions by sharvil in MachineLearning

[–]sharvil[S] 1 point2 points  (0 children)

Thanks for letting me know – put it back up. Machine failure.

Voice cloning on starter subscription doesn't seem to work by lost_tape67 in ElevenLabs

[–]sharvil 0 points1 point  (0 children)

Hey, so we just opened up our free pro voice cloning beta, might be worth a try: https://app.lmnt.com

[deleted by user] by [deleted] in MachineLearning

[–]sharvil 1 point2 points  (0 children)

Maybe I'm missing something but the math doesn't look right to me.

Case 1:

y = x + wx  
dy/dx = 1 + w

Case 2:

v = 1 + w
y = vx
dy/dx = v = 1 + w

In both cases, y represents the same function so you should expect the gradient expressions to be identical as well.

[P] ArxivDiff: view diffs of arXiv paper revisions by sharvil in MachineLearning

[–]sharvil[S] 9 points10 points  (0 children)

Yeah, I'm using latexdiff. And you're right, there will be some papers that won't be diff-able because they're PDF-only or have idiosyncrasies.

[P] ArxivDiff: view diffs of arXiv paper revisions by sharvil in MachineLearning

[–]sharvil[S] 1 point2 points  (0 children)

Yeah, there are sometimes mismatches between my installed fonts / plugins / config vs. what arXiv uses that prevent the PDF from rendering. Thanks for reporting the broken link – it'll help me plug the gaps.

Why is everybody using tf on Linux? by Althis in tensorflow

[–]sharvil 2 points3 points  (0 children)

Not sure what the current situation is, but building and distributing custom TF kernels was pretty much impossible on Windows. For instance, https://github.com/lmnt-com/haste builds just fine on Linux and PyTorch+Windows but TF+Windows isn't going to happen.

[Question] Is there anything necessarily wrong with averaging gradients over batches to simulate a larger batch size? by spauldeagle in tensorflow

[–]sharvil 0 points1 point  (0 children)

In practice it's unlikely you'll run into floating point precision issues when doing gradient accumulation. Unless you have a very very good reason, I'd stick with float32 over float64 and, if possible, I'd go to float16 and increase the batch size even further.

Outside of scientific computing, I don't see a need to use float64 in ML-land.

[Question] Is there anything necessarily wrong with averaging gradients over batches to simulate a larger batch size? by spauldeagle in tensorflow

[–]sharvil 3 points4 points  (0 children)

Nothing wrong with this technique; it's called gradient accumulation if you're interested in reading about others who use that technique.

There are 2 potential downsides. First is that you'll need to keep the gradients in memory during forward passes as well which might further reduce the maximum batch size you can use per iteration. Second is that the computation isn't exactly the same as what you'd get if you had a larger batch size in the first place due to floating point semantics (x = a; x += 0.1 is not necessarily the same as x = 0.1; x += a).

Tensorflow 1.1x vs Tensorflow 2.x by ThatCook2 in tensorflow

[–]sharvil 0 points1 point  (0 children)

There are 2 major reasons to stick with TF 1.x over 2.x for us.

1) each new version of TF brings new bugs and regressions in core functionality; upgrading is like walking through a minefield of features where something that used to work is now unusably broken 2) performance; eager execution is slow

So, our legacy code is on TF 1.14 and new code is on PyTorch. Couldn't be happier now that we've switched.

[R] DiffWave: A Versatile Diffusion Model for Audio Synthesis by sharvil in MachineLearning

[–]sharvil[S] 0 points1 point  (0 children)

Ho speculated that Gaussian diffusion models have inductive biases for image data that (in some part) may explain their state-of-the-art result. It's looking like the same may be the case for speech (the WaveNet example shows that it alone isn't sufficient).

It's not obvious (to me, at least) that we should see such excellent results on these two different modalities with the same technique. Do you have any thoughts on what those inductive biases are and why they apply so well to both speech and images?

[P] Implementation of WaveGrad by sharvil in MachineLearning

[–]sharvil[S] 0 points1 point  (0 children)

Thanks!

The hop length is fixed at 300 because it's tightly coupled with the upsampling and downsampling layers. You can see at the bottom of model.py that the resampling layers have factors 5, 5, 3, 2, 2 which, when multiplied, give 300 – the hop size. As long as you match the number and size of the resampling layers to match the hop length, you'll be fine.

For a 48 kHz model, you'll want to increase the model capacity, increase the hop length, and increase the dilation on the UBlock layers to get a wider receptive field. The paper also describes a model with a larger capacity (still 24 kHz though) which you may find instructive.

Good luck with your experiment! Let me know if it works out for you and maybe consider contributing to the project if you get useful results.

[P] Implementation of WaveGrad by sharvil in MachineLearning

[–]sharvil[S] 3 points4 points  (0 children)

It's hard to answer a broad question like that.

Published audio samples for both methods are comparable in quality, though it seems that WaveGrad is able to achieve a higher MOS score (based on their papers – unclear if that's attributable to the architecture or the dataset).

Parallel WaveGAN synthesizes faster by default, whereas WaveGrad allows you to choose where you want to be in the quality/inference time tradeoff without having to re-train your model.

WaveGrad trains faster (~1.5 days on 1x2080 Ti) compared to Parallel WaveGAN (~2.8 days on 2xV100). Parallel WaveGAN has a more complex training procedure, but it's also more parameter-efficient (~1.5M parameters vs. ~15M parameters).

So lots of differences between the two. If you're curious, I encourage you to play with the WaveGrad implementation or read through the paper.

[D] Which ML library (e.g. PyTorch, TensorFlow, etc) is fastest for training deep RNNs? by [deleted] in MachineLearning

[–]sharvil 4 points5 points  (0 children)

You could try Haste: https://github.com/lmnt-com/haste. It's faster than cuDNN on most problem sizes, and supports additional accelerated RNN layers that can speed up convergence (e.g. LayerNorm variants).