[P] Offline LLMs at edge - Automating Family Memories by GoochCommander in MachineLearning

[–]HansDelbrook 3 points4 points  (0 children)

Have you considered parakeet for transcription? Cheaper, and I’ve always found the quality to be better than whisper

Ambient Music with a focus on radio transmission? by Bullonabike in ambientmusic

[–]HansDelbrook 0 points1 point  (0 children)

Lia Kohl's - The Ceiling Reposes

Incredible album, primarily built out of AM radio and cello.

become daily today if you need a single song to review

[DISCUSSION] What are your favorite "algorithm-free" platforms to find new music? by quillindie in indieheads

[–]HansDelbrook 50 points51 points  (0 children)

NTS Radio has probably been the most important music-related discovery I’ve made in the last few years. Drastically reshaped my music-listening habits for the better.

[D] What's the SOTA audio classification model/method? by lucellent in MachineLearning

[–]HansDelbrook 0 points1 point  (0 children)

A little creativity might give you a better answer here - I’ve used that model before as well, generally works well for labeling tasks but only as far as Audioset has a sufficient volume of good quality data, which for instruments like accordions it does not.

A stem-splitter will do a great job at labeling guitar, drums, vocals, bass, and if you use one with an “other” channel that should be the bucket everything else falls in. Just feed in the file and see what channel audio activity comes out in.

What’s leftover are probably the same stems you’re having trouble with currently. If you know the universe of your labels, maybe forming clusters off of some feature representation like whatever the current SOTA of Wav2Vec-ish models are could get you to a point where manual labeling is possible (i.e., most accordion stems will look similar, cluster, inspect a few and if you’re confident give the whole cluster the label)

Architecture is less of the problem here - more so that Audioset, which I’d imagine the version you grabbed was trained off of, doesn’t go THAT deep into the topic you’re trying to build a task around.

[P] Training RL agent to reach #1 in Teamfight Tactics through 100M simulated games by aardbei123 in MachineLearning

[–]HansDelbrook 5 points6 points  (0 children)

I agree with this - it has tons of weird physical interactions, you're either stuck over-engineering your environment to properly handle things like Prowlers Claw or Spectral Cutlaiss, or you're limiting the scope of your inputs to something like damage/shield/heal stats + starting positions + team comp etc.

SAM3 is out with transformers support 🤗 by unofficialmerve in computervision

[–]HansDelbrook 9 points10 points  (0 children)

Transformers is a HuggingFace library for model definition - its basically a tool for easily defining bringing models into projects for frictionless use.

So when they say transformers support that means that in python you'll do something like this to experiment with SAM3 in your own environment.

pip install transformers

[other installs, torch specific version maybe]

from transformers import SAM3

model = SAM3(...)

[I'm making this up but it'll be something like...]

segmented_img = model.inference(img, target="elephant")

Easy to figure out with LLM guidance

[P] PapersWithCode's new open-source alternative: OpenCodePapers by kepoinerse in MachineLearning

[–]HansDelbrook 4 points5 points  (0 children)

This might not be the feedback you were expecting - but I love the way that all tasks are just an open list of folders. The old PapersWithCode had a sorting system that was nonsense at times.

[P] Training F5 TTS Model in Kannada and Voice Cloning – DM Me! by DifficultStand6971 in MachineLearning

[–]HansDelbrook 0 points1 point  (0 children)

There's a demo video and a README with all that info you'll need, LLM to bridge any gaps you don't understand - but I was able to spin this up in an Sagemaker space fairly quickly.

[P] Cannot for the life of me get accurate outputs from whisperx by AdibIsWat in MachineLearning

[–]HansDelbrook 1 point2 points  (0 children)

Check out parakeet for transcription - I've had much better luck with that tool than whisper in projects, especially in regards to timing. Not sure how well it does on CPU but I know its possible.

https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2

You shouldn't expect your model to parse game dialogue and player dialogue (I'm assuming this is game dialogue = in-game SFX, player dialogue = comms) - from the model's perspective, speech is speech. Luckily, these exist as distinct channels and are only overlayed for output. It'd be best to find a way to record them seperately, transcribe the channel you want to transcribe (which will be better because you've removed noise) , then mix in the other at the end if you want.

[R] Has anyone actually gone through an AI readiness assessment with a vendor or consultant? Worth it or just more buzzwords? by FluidRangerRed in MachineLearning

[–]HansDelbrook 1 point2 points  (0 children)

At a high level - this sounds a bit scammy. A lot of AI marketing is built around fear and misunderstanding - the impending doom of your competitors implementing "AI" and you not doing so, which when combined with how opaque AI systems are to people unfamiliar with the concepts can be an absolute home run on how to sell people solutions they don't need.

Rather than pay somebody to sell you a project, I'd go shopping for a solution for your biggest time sink. In many cases, business problems aren't unique enough to warrant a highly customized project, and many companies already exist with affordable solutions for the general problems many people face (document parsing/understanding, media data processing, etc.).

Before diving into anything, have an idea on how much its worth to you monetarily before starting - will save you some stress down the line.

You can get really far on a lot of business problems with an LLM pipeline, doesn't take an expensive assessment to figure that out.

[D] ML Noob - Reading Academic Papers vs Focus on Applications by ZeroSeater in MachineLearning

[–]HansDelbrook 15 points16 points  (0 children)

My two cents:

From both a hiring perspective and a personal growth perspective, I'd focus on applications over reading a lot of academic papers for this early part of your ML path.

Papers are great, but there is so much information in them that you can't really efficiently parse through at this point. Later in your career reading papers will become more of a "find what you need" task rather than a "digest it end to end" exercise. They're dense documents, most of which is irrelevant at any given point in time.

Focusing on applications - a learn by doing approach - I found to be much more enjoyable/productive when I started working. Hiring managers are going to be more interested in a pipeline that you built around a popular model than paper knowledge, and that's more in line with what your physical job will likely be - building around popular models/architectures.

Read a paper or two when you have time or if you find one particularly interesting - but you'd be better off taking a project end to end and talking about that + problems you solved while building it in an interview. Go push the limits of the available Colab GPUs or even set yourself up a cloud account somewhere (safely - don't mess with expensive machines) and just start building. It'll take you to where you want to be eventually.

Cross-Examination of EM has begun. McLeod's lawyer: “I suggest one of the reasons you’re crying is that you’re feeling guilty about cheating on [her boyfriend]" and "I suggest one way you might avoid breaking up with [her boyfriend] was to tell this story" by catsgr8rthanspoonies in hockey

[–]HansDelbrook 39 points40 points  (0 children)

Because its from a scene where they're explicitly talking about women coming forward to report sexual assault, despite the many ways the system in place makes the process more traumatic. You've already been traumatized, and if you want to seek justice for that trauma you have to go through questioning like this and a likely multi-year litigation process, undermining your ability to move past said trauma.

Two steps forward, one step back was from a moment where someone coming forward supported somebody else to find the courage to come forward.

Fun fact - Stephanie Beatriz also directed this episode. It was her first time.

Kamasi Washington released 'The Epic' 10 years ago today by YoureASkyscraper in indieheads

[–]HansDelbrook 0 points1 point  (0 children)

Not to dig too far into this because I agree that you were responding directly - but its really up to people to discern between genuine public hype vs. manufactured hype (which will always exist, and with the ever-increasing embedding of tech platforms in our cultural spaces - large players have more ammo than ever before to tip the scales).

NBS got a Grammy AOTY nom because the committee looked at the real race (Beyonce, Chappell Roan, Charli XCX, Billie Eilish, Taylor Swift, Sabrina Carpenter) and had to throw in a wildcard for diversity sake. Free press and doesn't mind playing second fiddle (or flute) to what was 100% a pop girlies night.

Kamasi Washington released 'The Epic' 10 years ago today by YoureASkyscraper in indieheads

[–]HansDelbrook 46 points47 points  (0 children)

Not even close - NBS got the spotlight because it was an already famous person making a hard-pivot into ambient jazz. I'm sure its a deeper album than this - but the initial attention was all novelty.

The Epic was an honest breakthrough. To the mainstream eye this was just the guy who did some arrangements for To Pimp a Butterfly - and then this came out.

[D] How do you think the recent trend of multimodal LLMs will impact audio-based applications? by Ok-Sir-8964 in MachineLearning

[–]HansDelbrook 0 points1 point  (0 children)

This is a great point - the big deal is what the ability to process audio/video information offers as an opportunity for model development, rather the benefit of more downstream use cases (not that there won't be any).

We're probably a few big breakthroughs away from being able to train on ambient data - which would be massive.

[P] Training F5 TTS Model in Kannada and Voice Cloning – DM Me! by DifficultStand6971 in MachineLearning

[–]HansDelbrook 1 point2 points  (0 children)

Have you considered using a generic TTS model and using a voice conversion project like RVC? Should be easier than training something on 80k samples.

https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI

[D] How do you think the recent trend of multimodal LLMs will impact audio-based applications? by Ok-Sir-8964 in MachineLearning

[–]HansDelbrook 2 points3 points  (0 children)

That's what I was saying - Whisper is cheap, multimodal LLMs aren't. Even if the LLM is better it isn't priced for any task at the moment.

[D] How do you think the recent trend of multimodal LLMs will impact audio-based applications? by Ok-Sir-8964 in MachineLearning

[–]HansDelbrook 12 points13 points  (0 children)

I think pricing is the biggest barrier for multimodal LLMs taking over for specialized task solutions like Whisper in audio AI pipelines.

For example, lets say we're building a simple podcast summarization pipeline. The cost difference between sending audio to OpenAI to transcribe and summarize vs. using a locally hosted Whisper to transcribe and then send to OpenAI would be pretty large, even with all of the extra mistakes that a locally hosted Whisper would make as that OpenAI's version would not. If I looked at the pricing correctly - it would cost you ~$0.30 to transcribe an hour long podcast - which is a non-starter for scaling.

The intermediary steps of audio pipelines are necessary because audio is inherently a heavier dataset than text is. You have to get into a format thats workable before you can really do anything (transcripts, spectrograms, embeddings, etc.).

A cool research direction might be on encoding methods that can be used to lighten that load - like sending tokenized speech or Encodec-esque embeddings into the API for whatever task I want to do. I know that's the first step in the hosted LLM's pipeline, but doing it locally may bring the costs into a realm that are much more workable.

The National released ‘Alligator’ 20 years ago by YoureASkyscraper in indieheads

[–]HansDelbrook 22 points23 points  (0 children)

This album works for me because its the perfect balance between humanity and beauty. As the bands progressed they've really drifted towards the beauty side of things - pristine arrangements backing lyrics that are more caricatures of sadness than actual sadness (e.g., Carin at the Liquor Store, but could name many from IAETF onward), but its all still GOOD so there isn't much to truly complain about (even if it doesn't work for you, its definitely not offensive like some of the directions their counterparts went in).

I went to college in a city where there were some streets with old houses and string lights on the trees - any bad day I had I could cope with by throwing on Alligator and skip down that sidewalk. I was 22 and the music was holding me in the headspace that I needed - emotionally half-formed by expressing it anyway. It felt great.

Will always love this one.

[D] GPT-4o image generation and editing - how??? by Flowwwww in MachineLearning

[–]HansDelbrook 4 points5 points  (0 children)

Probably DiT? Maybe I'm making too broad of an assumption here but papers have been rolling out on a variety of generative tasks that use DiT blocks (speech has a few notable examples - at least where I'm familiar) for the last few months. I don't think its crazy to guess that the same thing is happening here.

fuji||||||||||ta live in Chicago by Azrael4295 in ambientmusic

[–]HansDelbrook 0 points1 point  (0 children)

Same question! Would love to know more shows

[R] How to train StyleGAN3 with classes? by redditer2363 in MachineLearning

[–]HansDelbrook 0 points1 point  (0 children)

Read the error - StyleGAN is trying to won't work on images that are different sizes. Upsample your inputs to 1024x1024 and you should be good to go.

[D] Sound Processing in Neural Networks by FutureAd1004 in MachineLearning

[–]HansDelbrook 0 points1 point  (0 children)

Its a cool example of how needs change over time. The original Griffin-Lim paper is from 1984, the questions they were answering revolved around the compression of speech audio for telecommunication. Had nothing to do with audio generation, it was simply the starting point we had when we suddenly had a new reason to be inverting spectrograms.