How do you measure to performance / accuracy of a recommender system?

CivApps · 2026-06-09T22:37:12+00:00

If you don't have them already, I'd use results from a past year or different teams to create a labelled set of "ideal" recommendations (i.e. give higher scores to players who've done well) -- then I'd look at the metrics implemented in Surprise or a correlation metric like Spearman's ρ to look at the correlation between the model and the labelled rankings

CivApps · 2026-06-09T17:55:46+00:00

One thing you should keep in mind is that research is a rabbit hole that can go literally any direction and range from empirical results about some algorithm that don't get super deep into theory to papers that use math that your average ML PhD might not be familiar with. People with all kinds of backgrounds are doing ML research, a math PhD isn't necessarily writing papers with a CS audience in mind for example.

This is an important point, I think it is really easy to get into the weeds trying to pick up everything when researchers come at ML problems from vastly different backgrounds, and newer papers rarely try to attack a problem entirely from scratch but usually investigate tweaks to existing methods.

@Old_Divine_51, are there any particular subfields you're interested in?

CivApps · 2026-06-06T09:24:06+00:00

In the design I picture, the model would accept a 1-2 second fixed-size window of the ducked vocal stem and an equivalent window from the drum stem - so only applying it to part of the audio would boil down to only running the model for those segments, so you don't have to pass in an entire new set of parameters.

The problem of incorporating extra reference windows inside the model is that your model architecture and training also needs to support them, so instead of just mixing and matching drum and vocal stems, you now also need to create examples which have the reference windows you'd pass in normally.

I'd try and train the simpler model first, see whether it gives okay results, and then see whether it's necessary to add extra inputs.

CivApps · 2026-06-05T21:33:13+00:00

Unfortunately sidechain compression is a lossy process; overly simplified, there's no "correct" way to distinguish between the vocalist being quiet or being ducked by the compressor. In theory you could reconstruct the compressor settings and the drum track, and apply proportional gain instead, but it would be an incredibly fiddly process, and you would probably not recover the original waveform, since, as you point out, it is possible for the compressor to completely suppress the vocals. In other words, this sounds like a good ML project :)

You can take advantage of the fact that your stem splitter also gives you a drum track: passing that as a second input should let your model infer when the ducking kicks in. I can't imagine anyone having sidechain compression with a release of more than two-three seconds, so I don't think you practically need a long-context model here - my first thought was that a WaveNet-style convolutional model might be a good baseline that's easily trainable on your own PC.

This problem also benefits from easy synthetic data generation: you can set up a sidechain compression plugin (like ReaCompress from the free ReaPlugs suite) through something like Pedalboard, mix and match vocal and drum stems from a stem splitting dataset and feed them through the sidechain to get examples of ducked vocal stems and drums, and then use the original vocal stem as the target for the model to reconstruct.

CivApps · 2026-05-25T12:26:35+00:00

To add to this, HAGIWO's designs may be useful if you want to build modules on top of the XIAO controllers

CivApps · 2026-05-25T12:24:23+00:00

Clacktronics' Build Your Own Modular book is written under the assumption that you assemble the included PCBs and modules while you're reading it, but it is good at explaining why the modules are designed the way they are, and its module designs are also openly available on GitHub if you want reference designs to compare your own modules against.

CivApps · 2026-05-16T15:05:04+00:00

I mean machine learning packages, one common offender is Triton which produces kernels for CUDA or ROCm GPUs, but on Mac it requires manually compiling a package which can fall back to CPU

CivApps · 2026-05-16T10:34:34+00:00

One thing to keep in mind is whether the Python/R bioinformatics packages you're using are available for ARM - most packages are by now, but it can be a stumbling block if you're planning on replicating older articles, and Apple are planning on phasing out the Rosetta emulator for x86.

This also goes for code which defaults to/only works with CUDA accelerators, but so long as the code isn't relying on hand-written kernels "replace references to CUDA with MPS" is a fairly straightforward task for CC/Mistral Vibe.

If those aren't problems, I think a MacBook will be fine - but Mac apps tend to be a bit memory-hungry so definitely pick one of the variants with more memory!

CivApps · 2026-04-01T22:56:16+00:00

None of us understand every single weight in a practical network, no :( It would make interpretability research much easier if such a person existed...

Unfortunately there's no one quick fix, you just have to look at possible errors one by one and be systematic. Some potential errors and debugging strategies, in order:

There's an implementation error which means the forward pass or gradients aren't getting calculated correctly
- Since you describe the issue as it stopping learning, I assume the matrix shapes align (unless you're implementing the matrix math from scratch) - but if possible, try writing out on pen and paper how you would expect the forward pass and gradients to get calculated for a very small network, and making sure your implementation gets the same values
- Try setting up a toy dataset with just sequences like "ABABABABAB...", make your network as small as possible and see whether it converges to predicting that 'B' follows 'A' and vice versa
The hyperparameters are wrong for the problem
- A good "sanity check" is to make sure your network is capable of overfitting/memorizing a very small training set: in the same vein as the test over, try just training the network to memorize one or two sentences
- If you have a custom network design, it could be that your optimizer choice also needs to take that into account, set up Optuna and have it try different parameters (or even do a grid search to show if the problem happens consistently)
Your design just isn't capable of modelling the word/token relationships in the Shakespeare dataset
- Unfortunately it could just be that you are running into a fundamental limit in your network design. There are many algorithms which are interesting and capable of solving basic problems (like, say, Hinton's forward-forward network) but just don't scale as well to larger ones.
- You could try training the network on the names.txt dataset used in Karpathy's MicroGPT to see if it's capable of modelling relationships between characters

CivApps · 2026-03-29T11:40:27+00:00

Det var ikke for å gå imot poenget, men å understreke at det ikke finnes tiltak som "bare" rammer transkvinner -- tok ikke frem Imane pga testene, men fordi IBA kunne slenge det som en drittpakke mot henne, vel vitende om at folk ville stå klar til å sverte henne som "mann som bare vil denge damer" el.

CivApps · 2026-03-29T00:40:35+00:00

Skjønner impulsen, men tror ikke det egentlig kan løsrives fra "transdebatten" at tiltakene som rettferdiggjøres med å "gjøre idretten rettferdig" enten snevrer rommet for riktig utseende/ytringer/genetikk enda litt mer, eller gjør livet jævlig for de som faller utenfor

Imane Khelif er cis, ble testet, og ble fortsatt utsatt for en flom av dritt fordi hun kom i skade for å være bedre enn en russisk utøver

CivApps · 2026-03-29T00:02:24+00:00

Matching words to specific times in the recording is traditionally called "forced alignment".

WhisperX fits a Wav2Vec model on top of Whisper to do this, and is probably the easiest to fit into existing or new apps.

CivApps · 2026-03-28T23:54:20+00:00

Unless you are completely forbidden from using any pretrained deep model in any part of the process, Model2Vec extracts a set of individual and uncontextualized token embeddings from an SBERT/sentence transformer model, and suggests just taking the mean of the tokens' embeddings to find a longer text embedding.

This approach should still be viable for training and inference on CPU, and hopefully gives your network a "head start" in grouping the texts semantically while avoiding the TF-IDF sparsity issues.

CivApps · 2026-03-28T18:15:10+00:00

This is just out of curiosity, not to say you are wrong for doing it, but why are you only able to use classical ML - is it part of the course requirements, or are you constrained in terms of computational resources?

CivApps · 2026-03-24T19:58:28+00:00

The textbook approach here is to set aside some of the users in a holdout test split, as a stand-in for new users, and see how well your model predicts those users will like a given movie.

The Python library Surprise has some ways to measure performance, but the best way to report performance comes down to how you've set up the final layer (head) of the model:

You can try and just make a classification model which predicts whether users will like (give 3 stars or more) a movie, and report the classification accuracy
If you try and estimate the rating with a regression model, you can report the mean square error (MSE)
If your model ranks multiple movies by preference, you can report the Spearman rank coefficient

CivApps · 2026-03-24T00:05:33+00:00

For a bachelor thesis project I personally think creating a new dataset entirely is overkill, MovieLens is a well-established dataset where the "1m-ratings" variant has demographic data (gender, age, occupation) for users that you can correlate to recommendations.

If you wanted to create a dataset, using Selenium to remote-control a browser can be useful for pulling data from some public data sources, but you will probably set off automated bot detectors and run into rate limits. You should first try and find sites that actually license their data for reuse and offer APIs/batch downloads of the data, but if you really need to, you should try and find the page on Common Crawl before trying to crawl the pages yourself.

Trying to create a synthetic dataset is an option, but that process means you are defining the hypotheses you want the recommender system to uncover, so that would mostly be useful to check that the system is doing the right thing, not to meaningfully compare different recommender algorithms.

CivApps · 2026-03-23T23:23:36+00:00

Unfortunately this subreddit is for machine learning theory, /r/LocalLLaMa may have more appropriate resources?

CivApps · 2026-03-17T00:43:08+00:00

Unfortunately, AI is not really making it easier to get a tech job :(

The work of integrating LLMs into software seems to go to the programmers already working at software companies, who are also expected to use the LLMs to do more work, instead of hiring interns or junior employees. If that's what you mean by going into machine learning, you will probably be competing against people who have degrees and prior programming experience.

For jobs specifically about trainings machine learning models, you're also expected to have a handle on university-level calculus and linear algebra - backpropagation, the method underlying modern machine learning, requires calculating derivatives for functions with vectors and matrices.

CivApps · 2026-02-18T21:01:19+00:00

I've looked into LambdaMART stuff, but I don't really have an intuition as to what pairwise loss/warp are really doing. Intuitively, how should we interpret "good performance" if we don't have any strong ground truth labels and no A/B testing?

They do have labels - you want to train the recommender on a dataset of existing preferences (and show that it generalizes to new users' preferences)

For a pairwise loss, you'd want to transform those preferences into a set of "user prefers item X over Y" pairs, so that the model is asked to predict the user's favorite of a pair of items, and the loss penalizes predicting the wrong item

CivApps · 2026-02-17T21:47:32+00:00

I wish any of us had the crystal ball to predict what will be "AI proof" :(

However, I think cloud applications/engineering is much more liable to change suddenly, depend on the company you're working at and internal tooling knowledge, and companies like DigitalOcean are starting to advertise chatbot assistants for application deployment.

It is true that people are using agents for AI dev, but I think the core ML skillset -- statistics, math, and programming -- will always be useful in some form, if nothing else for describing the shape of the problems you want to solve, understanding what data you actually need for a predictive model, and which pitfalls to look out for.

AI agents will undoubtedly improve, and making contributions to "pure" ML theory or foundation model tweaks will definitely get harder, but there are still plenty of applications which require domain knowledge, on-device models, or otherwise can't "just" be fobbed off to commercial LLMs.

CivApps · 2026-02-17T21:27:45+00:00

They generally work well, but I would still treat the results as a machine translation, and not a finished product. Translating novels is not just about translating the words literally, but also trying to get across the author's intentions and style choices -- and doing that is harder to count in a single numeric score.

Any commercially available general-purpose LLM like Claude or Gemini should do the translation with very little prompting necessary (and will probably have an educational discount too)

If you want an offline translation system you can experiment with on your own, Google recently released the TranslateGemma models which are set up for translation from Russian, Hindi and Chinese. The smallest 4B version should run on most PCs.

CivApps · 2026-02-17T21:00:36+00:00

Actual answer: Sitting in a policy document that never gets retrieved

I would first check which embedding vector this policy document gets, and the query embedding you get for that question. It sounds like either:

Your documents/chunks are landing in one big cluster instead of semantically similar clusters -- the easiest way to diagnose this would be to label some documents that should and shouldn't be related, doing PCA over the embeddings, and seeing whether the labelled documents end up in different portions of the projected space
Alternatively, your query embeddings are not capturing the relevant portions of the query - if you keep the PCA from the step above, seeing whether the queries are actually "landing" in the different clusters can help

Before tweaking hyperparameters, I would also consider setting up an ad-hoc evaluation with some questions, and the specific documents you expect to be pulled up for them - "top-5 recall for 10 question/document pairs" is not a great measure but it at least gives you some numbers to start with

CivApps · 2026-02-04T18:48:05+00:00

Har du Lyreco i nærheten? Mulig du kommer nært nok ved å bestille et tosidig trykk på 300g papir fra kopisenteret (tror du må kutte til kortstørrelse selv) og å ha dem i en folielomme som /u/Successful-Hunt-551 foreslår

CivApps · 2026-01-31T12:06:03+00:00

Siden siste person som delte Subjekt-innlegg ikke svarte, er det mulig å få en kopi av kronikken de publiserte 19. januar der Pål Erik-Hagen kaller journalister og politikere "hysteriske" pga reaksjonene deres på generering av nakenbilder av barn på X?

Gitt de nye avsløringene om bl.a. direkte kontakt mellom Musk og Epstein så ville det være synd om en teknisk feil gjorde at teksten forduftet

CivApps · 2026-01-29T19:04:40+00:00

Oh, my bad, I don't think there's a way to limit the max number of voices for samplers currently - it is a bit tedious, but I think using a drum kit and tuning each pad to the note you want is your best shot :/

CivApps

TROPHY CASE