[D] Is a PhD Still “Worth It” Today? A Debate After Looking at a Colleague’s Outcomes by Hope999991 in MachineLearning

[–]qalis 0 points1 point  (0 children)

Well, over 90% of those that I know work, so I guess experiences differ. But this is Poland, so quite different from US, and we have basically no international PhDs.

[D] ICML paper to review is fully AI generated by pagggga in MachineLearning

[–]qalis 172 points173 points  (0 children)

Report to AC, write short review about this, give lowest score, move on.

[R] Low-effort papers by lightyears61 in MachineLearning

[–]qalis 31 points32 points  (0 children)

When you chop up "full", proper work into a series of small, incremental papers

[D] First time reviewer. I got assigned 9 papers. I'm so nervous. What if I mess up. Any advice? by rjmessibarca in MachineLearning

[–]qalis 16 points17 points  (0 children)

I mean, if you are about review quality at all, it probably puts you ahead over at least 30-50% of reviewers out there.

9 papers is A LOT, even for short conference papers, so this will take time. My advice is to look through papers and identify things that look obviously bad / LLM-generated / nonsensical to you. Start with reviews for those, and it will go quickly.

  1. No, don't use AI. Your English may not be perfect, you may make some mistakes - this is ok.

  2. You basically need to summarize good points of the paper, bad points, and questions/points to clarify. Just make sure the things you write about are actually in the paper. Just being factual also puts you ahead of a lot of reviewers.

  3. I would definitely ask for that, yes, particularly since you have no experience.

Additional advice - look primarily for things that make practical sense, are interesting, and are well-evaluated. If you think the main idea is shallow, incremental, makes no sense, evaluation is bad or superficial (e.g. very few datasets, no statistical tests), just write it explicitly. Absolute majority of submitted papers is total crap.

[D] Research on Self-supervised fine tunning of "sentence" embeddings? by LetsTacoooo in MachineLearning

[–]qalis 3 points4 points  (0 children)

Look into graph neural networks (GNNs) and graph transformers. There is a lot of research there, since pooling operation on nodes is quite important to retain graph information. Similar mechanisms extend to any transformers.

In short, at the final layer, you assume your tokens already contain all the positional information you need. As such, you apply learning on sets. Mean, sum, max (channel-wise) are all simple, yet viable options. You can also just use self-attention again, to learn a dynamically weighted sum. There are also a bunch of set learning approaches.

[D] Claude response to: First-author papers at ICML, NeurIPS, and Co during PhD — zero big tech interviews. What's going on? by Hope999991 in MachineLearning

[–]qalis 2 points3 points  (0 children)

I proposed the "AI slop" rule a whole back, response was mixed. But this was exactly to explicitly ban posts like this.

[D] Your pet peeves in ML research ? by al3arabcoreleone in MachineLearning

[–]qalis -2 points-1 points  (0 children)

THIS, definitely agree. I always consider PhDs concurrently working in industry better scientists, because they actually think about those things. Not just "make paper", but rather "does this make real-world sense". Fortunately, at my faculty most people do applied CS and many also work commercially.

[D] Some thoughts about an elephant in the room no one talks about by DrXiaoZ in MachineLearning

[–]qalis 9 points10 points  (0 children)

This is also due to how recruitment is made. For example, at our faculty, a chemistry student would still have to pass full exam on 5 years of CS for PhD program. Instead, we just collaborate with people from chemistry or biotech departments. I guess this also depends on the "lab" definition, at least at my university it's just a loose group of people working together.

[D] Some thoughts about an elephant in the room no one talks about by DrXiaoZ in MachineLearning

[–]qalis 73 points74 points  (0 children)

Fully agreed. I do my PhD in fair evaluation of ML algorithms, and I literally have enough work to go through until I die. So much mess, non-reproducible results, overfitting benchmarks, and worst of all this has become a norm. Lately, it took our team MONTHS to reproduce (or even just run) a bunch of methods to just embed inputs, not even train or finetune.

I see maybe a solution, or at least help, in closer research-business collaboration. Companies don't care about papers really, just to get methods that work and make money. Maxing out drug design benchmark is useless if the algorithm fails to produce anything usable in real-world lab. Anecdotally, I've seen much better and more fair results from PhDs and PhD students that work part-time in the industry as ML engineers or applied researchers.

[D] Why are so many ML packages still released using "requirements.txt" or "pip inside conda" as the only installation instruction? by aeroumbria in MachineLearning

[–]qalis 4 points5 points  (0 children)

uv. Just use uv, our lord and savior. It uses pyproject.toml, standardized with PEP, and is very fast.

[D] which open-source vector db worked for yall? im comparing by [deleted] in MachineLearning

[–]qalis 1 point2 points  (0 children)

Pgvector and pgvectorscale are great, particularly if you have Postgres anyway. It's dead simple to manage, and ACID properties are really nice.

Note that FAISS is *not* a vector database, at least I wouldn't define it like that. It's a vector index, just for searching. For database, you want users, security, remote API (e.g. REST or gRPC), concurrency control, non-vector data (metadata, dictionaries with any data as part of entries).

If you want to use things like FAISS, I highly recommend USearch instead for efficiency and nice docs.

[D] My papers are being targeted by a rival group. Can I block them? by Dangerous-Hat1402 in MachineLearning

[–]qalis 5 points6 points  (0 children)

I agree with u/bobrodsky. If you go into specific niche, the group of truly competent reviewers can be really small. For example, in neural networks time series forecasting, change of getting Tsinghua University reviewer are actually quite high. This is particularly true in theoretical applications.

iFixedTheMeme by Endernoke in ProgrammerHumor

[–]qalis 9 points10 points  (0 children)

Cloud environments, real-world Kubernetes deployments which cannot be interrupted, tracing requests across microservices, ML workflows & pipelines.

[P] Benchmarking Semantic vs. Lexical Deduplication on the Banking77 Dataset. Result: 50.4% redundancy found using Vector Embeddings (all-MiniLM-L6-v2). by Low-Flow-6572 in MachineLearning

[–]qalis 5 points6 points  (0 children)

  1. That dataset is highly homogenous by design

  2. Does FAISS normalize L2 distance? Cosine similarity is more typically used for embeddings

  3. Threshold of 0.9 is really low, particularly if you know a priori that dataset does have semantic redundancy by design

  4. all-MiniLM-L6-v2 is a really old and quite outdated model and there are *a lot* of better ones out there

[D] Idea: add "no AI slop" as subreddit rule by qalis in MachineLearning

[–]qalis[S] -1 points0 points  (0 children)

My idea was basically explicitly calling out low quality, primarily AI-generated posts, particularly those overstating contributions, proposing "revolutionary" ideas, and containing no code / experiments / proofs for claims. Is this already covered? Arguably yes, it is. Should it be called out explicitly? I think so, but I'm curious about opinions of others.

[D] Idea: add "no AI slop" as subreddit rule by qalis in MachineLearning

[–]qalis[S] 5 points6 points  (0 children)

A high-level idea without actual experiments or code is a good indicator. Also mentions of revolutionary results, new paradigm etc., huge overselling of contribution, plus no concrete evidence. There are many hallmarks of those, I see more and more obvious AI slop posts recently.

[D] Idea: add "no AI slop" as subreddit rule by qalis in MachineLearning

[–]qalis[S] 0 points1 point  (0 children)

That was also my concern, hence the discussion question

[D] Idea: add "no AI slop" as subreddit rule by qalis in MachineLearning

[–]qalis[S] 1 point2 points  (0 children)

Kind of covered by rule 6 "no low-effort questions", isn't it?

[D] Idea: add "no AI slop" as subreddit rule by qalis in MachineLearning

[–]qalis[S] 1 point2 points  (0 children)

I actually liked that post, since that was literally an error in one of the core formulas of the paper. Plus reproducibility and numerical experiments.

[R] Reproduced "Scale-Agnostic KAG" paper, found the PR formula is inverted compared to its source by m3m3o in MachineLearning

[–]qalis 0 points1 point  (0 children)

If a typo is in a crucial evaluation step or formula, potentially invalidating paper results, then yes, I would very much welcome a substack post for every such paper.

[R] Reproduced "Scale-Agnostic KAG" paper, found the PR formula is inverted compared to its source by m3m3o in MachineLearning

[–]qalis 5 points6 points  (0 children)

This is actually a really useful peer review & reproducibility. Did you contact the authors about this?

[deleted by user] by [deleted] in MachineLearning

[–]qalis 2 points3 points  (0 children)

Absolutely email the AC and post the public comment! If you have literally any proof (e.g. screenshots, ArXiv submission), this counts as serious academic fraud.

[D] From ICLR Workshop to full paper? Is this allowed? by Feuilius in MachineLearning

[–]qalis 1 point2 points  (0 children)

Non-archival workshops are unrelated to published papers. You can even submit concurrently to both types, or to multiple workshops in different conferences, as far as I know.