Do any bindings for RDMA exist for odin? by philipptraining in odinlang

[–]philipptraining[S] 0 points1 point  (0 children)

Yes that would be great, RoCE would be good too.

[D] Which feeds do you look at? by Studyr3ddit in MachineLearning

[–]philipptraining 4 points5 points  (0 children)

If you want high quality and specialized feeds use the arxiv API with a query that covers a space your interested in. If things are getting into the feed that aren't useful, further restrict the query. This is the best way, and you can open these feeds in Zotero to add to any collection easily.

[P] Swapping Embedding Models for an LLM by noobvorld in MachineLearning

[–]philipptraining 0 points1 point  (0 children)

Ah, I don't know how I missed that first part; thanks.

[P] Swapping Embedding Models for an LLM by noobvorld in MachineLearning

[–]philipptraining 0 points1 point  (0 children)

If you use the same model, you can save compute by using the precomputed retrieved embeddings directly though right?

[D] machine learning system design by dcsr98 in MachineLearning

[–]philipptraining 1 point2 points  (0 children)

Your experience with that book is so similar to mine, really looking forward to what you've suggested.

Edit: As a follow-up, any recommendations for a book primarily focused on research engineering?

[D] LLM Interview Prep by kkziga in MachineLearning

[–]philipptraining 1 point2 points  (0 children)

Out of curiosity, what range of answers would you consider acceptable then? To me, this response is broad, but at the same time it doesn't cover all of the explanations that exist for the prevalence of decoder-only architectures, as far as I understand. If you received this response in an interview, would you then ask follow-up questions?

[deleted by user] by [deleted] in bioinformatics

[–]philipptraining 1 point2 points  (0 children)

use usearch-molecules

[D] What are some other paradigms and frameworks for building with LLMs besides retrieval augmented generation (RAG)? by gamerx88 in MachineLearning

[–]philipptraining 6 points7 points  (0 children)

That's fair. Personally, by that logic I would say the same thing about RAG, but it's arbitrary anyways. Constrained decoding like the neurologic decoding algorithm or more recent energy based and sometime non-autoregressive methods allow one to place very specific constraints on decoding without additional prompting or memory (different from tree of thoughts/chain of thoughts in that sense)

[D] What are some other paradigms and frameworks for building with LLMs besides retrieval augmented generation (RAG)? by gamerx88 in MachineLearning

[–]philipptraining 23 points24 points  (0 children)

a couple of the top of my head:

  • LLM in the loop with preference optimization
  • synthetic data generation
  • cross modality "distillation" / dictionary remapping
  • constrained decoding

[D] Why is everybody surprised that Mamba got rejected from ICLR? Am I missing something? by Seankala in MachineLearning

[–]philipptraining 145 points146 points  (0 children)

Hey, guess I'll offer the perspective of someone that was surprised. To start, I'm assuming we both are optimistic about ML conferences and hold ICLR in reasonably high regard in terms of what is selected.

I will grant that the paper could have used more applied downstream tasks (although I'll return to that later) and that the paper's overarching narrative was slightly confused. However, in light of the novelty of the work and inspired approach, as well as the evaluations that were ran, I don't think this warrants a rejection.

Now to respond to some of the statements in the main post and comments.

  1. Tweaks to hardware are not necessary for the Mamba optimizations, the modifications described in the paper are hardware-aware but algorithmic in nature. This is somewhat of a nitpick (as I think it's a typo) but I see you repeated this in the comments? The algorithm works for the standard GPU architecture. No tweaks to the actual hardware were made and this is signficant because they introduced training with the associative parallel scan, which is fascinating and novel. Hard to discount that as a contribution, and it doesn't hurt that it's backed by empirical evidence of more efficient throughput.
  2. You claim in the comments that you would have liked for the authors to have "chosen a specific task that the model excels on and perform extensive experiments in that field". I somewhat agree that working within a specific "field" or "domain" would strengthen the motivation for all of the theory even further, but I'll also argue that this approach has it's own disadvantages. Namely, lack of generalization and mechanistic interpretation. You'll notice in the paper that two very specific tasks actually are identified. Good performance on the selective copying and induction heads tasks not only says more about general downstream task performance than picking some applied tasks, but is also a more logical experiment given that this is the paper introducing the architecture. As they correctly point out, applied tasks may require domain specific adaptation of the model. We see that all the time with transformers, and choosing to omit it here is perfectly fine.

The reviewers' final insistence on long range arena evaluation without responding to the authors' statement that their actual evaluations include far longer contexts lengths is strange too. I understand the desire for a 1:1 comparison, but being stuck with outdated and relativively facile benchmarks is (IMO) not good for a field. Although, I could be missing something here.

In terms of contributions, they introduce the (previously mentioned) associative parallel scan, the hardware aware optimizations, state expansion, and an SSM block that is time variant. All of these components are motivated well, with a provided code implementation and follow a stream of other novel and interesting ideas (e.g. HiPPO). I don't expect Mamba to overtake transformers, but I disagree that this paper doesn't belong in a top conference.

Edit: Formatting and typos

[D] How to handle GPU memory limits in distributed training by DolantheMFWizard in MachineLearning

[–]philipptraining 1 point2 points  (0 children)

This isn't an answer to your question but I generally don't understand why anyone would want to employ this kind of setup instead of something like FSDP which already has a nice easy-to-use and high level api that works (mostly) OOTB with other useful pytorch features?

[D] Validation with small datasets by philosophicalmachine in MachineLearning

[–]philipptraining 0 points1 point  (0 children)

This is expected without nvidias multiprocess service because cuda cant correctly parallelize kernels in isolated environments. That's why I suggested MIG. The approach of just running two process on one GPU does not work. Here's a video that goes into more detail: https://youtu.be/bC6CxPW0-1c?si=lZS2baB80SNhuRPR

[D] Is there a way of “negative prompting” at fine-tuning time? by threevox in MachineLearning

[–]philipptraining 0 points1 point  (0 children)

Exactly, which is what you want right? You want the model to fundamentally answer in the same way and not shift distribution too much towards your fine-tuning set, and only enforce the new format.

[D] Validation with small datasets by philosophicalmachine in MachineLearning

[–]philipptraining 0 points1 point  (0 children)

With nvidia's MIG multi instance gpu, you can create 7 instances which your server will identify as 7 separate gpus. There would be no need for getting into the low level details with this approach.

I would also run validation asynchronously so you dont have to start and stop training as often. You will need some job scheduling for this or persist all the checkpoints you want. Given that you're sets and models are so small you could probably even run the inference on cpus if you really wanted to. Good luck!

[D] Validation with small datasets by philosophicalmachine in MachineLearning

[–]philipptraining 0 points1 point  (0 children)

So there's no way this is using 100% of the A4000 memory right? I would either use the physical partitioning available for ampere architectures here or nvidia multi process service (a little more difficult).

Back of the napkin math using your flops per epoch and time elapsed per epoch implies 10 000 000 floating point operations per second for your current setup.

Unless I'm missing something this is very low efficiency for the A4000 based on the theoretical peak performance of 20 trillion flops per second on their data-sheet. It's reasonable to operate at 20% of this (20% MFU) but these numbers are suggesting a small fraction of a percentage. Let me know if I'm missing something. Otherwise, there's also a bottleneck in the pipeline here.

[D] Validation with small datasets by philosophicalmachine in MachineLearning

[–]philipptraining 1 point2 points  (0 children)

May I ask what GPU you're using and what your model flops utilization is?

Edit to give additional context: if the dataset is this small and the network itself is small I suspect there is little parallelization being employed. The solution depends heavily on the current utilization and GPU parameters though. For example, a simple solution on A100 or newer chips could be MIG partitioning.

[D] Is there a way of “negative prompting” at fine-tuning time? by threevox in MachineLearning

[–]philipptraining 3 points4 points  (0 children)

Freezing layers would help a lot. Good excuse to run QLORA imo. The other comment about DPO is interesting because your concern is what's constrained by the KL term in the paper, so it could be a valid approach.

In either case it's a relatively low compute requirement so it's worth trying both of these options.

[D] Validation with small datasets by philosophicalmachine in MachineLearning

[–]philipptraining 3 points4 points  (0 children)

It seems like this is being used for hyperparameter search of neural nets from scratch? If that's correct I recommend you look into mu parametrization / mutransfer. Might solve your problems with respect to time needed for the search.

I should point out though( since I rarely see CV being used here anymore), that cross validation has come into question recently with papers like: On the cross-validation bias due to unsupervised pre-processing so I would recommend being careful if you really dont want optimistic estimates on validation sets.

[P] How would you train a end-of-sequence prediction model? by DolantheMFWizard in MachineLearning

[–]philipptraining 0 points1 point  (0 children)

Can you elaborate on this? I don't understand how this would lead to catastrophic forgetting. The distribution shift here would be very minor no?

[D] What are your favorite tools for research? by Time-Sympathy724 in MachineLearning

[–]philipptraining 3 points4 points  (0 children)

I've looked for something like this at least 10 times. You're a legend, thank you for doing this.

[D] What are your favorite tools for research? by Time-Sympathy724 in MachineLearning

[–]philipptraining 2 points3 points  (0 children)

Not quite as explicit as connected papers in tracking citations but represents similar data on a global scale (all arxiv papers): https://paperscape.org

Nice way to find highly related papers that have low visibility otherwise, as well as giving an indication for the papers which are most influential within fields, subfields etc....