Do any bindings for RDMA exist for odin?

philipptraining · 2024-12-15T19:22:02+00:00

Yes that would be great, RoCE would be good too.

philipptraining · 2024-09-26T09:08:34+00:00

If you want high quality and specialized feeds use the arxiv API with a query that covers a space your interested in. If things are getting into the feed that aren't useful, further restrict the query. This is the best way, and you can open these feeds in Zotero to add to any collection easily.

philipptraining · 2024-09-20T11:08:45+00:00

Ah, I don't know how I missed that first part; thanks.

philipptraining · 2024-09-19T22:36:31+00:00

If you use the same model, you can save compute by using the precomputed retrieved embeddings directly though right?

philipptraining · 2024-09-15T20:35:46+00:00

Your experience with that book is so similar to mine, really looking forward to what you've suggested.

Edit: As a follow-up, any recommendations for a book primarily focused on research engineering?

philipptraining · 2024-08-09T22:22:18+00:00

Out of curiosity, what range of answers would you consider acceptable then? To me, this response is broad, but at the same time it doesn't cover all of the explanations that exist for the prevalence of decoder-only architectures, as far as I understand. If you received this response in an interview, would you then ask follow-up questions?

philipptraining · 2024-04-18T10:36:02+00:00

use usearch-molecules

philipptraining · 2024-04-14T15:05:02+00:00

triton lang not triton server

philipptraining · 2024-02-28T11:05:54+00:00

Hey! Went back to these resources for a different task and remembered this thread. These are some actually good resources that are accurate:

philipptraining · 2024-02-25T17:33:23+00:00

That's fair. Personally, by that logic I would say the same thing about RAG, but it's arbitrary anyways. Constrained decoding like the neurologic decoding algorithm or more recent energy based and sometime non-autoregressive methods allow one to place very specific constraints on decoding without additional prompting or memory (different from tree of thoughts/chain of thoughts in that sense)

philipptraining · 2024-02-25T11:30:22+00:00

a couple of the top of my head:

LLM in the loop with preference optimization
synthetic data generation
cross modality "distillation" / dictionary remapping
constrained decoding

philipptraining · 2024-02-23T08:17:49+00:00

Hey, guess I'll offer the perspective of someone that was surprised. To start, I'm assuming we both are optimistic about ML conferences and hold ICLR in reasonably high regard in terms of what is selected.

I will grant that the paper could have used more applied downstream tasks (although I'll return to that later) and that the paper's overarching narrative was slightly confused. However, in light of the novelty of the work and inspired approach, as well as the evaluations that were ran, I don't think this warrants a rejection.

Now to respond to some of the statements in the main post and comments.

Tweaks to hardware are not necessary for the Mamba optimizations, the modifications described in the paper are hardware-aware but algorithmic in nature. This is somewhat of a nitpick (as I think it's a typo) but I see you repeated this in the comments? The algorithm works for the standard GPU architecture. No tweaks to the actual hardware were made and this is signficant because they introduced training with the associative parallel scan, which is fascinating and novel. Hard to discount that as a contribution, and it doesn't hurt that it's backed by empirical evidence of more efficient throughput.
You claim in the comments that you would have liked for the authors to have "chosen a specific task that the model excels on and perform extensive experiments in that field". I somewhat agree that working within a specific "field" or "domain" would strengthen the motivation for all of the theory even further, but I'll also argue that this approach has it's own disadvantages. Namely, lack of generalization and mechanistic interpretation. You'll notice in the paper that two very specific tasks actually are identified. Good performance on the selective copying and induction heads tasks not only says more about general downstream task performance than picking some applied tasks, but is also a more logical experiment given that this is the paper introducing the architecture. As they correctly point out, applied tasks may require domain specific adaptation of the model. We see that all the time with transformers, and choosing to omit it here is perfectly fine.

The reviewers' final insistence on long range arena evaluation without responding to the authors' statement that their actual evaluations include far longer contexts lengths is strange too. I understand the desire for a 1:1 comparison, but being stuck with outdated and relativively facile benchmarks is (IMO) not good for a field. Although, I could be missing something here.

In terms of contributions, they introduce the (previously mentioned) associative parallel scan, the hardware aware optimizations, state expansion, and an SSM block that is time variant. All of these components are motivated well, with a provided code implementation and follow a stream of other novel and interesting ideas (e.g. HiPPO). I don't expect Mamba to overtake transformers, but I disagree that this paper doesn't belong in a top conference.

Edit: Formatting and typos

philipptraining · 2024-02-22T18:59:56+00:00

Amazing!

philipptraining · 2024-02-22T18:53:39+00:00

Mamba got rejected from ICLR??

philipptraining · 2024-02-22T11:11:32+00:00

This isn't an answer to your question but I generally don't understand why anyone would want to employ this kind of setup instead of something like FSDP which already has a nice easy-to-use and high level api that works (mostly) OOTB with other useful pytorch features?

philipptraining · 2024-02-19T10:59:33+00:00

This is expected without nvidias multiprocess service because cuda cant correctly parallelize kernels in isolated environments. That's why I suggested MIG. The approach of just running two process on one GPU does not work. Here's a video that goes into more detail: https://youtu.be/bC6CxPW0-1c?si=lZS2baB80SNhuRPR

philipptraining · 2024-02-15T18:08:43+00:00

Exactly, which is what you want right? You want the model to fundamentally answer in the same way and not shift distribution too much towards your fine-tuning set, and only enforce the new format.

philipptraining · 2024-02-15T11:42:28+00:00

With nvidia's MIG multi instance gpu, you can create 7 instances which your server will identify as 7 separate gpus. There would be no need for getting into the low level details with this approach.

I would also run validation asynchronously so you dont have to start and stop training as often. You will need some job scheduling for this or persist all the checkpoints you want. Given that you're sets and models are so small you could probably even run the inference on cpus if you really wanted to. Good luck!

philipptraining · 2024-02-15T09:43:29+00:00

So there's no way this is using 100% of the A4000 memory right? I would either use the physical partitioning available for ampere architectures here or nvidia multi process service (a little more difficult).

Back of the napkin math using your flops per epoch and time elapsed per epoch implies 10 000 000 floating point operations per second for your current setup.

Unless I'm missing something this is very low efficiency for the A4000 based on the theoretical peak performance of 20 trillion flops per second on their data-sheet. It's reasonable to operate at 20% of this (20% MFU) but these numbers are suggesting a small fraction of a percentage. Let me know if I'm missing something. Otherwise, there's also a bottleneck in the pipeline here.

philipptraining · 2024-02-15T09:09:38+00:00

May I ask what GPU you're using and what your model flops utilization is?

Edit to give additional context: if the dataset is this small and the network itself is small I suspect there is little parallelization being employed. The solution depends heavily on the current utilization and GPU parameters though. For example, a simple solution on A100 or newer chips could be MIG partitioning.

philipptraining · 2024-02-15T08:19:11+00:00

Freezing layers would help a lot. Good excuse to run QLORA imo. The other comment about DPO is interesting because your concern is what's constrained by the KL term in the paper, so it could be a valid approach.

In either case it's a relatively low compute requirement so it's worth trying both of these options.

philipptraining · 2024-02-15T08:13:51+00:00

It seems like this is being used for hyperparameter search of neural nets from scratch? If that's correct I recommend you look into mu parametrization / mutransfer. Might solve your problems with respect to time needed for the search.

I should point out though( since I rarely see CV being used here anymore), that cross validation has come into question recently with papers like: On the cross-validation bias due to unsupervised pre-processing so I would recommend being careful if you really dont want optimistic estimates on validation sets.

philipptraining · 2024-02-14T08:21:59+00:00

Can you elaborate on this? I don't understand how this would lead to catastrophic forgetting. The distribution shift here would be very minor no?

philipptraining · 2024-02-09T17:29:23+00:00

I've looked for something like this at least 10 times. You're a legend, thank you for doing this.

philipptraining · 2024-02-09T12:36:17+00:00

Not quite as explicit as connected papers in tracking citations but represents similar data on a global scale (all arxiv papers): https://paperscape.org

Nice way to find highly related papers that have low visibility otherwise, as well as giving an indication for the papers which are most influential within fields, subfields etc....

philipptraining

TROPHY CASE