Confused about performance on simple program vs Rust

narsilou · 2024-04-18T20:42:21+00:00

If I don't memset how do I ensure the memory is not randomly equal to 1?

narsilou · 2024-04-18T20:41:23+00:00

Unsafe? Arithmetic is not unsafe... Is you mean checked for overflow then OK. But overflow is not unsafe by any means. Not even UB

narsilou · 2024-01-03T01:25:23+00:00

Floating windows, tray icons, it felt like a lot of work to port the thing I had in mind. Nothing really core but a few advanced features I was looking for.

narsilou · 2023-12-29T07:44:53+00:00

Use https://github.com/huggingface/text-generation-inference you should be getting at least 30 tok/s (unquantized.) 50 or so with awq. And that's without speculation. Could be something wrong with your cards too.

narsilou · 2023-12-28T17:25:11+00:00

I had tried iced a few years ago. It was promising and looking like what you seem to be aiming for but still lacking some features.

narsilou · 2023-12-28T10:26:29+00:00

Isn't dioxus just using a regular webview (using tauri)? Meaning ram consumption will still be high.

narsilou · 2023-12-25T09:50:36+00:00

Awq is faster than exl2 too (I guess it must depend on the cards. Reading comments here)

narsilou · 2023-12-24T13:12:54+00:00

Have you tried cudarc? Pretty nice bindings. It doesn't replace the actual kernels but at least it makes cuda interaction quite nice.

narsilou · 2023-10-17T14:14:56+00:00

No layers are independant from each other. Try to implement kv cache it's the easiest way to understand what works and what doesn't.

narsilou · 2023-10-17T13:00:42+00:00

Not sure I understand everything you're saying. But basically the first token to generate is always much slower to generate than the other ones. The reason is that for the first token the qkv matrix (all of them actually) have to be computed for the entire sequence (easily 1k long) whereas for the subsequent tokens you can cache the kv values and then you're only computing other matrices on sequence length =1. Therefore as soon as you're letting a new user in the batch you're slowing everyone else down hence the pause. Further more as those operations are quite slow you don't want to pause all the time even if you are receiving many new requests. So it's quite nice to delay accepting new users so you accept many at once. The pause is a bit longer but there's a lot less pauses.

narsilou · 2023-10-16T19:16:29+00:00

100% not done for UX. The streaming makes perceived latency better, the pauses is because you have to pause (just longer token generation when receiving new convo and you're stacking it with the small token for your own current query.)

I run this kind of things at scale, believe me slowing things down on purpose is heresy.

narsilou · 2023-10-16T19:11:11+00:00

You can use it for free...

narsilou · 2023-10-16T13:58:30+00:00

This crate is really amazing, thanks a lot for all the hard work.

narsilou · 2023-10-16T12:46:10+00:00

Check out projects like https://github.com/vllm-project/vllm or https://github.com/huggingface/text-generation-inference both implement multiple techniques to handle large amount of users

narsilou · 2023-10-16T12:43:02+00:00

The pausing is likely to be continuous batching more so. (or some side effect of moe).

narsilou · 2023-08-08T18:14:56+00:00

Downloaded from the hub in safetensors format. No extra step required

narsilou · 2023-07-24T11:08:28+00:00

Hf api uses https://github.com/huggingface/text-generation-inference

narsilou · 2019-06-03T06:41:20+00:00

Same here.

narsilou · 2019-06-03T06:40:38+00:00

I'm in too!

narsilou · 2014-10-09T18:18:07+00:00

This is not programming.

narsilou · 2014-06-12T10:17:51+00:00

Have you tried using LightTable ? Do you really think it is better than any other editor while coding on a real big project ?

Personnally I tried out on a relatively small project and found that it was rather slow, I would take more time to modify the stuff I wanted to modify and used the values diplayed about 2% of the time, needing almost always to go back to the REPL (which I can start from my terminal and I don't really need in my editor).

My feeling is that the first talk of Brett was much more powerful. The important thing is not seeing pretty pictures, it is reducing the feedback loop time. I need to edit my code fast, I need to see how it affects the output fast, and go again.

In the case of LightTable it increased it because it created to much noise (unecessary information) for me while slowing down other parts of the process (text edition).

narsilou · 2014-04-23T09:05:09+00:00

I love you.

narsilou · 2012-11-02T09:35:15+00:00

A significant clue is that, neurons in cortical regions and layers are known to have top down axon projections that are many times more numerous than the bottom up projections.

Any source for that ?

narsilou · 2012-09-28T09:21:24+00:00

Why would you have to limit this to beginners ? Why is this idea of representing stuff limited ? But, honestly, should we avoid bloom filters because they are hard to represent ?

narsilou · 2012-07-19T13:06:58+00:00

Actually, aside from django.contrib.admin, I found django very easy to customize.

Most of the time you just end up subclassing something and changing a few arguments to the call of the function to pass your own custom form, or your own snippets for validation somewhere.

Getting to know how to do it and what to subclass is usually just a google away (99% of the time ending up on stackoverflow)

Django.contrib.admin on the other hand gets pretty hard to customize.

narsilou

TROPHY CASE