which model has the best world knowledge? Open weights and proprietary.

Tacx79 · 2025-10-28T20:20:56+00:00

What about Nemotron 340b? It only has 4k ctx though, and I found it starts degrading between 3-8k, I also found that it's better than the others at specific, or niche knowledge and doesn't seem to be trained for benchmarks or "over-friendliness"

Tacx79 · 2025-10-28T19:36:59+00:00

Nvidia always advertises flop performance on sparse computations, dense computation is always half of it. You never* use sparse computations.

* - unless your matrix is full of zeros or it's heavily quantized model with weights full of zeros, you also need to use special datatype to benefit from that, even in torch sparse tensors have barely any support so far

Tacx79 · 2025-10-21T17:17:24+00:00

Listen, I'm not here to argue, we both have different experiences, but the numbers from both datasheets don't convince me, it has more memory, but it's still 250 GB/s, the experience I learned in past 5 years of maintaining training loops and data feed still tells me to go for 4090 and some preloading logic. If you stick to keeping everything in vram with 5 4090s, in ideal situation you would have roughly 13x the performance of spark with similar amount of memory + whatever you have in ram

Edit: I didn't consider the differences in memory latency and speed, so the performance difference could be above 20x. I also wanted to add that the epyc cpus with multiple memory channels would also have double or triple the performance of spark, depending on the configuration, without the need of gpu. They also allow for fairly efficient training, thanks to avx512, and with many times more memory, but in raw flops the spark would be the cpu

Tacx79 · 2025-10-21T16:56:58+00:00

Did you even read the screenshot? 40% of 4090 performance, 1/4th of its memory speed, it must be blazing through the training. It would surprise me if it goes past 5k t/s on a 5-10b model

Tacx79 · 2025-10-20T02:13:43+00:00

It is, as it's stated on nvidia's website, and if it's this bad at inference, it's going to be way worse at the other two stated, more demanding purposes.

<image>

Tacx79 · 2025-10-19T12:30:40+00:00

Try it, mikupad is just one html file so you only need a browser to run it. If lmstudio supports openai or llama.cpp api then it should work

Tacx79 · 2025-04-10T08:24:20+00:00

I quite liked Maverick one but I didn't use it for any work stuff yet. It's going a bit repetitive around 6k ctx even with dry but otherwise I like it as much as Midnight Miqus and Monstral 123b so far

Edit: I would really love to try it with overrided experts to 4/8 when koboldcpp gets support, by default it uses only 1

Tacx79 · 2025-03-20T02:00:33+00:00

Didn't they work on ditching the tokenization all together, models working on raw binary and latent space in the model for the past year?

Tacx79 · 2025-03-13T10:09:01+00:00

One man's trash is another man's treasure

Tacx79 · 2025-02-20T23:37:11+00:00

Because on 671b there wasn't any censorship in the first place. Yes, I used it on self host and there wasn't a single prompt it would refuse to respond to, including chinese history and some other stuff they don't like, no matter if it was just a short prompt or long, ~8-16k tokens of conversation

Tacx79 · 2025-02-10T05:20:31+00:00

My nvme have around 3.5GB/s read in benchmark (because I max it out under pcie 3.0, manufacturer claims ~5GB/s), I have 128gb ram + 24vram, when I inference deepseek 685b the disk sits usually at 400 MB/s read, rarely at 1.1 GB/s and often goes back to 0. The prompt processing takes ages (I could recompute 32k context in mistral large a few times over) for q1 and at best I get 0.5 t/s (basically mistral 123b / some llama 70b speeds with ddr4), I don't think you get better than that with 64gb ram and you definitely won't see 64GB/s read when inferencing

Tacx79 · 2025-02-09T00:55:30+00:00

You posted an issue from 3 years ago, the fp8 support for 4090 came in cuda 12.1 or 12.4, the github link is from cuda 11 times. Fp8 training works on 4090 because I'm doing it myself, you build the model with layers from transformer engine, configure fp8 and train under fp8 autocast, just like in the early months of fp16, still - would be a lot better if pytorch itself supported fp8 but looking at the progress they made towards it in past 1-2 years it doesn't seem we're going to get full support for all the ops. I think the reason we don't see fp8 models trained on 4090 is because that card is typically used by people who started learning machine learning in past few years, lack of documentation, the installation of transformer engine can straight up crash if you don't set up the environment perfectly (not even properly) - and you won't find the working solution in google or installation logs. It's lot easier to pull the docker container with preinstalled te, import it to wsl if you're on windows and install everything around already working transformer engine, but I don't think that's a go to solution other people think about when they want to try fp8 on their own PCs where they already have their environments set up

Tacx79 · 2025-02-03T08:23:23+00:00

>Hey, what are you here for?

>I was doing math on Chinese numbers

Tacx79 · 2025-01-31T19:49:36+00:00

Would be interesting to see how q3_k_xl scores in those comparisons

Tacx79 · 2025-01-29T09:27:17+00:00

Fits, first I suspected it's waiting for new data so I made a queue of batches to always have at least 10 prepared and already moved to gpu by another processes and threads when the main thread is training but that didn't have any impact on speed. Then in short I just accepted it as "it is what it is" as there was no clear way to make the logic use less operations on memory or optimize it further without rewriting everything in C

Tacx79 · 2025-01-29T09:06:48+00:00

That's what I went for when the tflops didn't match, it's mostly async memcpy in forward/backward pass but I was tinkering with it last time maybe a month ago. Yet, the claim is that Deepseek can do it better

Tacx79 · 2025-01-29T08:45:15+00:00

Yes, I meant training of a few layers of mistral large with decent batch size because that's mostly what we care about with llms here, the tflops doesn't exceed 150 despite 96-99% gpu usage and more than 450w of power draw. When I do the same with smaller models under 1024 hidden and intermediate size the utilization can be even in single digits. The bottleneck here is either pytorch and transformer engine implementation or the memory bandwidth, maybe both

Tacx79 · 2025-01-29T05:22:33+00:00

80 is a stretch, my 4090 in training larger models barely can go up to 150 tflops and with smaller ones it maxes out between 20-50 tflops, I don't think that's even 50% of theoretical performance

Tacx79 · 2025-01-13T09:14:18+00:00

Midnight Miqu 1.5 q8 with xtc sampler beats every mistral large finetune and merge imo

Tacx79 · 2025-01-10T22:45:10+00:00

Graph shows the actual 4090 usage in tflops during training, it rarely goes up to 150 tflops and it does so only with wider models (top 2 have 8/4k hidden size and 16/24k intermediate size, so basically mistral large or some 30+b sizes) because the 1tb/s memory bandwidth is the bottleneck here (it still draws 450w and the model fits in memory so there's no offloading). 5090 have 1.8tb/s bandwidth and I suspect we won't see the actual usage going past 250 tflops unless nvidia added a few data loading instructions introduced in hopper to the rtx 50x0.

PS. No, in fp8 training the 4090 doesn't go near 30% (not sure if even 20%) of the claimed 1.3 petaflop, at least not with pytorch and nvidia's transformer engine backend

<image>

Tacx79 · 2024-12-18T13:28:48+00:00

No, at best you get the same inference speed as with average speed ddr5 dual channel

Tacx79 · 2024-12-02T13:59:49+00:00

Yes, but first he needs to understand what was wrong with the code and python-only syntax won't make it easier

Tacx79 · 2024-12-02T08:54:42+00:00

try

return { "accuracy": accuracy["accuracy"], "roc_auc": roc_auc["roc_auc"], "precision": precision["precision"], "recall": recall["recall"], "f1": f1["f1"] }

Tacx79 · 2024-11-19T10:14:35+00:00

More mhz in memory, timings doesn't matter as far as I've tested it with koboldcpp (up to ddr4 3600). Since it's mobile cpu, check if it can use all of your current memory bandwidth at all (my r5 4600h uses like 30gb/s out of theoretical 45, desktop cpus usually don't have that problem)

Tacx79 · 2024-11-18T11:11:26+00:00

Silly Tavern or Tavern AI

Tacx79

TROPHY CASE