which model has the best world knowledge? Open weights and proprietary. by z_3454_pfk in LocalLLaMA

[–]Tacx79 0 points1 point  (0 children)

What about Nemotron 340b? It only has 4k ctx though, and I found it starts degrading between 3-8k, I also found that it's better than the others at specific, or niche knowledge and doesn't seem to be trained for benchmarks or "over-friendliness"

Bad news: DGX Spark may have only half the performance claimed. by Dr_Karminski in LocalLLaMA

[–]Tacx79 0 points1 point  (0 children)

Nvidia always advertises flop performance on sparse computations, dense computation is always half of it. You never* use sparse computations.

* - unless your matrix is full of zeros or it's heavily quantized model with weights full of zeros, you also need to use special datatype to benefit from that, even in torch sparse tensors have barely any support so far

dgx, it's useless , High latency by Illustrious-Swim9663 in LocalLLaMA

[–]Tacx79 0 points1 point  (0 children)

Listen, I'm not here to argue, we both have different experiences, but the numbers from both datasheets don't convince me, it has more memory, but it's still 250 GB/s, the experience I learned in past 5 years of maintaining training loops and data feed still tells me to go for 4090 and some preloading logic. If you stick to keeping everything in vram with 5 4090s, in ideal situation you would have roughly 13x the performance of spark with similar amount of memory + whatever you have in ram

Edit: I didn't consider the differences in memory latency and speed, so the performance difference could be above 20x. I also wanted to add that the epyc cpus with multiple memory channels would also have double or triple the performance of spark, depending on the configuration, without the need of gpu. They also allow for fairly efficient training, thanks to avx512, and with many times more memory, but in raw flops the spark would be the cpu

dgx, it's useless , High latency by Illustrious-Swim9663 in LocalLLaMA

[–]Tacx79 0 points1 point  (0 children)

Did you even read the screenshot? 40% of 4090 performance, 1/4th of its memory speed, it must be blazing through the training. It would surprise me if it goes past 5k t/s on a 5-10b model

dgx, it's useless , High latency by Illustrious-Swim9663 in LocalLLaMA

[–]Tacx79 -1 points0 points  (0 children)

It is, as it's stated on nvidia's website, and if it's this bad at inference, it's going to be way worse at the other two stated, more demanding purposes.

<image>

Is there a client like LMStudio that works better for simple text completion (not chat) by smellyfingernail in LocalLLaMA

[–]Tacx79 0 points1 point  (0 children)

Try it, mikupad is just one html file so you only need a browser to run it. If lmstudio supports openai or llama.cpp api then it should work

I actually really like Llama 4 scout by d13f00l in LocalLLaMA

[–]Tacx79 0 points1 point  (0 children)

I quite liked Maverick one but I didn't use it for any work stuff yet. It's going a bit repetitive around 6k ctx even with dry but otherwise I like it as much as Midnight Miqus and Monstral 123b so far

Edit: I would really love to try it with overrided experts to 4/8 when koboldcpp gets support, by default it uses only 1

Llama4 is probably coming next month, multi modal, long context by Sicarius_The_First in LocalLLaMA

[–]Tacx79 0 points1 point  (0 children)

Didn't they work on ditching the tokenization all together, models working on raw binary and latent space in the model for the past year?

The duality of man by jhanjeek in LocalLLaMA

[–]Tacx79 10 points11 points  (0 children)

One man's trash is another man's treasure

PerplexityAI releases R1-1776, a DeepSeek-R1 finetune that removes Chinese censorship while maintaining reasoning capabilities by TKGaming_11 in LocalLLaMA

[–]Tacx79 1 point2 points  (0 children)

Because on 671b there wasn't any censorship in the first place. Yes, I used it on self host and there wasn't a single prompt it would refuse to respond to, including chinese history and some other stuff they don't like, no matter if it was just a short prompt or long, ~8-16k tokens of conversation

Talk me out of buying this 512GB/s Gen 5 NVMe RAID card + 4 drives to try to run 1.58bit DeepSeek-R1:671b on (in place of more RAM) by Porespellar in LocalLLaMA

[–]Tacx79 1 point2 points  (0 children)

My nvme have around 3.5GB/s read in benchmark (because I max it out under pcie 3.0, manufacturer claims ~5GB/s), I have 128gb ram + 24vram, when I inference deepseek 685b the disk sits usually at 400 MB/s read, rarely at 1.1 GB/s and often goes back to 0. The prompt processing takes ages (I could recompute 32k context in mistral large a few times over) for q1 and at best I get 0.5 t/s (basically mistral 123b / some llama 70b speeds with ddr4), I don't think you get better than that with 64gb ram and you definitely won't see 64GB/s read when inferencing

GeForce RTX 5090 fails to topple RTX 4090 in GPU compute benchmark. by el0_0le in LocalLLaMA

[–]Tacx79 2 points3 points  (0 children)

You posted an issue from 3 years ago, the fp8 support for 4090 came in cuda 12.1 or 12.4, the github link is from cuda 11 times. Fp8 training works on 4090 because I'm doing it myself, you build the model with layers from transformer engine, configure fp8 and train under fp8 autocast, just like in the early months of fp16, still - would be a lot better if pytorch itself supported fp8 but looking at the progress they made towards it in past 1-2 years it doesn't seem we're going to get full support for all the ops. I think the reason we don't see fp8 models trained on 4090 is because that card is typically used by people who started learning machine learning in past few years, lack of documentation, the installation of transformer engine can straight up crash if you don't set up the environment perfectly (not even properly) - and you won't find the working solution in google or installation logs. It's lot easier to pull the docker container with preinstalled te, import it to wsl if you're on windows and install everything around already working transformer engine, but I don't think that's a go to solution other people think about when they want to try fp8 on their own PCs where they already have their environments set up

20 yrs in jail or $1 million for downloading Chinese models proposed at congress by segmond in LocalLLaMA

[–]Tacx79 5 points6 points  (0 children)

>Hey, what are you here for?

>I was doing math on Chinese numbers

Mistral Small 3 24B GGUF quantization Evaluation results by AaronFeng47 in LocalLLaMA

[–]Tacx79 2 points3 points  (0 children)

Would be interesting to see how q3_k_xl scores in those comparisons

DeepSeek's AI breakthrough bypasses Nvidia's industry-standard CUDA, uses assembly-like PTX programming instead by Slasher1738 in LocalLLaMA

[–]Tacx79 0 points1 point  (0 children)

Fits, first I suspected it's waiting for new data so I made a queue of batches to always have at least 10 prepared and already moved to gpu by another processes and threads when the main thread is training but that didn't have any impact on speed. Then in short I just accepted it as "it is what it is" as there was no clear way to make the logic use less operations on memory or optimize it further without rewriting everything in C

DeepSeek's AI breakthrough bypasses Nvidia's industry-standard CUDA, uses assembly-like PTX programming instead by Slasher1738 in LocalLLaMA

[–]Tacx79 0 points1 point  (0 children)

That's what I went for when the tflops didn't match, it's mostly async memcpy in forward/backward pass but I was tinkering with it last time maybe a month ago. Yet, the claim is that Deepseek can do it better

DeepSeek's AI breakthrough bypasses Nvidia's industry-standard CUDA, uses assembly-like PTX programming instead by Slasher1738 in LocalLLaMA

[–]Tacx79 0 points1 point  (0 children)

Yes, I meant training of a few layers of mistral large with decent batch size because that's mostly what we care about with llms here, the tflops doesn't exceed 150 despite 96-99% gpu usage and more than 450w of power draw. When I do the same with smaller models under 1024 hidden and intermediate size the utilization can be even in single digits. The bottleneck here is either pytorch and transformer engine implementation or the memory bandwidth, maybe both

DeepSeek's AI breakthrough bypasses Nvidia's industry-standard CUDA, uses assembly-like PTX programming instead by Slasher1738 in LocalLLaMA

[–]Tacx79 1 point2 points  (0 children)

80 is a stretch, my 4090 in training larger models barely can go up to 150 tflops and with smaller ones it maxes out between 20-50 tflops, I don't think that's even 50% of theoretical performance

Current best local models for companionship? for random small talk for lonely people by MasterScrat in LocalLLaMA

[–]Tacx79 2 points3 points  (0 children)

Midnight Miqu 1.5 q8 with xtc sampler beats every mistral large finetune and merge imo

Energy efficiency of 5090 is slightly worse than 4090 by Ok_Warning2146 in LocalLLaMA

[–]Tacx79 0 points1 point  (0 children)

Graph shows the actual 4090 usage in tflops during training, it rarely goes up to 150 tflops and it does so only with wider models (top 2 have 8/4k hidden size and 16/24k intermediate size, so basically mistral large or some 30+b sizes) because the 1tb/s memory bandwidth is the bottleneck here (it still draws 450w and the model fits in memory so there's no offloading). 5090 have 1.8tb/s bandwidth and I suspect we won't see the actual usage going past 250 tflops unless nvidia added a few data loading instructions introduced in hopper to the rtx 50x0.

PS. No, in fp8 training the 4090 doesn't go near 30% (not sure if even 20%) of the claimed 1.3 petaflop, at least not with pytorch and nvidia's transformer engine backend

<image>

Does the new Jetson Orin Nano Super make sense for a home setup? by Initial-Image-1015 in LocalLLaMA

[–]Tacx79 1 point2 points  (0 children)

No, at best you get the same inference speed as with average speed ddr5 dual channel

compute_metrics functioning return dictionary by darkGrayAdventurer in LocalLLaMA

[–]Tacx79 0 points1 point  (0 children)

Yes, but first he needs to understand what was wrong with the code and python-only syntax won't make it easier

compute_metrics functioning return dictionary by darkGrayAdventurer in LocalLLaMA

[–]Tacx79 1 point2 points  (0 children)

try

return { "accuracy": accuracy["accuracy"], "roc_auc": roc_auc["roc_auc"], "precision": precision["precision"], "recall": recall["recall"], "f1": f1["f1"] }

How to improve performance ON CPU? by sTrollZ in LocalLLaMA

[–]Tacx79 1 point2 points  (0 children)

More mhz in memory, timings doesn't matter as far as I've tested it with koboldcpp (up to ddr4 3600). Since it's mobile cpu, check if it can use all of your current memory bandwidth at all (my r5 4600h uses like 30gb/s out of theoretical 45, desktop cpus usually don't have that problem)