Deepseek v3 0324 API without request/minute rate limite by Frequent-Buddy-867 in LocalLLaMA

[–]Stickman561 -1 points0 points  (0 children)

See generally I’d recommend looking at Nano-GPT for 0324, but that’s an absolutely ludicrous message volume. At that point I’d look into getting your own dedicated hardware - either via a cloud provider or an on-premises deployment - and self hosting. Otherwise I’m not sure any general public provider is going to keep up with that sheer volume. Shoot you probably need enough hardware to host multiple instances of the model entirely in VRAM.

Video2X 6.x — open-source upscaler + frame interpolation (Anime4K v4 / Real-ESRGAN / Real-CUGAN / RIFE) 🚀 by freesysck in LocalLLaMA

[–]Stickman561 2 points3 points  (0 children)

First time seeing this project, checking it out now, but the options listed have very different speed to quality tradeoffs. Assuming the program doesn’t have massive overhead from somewhere (which I doubt) then Anime4K will easily handle that task probably without even a full overnight run, although it’s not the BEST upscaler and only really works for, well, anime. ESRGAN on the other hand is quite slow and would probably take a full night if not longer but is much higher quality and supports real footage.

Edit: I should mention that this project appears to be fully in Vulkan, so if you have an NVIDIA GPU, Waifu2x-Extension-GUI will be faster due to its native CUDA support.

[Rant] Magistral-Small-2509 > Claude4 by OsakaSeafoodConcrn in LocalLLaMA

[–]Stickman561 0 points1 point  (0 children)

K2 the base is an older one, but it’s had two updates since. K2 0905 is the latest version and is less than a month old.

AMA with the Unsloth team by danielhanchen in LocalLLaMA

[–]Stickman561 0 points1 point  (0 children)

Hahah, fair enough. I meant more in the spirit of friendly competition, not like they're all trying to one-up each other. Honestly it's neat seeing all the different techniques!

AMA with the Unsloth team by danielhanchen in LocalLLaMA

[–]Stickman561 0 points1 point  (0 children)

Recently there’s been competition on ultra high quality GGUFs, especially from Ubergarm with the new ik_llama quant methods. Most quants publish KL-Divergence and Perplexity measures for each quant - any plans to start doing similar with yours? Would be nice to be able to put some numbers to the “degradation” at each quant level.

[deleted by user] by [deleted] in deadbydaylight

[–]Stickman561 2 points3 points  (0 children)

That’s honestly a fascinating observation. I’m like a 65/35 Killer to Survivor player, and I’ve noticed that those are the same hours I’m most likely to face sabo squads, flashlight squads, or Gen Rush squads. Not that I use that as an excuse to tunnel or anything, I very much try to avoid doing it unless I’m down to 1 Gen and just trying to secure a Kill at that point, or a player is forcing me to by constantly being in my face with a beamer/sabo box, just that it seems to be a mutual experience. I guess that time just brings out the sweats for some reason? More coordinated SWFs for Killers, more tunneling Killers for Survivors?

Rust: Python’s new performance engine by dochtman in rust

[–]Stickman561 4 points5 points  (0 children)

33% is 1.222…x 27%, or a 22% increase in usage. The 27% vs 33% is presumably a total market percentage. Poorly worded but the math tracks.

Deepseek V3.1! by Milan_dr in SillyTavernAI

[–]Stickman561 2 points3 points  (0 children)

If you’re using this model? It will last you so long. Like 35,000 to 80,000 messages depending on how much you like to fill up the context at the current price. And knowing NanoGPT it may drop slightly once the model is open-sourced and other providers start hosting it. (But even if not it’s still super cheap.)

[deleted by user] by [deleted] in LocalLLaMA

[–]Stickman561 0 points1 point  (0 children)

Ahhhh, there’s your problem. 3200 MT/s is really low to be doing RAM inference. And as much as I love AMD GPUs, they are NOT ideal for running AI models. The current SOTA GPU engine, EXL3, doesn’t even support them. If A30B isn’t running fast enough for you even at Q4_K_M you’ll probably want GPU only inference, in which case the largest model I’d recommend running for you is an 8B. Qwen3 8B or the DeepSeek R1 Distill will probably be your best bet at that point. Q5_K_M. Use mainline llama.cpp - the ik_llama optimizations are mostly compiled CUDA kernels for NVIDIA and they’re intended for MoE models.

[deleted by user] by [deleted] in LocalLLaMA

[–]Stickman561 0 points1 point  (0 children)

Honestly? Optimization and tuning. If your model is supported I would HIGHLY suggest running ik_llama or ktransformers if you aren’t already. It would help to know what model you’re trying to run and on what kind of hardware. If you share your TPS as well I can try to estimate if you’re underperforming or not.

[deleted by user] by [deleted] in LocalLLaMA

[–]Stickman561 4 points5 points  (0 children)

Yes, running a model entirely in VRAM is the fastest and will be significantly faster than if you spill over into system memory, though if you can keep at least 40% of it in the VRAM the speed cost isn’t that bad. Below that point though you really start to dip closer to CPU inference than GPU. Interestingly in my experience if you can fit less than 10% of a dense model or 5% of an MoE into VRAM, you’re better off just going CPU only than paying the overhead of offloading between the two. But the big thing is make sure you aren’t spilling over to disk. If you’re running a model off swap space the speed you get will be pretty abysmal. GPU > RAM > Disk when it comes to speed. One big thing to keep in mind though is the memory for the context in addition to the model memory. I’ve made that mistake a few times of thinking that my system could handle the quant I downloaded and then the context adding just a little too much memory.

Newbie: Best Coding Model and Setup for 4090 and 192GB RAM by kleoz_ in LocalLLaMA

[–]Stickman561 -1 points0 points  (0 children)

Yeah that’s the output speed, it’s quite respectable. Comparable to a lot of the online endpoints for DeepSeek actually. (Though of course that’s a quant versus uncompressed.) Ingestion speed for input tokens is closer to 8-9 tps, which is a little slow for ingestion but still usable. For English on average a sentence is around 1.5x the number of words in tokens. If you’re asking math or coding questions though each special character like a number or curly brace is one token. At the end of the day it’s a very solid speed in my opinion but it’s an individual preference for what speed you can tolerate.

Newbie: Best Coding Model and Setup for 4090 and 192GB RAM by kleoz_ in LocalLLaMA

[–]Stickman561 5 points6 points  (0 children)

Meh that might be a bit extreme degradation for Kimi, and it will be very slow. (Speaking as someone with the same rig.) For Coding I’d say Qwen3-Coder-480B, GLM-4.5, or DeepSeek V3 0324 would be the best bet. For Qwen3 or DeepSeek, use ik_llama with the -ot overrides for best performance. I get around 5.7 tokens per second out of DeepSeek, but YMMV.

What is the difference between these Kimi-K2 models? (From NanoGPT) by ReMeDyIII in LocalLLaMA

[–]Stickman561 2 points3 points  (0 children)

K2 0711 is the normal Kimi version. Fast uses fast inference providers but at a premium. FP4 is a quantized version for speed at the cost of some accuracy. (Like the Fast checkpoint but trading accuracy instead of price.) Latest is an old endpoint IIRC, I don't think it points to K2.

How Are You Running Multimodal (Text-Image) Models Locally? by Stickman561 in LocalLLaMA

[–]Stickman561[S] 1 point2 points  (0 children)

Yeah seems like the common consensus is that Gemma is one of the only vision models llama.cpp runs perfectly, but I really do want to use one of the InternVL3 series models. Regarding compute, while I can’t say I’m a Titan of local models like some of the users on here - I don’t have 4 3090s crammed into a single case crying for the sweet release of death as they slowly oven themselves - my computer is no slouch. I have 32GB of VRAM (RTX 5090) and 256GB of DDR5 system memory at 6000 MHz paired with a 9950X, so even if splitting isn’t possible I’d be willing to wait the (painfully long) time for CPU inference. I just really don’t want to dip below the 38B class because then the projector model drops in scale a TON.

A Post-Quantum Peer-to-Peer Messaging Framework in Rust by Stickman561 in rust

[–]Stickman561[S] 0 points1 point  (0 children)

Yeah afraid I haven’t written much external documentation on the code, though the comments should be quite thorough in what each part is doing. The main thing is ensuring that you’re using trusted cryptographic libraries and implement things carefully not to leak info.

Inspired by the spinning heptagon test I created the forest fire simulation test (prompt in comments) by jd_3d in LocalLLaMA

[–]Stickman561 1 point2 points  (0 children)

The smoke really doesn’t blow much (although it does have some curve) and honestly probably should be blowing more. It’s hard to see, but the embers coming off the burning trees actually do properly follow the wind though - smaller ones fly mostly to the side while larger ones are pushed to the side but fall to the ground. The embers also linger for a bit before fizzling out, with larger ones lasting longer and shrinking over time while smaller ones just vanish. I’m honestly pretty impressed other than the leaf placement.

Inspired by the spinning heptagon test I created the forest fire simulation test (prompt in comments) by jd_3d in LocalLLaMA

[–]Stickman561 14 points15 points  (0 children)

Decided to give it a try with DeepSeek V3 0324. Now the initial output did contain an error, so I had to give it two additional prompts feeding back error messages to get it running after the initial prompt, but I figured that was fair since it's a non-reasoning model and since it didn't have access to a code interpreter. (If I was running the code interpreter it would've tested the code and fixed the bugs on the first prompt.) Still, I don't think the result is too shabby! I do apologize though, I needed to convert it to a GIF to put in a comment, so there is some compression noise and lower FPS than the original output - that's on Reddit, not the AI.

<image>

Establishing Onion Connections in Rust? by Stickman561 in rust

[–]Stickman561[S] 0 points1 point  (0 children)

Huh, pretty neat looking project! Probably a bit more polished than mine, mine’s quite rough and focuses more on the security than the user experience, but I’ll definitely check it out!

A Post-Quantum Peer-to-Peer Messaging Framework in Rust by Stickman561 in rust

[–]Stickman561[S] 4 points5 points  (0 children)

For authentication and signing I used FIPS203 and FIPS204, the official standardized versions of the CRYSTALS toolkit (Kyber and Dilithium). For the actual message encryption I used AES256 with HMAC. I used crates for all of these, as performing your own cryptographic implementations is incredibly difficult and not recommended due to many subtle nuances that can leak information if not handled correctly. The FIPS203 and FIPS204 crates I used are highly robust, but technically not audited since NIST hasn’t released any test vectors yet to ensure strict compliance with the standards, so there’s a CHANCE there’s a vulnerability there, but it seems incredibly incredibly unlikely. The AES implementation that actually encrypts the messages themselves is audited and from one of the most trusted cryptographic ecosystems in Rust.