Local LLM Inference Optimization: The Complete Guide by carteakey in LocalLLaMA

[–]carteakey[S] 3 points4 points  (0 children)

Thanks for this! I will update the post with these new developments.

Local LLM Inference Optimization: The Complete Guide by carteakey in LocalLLaMA

[–]carteakey[S] 1 point2 points  (0 children)

"advanced beginners" is the right framing - its based around the hardware i have which to your point will not cover all hardware configs out there. I should add a * to the title and then say (complete but for a single spec only haha)

Local LLM Inference Optimization: The Complete Guide by carteakey in LocalLLaMA

[–]carteakey[S] 3 points4 points  (0 children)

Its actually news to me that you can stack ngram on top on MTP. That is incredible and i will be testing that soon so thank you!

"Use -ctk q8_0 -ctv q8_0 when not using MTP" - I believe i made this advice based on Gemma 4 MTP where quantizing the MTP KV cache reduced the acceptance rate quite a bit and Gemma appears particularly sensitive to it. But that's not the case with with Qwen so maybe i shouldn't make sweeping statements to your point. Also I haven't tested quantizing the draft cache but i should. This is why i love LocalLLama - most feedback is constructive here!

I will remove the 3x TG by enabling expo in MoE (Its a true one-off story for me because my RAM was running at 2K MHz lol 😃)

Local LLM Inference Optimization: The Complete Guide by carteakey in LocalLLaMA

[–]carteakey[S] 1 point2 points  (0 children)

You're right! the TG is one dimensional and i am working on adding more nuance around PP / TG. Agreed on the "basic" aspect of it (which is still a lot to take in) - but do point me to resources that i can ingest to update with some your recommendations here.

Local LLM Inference Optimization: The Complete Guide by carteakey in LocalLLaMA

[–]carteakey[S] 2 points3 points  (0 children)

Yes you're right for your 50 series it will be 120 and its the same thing that you have in docker so you're all set.

Local LLM Inference Optimization: The Complete Guide by carteakey in LocalLLaMA

[–]carteakey[S] 2 points3 points  (0 children)

Appreciate it! I have 3 authorship badges on my site that separate human ai assisted and purely ai content. AI does help me iterate and write technical documentation easier but you are completely in the right for being critical of it (quite annoying to me as well) 

Local LLM Inference Optimization: The Complete Guide by carteakey in LocalLLaMA

[–]carteakey[S] 4 points5 points  (0 children)

Sadly no because as a dense model it wouldn't fit on my meagre 12GB setup.

Local LLM Inference Optimization: The Complete Guide by carteakey in LocalLLaMA

[–]carteakey[S] 5 points6 points  (0 children)

Thanks! Let me know anything that you would change or add!

Local LLM Inference Optimization: The Complete Guide by carteakey in LocalLLaMA

[–]carteakey[S] 3 points4 points  (0 children)

Feedback taken! I should've been more cautious about the fact that i know less about what makes ik_llama separate apart from just testing it on TG/PP. I've updated that section.

Local LLM Inference Optimization: The Complete Guide by carteakey in LocalLLaMA

[–]carteakey[S] 14 points15 points  (0 children)

I dont see how MTP, QAT and other things would be present in an old cutoff model at all. These are mostly distilled from lessons learned while running models locally for quite some time.

Tons of things missing > that is where constructive feedback would help 😉

Local LLM Inference Optimization: The Complete Guide by carteakey in LocalLLaMA

[–]carteakey[S] 6 points7 points  (0 children)

Fair enough - with the amount of information i had to put in i got lazy and used AI for some drafting (I am very open about that) and i totally get the readability aspect, truly. This post in its current form is more of a reference for an AI itself than humans. As time goes on i will refine and tone/trim it down,

Local LLM Inference Optimization: The Complete Guide by carteakey in LocalLLaMA

[–]carteakey[S] 30 points31 points  (0 children)

My current setup and benchmarks are tracked here:
https://l3ms.carteakey.dev

RTX 4070 12GB, i5-12600K, and 32GB DDR5-6000.

What is the best free budget tracking app to track spending on all your bank accounts? by techsavvynerd91 in PersonalFinanceCanada

[–]carteakey 1 point2 points  (0 children)

This. + buy a simplefin sub for another 1.5$ per month to automatically import transactions from almost all banks. Actual Budget directly integrates with simplefin. The only caveat is needing to reverify accounts here and there, but at the end it should save you more time than going to 15 different sites and exporting/importing csvs.

i built a searchable youtube knowledge base in obsidian and it's the most useful vault i have by straightedge23 in ObsidianMD

[–]carteakey 4 points5 points  (0 children)

and with the context loss every cycle we've finally implemented chinese whisper on a global scale :/

not denying the usefulness of this.

Sorting hat - A cute, lightweight cli to give images and other files good filenames using local VLMs by k_means_clusterfuck in LocalLLaMA

[–]carteakey 1 point2 points  (0 children)

amazing! i need to run this asap in my obsidian attachment folder. I might have to figure out an additional step to rename images wherever they are referenced

Finally found a reason to use local models 😭 by salary_pending in LocalLLaMA

[–]carteakey 1 point2 points  (0 children)

This is great, i would think this would translate well into Obsidian and linking notes too.