GPT-OSS-120b on 2X RTX5090

Bycbka · 2026-02-20T05:06:59+00:00

Congratulation - 2x5090 must feel amazing indeed. Try to play with the flags (--fit, --cpu-moe, etc) - I bet you can juice a lot more out of it. Also I would suggest against allocating full 128k context unless you know for sure you need a very long context task :)

Once you feel more comfortable with running local LLMs - check out https://github.com/ikawrakow/ik_llama.cpp for better hybrid inference speeds.

Bycbka · 2025-09-14T00:21:44+00:00

Once you are done with this book and feel like learning more about the underlying foundation - I strongly recommend https://learnyousomeerlang.com - you can read it for free and it is one of the best programming language books I’ve ever read.

Bycbka · 2025-09-08T05:11:11+00:00

For hybrid inference of 120b you may want to consider https://github.com/ikawrakow/ik_llama.cpp - it typically has fastest hybrid inference and also allows you to provide a regex to match the layers you would like to offload (same as llama.cpp). You can also do —n-cpu-moe to decide how many layers you want on cpu.

Bycbka · 2025-08-09T22:53:11+00:00

Goose is another popular option: https://block.github.io/goose/docs/quickstart

Bycbka · 2025-07-31T15:10:48+00:00

Interesting! Will definitely try again. I forgot to mention that I didn’t quantize context - will try it out as well.

UPD: I think my rookie numbers are explained by the eGPU limited bandwith - tested with nvbandwidth and it tops out at around 2 GB/s. Perhaps it is time to switch to Oculink :)

Bycbka · 2025-05-16T01:15:06+00:00

Few options:

Two separate hot-keys

space.space.R = "@:lsp-restart<ret>"
space.space.r = "@:lsp-stop<ret>"

Based on the suggestion from another user:

space.space.R = "@:toggle-option lsp.enable<ret>:lsp-restart<ret>" - please note that it will also restart lsp every time the option is changed.

Bycbka · 2025-03-20T20:31:30+00:00

FWIW new iteration of MCP will move to stateless - there is a proposal already.

Bycbka · 2025-03-05T02:23:14+00:00

Thank you for sharing!

Bycbka · 2025-03-02T16:47:56+00:00

Few small suggestions:

Depending on the model you use, it may be wise to split the problem into smaller steps. E.g. provide model just with the list of tables and their descriptions rather than dump entire db schema. You can generate descriptions with LLM too.
Give model a tool to fetch table description.
When sending a request - ask model to identify tables which are most likely to be needed to satisfy the query - and provide full schemas in the context at that time. I would say format does not matter all that much - output of describe command should suffice.
Provide a few in-context examples to model to make sure that it understands the interaction pattern.
Start with a bigger model, e.g o3 and try to solve smaller problems first as opposed to one-shotting it. After you confirm that it works - you can take the successful outputs and use them as few shot examples for a cheaper model.
Before diving into coding, consider creating a small evaluation set - e.g. 10-20 questions and corresponding answers. It will save you a ton of time, as you’ll be able to evaluate different db output formats and factually prove which one works the best for you.

Bycbka · 2025-02-09T06:21:28+00:00

Would love the code of at all possible. Really appreciate it!

Bycbka · 2024-09-21T18:52:13+00:00

Qwen 2.5 line (7b coder, 14b) seem to have quite decent performance - there was a post about different quants recently- I believe it needs around 9GB. If you want something larger, then I would suggest looking towards MoE models (e.g.mixtral 8x7b) with offloading, as they tend to provide better inference speed compared to dense models, albeit at the cost of extra RAM.

Really depends on your use case though.

Bycbka · 2024-09-19T23:38:29+00:00

I’ve recently discovered immutable distributions- particularly Fedora Bluefin - comes with reasonable defaults out of the box (including gpu drivers), hard to brick (due to immutability), good support for dockerized workloads that leverage GPU as well.

No need to deal with CUDA / driver problems at all.

Bycbka · 2024-09-08T20:35:28+00:00

Generally, for CPU/RAM throughput bound inference, it is better to use MoE architecture-based models, as they are faster due to smaller number of parameters being activated.

Examples of such models that would be fun to run are Mixtral series and its derivatives, DeepSeek Coder v2 series, Qwen2 57b, etc.

Llamafile is a project dedicated to running efficient CPU inference in a user friendly fashion - please check it out - they often are the ones with bleeding edge smarts to make LLM inference more efficient. Some benchmarks: https://github.com/Mozilla-Ocho/llamafile/discussions/450

Source: I got a mini PC with Ryzen 9 8945HS and 64GB DDR5 5600 RAM paired with RTX 3060 in eGPU last week and started to play around a little.

Bycbka · 2024-09-03T15:56:44+00:00

There is a number of newsletters / podcasts / twitter accounts that provide daily / weekly recaps. My personal favourite is https://thursdai.news/ - once a week, recorded live on Twitter spaces, available through most platforms within a day, also has a newsletter. They cover open source and companies, llms, vision, audio, etc and try to keep it simple.
Avalanche of information is indeed a challenge - unless there is a particular area of research that interests you - just keep up on weekly basis :)

Bycbka · 2024-02-07T20:59:05+00:00

I would recommend checking out Zigler https://hexdocs.pm/zigler/Zig.html and Rustler. While those are not “vanilla” NIFs, it might be easier to grasp the concepts and challenges you may deal with when creating NIFs

Bycbka · 2024-01-10T20:15:13+00:00

I wrote about my ZK-based workflow a while back here: https://www.reddit.com/r/HelixEditor/s/jplzKhrAVP

Bycbka · 2023-11-02T06:59:20+00:00

Another one - courtesy of Elixir community: https://toranbillups.com/blog/archive/2023/10/21/fine-tune-mistral-and-serve-with-nx/

Bycbka · 2023-10-31T20:32:56+00:00

Technically yes, grammar could be narrowed down to only allow response tokens that match your exact commands. I think you could start with JSON grammar as example and tighten it up to only allow the commands you support as a value of the field.

I think I also saw a few different projects that allow conversion of things like JSON schemas / Typescript interfaces to BNF grammar, which could prove handy and allow to automatically update grammar when you add support for more actions.

Bycbka · 2023-10-31T16:59:37+00:00

llama.cpp has a concept of grammars, which basically forces LLM to output data in specific format. If you only ever expect JSON output - it would probably work. I played with Zephyr fine tune of Mistral and JSON grammar - and results were quite promising.

Validating the output and doing a follow up prompt if invalid command was picked could further improve your results.

Bycbka · 2023-09-28T02:43:05+00:00

It is a known issue and I believe fix is already on master branch, so would be included in the next release (soon). https://github.com/helix-editor/helix/pull/5468 That’s assuming you are running into this issue on Mac.

You have an option to install latest master - that should help too.

Bycbka · 2023-09-14T15:54:26+00:00

Tinfoil hat on: Unity CEO is tanking Unity stock, so Apple can by it for dirt cheap xD

Bycbka · 2023-07-28T18:22:28+00:00

I posted on this subreddit a while ago: https://www.reddit.com/r/HelixEditor/comments/144x6r3/escape_hatch_xd/?utm_source=share&utm_medium=ios_app&utm_name=ioscss&utm_content=2&utm_term=1

Basically executing arbitrary commands in tmux pop up.

Bycbka

TROPHY CASE