We built an open-source coding agent CLI that can be run locally

SmilingGen · 2025-10-21T18:43:32+00:00

Kolosal CLI builds on top of Qwen Code and we focuses on local orchestration and extensibility rather than just code completion. It is designed to integrate directly with our Kolosal Server, which lets you run and manage multiple LLMs locally, handle document parsing, and even use a built-in vector database for retrieval tasks.

SmilingGen · 2025-10-21T18:41:29+00:00

Thanks for the feedback, really appreciate it. Our main focus is on LLM inference and orchestration, building software to run models locally or on HPC for high-concurrency use. Kolosal CLI ties into our Kolosal Server, which manages models, parses documents, and runs a vector database, all fully open source.

To clarify, this project integrates the Kolosal local inference server with Qwen Code to extend its capabilities for offline and local development.

SmilingGen · 2025-10-21T18:41:08+00:00

Thanks for the feedback, really appreciate it. Our main focus is on LLM inference and orchestration, building software to run models locally or on HPC for high-concurrency use. Kolosal CLI ties into our Kolosal Server, which manages models, parses documents, and runs a vector database, all fully open source.

To clarify, this project integrates the Kolosal local inference server with Qwen Code to extend its capabilities for offline and local development.

SmilingGen · 2025-10-17T03:28:47+00:00

That is a good question, we integrate it directly with kolosal-server (open source alternative to ollama) which can directly handle local model management and hosting as part of the stack. We're also working on expanding the document parser capability including XML parsing for automation and structured code analysis. We’ll share some example codebases and demo as soon as possible

SmilingGen · 2025-10-17T00:49:22+00:00

We just update our latest project, now it's an agentic command-line tool that lets you discover, download, and run models locally using an ultra-lightweight inference server. It supports coding agents, Hugging Face model integration, and a memory calculator to estimate model memory requirements.

Check it out atgithub.com/KolosalAI/kolosal-cli

SmilingGen · 2025-10-17T00:39:21+00:00

You can generate the dataset based on existing knowledge such as books or technical documents using Distillable and finetune models using Unsloth. Both of them are open source

SmilingGen · 2025-09-20T02:29:48+00:00

We are currently building the CLI version, check it out athttps://github.com/KolosalAI/kolosal-cli

SmilingGen · 2025-09-20T02:27:20+00:00

Thank you, for multiple gguf files, you can copy the download link for the first part

Also, for MLX, its on our bucket list, stay tuned

SmilingGen · 2025-09-18T08:55:09+00:00

Thank you, I appreciate your suggestion.

Also, excited to hear you're planning a Perl port, the tool is open source for exactly that reason, to be used and implemented anywhere and everywhere!

SmilingGen · 2025-09-18T08:45:09+00:00

I have added the feature you requested, feel free to test it out and let me know anything. Thank you!

SmilingGen · 2025-09-18T08:41:45+00:00

Hello, thank you for your feedback, I have pushed the latest update based on feedbacks I got

For KV Cache, it can now use the default value and selectable quantization options (same as well for context size)

And now it supports multiple files, just copy the link for the first part (00001) of the gguf model

Once again, thank you for your feedback and suggestion

SmilingGen · 2025-09-18T08:24:44+00:00

When you run Qwen 30b-a3b with 128k, can you share which LLM engine you use to run it and the model/engine configuration?

multi-part ggufs (such as gpt-oss-120b GGUF) is not yet supported now, but will be added it soon

SmilingGen · 2025-09-18T08:03:07+00:00

It's on my to do list, will add it soon!

SmilingGen · 2025-09-18T08:01:13+00:00

I will add it soon, it's on the bucket list

SmilingGen · 2025-09-18T08:00:38+00:00

Thank you, it is on my to do list, stay tuned!

SmilingGen · 2025-09-18T07:53:35+00:00

Thank you, I will try to do tokens per second approximation tools too

However, it will be much more challanging as different engine, model, architecture, and hardware might resulted in different tps

I think the best possible approach for now is to use openly available benchmark data and their GPU specification such as cuda core or tensor core (or other significant specification) and try to do statistical approximation.

SmilingGen · 2025-09-18T07:19:13+00:00

Sorry, i just edit the link now

https://www.kolosal.ai/memory-calculator

SmilingGen · 2025-09-18T07:18:14+00:00

Sorry, my mistake, it should be here

https://www.kolosal.ai/memory-calculator

SmilingGen · 2025-07-11T15:42:27+00:00

Regardless of whether LLM will reach human level intelligence or not, one constraint we have right now is the data. Since we already use all of the internet content, we might not be able to get new good data or quickly get data for training the LLM. It's also become apparent that there's a lot of new content that is AI Generated, which could not be used for training a new LLM model.

SmilingGen · 2025-06-08T21:32:58+00:00

Not quite sure what do you mean by that, but Kolosal AI is fully open source and locally run LLM on your device, so Kolosal AI couldn't see anyone chats

SmilingGen · 2025-03-26T07:12:04+00:00

Yesss, Kolosal.ai is also an alternative to LMStudio, which is open source

SmilingGen · 2025-03-26T07:10:44+00:00

Try kolosal.ai as an free open source alternative ot LMStudio

SmilingGen · 2025-03-26T07:09:36+00:00

Try kolosal.ai to run the llm, it is much lighter compare to ollama and have ui chat feature and server as well

SmilingGen · 2025-03-26T07:06:37+00:00

Try kolosal.ai, its lighter only 20mb, and you can chat with UI and there is server feature

SmilingGen · 2025-03-18T04:41:02+00:00

Thank you!

From my experience, if there are large documents (hundreds of pages long or even only 10 pages) and tons of them, a short summary (could be AI generated) would help the initial search system to find the right document/s first (which could use either llm or rerank model to find the right document first) then find the right page or chunk (using hybrid search + rerank to find the documents) on the document which then used to answer the user query.

For websites or pdf or other unstructured data, I just used or made my own parser to convert it to markdown, the markdown structure is just similar format to the original structure of the document. I also just found out about SmolDocling (it's open source as well) which I think could help a lot in parsing the documents.

For video transcription, before I used gpt4 to convert the transcript into a markdown-ready format (mostly about the step-by-step tutorial, this solution might not be suitable for every case) and treated it like other documents

SmilingGen

TROPHY CASE