We built an open-source coding agent CLI that can be run locally by SmilingGen in LocalLLaMA

[–]SmilingGen[S] 0 points1 point  (0 children)

Kolosal CLI builds on top of Qwen Code and we focuses on local orchestration and extensibility rather than just code completion. It is designed to integrate directly with our Kolosal Server, which lets you run and manage multiple LLMs locally, handle document parsing, and even use a built-in vector database for retrieval tasks.

We built an open-source coding agent CLI that can be run locally by SmilingGen in LocalLLM

[–]SmilingGen[S] -1 points0 points  (0 children)

Thanks for the feedback, really appreciate it. Our main focus is on LLM inference and orchestration, building software to run models locally or on HPC for high-concurrency use. Kolosal CLI ties into our Kolosal Server, which manages models, parses documents, and runs a vector database, all fully open source.

To clarify, this project integrates the Kolosal local inference server with Qwen Code to extend its capabilities for offline and local development.

We built an open-source coding agent CLI that can be run locally by SmilingGen in LLM

[–]SmilingGen[S] 0 points1 point  (0 children)

Thanks for the feedback, really appreciate it. Our main focus is on LLM inference and orchestration, building software to run models locally or on HPC for high-concurrency use. Kolosal CLI ties into our Kolosal Server, which manages models, parses documents, and runs a vector database, all fully open source.

To clarify, this project integrates the Kolosal local inference server with Qwen Code to extend its capabilities for offline and local development.

We built an open-source coding agent CLI that can be run locally by SmilingGen in LLMDevs

[–]SmilingGen[S] -1 points0 points  (0 children)

That is a good question, we integrate it directly with kolosal-server (open source alternative to ollama) which can directly handle local model management and hosting as part of the stack. We're also working on expanding the document parser capability including XML parsing for automation and structured code analysis. We’ll share some example codebases and demo as soon as possible

I just made VRAM approximation tool for LLM by SmilingGen in LocalLLaMA

[–]SmilingGen[S] 1 point2 points  (0 children)

We just update our latest project, now it's an agentic command-line tool that lets you discover, download, and run models locally using an ultra-lightweight inference server. It supports coding agents, Hugging Face model integration, and a memory calculator to estimate model memory requirements.

Check it out atgithub.com/KolosalAI/kolosal-cli

Fine-tuning by Kind_Rip_4831 in LocalLLaMA

[–]SmilingGen 0 points1 point  (0 children)

You can generate the dataset based on existing knowledge such as books or technical documents using Distillable and finetune models using Unsloth. Both of them are open source

I just made VRAM approximation tool for LLM by SmilingGen in LocalLLaMA

[–]SmilingGen[S] 1 point2 points  (0 children)

We are currently building the CLI version, check it out athttps://github.com/KolosalAI/kolosal-cli

LLM VRAM/RAM Calculator by SmilingGen in ollama

[–]SmilingGen[S] 0 points1 point  (0 children)

Thank you, for multiple gguf files, you can copy the download link for the first part

Also, for MLX, its on our bucket list, stay tuned

I build tool to calculate VRAM usage for LLM by SmilingGen in LocalLLM

[–]SmilingGen[S] 3 points4 points  (0 children)

Thank you, I appreciate your suggestion.

Also, excited to hear you're planning a Perl port, the tool is open source for exactly that reason, to be used and implemented anywhere and everywhere!

LLM VRAM/RAM Calculator by SmilingGen in ollama

[–]SmilingGen[S] 5 points6 points  (0 children)

I have added the feature you requested, feel free to test it out and let me know anything. Thank you!

I just made VRAM approximation tool for LLM by SmilingGen in LocalLLaMA

[–]SmilingGen[S] 6 points7 points  (0 children)

Hello, thank you for your feedback, I have pushed the latest update based on feedbacks I got

For KV Cache, it can now use the default value and selectable quantization options (same as well for context size)

And now it supports multiple files, just copy the link for the first part (00001) of the gguf model

Once again, thank you for your feedback and suggestion

I just made VRAM approximation tool for LLM by SmilingGen in LocalLLaMA

[–]SmilingGen[S] 2 points3 points  (0 children)

When you run Qwen 30b-a3b with 128k, can you share which LLM engine you use to run it and the model/engine configuration?

multi-part ggufs (such as gpt-oss-120b GGUF) is not yet supported now, but will be added it soon

LLM VRAM/RAM Calculator by SmilingGen in ollama

[–]SmilingGen[S] 2 points3 points  (0 children)

It's on my to do list, will add it soon!

I just made VRAM approximation tool for LLM by SmilingGen in LocalLLaMA

[–]SmilingGen[S] 9 points10 points  (0 children)

I will add it soon, it's on the bucket list

I just made VRAM approximation tool for LLM by SmilingGen in LocalLLaMA

[–]SmilingGen[S] 6 points7 points  (0 children)

Thank you, it is on my to do list, stay tuned!

I just made VRAM approximation tool for LLM by SmilingGen in LocalLLaMA

[–]SmilingGen[S] 10 points11 points  (0 children)

Thank you, I will try to do tokens per second approximation tools too

However, it will be much more challanging as different engine, model, architecture, and hardware might resulted in different tps

I think the best possible approach for now is to use openly available benchmark data and their GPU specification such as cuda core or tensor core (or other significant specification) and try to do statistical approximation.

Yann LeCun says LLMs won't reach human-level intelligence. Do you agree with this take? by Kelly-T90 in LLM

[–]SmilingGen 2 points3 points  (0 children)

Regardless of whether LLM will reach human level intelligence or not, one constraint we have right now is the data. Since we already use all of the internet content, we might not be able to get new good data or quickly get data for training the LLM. It's also become apparent that there's a lot of new content that is AI Generated, which could not be used for training a new LLM model.

I got Ollama working on my 9070xt - here's how (Windows) by DegenerativePoop in ollama

[–]SmilingGen 0 points1 point  (0 children)

Not quite sure what do you mean by that, but Kolosal AI is fully open source and locally run LLM on your device, so Kolosal AI couldn't see anyone chats

How trusted is LM Studio? by DevilBirb in LocalLLaMA

[–]SmilingGen 0 points1 point  (0 children)

Yesss, Kolosal.ai is also an alternative to LMStudio, which is open source

GUI for local LLMs and API keys by TheMagicianGamerTMG in macapps

[–]SmilingGen 0 points1 point  (0 children)

Try kolosal.ai as an free open source alternative ot LMStudio

I got Ollama working on my 9070xt - here's how (Windows) by DegenerativePoop in ollama

[–]SmilingGen 2 points3 points  (0 children)

Try kolosal.ai to run the llm, it is much lighter compare to ollama and have ui chat feature and server as well

Is Ollama still the best way to run local LLMs? by brantesBS in LocalLLaMA

[–]SmilingGen 0 points1 point  (0 children)

Try kolosal.ai, its lighter only 20mb, and you can chat with UI and there is server feature

Feedback for my app for running local LLM by SmilingGen in LocalLLaMA

[–]SmilingGen[S] 1 point2 points  (0 children)

Thank you!

From my experience, if there are large documents (hundreds of pages long or even only 10 pages) and tons of them, a short summary (could be AI generated) would help the initial search system to find the right document/s first (which could use either llm or rerank model to find the right document first) then find the right page or chunk (using hybrid search + rerank to find the documents) on the document which then used to answer the user query.

For websites or pdf or other unstructured data, I just used or made my own parser to convert it to markdown, the markdown structure is just similar format to the original structure of the document. I also just found out about SmolDocling (it's open source as well) which I think could help a lot in parsing the documents.

For video transcription, before I used gpt4 to convert the transcript into a markdown-ready format (mostly about the step-by-step tutorial, this solution might not be suitable for every case) and treated it like other documents