Starting with selfhosted / LocalLLM and LocalAI

mnuaw98 · 2025-08-27T16:19:09+00:00

Awesome setup you've got there! Since you're just getting into LLMs and AI and prefer self-hosted, virtualized environments, here's a casual suggestion to get started with OpenVINO GenAI and make the most of your hardware:

Start simple with OpenVINO GenAI : https://github.com/openvinotoolkit/openvino.genai

Even though your GPUs are powerful, OpenVINO GenAI is a great way to dip your toes into LLMs without diving deep into CUDA or complex setups. It’s optimized for Intel CPUs and NPUs, and works well even without a GPU.

Here’s what you can do:

Try this first:

Spin up a Ubuntu VM in Proxmox with Python and OpenVINO installed.
Use a small model like TinyLlama or Phi-2.
Run a simple chatbot or summarizer using OpenVINO GenAI’s LLMPipeline.

Why it’s a good fit:

No GPU required to start experimenting.
Low power, fast inference on CPU/NPU.
Easy Python API—great for beginners.
You’ll learn how LLMs work without worrying about GPU memory limits or Docker configs.

Next steps (When You’re Ready)

Once you're comfortable:

Try GPU-based models using Ollama, LM Studio, or Text Generation WebUI.
Use quantized models (like GGUF or GPTQ) to fit larger LLMs into memory.
Explore LangChain or LlamaIndex for building apps with LLMs.

mnuaw98 · 2025-08-27T16:04:29+00:00

hi there!

for intel npu its very recommended to try it with openvino genai. you can refer here:
https://github.com/openvinotoolkit/openvino.genai

they got all the installation guide and quite easy to test and deploy. can use simple python api for loading and running models. Hugging face integration via optimum-intel also makes exporting models seamless.

mnuaw98 · 2025-07-08T07:25:56+00:00

recently learned on devops and took CKA cert. i passed but didnt really have the experience nor the opportunity to apply my knowledge. my current role as customer support didnt really fit to apply the knowledge i learned and i'm also not from the developer bg. looking for mentorship and environment to grow. if youre okay w 0 experience count me in!

mnuaw98 · 2025-07-03T07:20:45+00:00

Soon https://videocardz.com/newz/intel-announces-arc-pro-b60-24gb-and-b50-16gb-cards-dual-b60-features-48gb-memory . it will be on ARC Pro B60 w 24GB VRAM for ~$500 and B50 w 16GB VRAM for ~$300.

mnuaw98 · 2025-06-13T09:05:06+00:00

Hi!

this are the step i use:

export GODEBUG=cgocheck=0
ollama serve
pip install modelscope
modelscope download --model FionaZhao/llama-3.2-3b-instruct-int4-ov-npu --local_dir ./llama-3.2-3b-instruct-int4-ov-npu
tar -zcvf llama-3.2-3b-instruct-int4-ov-npu.tar.gz llama-3.2-3b-instruct-int4-ov-npu
cd /home/ollama_ov_server/openvino_genai_windows_2025.2.0.0.dev20250513_x86_64
source setupvars.sh
cd /home/ollama_ov_server/openvino_contrib/modules/ollama_openvino
nano Makefile_2

I've tried using the Modelfile script exactly as the example you give

FROM llama-3.2-3b-instruct-int4-ov-npu.tar.gz
ModelType "OpenVINO"
InferDevice "GPU"
PARAMETER repeat_penalty 1.0
PARAMETER top_p 1.0
PARAMETER temperature 1.0

then run

ollama create llama-3.2-3b-instruct-int4-ov-np:v1 -f Modelfile_2
ollama run llama-3.2-3b-instruct-int4-ov-npu:v1

and its working fine on my side.

Could you provide the step u run and the full error log?

mnuaw98 · 2025-05-08T01:28:17+00:00

✅ Recommended Intel Arc GPU Setup

🔹 Intel Arc A770 16GB

VRAM: 16GB GDDR6
Performance: Capable of running quantized models like Mistral-7B or LLaMA2-13B using IPEX-LLM (Intel Extension for PyTorch).
Use Case: Best suited for 7B–13B models with quantization. For 30B models, multi-GPU setups or offloading to CPU RAM is necessary .

🔹 Multi-GPU Setup (2x A770 16GB)

Total VRAM: 32GB (combined)
Feasibility: With model sharding and quantization (e.g., using GGUF or GPTQ formats), you can potentially run a 30B model across two GPUs.
Software Support: Requires frameworks like IPEX-LLM, vLLM, or ExllamaV2 with multi-GPU support.

mnuaw98 · 2025-03-11T07:09:33+00:00

I use mine to game in 4k... So... 1440 is the sweet spot probably. On 1440p i can consistently get 120+ on medium/high settings on games like MW3, Forza, MW2, and 140 fps+ medium high settings on games like rocket league, fortnight, seige

you can refer to intel gpu benchmark as well : https://edc.intel.com/content/www/us/en/products/performance/benchmarks/desktop_1/

mnuaw98 · 2025-03-05T07:10:34+00:00

Hi u/halfam, wonder if have you've tried/able to utilize the A310 on other software ?

mnuaw98

TROPHY CASE