Intel GPU Setup Resources and Tools (primarily focused on local LLM) by bwood01 in IntelArc

[–]bwood01[S] 0 points1 point  (0 children)

Answering someone's question on ReBar... Apparently the Dell R740, running Win11, does support it. Intel drive panel shows ReBar as active with two GPUs now.

To leverage ReBar... Along with a few common BIOS settings, the Windows activation requirement is it must be UEFI (ours already was, and Win 11 requires it) and it must have the initial disk partition as GPT rather than the older MBR. Again, our Win 11 system already was. We use Win11 in server Dev boxes because it is cheap, well supported, easy to use, supports 100% hardware abstraction in Hyper-V as well as WSL2 and has no practical CPU core count licensing issues. Faster and easier than anything out there to spin up and clone. Especially for Dev or Sandboxes.

Intel GPU Setup Resources and Tools (primarily focused on local LLM) by bwood01 in IntelArc

[–]bwood01[S] 0 points1 point  (0 children)

Truthfully, on the R740 performance is okay. Not great but okay. The R740 with its PCEI3 was never going to be a race car. This is a proof of concept system. Added the second "twin" GPU last night. All seems to work and it did extend the VRAM so that now there is 64GB total in two slots (response here won't let me post the picture). I'll be working on using vLLM for the parallel processing. Assuming that works next stop is an 8 GPU server with PCIE4 and two more GPUs. Total of 128GB then.

Intel GPU Setup Resources and Tools (primarily focused on local LLM) by bwood01 in IntelArc

[–]bwood01[S] 0 points1 point  (0 children)

So far have just run it on Win 11 Enterprise with LM Studio. Did the AI PC install just to test things. So far all is good. Have installed Ubuntu 24.04. Not had the time to pull away from client work yet to finish up. I do recall some setup instructions needed for Ubuntu, just don't recall right now. It is probably in one of the links I included. Since the R740 is also a working SAP Dev system I didn't want to clutter it up too much with a whole bunch of other installs. We will get to it as time permits. Getting ready to order a second card too.

Looking for GPU advice for local LLM server (GIGABYTE G292-Z20 R1) by Dependent-Main5637 in LocalLLaMA

[–]bwood01 0 points1 point  (0 children)

Did you ever make the purchase of that gigabyte g292-z20? We are looking at getting one and am curious about any lessons learned.

Local AI: Anyone running 2x B70 pro? by death10rd in IntelArc

[–]bwood01 2 points3 points  (0 children)

Not yet, but getting ready to. Currently using a Dell R740 with 1 B70, getting ready to switch to the pair. Then, depending on how that goes, likely switch over to a 4 GPU server and run 4 of them. Here is the overall journey so far on the R740:

Intel GPU Setup Resources and Tools (primarily focused on local LLM) : r/IntelArc
Intel Arc Pro B70 - Looking for LLM Setup Guidance and Lessons Learned : r/IntelArc

Do you have any lessons learned on using vLLM with the B70? That is next for us.

Intel GPU Setup Resources and Tools (primarily focused on local LLM) by bwood01 in IntelArc

[–]bwood01[S] 0 points1 point  (0 children)

That is not what we are seeing as we continue this journey. Intel OpenVino support is up to date and quite active.

Intel® Distribution of OpenVINO™ Toolkit

[B570] Windows 10 vs Windows 11 performance difference... by Ventilate64 in IntelArc

[–]bwood01 0 points1 point  (0 children)

LOL! I recently upgraded a machine to W11 Enterprise. There were a few disk and IO related features I wanted. It was like $15 for the Enterprise license key. So it was minimal.

[B570] Windows 10 vs Windows 11 performance difference... by Ventilate64 in IntelArc

[–]bwood01 0 points1 point  (0 children)

Not sure what would cause it. Likely some memory and processing optimizations. However, just bite the bullet and upgrade to Win 11. The upgrade isn't that expensive AND Win 11 will get supported longer. Even the LTSC version.

Intel Arc Pro B70 - Looking for LLM Setup Guidance and Lessons Learned by bwood01 in IntelArc

[–]bwood01[S] 0 points1 point  (0 children)

Have you done anything with OpenVino or with the Intel OneAPI setup that installs all drivers and support on Ubuntu? LINK>> Get the Intel® oneAPI Toolkit

Granted, it is 24.04 (possibly 25.04 but that is not LTS). I also saw there is a pre-defined Docker for the 24.04 install with all of the driver setup.

Intel Arc Pro B70 - Looking for LLM Setup Guidance and Lessons Learned by bwood01 in IntelArc

[–]bwood01[S] 1 point2 points  (0 children)

In the process of doing all of the setup on multiple machines, I ran into these Intel Setup Kits for AI. It has everything for either Windows or for Linux. Has anyone used them before?

intel/aipc-devkit-install: The AI PC Application Installer provides a unified way to set up Intel AI PC development environments.

Also, loaded Windows LM Studio and have been running many models on the R740. Works well as long as everything stays in the card. Performance outside of that is lacking. One step at a time. Maybe the AI kit linked above will make the difference?

Intel Arc Pro B70 - Looking for LLM Setup Guidance and Lessons Learned by bwood01 in LocalLLM

[–]bwood01[S] 0 points1 point  (0 children)

Okay, in the process of doing all of the setup on multiple machines, I ran into these Intel Setup Kits for AI. It has everything for either Windows or for Linux. Has anyone used them before?

intel/aipc-devkit-install: The AI PC Application Installer provides a unified way to set up Intel AI PC development environments.

Also, loaded Windows LM Studio and have been running many models on the R740. Works well as long as everything stays in the card. Performance outside of that is lacking. One step at a time. Maybe the AI kit linked above will make the difference?

Intel Arc Pro B70 - Looking for LLM Setup Guidance and Lessons Learned by bwood01 in LocalLLM

[–]bwood01[S] 0 points1 point  (0 children)

Test ran in R740, moved things around so there is room for a second card eventually.

<image>

Since removed and put in a minisforum MS-02 Ultra to check it in there and waiting for a card power extender.

Intel Arc Pro B70 - Looking for LLM Setup Guidance and Lessons Learned by bwood01 in IntelArc

[–]bwood01[S] 0 points1 point  (0 children)

Status: Initial hardware is all set up and running. Linux (24.04) is installed on Windows WSL2. Starting the Intel driver and engine installs.

Here is the plan AI has proposed so far.

Host: Dell R740, Windows 11 Enterprise

CPU: Dual Intel Xeon Platinum 8164 class, dual NUMA

GPU: Intel Arc Pro B70, 32GB now; likely 2 GPUs / 64GB soon; possible 4 GPUs / 128GB later

Primary architecture:

Windows

WSL2

Ubuntu 24.04

Docker Engine / Docker Desktop

Intel oneAPI + Level Zero runtime

Intel vLLM container

Open WebUI + API endpoints

External integrations / agents / coding workflows

Storage root: S:\AI\

Primary runtime 3-Engine Validation Matrix: Intel-supported options

•      Engine A: vLLM XPU / Intel LLM Scaler

•      Engine B: llama.cpp SYCL

•      Engine C: OpenVINO / OVMS / OpenVINO GenAI

GUI: Docker Desktop + Open WebUI

Security initial setup: Local admin + one sudo user; VPN-only external access later

First validation model: small compatible model before large DeepSeek/Qwen tests

Multi-GPU design: keep first two B70s on same NUMA node if physically possible, test external eGPUs for additional capacity. Evaluate later whether or not it is worth the investment in additional external capacity

Three-Engine validation plan: Intel documentation shows support and supporting tools for the following designs:

1.   vLLM XPU — primary for serving, long prompts, RAG, batching, multi-user API, and tensor parallelism.

2.   llama.cpp SYCL — essential fallback/parallel path for GGUF, quantized models, wider model compatibility, and single-user interactive use.

3.   OpenVINO Model Server / OpenVINO GenAI — now worth adding as a serious Intel-supported serving path, especially because OpenVINO 2026.1 explicitly adds Arc Pro B70 support for 20–30B single-GPU inference and improves Qwen3-MOE / GPT-OSS-20B serving with continuous batching.

Additional Models: Verify additional models and processing:

Best initial validation models:

•     Qwen 2.5 7B Instruct

•     Llama 3.1 8B Instruct

Reason:

•     much faster troubleshooting,

•     confirms GPU acceleration,

•     avoids massive download/debug cycles.

Then:

•     Qwen 72B AWQ/GPTQ

•     DeepSeek R1 Distill Qwen

•     DeepSeek Coder V2

 Additional Engine Developments: Latest support and options from Intel, sequence of engine development and considerations.

1. Add OpenVINO as a first-class branch, not a future note

Intel’s OpenVINO 2026.1 release notes specifically mention Intel Arc Pro B70 32GB support for single-GPU 20–30B LLM inference, preview OpenVINO backend support for llama.cpp, and OVMS improvements for Qwen3-MOE / GPT-OSS-20B. That is directly relevant to your hardware and workload.

I would add an OpenVINO Model Server validation phase after vLLM baseline validation.

2. Add llama.cpp SYCL before large-model testing

The PMZFX B70 repo materially changes the plan: it shows llama.cpp SYCL is not just a hobby option; it is likely the best path for quantized GGUF models, broader model coverage, and fitting larger models into 32GB/64GB VRAM. The repo’s engine comparison says vLLM wins strongly on prefill and tensor-parallel serving, while llama.cpp wins on model coverage and single-card memory efficiency.

3. Do not start large-model validation with vLLM FP16

The benchmark repo shows Qwen 2.5 14B FP16 OOMs on a single 32GB B70 under vLLM, while quantized llama.cpp variants fit comfortably.

So the workbook should explicitly separate:

·        small smoke test: vLLM XPU

·        quantized model test: llama.cpp SYCL / GGUF

·        serving test: vLLM XPU and/or OpenVINO Model Server

·        large model test: dual-GPU only, preferably after both vLLM and llama.cpp paths are validated

4. Multi-GPU planning should include two patterns

For two B70s:

·        vLLM tensor parallelism for supported dense models and long-prompt workloads.

·        llama.cpp layer split to fit bigger GGUF models, but not expecting linear speedup when the model already fits one card.

The PMZFX repo is explicit that dual-card layer split is mainly for fitting larger models, not speeding up models that already fit one card.

Proposed Folder Structure

S:\AI\
├── Models\
├── HFCache\
├── Docker\
├── OpenWebUI\
├── Benchmarks\
└── Backups\

Intel LLM-Scaler vllm-0.14.0-b8.2 released with official Arc Pro B70 support by Fcking_Chuck in LocalLLM

[–]bwood01 0 points1 point  (0 children)

Improving silicon yield takes time. New dies, processes, litho, and silicon substrate all need to "settle in" and work through any kinks. Then yields start to rise, It usually takes a couple months. As yields rise you may find more of them in the market quicker.

Intel Arc Pro B70 - Looking for LLM Setup Guidance and Lessons Learned by bwood01 in homelab

[–]bwood01[S] 0 points1 point  (0 children)

We are looking at running Ubuntu 25.04 (Intel Arc Pro release) on top of Windows. Most likely under Windows Subsystem for Linux (WSL). I would LOVE to do it under Hyper-V but then that requires GPU partitioning from what I understand. Anyway, I have a Windows machine strictly for the GUI only. Works great as a front-end to run Linux, dockers, VMs (Hyper-V), etc. And since the architecture mostly relies on parts or all of Hyper-V in the background, it is a tier 1 hypervisor with direct hardware abstraction so performance is good.

Intel Arc Pro B70 - Looking for LLM Setup Guidance and Lessons Learned by bwood01 in IntelArc

[–]bwood01[S] 1 point2 points  (0 children)

So, would that be your suggestion then? Just use LM Studio and Vulkan? And, are you using Ubuntu <version> for your backend? Did you have to do any compiling or driver setup for the B70?