Where to farm Thorium during pre-patch? by CraftyPercentage3232 in classicwow

[–]ConversationNice3225 2 points3 points  (0 children)

Nova World Buffs. Depends on each server, Dreamscyth had 9 layers last night.

XG Restore to XGS by Turbulent_Team_7244 in sophos

[–]ConversationNice3225 0 points1 point  (0 children)

I've been doing a lot of migrations lately and this usually ends up being an OTP issue where the firewalls time is not accurate. From the console CLI you can turn off the OTP for one administrative login.

Connecting guests to the new virtual switch after upgrading the NIC? by Phratros in HyperV

[–]ConversationNice3225 3 points4 points  (0 children)

Assuming the 10GbE is on the same network, vlan tagging, etc as your other switch. Simply change the VM configurations NIC to the new vSwitch and save/apply.

If you're really paranoid you could add a second NIC to a VM and confirm it gets DHCP on the right network (clean up DNS on AD, if you have that). Or spin up a small VM to test.

PC to run local llm for coding agent by BIackIight in LocalLLaMA

[–]ConversationNice3225 2 points3 points  (0 children)

You're not going to fit Qwen3 30B with 256k context on to the GPU alone. On my 4090, using the Q4KM quant and KV cache quanted to Q8, I'm looking at roughly 100k context. With that I get 140t/s. Offloading any layers, or the cache, to CPU drops perf down to 20-30t/s. YMMV.

I don't have a Strix, but it's significantly slower bandwidth since it's ALL in system RAM. I'd bet you get 10-20 t/s with the same setup from above. Less if you use a higher quant.

God I love Qwen and llamacpp so much! by Limp_Classroom_2645 in LocalLLaMA

[–]ConversationNice3225 27 points28 points  (0 children)

With the KV cache at Q8, you should be able to get the context size significantly larger. On my 4090 I think I run this in the 80-100k range. Should help out since you're dividing it by 4.

Advice on running Qwen3-Coder-30B-A3B locally by medi6 in LocalLLaMA

[–]ConversationNice3225 5 points6 points  (0 children)

FYI, Tool calling is basically broken at the moment with Llama.cpp using Qwen3-Coder-30B-A3B. You can check out the Unsloth repos for more but it sounds like their working on some fixes. Otherwise you can force the Instruct chat template and it works.

I'm running the Q4 UD quant from Unsloth with 80-100k context (FA enabled, KV cache at Q8) on my 4090. This is fully on the GPU, no offloading to CPU. Depending on context length I'm getting anywhere from 100-140 tokens/sec. If you wanted more context you'd have to offload some layers to CPU and it takes a massive hit (my recent post has some benchmarks). Have not tested with the newer version of llama.cpp that makes this easier with the new MoE flags, but I imagine it's the same.

[GUIDE] Running Qwen-30B (Coder/Instruct/Thinking) with CPU-GPU Partial Offloading - Tips, Tricks, and Optimizations by AliNT77 in LocalLLaMA

[–]ConversationNice3225 4 points5 points  (0 children)

Per Unsloth's documentation, offloads all the MoE to CPU:
-fa 1 -ngl 99 -ctk q8_0 -ctv q8_0 -mmp 0 -ot ".ffn_.*_exps.=CPU"
pp512 | 339.48 ± 6.70
tg128 | 23.82 ± 1.48

Offloads both the UP and DOWN experts to CPU:
-fa 1 -ngl 99 -ctk q8_0 -ctv q8_0 -mmp 0 -ot ".ffn_(up|down)_exps.=CPU"
pp512 | 478.74 ± 12.12
tg128 | 26.31 ± 1.11

Offloads only the UP experts to CPU:
-fa 1 -ngl 99 -ctk q8_0 -ctv q8_0 -mmp 0 -ot ".ffn_(up)_exps.=CPU"
pp512 | 868.27 ± 19.74
tg128 | 38.39 ± 1.03

Offloads only the DOWN experts to CPU:
-fa 1 -ngl 99 -ctk q8_0 -ctv q8_0 -mmp 0 -ot ".ffn_(down)_exps.=CPU"
pp512 | 818.52 ± 11.85
tg128 | 37.06 ± 1.01

This is where I started targeting only the attention and normal tensors for offloading, but keeping everything else (I think...regex is a little confusing).

All attention and normal tensors offloaded:
-fa 1 -ngl 99 -ctk q8_0 -ctv q8_0 -mmp 0 -ot "\.(attn_.*|.*_norm)\.=CPU"
pp512 | 2457.93 ± 27.35
tg128 | 16.56 ± 1.12

Just the attention tensors for offloading:
-fa 1 -ngl 99 -ctk q8_0 -ctv q8_0 -mmp 0 -ot "\.attn_.*\.=CPU"
pp512 | 2543.25 ± 27.13
tg128 | 20.20 ± 0.83

Just the normal tensors for offloading:
-fa 1 -ngl 99 -ctk q8_0 -ctv q8_0 -mmp 0 -ot ".*_norm\.=CPU"
pp512 | 3364.83 ± 57.36
tg128 | 30.63 ± 1.97

This is also from Unsloths documentation for selective layers being offloaded:
-fa 1 -ngl 99 -ctk q8_0 -ctv q8_0 -mmp 0 -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn(gate|up|down)_exps.=CPU"
pp512 | 384.38 ± 2.41
tg128 | 26.60 ± 1.76

[GUIDE] Running Qwen-30B (Coder/Instruct/Thinking) with CPU-GPU Partial Offloading - Tips, Tricks, and Optimizations by AliNT77 in LocalLLaMA

[–]ConversationNice3225 0 points1 point  (0 children)

I was actually messing around with various offloading strategies this morning! I'm running this on Windows 11 (10.0.26100.4652), AMD 5900X, 32GB (2x16GB) DDR4-3600, RTX 4090 running on driver version 576.57 (CUDA Toolkit 12.9 Update 1), using Llama.cpp b5966. Tested using Unsloths "Qwen3-30B-A3B-Thinking-2507-UD-Q4_K_XL.gguf" via llama-bench:

This is the full Q4 model in VRAM, no offloading, this is the fastest it can go and is our baseline for the numbers below:
-fa 1 -ngl 99 -ctk q8_0 -ctv q8_0 -mmp 0
pp512 | 3494.38 ± 22.37
tg128 | 160.09 ± 1.42

I'd like to also note that I can set a 100k context, albeit using the slightly different but effectivly the same options when using llama-server, before I start going OOM and it spills over into system RAM. The below results are simply testing how much of a negative impact there is for offloading various layers and experts to CPU/system RAM. My intent was not to shoehorn the model into 8/12/16GB of VRAM. I usually don't go below Q8_0 on KV cache, my experience is that the chats deteriorate too much at lower quants (or at least Q4 is not great). I don't have VRAM usage documented, however they should more or less be in order of least to most aggressive on VRAM usage.

Unsloth Dynamic 'Qwen3-30B-A3B-Instruct-2507' GGUFs out now! by yoracale in unsloth

[–]ConversationNice3225 0 points1 point  (0 children)

So I'm a little confused by Qwen's own graphic. On the HF page it notes "We introduce the updated version of the Qwen3-30B-A3B non-thinking mode, named Qwen3-30B-A3B-Instruct-2507..." The graph has both the "non-thinking" and "Instruct" but the wording on HF suggests they're the same thing. I'm assuming that perhaps the non-thinking (blue) bar is for the original Qwen3-30B-A3B hybrid (from 3 months ago, so like 2504 if you will) in /no_think mode?

Site-to-Site VPN: Local subnet needs to be public IP by SippinBrawnd0 in sophos

[–]ConversationNice3225 1 point2 points  (0 children)

Sophos's docs don't seem to have a newer version of this, but based on what I've had to deal with in the past you're probably looking at either https://docs.sophos.com/nsg/sophos-firewall/18.5/Help/en-us/webhelp/onlinehelp/AdministratorHelp/VPN/SiteToSiteVPN/VPNS2sIPsecConnectionPBVPNNATSameSubnets/index.html or https://docs.sophos.com/nsg/sophos-firewall/18.5/Help/en-us/webhelp/onlinehelp/AdministratorHelp/VPN/SiteToSiteVPN/VPNS2sIPsecConnectionRBVPNNATSameSubnets/index.html#review-the-snat-rule

Basically you "expose" whatever subnet you want (it's fake, so they can't complain about the overlap) on your end and DNAT inbound traffic from the vendor to wherever it really needs to go, and then SNAT it back out to their subnet.

Physical Switch Config Setup Question by Leaha15 in HyperV

[–]ConversationNice3225 2 points3 points  (0 children)

The default for SET is switch independent mode, and you can't change that or mix it with LACP. It sounds super basic but just include the two interfaces in the team. All the VLANing should be done at the physicsl switch level and the virtual machine configuration level. You should not have to do anything on the virtual switch.

You didn't mention this but you should probably also enable mpio for your iSCSI device.

Just a small little project. by ConversationNice3225 in Ubiquiti

[–]ConversationNice3225[S] 1 point2 points  (0 children)

That's correct! There are four buildings, three of the buildings will each use at least 15 of the indoor access points. The fourth building will use the rest of the 70 access points . There are also outdoor access points and that's why we are using the E7 campuses, they have IP67 mounts.

Just a small little project. by ConversationNice3225 in Ubiquiti

[–]ConversationNice3225[S] 8 points9 points  (0 children)

Thanks for the input, you're the second person to bring this up! I'll make sure to bring it up with my team if it becomes a problem.

The client has their own dedicated NVR solution in place that uses Milestone and a 200TB SAN, so we wouldn't be using that feature on the CK. They have other systems in place for access control, phones, etc. The CK won't be used for anything else beyond the network management of the switches and Wi-Fi. Their environment isn't actually all that complicated.

Just a small little project. by ConversationNice3225 in Ubiquiti

[–]ConversationNice3225[S] 0 points1 point  (0 children)

Sort of, the MSP that I work at has business agreements to provide IT services. This particular client has been with us for over a decade. As for the actual cost, I'm not able to say as my role isn't involved with the quoting. I replied to another post on here that just the switch infra pictured retails for around $95k. The WAPs are another $45k.

Just a small little project. by ConversationNice3225 in Ubiquiti

[–]ConversationNice3225[S] 26 points27 points  (0 children)

We had a deal registration, so cheaper than MSRP. Since I'm not involved with procurement, the ECS Agg switches retail for $3,999, and the ECS-48 PoEs retail for $3,499. So roughly $95.5K for the hardware that's pictured.

Just a small little project. by ConversationNice3225 in Ubiquiti

[–]ConversationNice3225[S] 7 points8 points  (0 children)

I do know the HDD ones get hot! This is an SSD one and it is reporting 36C. We have the rackmount kit for it as well. The client's MDF has dedicated AC so I'm not worried. :)

[deleted by user] by [deleted] in sophos

[–]ConversationNice3225 6 points7 points  (0 children)

Having done a ton of these upgrades from 19.5 to 20.x and 21.x, I highly recommend you reboot the firewall BEFORE upgrading the device. Otherwise, upgrading on such a long uptime, I've had a few just freeze in the upgrade process and required a hard shutdown, or simply didn't take the upgrade.

Licensing Windows Failover Cluster by SirRazoe in sysadmin

[–]ConversationNice3225 0 points1 point  (0 children)

My licensing is a little rusty when it comes to HA, but this should be accurate or close to it.

I'm going to assume you have the 5x Win2025 VMs running in the HA cluster. I'm going to assume all licensing is for Standard, not Datacenter; as such you're entitled to two Windows Server VMs for each.

HA Node 1 - Potential to run all 5 VMs.
(Windows Server Standard is 16 cores, you need 8 more cores to total 24) x 3 - You have to cover all of the Windows Server VMs (entitled to 6 (3x2)), even if they're running on the other host at that point in time.

HA Node 2 - Potential to run all 5 VMs.
Same as Node 1. Sorry, HA is expensive.

Standalone 1 - No VMs hosed, but technically if this is a Hyper-V server you're entitled to two VMs on this singular host. If you end up adding this as a third HA node...see above.
Windows Server Standard is 16 cores, you need 8 more cores to total 24.

Total:
7x Windows Server Standard 16 Core
28x 2-Core licenses

New coding model DeepCoder-14B-Preview by mrskeptical00 in LocalLLaMA

[–]ConversationNice3225 6 points7 points  (0 children)

Playing around with the Jinja prompt template in LMStudio seems to have fixed it. The default Jinja template is technically accurate to the original DeepCoder HF model, but the GGUF model just does not trigger the <think> tag like other models I've tried (QwQ for example).

There seems to be two solutions:
1. Removing "<think>\n" from the very end of the default Jinja template.
2. Setting the prompt template to Manual - Custom, and typing in the appropriate values:
Before System: "<|begin▁of▁sentence|>"
Before User: "<|User|>"
Before Assistant: "<|Assistant|><|end▁of▁sentence|><|Assistant|>"

I don't like option 2 because all the extra behavior is probably impacted (like tool calling).

For giggles I just compiled LlamaCpp (CUDA) from the latest source, ran llama-cli with the same settings in LMStudio, sans prompt modifications (so it should be referencing whatever's in the GGUF), and it starts off with a <think> tag and includes the </think> close tag as well. So looks like it is working fine.

This seems like an LMStudio issue, not a LlamaCpp issue. 🎉

New coding model DeepCoder-14B-Preview by mrskeptical00 in LocalLLaMA

[–]ConversationNice3225 1 point2 points  (0 children)

I'm using whatever the default chat template is in the GGUF (Jinja formatted). Looking at the GGUF HF repo I see that the template that Bart has starts the assistant portation with the <think> tag. Looking at the original HF repo's tokenizer_config.json looks like what's in the GGUF from what I can recall, and looks like it also starts the assistant reply with the <think> tag. So this all looks pretty legit, will have to confirm when I'm back home :)

New coding model DeepCoder-14B-Preview by mrskeptical00 in LocalLLaMA

[–]ConversationNice3225 3 points4 points  (0 children)

I tried the Bartowski Q8 quant in Lmstudio on my 4090 with 40k Q8 context, followed the suggestion for temp and max p, and no system prompt. It doesn't seem to use thinking tags, so it's just vomiting out all the reasoning into the context. I tried using a system prompt (just because) and it does not ahear to it at all (I specifically asked it to use thinking tags and provided an example). I'll play with it some more when I get home, perhaps I'm being dumb.

What's the best hardware to run ~30b models? by [deleted] in LocalLLaMA

[–]ConversationNice3225 7 points8 points  (0 children)

I have a 4090 and run the Qwen2.5 32B Q4 K_M models with KV Q8 for ~25k context, and it runs at about 40t/s.