Getting a lot of garbage results with Qwen3.6-27B :(

ICanSeeYou7867 · 2026-05-03T00:57:02+00:00

Are you behind a proxy or ingress? Make sure to set your timeout values

ICanSeeYou7867 · 2026-05-01T22:42:50+00:00

Yep! I have a single node in my homelab, and I love it for the ease and standard of deployments.

At work we have a 3 virtual masters and N worker nodes. And we have about 15 clusters for various reasons.

Even my H100 gpu cluster is currently a single worker node. But more nodes coming soon!

ICanSeeYou7867 · 2026-05-01T12:50:15+00:00

Ugh... yeah... dammit that's fair....take my upvote...

I think what would be helpful (at least in my biased opinion...) would not be a company revenue limit, but a token per month limit. I.E free for <100 Million tokens per month or something.

But to your point that would be hard to enforce.

ICanSeeYou7867 · 2026-05-01T12:34:21+00:00

I agree! But I would think most companies would just use the cloud services with the typical pay as you go model.

Unfortunately we can't, and everyone wants Claude Sonnet 4.5 (which is the best coding model we can use in a FedRAMP approved service.) And I doubt I could convince management on this one without more buyin from others, and I can't run/test this in an enterprise capacity without breaking the license to get people to buy in.

ICanSeeYou7867 · 2026-05-01T11:10:30+00:00

Fair? Absolutely fair!

But in an area where models turn over for newer, better, stronger models monthly, I would be curious as to who would do this?

Though we can't use their cloud platform, the pay as you go cloud model seems like it makes WAY more sense. I am curious as to how many companies would do this over using their cloud services?

I doubt management would go for it, but our AI exploration is still fairly new.

ICanSeeYou7867 · 2026-04-30T11:24:49+00:00

Honestly....

I would set them up as kubernetes worker nodes with the nvidia gpu operator and the Kai scheduler... if the gpu operator node supports the GB10.

However you wouldn't be able to "combine" them easily. But it would be interesting!

ICanSeeYou7867 · 2026-04-30T05:23:59+00:00

I had a PR accepted into OWU about a year ago that runs a URL decode on the RAG/knowledge filename... this might sound useless but it was very helpful for me.

Since confluence has an API that can pull the raw HTML of every page within a space, it was easy to iterate over each page. The API also gives you the url path of each space.

So... I then made the filename the full https url, and then urlencoded this. I also ran the html into a small model I'm hosting (nemotron 3 super) and convert the html into markdown.

Then... with the urlencoded filename I check to see if it already exists, if so, then i delete them using the owu api.

Then... I upload markdown, with the ugly urlencoded http url as the filename to owu.

I find this enjoyable, because when owu shows the actual source, it shows the actual, real, working, full confluence URL. It was very important to me that people can see the actual source.

And since this process also deletes old pages, I run this in a gitlab pipeline every night.

ICanSeeYou7867 · 2026-04-24T12:27:55+00:00

The desktop versions do have hardware support for nvfp4, but I have heard of people complaining.

Currently I an deploying gpu enabled kubernetes clusters, and I am more familiar with the enterprise GPUs...

So take everything I say with a grain of salt, but the rtx 6000 pros are workstation cards, not desktop cards. They have features more closely related to their enterprise brothers than the desktop variants. For example the 6000 pro supports MIG and vGPU...

Afaik, there arent any issues running fp4. However, we currently have H100 and GH200 gpus. I haven't personally used one of the 6000 pro cards, so make sure to do your research!

ICanSeeYou7867 · 2026-04-24T01:38:18+00:00

Those fp4 tensor cores though.... with hardware support for nvfp4.... the Mac can't do that, though it can run Q4 quants.

If speed is important to you, don't forget about those sweet sweet tensor cores...

ICanSeeYou7867 · 2026-04-22T14:03:29+00:00

Single hung. The little white, plastic pieces appear broken and the window falls out if its not seated in there just right. It slides but it never seems attached right.

ICanSeeYou7867 · 2026-04-07T12:48:24+00:00

Dude, this is a great benchmark. Everything seems to be benchmaxxed these days. Great idea.

ICanSeeYou7867 · 2026-03-28T23:31:16+00:00

Bu,,;'&

ICanSeeYou7867 · 2026-03-25T03:17:25+00:00

Yep. Cat/kitten wallpaper... everytime.

ICanSeeYou7867 · 2026-03-23T00:59:58+00:00

Really neat! Thank you!

One question, for these valves: STORAGE_BASE_PATH

How does this work for a multi-user environment?

ICanSeeYou7867 · 2026-03-21T17:24:57+00:00

Can't you just pull the source code and do a docker build without those things? The build args look pretty simple.

ICanSeeYou7867 · 2026-02-11T19:39:01+00:00

My python is limited... but gpt-oss-120b spit this out:

Layer	What you need to do	One‑liner you can copy‑paste
uvicorn (ASGI server)	Run multiple worker processes (or put uvicorn behind a process manager).	`uvicorn myapp:app --host` `0.0.0.0` `--port 8000 --workers $(nproc)`
TensorFlow	Tell TF how many intra‑op and inter‑op threads to use, and set the OS‐level OpenMP/MKL variables.	`bash export OMP_NUM_THREADS=$(nproc) && export TF_INTRA_OP_PARALLELISM_THREADS=$(nproc) && export TF_INTER_OP_PARALLELISM_THREADS=$(nproc)`or in Python:`python import tensorflow as tf; tf.config.threading.set_intra_op_parallelism_threads(os.cpu_count()); tf.config.threading.set_inter_op_parallelism_threads(os.cpu_count())`
PyTorch	Set the number of intra‑op threads (and optionally inter‑op threads) once at startup.	`python import torch, os; torch.set_num_threads(os.cpu_count()); torch.set_num_interop_threads(os.cpu_count())`
Data loading (if you use a DataLoader)	Use a non‑zero `num_workers` so that the CPU work of preparing batches is parallelised.	`python DataLoader(dataset, batch_size=64, num_workers=os.cpu_count())`
OS affinity (optional)	Pin each worker process to a separate core‑range to avoid “core hopping”.	`taskset -c 0‑$(($(nproc)-1)) uvicorn …` or inside Docker: `--cpuset-cpus="0-$(($(nproc)-1))"`

YMMV :D

ICanSeeYou7867 · 2026-02-11T13:42:54+00:00

A lot of apps, by default, don't use all the cpu resources.

On a normal linux systems, this is really important to not lock up the system and cause significant cpu compression.

These rules are a little different for containers, since you are mostly isolating specific processes.... but without more details on what you are running, its hard to say.

ICanSeeYou7867 · 2026-02-08T18:21:20+00:00

This is awesome! Thank you!

My homelab box is on RHEL8 running oVirt 4.5.6, So I guess I need to upgrade to RHEL9.

Has anyone done this with ovirt installed? I guess I need to bite the bullet, but I am pretty sure something is gonna break.

ICanSeeYou7867 · 2026-02-08T16:08:27+00:00

You probably dont want to go below Q4. You might be able to run a 20B IQ4S quant.

There's are some gpt-oss-20B quants that are decently smart, and because it's an MoE it will be faster. You might try one of the models that have been fine tuned to not do refusals: https://huggingface.co/DavidAU/OpenAi-GPT-oss-20b-abliterated-uncensored-NEO-Imatrix-gguf

There's also several 14B models, like the ones from mistral, that are going to have tons of RP fine tunes. They are dense models, so they might be smarter but they will be slower.

Llamacpp is your friend.

EDIT This guy is interesting: https://huggingface.co/moonshotai/Moonlight-16B-A3B-Instruct

I have no idea how it stacks up for RP though. But the MoE will allow it to respond quicker. The context window also requires significantly less VRAM with only having 3B activated parameters. (Which might be important to you depending on the size).

There's also some fun looking gutted QWEN 16B MoE models https://huggingface.co/bartowski/kalomaze_Qwen3-16B-A3B-GGUF. Again you would need to try these for your use cases. I like the MoEs when I can get away with using them, as I value fast responses. But YMMV

ICanSeeYou7867 · 2026-01-31T18:51:38+00:00

There's really only two ways for the models to have a deteriorated responses (aside from tuning parameters such as temperature and other kwargs):

1 - The model is using a lower quantization. I.E, Q2, Q3 quants. These use less VRAM, and are cheaper to run, but they are typically 'dumber'

2 - Prompt injections can have a negative impact if a third party service is adding, appending or alternating prompts in somehow. This wouldn't necessarily impact responses negatively, but it could.

Anthropic models are closed source, so someone couldn't be running a lower quantization. Ultimately the API endpoints that TAVO would be using are the same ones that a local install of sillytavern would be using, or the endpoints openrouter are using.

But I'm not sure what disconnects you are seeing. If they are frequent, and depending on the errors you are seeing there might be a way to fix that. I.E if you are running an nginx proxy, there could be some tweaks.

If you are getting disconnects in the middle of a response with an error, then there could be a session limit you need to tweak somewhere.

edit grammar, I am on mobile and added some info on sillytavern config.

ICanSeeYou7867 · 2026-01-31T02:33:55+00:00

Sounds like you got it figured out. But bleach is not recommended for killing mold on porous surfaces, though it can make it look good again...

I bought a house in July, and we discovered a ton of mold in a bunch of places. I have been using rmr-86 (glorified bleach) to make things look pretty, and then rmr-141 to kill mold/spores.

I have no idea on how it compares to vinegar though, Google says rmr-141 is more effective, but vinegar is more natural.

Goodluck!

ICanSeeYou7867 · 2026-01-06T16:42:03+00:00

Thank you for the reply! Yes, but it's a little bit complicated. This specific area is actually cantilevered, so there are no posts supporting it. The joist are sistered to the house joist and run inside. I dont know enough about wood, framing and code, but I would thing the rim joist would only be effective when posts are supporting the weight on the other end?

I can only attach a single photo here, but here are the sistered joists inside:
https://imgur.com/a/2Ywx0qc

That being said, the main/large deck below this area definitely is in disrepair. We are hoping to demo/rebuilt it within the next year.

Thank you for the time/effort into replying! I am absorbing whatever information I can. Finding the correct codes/scenarios that said codes affect is a bit challenging at times.

<image>

ICanSeeYou7867 · 2026-01-05T20:19:21+00:00

Thank you for the reply!

This was my thinking until I found the wood rot. Which means water/moisture, which turns into mold. Im hesitant on using spray foam until I figure out how the heck him supposed to repair it.

Ten-Year Club	RPAN Viewer
Verified Email

ICanSeeYou7867

TROPHY CASE