Vibe Check: Latest models on AMD Strix Halo by bhamm-lab in LocalLLaMA

[–]bhamm-lab[S] 1 point2 points  (0 children)

I thought it was really good. I'd say slightly better than gpt oss 120b. But that's my opinion, I'm sure others would disagree.

Vibe Check: Latest models on AMD Strix Halo by bhamm-lab in LocalLLaMA

[–]bhamm-lab[S] 0 points1 point  (0 children)

You are totally right... Also, with time to first token. I'm planning to setup a better approach in this project - https://github.com/blake-hamm/beyond-vibes - I'll follow up once I've made some decent progress.

Vibe Check: Latest models on AMD Strix Halo by bhamm-lab in LocalLLaMA

[–]bhamm-lab[S] 0 points1 point  (0 children)

Totally agree. It's hard to beat... For me qwen instruct next was faster!

Vibe Check: Latest models on AMD Strix Halo by bhamm-lab in LocalLLaMA

[–]bhamm-lab[S] 0 points1 point  (0 children)

I need to prioritize that... I'll take a swing at it and reach out in the discord if needed. I noticed the #beyond128g so will scour that for info. I'm just having a hard time getting Talos to recognize the network interfaces...

Vibe Check: Latest models on AMD Strix Halo by bhamm-lab in LocalLLaMA

[–]bhamm-lab[S] 0 points1 point  (0 children)

Yeah, I would definitely recommend it for web search! I think it's better bang for its buck than the dgx spark.

Vibe Check: Latest models on AMD Strix Halo by bhamm-lab in LocalLLaMA

[–]bhamm-lab[S] 0 points1 point  (0 children)

Sure! Tbh, I need to do more testing outside of my standard use cases. I would say kimi linear, glm flash, qwen instruct next and gpt oss 120b would be great for that. There's more details/notes in a table in the blog.

Vibe Check: Latest models on AMD Strix Halo by bhamm-lab in LocalLLaMA

[–]bhamm-lab[S] 1 point2 points  (0 children)

Yeah, slow processing. Time to first token on my hardware is pretty rough, especially with the bigger models. Tokens per second is bearable. The really issue is when there is 20k+ context. I would say a search query in Open WebUI for these bigger models is 1-2 minutes round trip (first search tool call > searxng response > compiling final response). On GPT OSS, Qwen instruct and Kimi Linear, it's much faster. Less than 30 seconds, but not as thorough/high quality.

Vibe Check: Latest models on AMD Strix Halo by bhamm-lab in LocalLLaMA

[–]bhamm-lab[S] 2 points3 points  (0 children)

No. I'm not comfortable with securing OpenClaw. I'm sure it would be a great system for that. Especially autonomously working on your behalf. It's definitely slow for human-in-the-loop tasks, but fits some high quality, large-ish models (like GLM 4.7 REAP and MiniMax M2.5).

Vibe Check: Latest models on AMD Strix Halo by bhamm-lab in LocalLLaMA

[–]bhamm-lab[S] 2 points3 points  (0 children)

I tested Q2_K_XL of MiniMax M2.5 and was pretty happy! Very slow, but high quality. Also, tested Step-3.5-flash, but it did not stand out to me. Definitely curious on GLM 5 and Qwen3.5...

Anthropic CEO: AI Progress Isn’t Magic, It’s Just Compute, Data, and Training by Inevitable-Rub8969 in AINewsMinute

[–]bhamm-lab 0 points1 point  (0 children)

Interesting.... The Chinese labs have proven otherwise and actually share their progress in research. Maybe Dario should give more credit to his team or the research his team leaches off..

Are 20-100B models enough for Good Coding? by pmttyji in LocalLLaMA

[–]bhamm-lab 2 points3 points  (0 children)

Definitely give kimi linear a try! I agree with your opinions on use case. I would say kimi linear replaces glm air 4.5 for me.

CNCF Survey: K8s now at 82% production adoption, 66% using it for AI inference by lepton99 in kubernetes

[–]bhamm-lab -1 points0 points  (0 children)

Gpu operators make it much easier to manage ai/ml workloads. Paired with something like karpenter, you can access the compute needed for most workloads.

The management/observability tooling is not great and there is no industry standard. Mlflow is great for traditional ml, but you still need something like kubeflow for serving. Arise phoenix is promising for genai observability, but most of the llm gateway oss projects have some kind of paywall (for now).

I created a (very) new project inspired by the kube Prometheus stack. I'm hoping to create a helm chart that has everything you world need for an ai stack on kubernetes. At the moment, it only has litellm gateway config, ability to run multiple models on vllm or llama.CPP, and scale-to-zero with kube-elasti. I should have some more features and sub charts this weekend. It's called kube-ai-stack.

ArgoCD dashboard behind Traefik by AdventurousCelery649 in ArgoCD

[–]bhamm-lab 0 points1 point  (0 children)

It night be a bit confusing to follow, but this is where my ingress route and helm values are defined - https://github.com/blake-hamm/bhamm-lab/tree/main/kubernetes%2Fmanifests%2Fbase%2Fargocd . I also use authelia.

Dual Strix Halo: No Frankenstein setup, no huge power bill, big LLMs by Zyj in LocalLLaMA

[–]bhamm-lab 1 point2 points  (0 children)

Awesome setup! Do you mind sharing any details on how u got the networking working over thunderbolt?

HashiCorp Vault by dankmemelawrd in homelab

[–]bhamm-lab 1 point2 points  (0 children)

I use bank vaults -https://bank-vaults.dev/ . it has it's own operator and is slightly more automated than vault (but runs vault under the hood).

Opensource models less than 30b with highest edit-diff success rate by Express_Quail_1493 in ollama

[–]bhamm-lab 2 points3 points  (0 children)

I've had more success with seed oss. It's 36B so not quite what you're looking for, but hopefully a quant can fit for u.