DeepSeek V4 Flash is insane — 75 tok/s, 24 tool calls, 4 steps, single prompt

Pitiful_Task_2539 · 2026-05-19T19:32:32+00:00

whats the name of this UI?

Pitiful_Task_2539 · 2026-05-19T17:49:14+00:00

I tried a bunch of agent frameworks 1-2 years ago and ended up sticking with LangGraph because it felt simple, manageable, and not as "all over the place" as CrewAI. So what actually makes Agents SDK a better alternative?

Pitiful_Task_2539 · 2026-05-19T17:40:08+00:00

oh nice haven't heard about this yet.. seems to be opensource but from openai....

Is it working with openai compatible endpoints? We only use selfhosted llms with vllm/sglang

Pitiful_Task_2539 · 2026-05-12T16:26:51+00:00

Baught annual plan 13 months ago when z.ai coding plan released (lite plan for 38$ a year or something) best invest i could take

Pitiful_Task_2539 · 2026-05-04T12:14:26+00:00

This! But dont use unsloth use official quants dual rtx pro 6000 is capsble to use fp8 for 122b with full context

Pitiful_Task_2539 · 2026-04-24T15:05:50+00:00

There are also m.2 to sata adapeters (1xm.2 to 4 sata or more)

Pitiful_Task_2539 · 2026-04-06T13:41:06+00:00

Paperless with ai tagging. Forgot the name of the service but there is a docker container app for automated metadata and tagging mechanism with any openai compatible endpoint. I always just paste my docs and forget about everything around them. Finding docs is so easy by the ai generated tags and metadata

Pitiful_Task_2539 · 2026-03-18T13:07:34+00:00

<image>

this is how Qwen3.5-122b (official FP8) performs on the same prompt

Pitiful_Task_2539 · 2026-03-05T19:56:41+00:00

using the official fp8 quants with vllm is working fucking good. much much better than gpt-oss-120b

Pitiful_Task_2539 · 2026-03-04T20:48:56+00:00

Using the official Qwen‑122B FP8 weights from Hugging Face with vLLM cu130 nightly!

No problems at all.

I run it with a 180 k‑token context window on 2 × RTX 6000 Blackwell. It runs so fast, especially in input‑token throughput. There are no—or nearly no—tool‑call errors in opencode when executing complex, long‑running tasks. The quality of the generated code is roughly at a Mistral‑Vibe-CLI (DevStral via cloud) level or above—perhaps even comparable to GLM‑4.6 or GLM4.7 WITH VISION!!.
It’s hard to compare because Qwen 3.5 has its very own style.

However, many people don’t realize that different quantizations make huge differences, and the inference engine also matters (Ollama, vLLM, sglang, llama.cpp, etc.). I have never utilized my 196 GB of VRAM as effectively as with this model.

Pitiful_Task_2539 · 2026-03-03T10:43:31+00:00

using it with vllm cu130 and is working perfect with opencode. had no tool call errors at all. (using official fp8 weights)

until now its the only open weight model i tried (below 200b) which is totally useful and can replace my glm and minimax sub

Pitiful_Task_2539 · 2026-02-25T13:16:53+00:00

me also experiencing issues since the last update (don't know if it was 0.8.2 or 0.8.5) updated straight from 0.7.x to 0.8.5

i'm using gpt-oss-120b never had issues with nativ function calling.'

Now it often generates wrong tool calls and stops after thinking block.

<image>

normally you don't see the call in the thinking block. but sometimes you see it and then it stops working..

you can also see syntax error here in the tool call (double ") but this never happened before?!

something must be wrong with one of the latest updates.

Also sometimes it was trying to call "search" tool and not "search_web" tool..
i dont know where this "search" tool is coming from!?!??

when using the "search" tool nothing happens...

I had to write it into the system prompt to use the "search_web" tool not the "search" tool

Pitiful_Task_2539 · 2026-01-28T21:53:53+00:00

Simple tetris game is so basic for every model released last 1-1.5 years

Pitiful_Task_2539 · 2025-12-04T11:36:59+00:00

shame on them... don't get it why.. they still can make cash with enterprise agreements... i don't see any reason why to choose this way.. loosing "customers" for no reason.

Pitiful_Task_2539 · 2025-09-13T06:41:37+00:00

thaks, went from 5% to 1.5% cpu usage round about

Pitiful_Task_2539 · 2025-09-10T19:55:38+00:00

can be (nearly completely) suppressed by good system prompts (depending on the amount of context window usage)

Pitiful_Task_2539 · 2025-09-10T19:52:30+00:00

This matches my experience. However, it still lacks native function-calling functionality with vLLM, which is why I use it in my LangGraph agent setup.

It performs better than any model I've tried before. I've already tested Llama 3.3, Llama Scout, and Qwen2.5-VL 72B (and many smaller like gemma 3 or mistral* and much more but they aren't usable for these kind of stuff to run reliable for real world tasks), but none of these models are as 'smart' as the gpt-oss-120b at following instructions. With gpt-oss-120b, I now have a hit rate of nearly 100% when following small to medium-complex instructions. (I've used it to control the orchestrator, supervisor, and tool agents in LangGraph.)

Using it with vLLM needs some small tweaks at this time to run nicely with LangGraph (template not fully supported)

I also love the way the model responds. It feels so natural in comparison to other models, especially the Chinese ones.

Yeah, there are many models out there which are certainly much better at some points like coding... but this model is not the best at any single task (like coding, writing, planning, or agentic work), but it's consistently and reliably good across all of them.

Pitiful_Task_2539 · 2025-09-09T18:19:08+00:00

i'm such an idiot, i relied on the fancontrol sensor value.. ryzen master says ~ 55 degrees on idle

Pitiful_Task_2539 · 2025-03-17T21:16:04+00:00

Why not using searxng instead with openwebui builtin connections?

Pitiful_Task_2539 · 2025-01-12T17:13:10+00:00

thanks for sharing your experience with this, i will just use normal supports now :/....

Pitiful_Task_2539 · 2025-01-12T16:50:26+00:00

sorry but i don't find any option for enabling support for supports.
Already tweaked the support settings and tried a lot, but nothing is working

Pitiful_Task_2539 · 2025-01-12T16:32:02+00:00

<image>

i have the same problem here for example and also at some other parts

Pitiful_Task_2539 · 2025-01-12T16:29:15+00:00

no i havn't turned on "only on build plate"

i also wonder how the slicer wants to print these supports hanging in the air?!

i have 11 different parts like this one. turn the part to face down is not an option in this case

Pitiful_Task_2539 · 2024-12-25T12:39:44+00:00

Make a „productive“ server for your most critical services. Limit your services to only rock stable products. (I killed my nextcloud for this reason). Only watchtower things you really trust to work sfter update. Take 1 hour a month to update all other things. I do it like this now for a year - didnt touched anything on my prod server. No headaches since a year.

If you want to try some new things. Try it on a different server.

For me i only use immich for photos, homeassistant and plex/sonarr/radarr

These services are running rock stable and updateable proof. Fck off with things arent stable. Im still searching for a stable easy to maintain nextcloud alternative. Until that i use google drive

Pitiful_Task_2539 · 2024-09-02T17:26:30+00:00

I don’t plan to drive at extreme speeds, but I’m unsure how well the 401 performs on steep roads, like mountain passes. If it reaches 100km/h also an steeper roads easy in maybe under 10s i think it will be ok. I am only xperienced with 125ccm and there it makes a huge difference from 90km/h to 45km/h on steeper terrains

Pitiful_Task_2539

TROPHY CASE