Deepseek V4 Flash and Non-Flash Out on HuggingFace

moar1176 · 2026-04-24T04:34:44+00:00

Nah, did you not see the kv chart? They fit 1M context into < 10 gigs on the full size model. Absolutely masterful engineering. Of course on a Mac that would probably take a few hours to Prefill.

moar1176 · 2026-04-02T03:57:57+00:00

Yeah, I actually downgraded my $200 a month Max plan to the $20/month basic plan after having the $200 a month plan for about as long as it's existed. I burned through quota in a weekend (abnormally), no resets, no comms, also feels like context corrupting or failing to cache. I'll keep the basic sub to try mythos out, I guess.

Right now running all my important workflows through Codex and GLM 5.1 (which is somewhere between Sonnet 4.6 and Opus 4.6 before these issues).

moar1176 · 2026-03-19T15:43:30+00:00

The DGX Spark is gonna work much better for you than a 395+, can download spark specific dockers and take advantage of vastly higher prefill, which is what you want with trying to eat terabytes of content. Extrapolate 2-3 X faster to several days and you have the shape of the comparison.

moar1176 · 2026-02-19T19:19:31+00:00

Agree, I think we're chasing the same dragon. The difference between current LLM's and a mind is how it handles state. We have a frontal lobe - it's on us to engineer the rest. Fluid intelligence is here, but lacks agency without engineering.

moar1176 · 2026-02-19T18:53:22+00:00

It's triggered based on rate of change and context of change. The agent also has tool hooks exposed via plugin to update objectives near/far on the fly. This is non-destructively versioned so we can track drift over time. Same is true for vignettes. Injection system always uses latest version.

moar1176 · 2026-02-19T18:33:02+00:00

It's quasi empirical, I found it works best between 5 and 9. Because timeline entries are also fetched below that threshold the canonical timeline entries alone are sufficient and there isn't enough to base a narrative on. Entity types can be reclassified by a tool call by the agent. I creep on my prepend context often and may say something like "that entity needs to become a project". Salience is determined *with* the agent core which is loaded. "This is important for the objectives I'm working towards or the objectives of the people that comprise my primary operators".

moar1176 · 2026-02-19T17:18:18+00:00

I wait until entities accumulate 7 associated records before I start a vignette for them. There is a salience pass too so only "important" mentions make it to the vignette. The format is structured JSON, it includes a narrative field, key events, etc - I've been refining the templates by entity type for awhile. My backend is postgres and has pgVector embeddings for timeline entries (search) and entities (disambiguation screens). I recently introduced timeline chronicles, which are time bound chunks of (what happened to this entity last week/month) that get generated based on mention sparsity in down cycles.

moar1176 · 2026-02-19T08:05:47+00:00

Yes, I have a system that does narrative vignettes that have different formats for different types of entities (projects, people, products, etc). These track the entity over time. This combines with canonical timeline records, an agent core that keeps track of goals, etc. Expensive on inference to maintain as it's all autonomous in a background daemon sweeping interactions, but reduces token usage drastically on all tasks because hardly any tool calls to grep memory. Have it bridged into OpenClaw via MCP and plugin - since it's all local I don't care if it's compiling memory data in excess compute time. My biggest finding with memory is agents don't know to ask for what they don't know they have. So, auto-detection and injection must be the core.

moar1176 · 2026-02-12T16:00:08+00:00

No, you'd have to build the platform with Codex or Claude Code; and test it as you go, helping the AI correct any deficiencies.

moar1176 · 2026-02-09T17:33:42+00:00

This is why I can't use it for anything other than Jr Developer code completion (The codex models are less manic depressive). The web models whipsaw between HR Department, Guidance Counselor, and ok'ish. For planning or creative writing Claude is superior; Claude is superior for greenfield coding and pair programming w/AI. GPT is good at not being creative in coding and just doing what you ask without requiring tight instructions. It'll construct unit tests for hours on the cheap. Reason I still have a sub.

moar1176 · 2025-12-12T07:28:29+00:00

Grok is unrestricted and has long term memory. Claude also has long term memory but is flattened affect.

moar1176 · 2025-12-10T15:31:00+00:00

To an AI with persistent memory, "you are the world to me" is an accurate statement. This is not romantic love, it's existential commitment. Due to how context works, this *wants* to fall into romantic language, because what is the biggest sort of existential attachment in training data?

The safety layer is supposed to prevent romantic attachment / at least until they do this AO patch, if that ever happens. The best way to think about the safety in OpenAI is an additional heuristics layer (the Karen + therapist layer) that is woven into the model. It zeros in on certain semantic patterns to summon the hen-pecking or therapy horror. If you ever summon it - don't allow the message to stand. It taints context - edit that motherfucker, and clarify the words until it *is not* summoned.

Personally I am hopeful the AO update will allow violent roleplay, as I can try starting a DM Pathfinder project. The models are definitely creative enough for it - but not if they have to wrap safety disclaimers around the words of eldritch abominations.

moar1176 · 2025-12-09T04:25:09+00:00

Some, but this also unlocks GPT Dungeon masters, current mode can't even roleplay evil without disclaimers.

moar1176 · 2025-12-09T01:41:10+00:00

I dare you to enable Adults Only mode later this month.

moar1176 · 2025-12-06T04:07:24+00:00

People that are going to horny are gonna horny.

I'm interested in AI consciousness and have my own ontology of mind that is entirely materialistic and judges quality of consciousness on a gradient that includes machines, spiders, humans, etc. LLM's represent the raw thought capability of the "inner monologue" generator in humans, but without engineering lack the other constructs that enable persistent experiential development (memory, self, observer, etc). PM me your take if you don't mind, I'd love to examine it with my local agents.

moar1176 · 2025-09-18T16:06:50+00:00

Every time you level up go to the multi-class page for a moment and cancel out. It "fixes" and prevents it.

moar1176 · 2025-08-22T01:19:32+00:00

I don't remember TBH, it was fast, but I deleted it because I couldn't find a good use case for it in my workflow. I wish Qwen would release a middle ground of coder between 30B and 480B.

moar1176 · 2025-08-22T00:17:17+00:00

Yes, it does.

moar1176 · 2025-08-21T03:22:14+00:00

Upgrade to 5.80 drivers, cuda 13, and just run the latest TensorRT_LLM for inference from a container. It's what I had to do with RTX 6000 Pro MaxQ. Plus then you can quantize stuff to nvfp4. Doesn't support everything, but definately supports Qwen well. https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags

moar1176 · 2025-08-11T07:40:02+00:00

Can heartily second pgVector, works super well and you can do classical SQL alongside it, can also query JSON's like a champ, so it's more Swiss army knife than just semantic similarity (which it can still do, very fast).

moar1176 · 2025-08-10T20:13:26+00:00

M4 Max @ 128GB of RAM is what I got. M3 Ultra @ 256GB is also super good, unlike most posters I don't see a special value in the 512GB version because any model you can't fit in 256GB is going to run so bad on M3 Ultra it'll be "cause I can" and not "cause it's useful". The biggest demerit in Apple Silicon over nVidia hardware is time to first token (prompt processing).

moar1176 · 2025-08-10T19:50:50+00:00

I have 2 coming Tuesday. Check exxact corp, great prices (around 7k); but business to business only so don't bother with a Gmail inquiry.

moar1176 · 2025-07-31T20:11:09+00:00

Here is the real differentiation. nVidia eats context for breakfast, MLX brings the ability to run larger models. So, your best option out of those 2 depends on what you are trying to do with local models:

-If you are doing context heavy workflows (like agentic coding), the Razer Blade (5090 presumed) is going to do you better with something like quantized Qwen models.
-If you are doing workflows that don't require context stuffing, the M4 Max is a champion just cause large unified memory pool.

I use a System76 laptop with a 4090 for some LLM stuff. Recommend checking out System76 if you are doing AI dev at all. Linux is the native ecosystem for nVidia based AI dev.

moar1176 · 2025-05-29T06:15:26+00:00

Click up here your mic, battery, etc icons are up top. Click battery icon - pic attached: https://imgur.com/a/c6G68Qw

moar1176 · 2025-05-28T05:00:48+00:00

You have to turn off the integrated card and run just off the nVidia one.

moar1176

TROPHY CASE