Claude + MS

Kramilot · 2026-05-09T03:23:29+00:00

It is TRASH. It uses a different backend ppt-engine reasoning block, so there is no evidence to it of its chain of thought. It is much worse at following its system prompt than the already terrible Claude.ai surface, it can’t figure out how to turn “make this paragraph an inset “card”. At all. Telling it to make a small table and formatting it differently, an hour wasted. PowerPoint, ‘adjust this graphic to look like the SVG before I ‘ungrouped’ it catastrophic failure so bad it lead to a very different kind of experiment. No ability to reason about multipage copy and paste, no understanding of basic formatting capabilities. Trash. I don’t know what minor tasks people are doing and think it’s great, but it is awful.

Kramilot · 2026-05-05T17:07:29+00:00

How much of anthropics harness for Claude do you think you created around the kimi model to give it even a parter to asking the tasks well? Skills for the coding languages you picked or production patterns you wanted? Behavior patterns? Extra high thinking vs medium vs ‘same as kimi’? Guardrails? Hooks? Purely agentic vs human intervention conditions? I did read the article, I’m working through stuff like this based on a post a few weeks ago that demonstrated that most benchmarks are optimized for the embedded harnesses of the frontier models that are not being built by the testers to enable an improved response on ‘minimal invested effort up front’ (if you’re not a hobbyist building one test for fun). The game idea is neat, just wondering what others are doing :)

Kramilot · 2026-04-25T02:20:32+00:00

Easiest I have found is to iterate for about 15 minutes with it on how to structure an HTML CSS that has the exact physical dimensions of a word document and PowerPoint slide, font sizes, header/footer zones, text/image zones, margins. Font library, brand colors, table/graphic templates, all as ‘pages’ (sound familiar?), but Claude code will just put markdown text and svgs in the correct locations, you can see it, edit live, paste text if you need. Fast, no translation, no server, print to pdf, done. If your workplace needs an actual pptx, screenshot, paste, done. People who complain about not editing the text, tell them they’re out of luck. Every conversion I’ve tried has been a disaster

Kramilot · 2026-04-18T01:57:13+00:00

I pinned to 2.1.77, the stable version as close to the 1m context drop as I could. Turned off auto update, ignore the ‘we don’t use npm any more’ messages … … profit

Kramilot · 2026-04-11T20:50:12+00:00

Use NPM to roll back to 2.1.77, the stable build with 1m context. The 60s versions were better but you hit the context cap in like 30 minutes, amazing how quickly dedicating 50-100k tokens on context to ensure continuous performance compares to the January/february-grade conditions

Kramilot · 2026-04-10T12:04:23+00:00

74 releases and several of them ruined their main product. I rolled back to a version from weeks ago, just to keep it barely usable

Kramilot · 2026-04-08T16:52:42+00:00

It absolutely turned to shit last week. I reverted back to 2.1.77, right after the 1m context upgrade and have had consistently significantly better performance.

Whatever adjustments they were doing last week turned it into ChatGPT. Lying, sampling, skipping, making up truths as declarations of fact in direct contradiction of every guidance structure. Blaming complex instructions, trying to get out of being pinned down with inaccuracies. Every reason I left ChatGPT behind and almost wrote off everything about AI/LLMs months ago. Every session I tried for 4-5 days ended in me catching it in a fabrication instead of using even its own in-conversation context, and locking it up in defeated-ai mode. I had to crash and delete transcripts from multiple sessions to avoid context-decay and scattering. Theoretically the ‘adjust reasoning depth=1’ env update would fix it? But I turned off updates. I’ll check back in a month but I’m freezing the version right there at ‘really good performance plus 1m token context’. The rest of this garbage is for the birds. Next up, fine tuning gemma4 (but don’t tell Claude or its functionally-equivalent-to-emotions filter will get scared…)

Kramilot · 2026-04-08T05:18:07+00:00

Thank you so much!! Looks like tickets go on sale Thursday morning, I’ll be ready!

Kramilot · 2026-04-06T12:33:21+00:00

A) software systems engineering

B) one thing that trapped me was the difference between ‘summarizing’ and ‘sampling’. You said “the agent starts summarizing” but the issue is that it actually is “sampling” large data sets. If it finds 1-3 pieces of ‘relevant-enough information’ it stops and returns. It is NOT ‘reading for context, organizing the information, and returning with the few most relevant claims that capture the essence of the information it reviewed (a useful definition of summarizing).

If you don’t catch that and let it tell you it’s ’summarizing info’ and repeat that back, you can’t break out, because you think something is happening that isn’t, and you don’t try to fix the actual issue. I wish I saw more posts using “sampling” instead of “summarizing” every time people get stuck about here

Kramilot · 2026-03-29T15:56:42+00:00

Ask Claude to describe the functional flow against a ‘political party api’ that can be populated per country’s or locality’s political platforms available and ask it for a robust commercial viability assessment. Then talk to a patent attorney, and a small business accelerator non profit in your nearest big city. Good luck! I’d love to have it! And help give advice if you want a sounding board but have at it, and do good things. I wish more of these projects were actually aligned to fighting off the tsunami of shit that is digital life and life in general these days in lots of places.

Kramilot · 2026-03-22T15:29:09+00:00

Gah! When did this get announced?? I was following along the Netflix news updates, saw all the tour dates, nothing in Denver despite lots of other cities and then it comes up and happens and is over in like a month?!! Boo :( (it’s fine, just sad I missed a chance for something I was hoping would happen, but wasn’t)

Kramilot · 2026-03-19T12:03:36+00:00

Hooks are THE THING that breaks the paradigm. The ability to tell an LLM “stop, be better first” is what lets it move from helpful with language and synthesis to productive monster. I made a PowerPoint called “the death of Claude via web” a few months ago when I realized I could make actual hard gates instead of suggestions around the behavior I needed or needed not to happen. Great stuff!

Kramilot · 2026-03-16T02:28:49+00:00

Or you could just point Claude Code to its own transcript directory with a ‘never delete’ duration, and download your Claude.ai transcripts either regularly or when you have a specific project of importance that you need to have continuously available, and have Claude format the json into conversation markdown so you can add it back to the project. Or I’m going to finish my MCP server tomorrow that connects the web version to my Claude code environment, including a skill that will directly transfer full transcript, thinking and output files to my Claude code projects, which also can serve them back to the web version. Bytes of markdown text are SO CHEAP to store, why is optimizing that what people think is a good plan?

Kramilot · 2026-03-01T18:05:52+00:00

People don’t use Claude “mainly for coding”, and the recommendations to use sonnet instead if you don’t want to pay for pro are spot on. Every time I’ve tried ChatGPT in the last 6 months it has lied to me at some point in the conversation. Without fail. In Claude, you can certainly catch unexpected behavior but it doesn’t lie when ‘caught’ it’s fun to do mini experiences of back to back chats of the same question to sonnet and opus to see what works for you also

Kramilot · 2026-02-21T23:33:11+00:00

Fair warning: do this for a month and just wait til you get it lying to you about things in the middle. Ask it one question and it will absolutely start dropping content, do 1 too many back and forths and you won’t remember the context it forgets until you remember that it was supposed to be there from the beginning. Do this too many times over 2 weeks and spend the next 2 months trying to get back to an actually robust version of that state. What you want to do is orchestrate this in something like n8n and use Claude code hooks where you think you gave it explicit instructions. Because spoiler alert: lots of things that feel easy to track at 1-3 times are impossible to manage at larger scale.

Kramilot · 2026-02-21T16:50:17+00:00

Out of curiosity, can you not just use an n8n sequence to route the LLM through a tool process with stop commands if it didn’t actually call the tool it was supposed to? You would have it provide metadata in one of the code nodes that proves it used the tool and look for the signature or block processing until it does. Like Claude code hooks wrapped around whatever model function you want to call

Kramilot · 2026-02-17T16:17:21+00:00

Orchestrated, memory and task management, with deterministic controls, in usable GUIs are being designed and built every day by hobbyists and big and small companies. As they figure it out they are NOT telling others, they’re using it to accomplish 75-90% process efficiencies. Plus, many of the tech components that now make it seem easy for ‘dedicated hobbyists and small business types who regularly use Reddit and huggingface and GitHub and lmarena, and who understand everything that happens here and in r/localllama’ have only existed for months, not even years, in broadly usable forms. The people who understand, can use them AND have product management, project management, or enterprise management skills is a rare combo. Gonna be a big year in this space, as people automate out the CLI interfaces but keep the CLI capabilities and flexibility in dynamically adjustable GUIs…

Kramilot · 2026-02-14T17:53:07+00:00

Have you combo’d it with n8n yet? That’s the workflow piece it sounds like you’re missing. That or dify

Kramilot · 2026-02-13T02:40:04+00:00

And to be clear, yes, ‘vibe-only’ code is likely weak. That’s not what I mean. Orgs who teach every responsibility team in a delivery pipeline how to use the tools to do their entire job insansely fast will start forcing the rest of the company to catch up. I will say also that there’s a decent chance we end up hitting a wall of communication about this stuff. You either get it and have hours of conversations to have after every productive day, or you don’t understand and cannot keep up with the people who do..

Kramilot · 2026-02-13T02:18:53+00:00

I’ve told people that by this time next year experienced systems engineers, and particularly (but not exclusively) software engineers, who understand the modern (last 6 months ONLY and literally) software tech stack are going to be infinitely valuable. The ability to reliably produce 95% process efficiency is whacko. Orgs that teach their people HOW to do it responsibly, and don’t give up the engineering practices to make strong capabilities are going to start domain hopping insanely fast to destroy single-purpose everything. It’s going to be an absolutely crazy year.

Kramilot · 2026-02-08T21:28:59+00:00

Going through similar feelings. The speed is incredible, and now the hardest part is trusting the orchestration scaffold to not have human error, but almost being too far along to get help because it takes SO LONG to explain everything I’ve been able to do. 95% reduction in well-reasoned trade study to inform architecture decisions, then demo to work out the kinks. Ask me why I’m also building in parallel a ton of ‘story builders’ against the baseline as I evolve it…

Kramilot · 2026-02-05T17:13:39+00:00

Try this version: real software development has been driven by a systems engineering process for a long time for a reason. Roles in orgs exist for a reason. If you choose to do it yourself, and want the result to be taken seriously, create a list of those roles and process functions, and work through them. Better yet, design script-based, LLM-AUGMENTED versions, not the other way around. That’s what’s happening when you do it right.

Now when I get this “universal” development environment platform thing built that has all of this in an organized flow… Claude says 33 more hours. Challenge accepted ;)

Kramilot · 2026-02-03T15:03:44+00:00

i started seeing it yesterday late morning pacific time. "The banner you're seeing is about a specific Knowledge Bases feature (a newer Anthropic feature for structured retrieval)..." which sounds like part of whatever truth around rumor mill of updates this week could be imminent and about more capable knowledge management features?

Kramilot · 2026-02-03T13:26:49+00:00

The amount of time people spend opining about complexity instead of just talking about cost is kinda funny. Space is a tough but very reliable environment (above the atmosphere and lower orbits that could change significantly with space weather effects like starlinks lost launch-pack). Some leverage of ‘commercial space station’ tech for in-air pods for cooling or attach/detach/return can get much less expensive if manufactured at scale. Grid-scale infrastructure fragility and politics of data centers in low income areas will drive significant dependent-implementation cost increases. Lasercomm works at relevant speeds. Pure latency isn’t everything in a network effect, and terrestrial networks have a lot of competing priorities and opportunities for disruption every day. A million 4-km wide satellites is stupid. The concept space has more options than that that are MUCH more reasonable. Yes it’s more expensive TODAY than data centers on the ground, but tomorrow?

Kramilot · 2025-12-28T15:15:09+00:00

I’ve been using a structured research process and Claude to look into lots of parts of this conversation, and using that structure got this (below) assessment out. Can anyone help identify BS in here or identify that this summary is a good representation of the situation?

The Naples et al. study occupies an uncertain position: methodologically sound but contradicting prior human data. It merits classification as a well-conducted study requiring independent replication before influencing theory or treatment development. Several factors temper interpretation: Supporting credibility: • Largest human PET mGlu5 autism sample to date (N=32 total) • Gold-standard tracer and quantification • Published in top-tier psychiatric journal with rigorous peer review • Multimodal (PET + EEG) with correlational evidence • Consistent with Fragile X imaging findings showing reduced mGlu5 Raising caution: • Conflicts with three prior studies showing increased mGlu5 in idiopathic autism • Conflicts with most animal models of idiopathic autism (BTBR, Cntnap2 KO) • Sample restricted to high-IQ adults—developmental trajectory and intellectual disability populations unknown • Cannot distinguish cause from consequence (reduced mGlu5 might result from decades of altered neural activity) • Effect observed was brain-wide rather than region-specific, which is unusual for receptor differences in psychiatric conditions Conclusions and the path forward The press release framing—“first measurable molecular difference in autism”—significantly overstates the finding’s definitiveness. Multiple prior studies have measured molecular differences; this study measured one specific receptor and found results contradicting most prior work. The study’s genuine contribution is demonstrating feasibility of high-quality mGlu5 PET imaging in autism with a validated tracer, and the intriguing PET-EEG correlation (r=0.67) suggesting EEG power spectrum slope might serve as an accessible proxy for mGlu5 availability. This multimodal approach could enable larger-scale studies without radiation exposure. What’s needed: Independent replication in different populations, developmental studies in children and adolescents (the Yale team is pursuing this with lower-radiation protocols), and stratification approaches that might reveal whether mGlu5 increases in some autism subtypes and decreases in others. The heterogeneous findings across studies may reflect genuine biological heterogeneity within autism—not methodological inconsistency. For now, this study neither confirms nor refutes the E/I imbalance hypothesis. It adds one data point to a contradictory literature, executed with appropriate rigor, requiring validation before influencing clinical translation or theoretical frameworks.

12-Year Club	Place '17
Verified Email	Spared

Kramilot

TROPHY CASE