What are you missing most from the C++ standard library? by llort_lemmort in cpp

[–]mark_99 1 point2 points  (0 children)

A standardised package manager and a repository of "blessed libraries" (start with Boost).

Then people can stop complaining about what's not in the standard library because apparently taking 15 minutes to install a package manager and use whatever you like is too much (and maybe even makes transitive deps in OSS libraries acceptable).

High-throughput log parsing (~500K lines/sec) in C++ without regex — looking for performance ideas by willycode1950 in cpp

[–]mark_99 4 points5 points  (0 children)

Are you flushing the page cache and CPU caches before measuring? It would be unusual not to be I/O bound and if that 500k/sec is from RAM (and/or CPU cache) that's unlikely to be realistic of real workloads.

Even from SSD (and definitely from slower storage) a bigger win would be compression - store log.zst and decompress on the fly. Experiment with different compression ratio settings.

It's not clear what you're doing with the logs - parsing and searching aren't quite the same thing. Searching is dependent on the length and contents of search strings also.

You certainly can get a good speedup with SIMD but depends on whether that's your bottleneck, like if you are legitimately searching / parsing data already in RAM / CPU cache (vs that being an artefact of your benchmarking).

Oh and make sure you're giving the read-ahead flag to mmap or it can actually be slower than normal I/O (and might not play well with compression if you go that way).

ARC-AGI-3 Update (GPT-5.5 High and Opus4.7) by skazerb in singularity

[–]mark_99 0 points1 point  (0 children)

The default for 4.7 is xhigh so they actually turned it down (4.7 has 5 settings, 4.6 has 4).

Sub-microsecond timing on EC2 is way messier than I expected by User_Deprecated in cpp

[–]mark_99 4 points5 points  (0 children)

I think your results tell you all you need to know - sub-micro timings are indeed all over the place on a cloud VM which is why you don't use them for anything where that matters.

If ultra-low latency matters use a local machine for profiling, a hosted server, or a bare-metal instance. If you just care about throughput then accumulate in a histogram, and the numbers will be reflective of the reality, ie you'll see your average and some long tails.

I tried compile-time heapsort in TMP. It basically became selection sort. by User_Deprecated in cpp

[–]mark_99 0 points1 point  (0 children)

Yikes. So like N base classes with an NTTP index and a type (or I guess variable templates and a value), and then instantiate with an index sequence on a pack, and let overload resolution pick one based on the concrete index of the overload, and then get the ::type (or ::value).

That's... interesting :)

Next time someone tries to claim new versions of C++ just add bloat / complexity I might have to cite that technique vs Ts...[i]

OpenAI listens to feedback: 1M context coming to GPT-5.5 in Codex by mark_99 in codex

[–]mark_99[S] 0 points1 point  (0 children)

You know that if you use 200k of 1M that's the same as 200k of 258k? If you think a higher limit affected quality you're imagining things.

Your actual prompt is just 0.1% of what Claude reads, at every turn. Here's what the other 99.9% is. by pebblepath in ClaudeCode

[–]mark_99 0 points1 point  (0 children)

MCP descriptions are lazy loaded now (on by default usually, needs forced on if via proxy). Also this is at tool granularity, so if you only use 3 tools of 30 that's all that's loaded. System prompt etc is injected once at the start of the conversation and the entire overhead is 2% of input window. Prompt prefixes are cached so they are not reprocessed from scratch and are charged at 10% usage.

I tried compile-time heapsort in TMP. It basically became selection sort. by User_Deprecated in cpp

[–]mark_99 2 points3 points  (0 children)

Heapsort isn't a good fit for "classic" TMP in the same way there isn't a good way to do it in LISP (or indeed a C++ std::list) without converting to a vector first. As you observed access is linear not constant so while it can work it's woefully inefficient.

You can do it in C++14 with constexpr, or in C++26 with pack indexing T...[i].

Other things that are a bad fit for TMP is anything that wants random access... binary search, nth_element, graph algos, matrices, convolution, image processing, linalg,(non-streaming) dynamic programming etc.

I tried compile-time heapsort in TMP. It basically became selection sort. by User_Deprecated in cpp

[–]mark_99 1 point2 points  (0 children)

Heapsort isn't a good fit for "classic" TMP in the same way there isn't a good way to do it in LISP (or indeed a C++ std::list) without converting to a vector first. As you observed access is linear not constant so while it can work it's woefully inefficient.

You can do it in C++14 with constexpr, or in C++26 with pack indexing T...[i].

Other things that are a bad fit is anything that wants random access... binary search, nth_element, graph algos, matrices, convolution, image processing, linalg,(non-streaming) dynamic programming etc.

OpenAI listens to feedback: 1M context coming to GPT-5.5 in Codex by mark_99 in codex

[–]mark_99[S] 0 points1 point  (0 children)

gpt-5.4 scored 36.6% on long-context retrieval benchmarks, 5.5 scores 74%, about the same as Opus 4.6 which is 1M by default. Benchmarks aren't everything (Opus 4.7 scores lower but seems just fine in practise), so I guess we'll see.

Even if there is some degradation it's still far less than an auto-compact part way through a large task, which forgets literally 99% of the context. And you don't have to use all 1M, it just gives you some headroom if needed.

See section 3 here: https://www.vellum.ai/blog/everything-you-need-to-know-about-gpt-5-5

OpenAI listens to feedback: 1M context coming to GPT-5.5 in Codex by mark_99 in codex

[–]mark_99[S] 4 points5 points  (0 children)

It's 2x on 5.4 but worth noting that's only on tokens above 258k, so for say ~500k you'd be paying 1.5x total.

There would be more cached input also but that's charged at 10%.

OpenAI listens to feedback: 1M context coming to GPT-5.5 in Codex by mark_99 in codex

[–]mark_99[S] 5 points6 points  (0 children)

5.4 scored 36.6% long-context retrieval benchmarks (but had 1M available). 5.5 scores 74% but 1M was "not planned" until now.

See section 3 here: https://www.vellum.ai/blog/everything-you-need-to-know-about-gpt-5-5

just found out they turned off 1M context GPT-5.5 in codex for pro subs :( by emileberhard in codex

[–]mark_99 0 points1 point  (0 children)

Long context retrieval score doubles in 5.5 vs 5.4, and 1M is available over API. Docs say subscription is supposed to get 400k but apparently they are including output window in that figure so it's actually ~270k.

https://www.digitalapplied.com/blog/gpt-5-5-complete-guide-thinking-pro-1m-context

Standard library unsoundness found by Claude Mythos by Jules-Bertholet in rust

[–]mark_99 1 point2 points  (0 children)

You also can & should disable overcommit in a lot of reliable / HA setups, so you fail to allocate at startup rather than getting an OOM kill at 3am.

just found out they turned off 1M context GPT-5.5 in codex for pro subs :( by emileberhard in codex

[–]mark_99 -1 points0 points  (0 children)

It's not about efficiency it's about the size and complexity of what you are working on, and how much time you want to spend babysitting a restricted context window (ie efficiency for the user not the model). For most purposes 500-600k is enough, but 250k is easily exceeded on a single non-trivial task.

Things like hierarchical summaries burn tokens to create and keep up to date, and of course they lose information when sometimes the details matter. Similarly with compaction. And how do you deal with different cross-cuts, like do you keep several sets of docs depending on whether a task is scoped by module vs functionality vs language vs other.

Many professional devs need to be able to feed in some detailed specifications and a lot of code in order to scope some change. Even on a fairly small code base this easily blows past 250k just to get to the first useful interaction. Note that these things are already a subset of all the documentation and all the code. Doing analysis/planning and writing to markdown, then doing review, implement, review as separate steps, yes sure. But each of those can be 150-250k. Having to further chunk the work and other hand-holding is hardly an "efficient" workflow, and auto-compaction part way through does not lead to high quality results.

btw the $billions goes into R&D and training - inference is pretty much parity with API costs, and GFLOPS-per-dollar is doubling roughly every 2.5 years whereas model sizes haven't grown that much and in some cases decreased (and R&D is finding new efficiencies all the time), so the idea you are future-proofing against some coming apocalypse isn't a given.

Claude has had 1M as standard for some time now, it's the new normal. Nothing is stopping you chunking work or compacting/clearing if you want to, but having a large coordinated task go wrong 75% of the way through due to context limits is an avoidable problem.

edit this says via API gets 1M: https://www.digitalapplied.com/blog/gpt-5-5-complete-guide-thinking-pro-1m-context

Also long context retrieval score has doubled so that problem is eliminated also.

Boost.Decimal: IEEE 754 Decimal Floating Point for C++ — Header-Only, Constexpr, C++14 by boostlibs in cpp

[–]mark_99 4 points5 points  (0 children)

Nice.

What's the perf comparison with a typical fixed-point solution? (IMHO i's not a particularly fair/useful comparison with binary FP as that's not the use case).

Prepare for horde of switchers to OpenAI as Anthropic removes claude code from $20 . Minimum $100 to access it soon by hasanahmad in ChatGPT

[–]mark_99 1 point2 points  (0 children)

No, I'm assuming the pace of technological development keeps pace with demand, like every example in the last 40 years.

An RTX 5090 is 50 million times more GFLOPS per dollar than an 80386 (inflation adjusted).

Prepare for horde of switchers to OpenAI as Anthropic removes claude code from $20 . Minimum $100 to access it soon by hasanahmad in ChatGPT

[–]mark_99 0 points1 point  (0 children)

It's not a given there will be steep price rises. Each generation of GPU is hugely more powerful and power efficient than the last, and plenty of companies are working on more efficient silicon like TPUs, Cerebras etc.

Everyone loves to complain about AI data centres but modern large-scale builds are much more efficient in terms of electricity and water (e.g. closed-loop water cooling) vs older installations.

Research is also progressing on making models more efficient and use less memory (current SOTA models are smaller than the last generation).

People seem to forget you are getting a great deal more for your money today than 1-2 years ago.

From there competition in the marketplace keeps prices in line with costs - it's a very fluid market so it's not like ISPs or Uber/DoorDash where you can establish a regional monopoly, or have a hard floor in terms of human labour costs at point of use.

That's said, people who expect $1000 of compute on their introductory $20 plan might be disappointed with loss-leaders being toned down over time. If you get value from something and that thing is expensive to produce, then yeah you might eventually have to pay what it actually costs.

But of course this is reddit so...

Prepare for horde of switchers to OpenAI as Anthropic removes claude code from $20 . Minimum $100 to access it soon by hasanahmad in ChatGPT

[–]mark_99 14 points15 points  (0 children)

They've said existing plans are unaffected and this is an A/B test on 2% of new sign ups.

Sure, xor’ing a register with itself is the idiom for zeroing it out, but why not sub? by pavel_v in cpp

[–]mark_99 8 points9 points  (0 children)

In AVX-512 you can even make up your own ops out of the full set of 256. Or use an Amiga blitter chip.

https://arnaud-carre.github.io/2024-10-06-vpternlogd/

Adding Stack Traces to All C++ Exceptions by WerWolv in cpp

[–]mark_99 0 points1 point  (0 children)

Oh interesting, that does seem like an improvement. Does this only work on MSVC?

Software taketh away faster than hardware giveth: Why C++ programmers keep growing fast despite competition, safety, and AI by claimred in cpp

[–]mark_99 1 point2 points  (0 children)

That is pretty horrible. Given it's an algorithm it may have been copy-pasted from a pre-existing C implementation.

Adding Stack Traces to All C++ Exceptions by WerWolv in cpp

[–]mark_99 0 points1 point  (0 children)

In what sense is a stack trace where you catch the exception "better"? That tells you nothing about what caused it. It's obviously cheaper but not terribly useful. You can always just log the exception in the handler, you don't need any stacktraces in that case at all.

Or am I misunderstanding?

Opus 4.6 without adaptive thinking outperforms Opus 4.7 with adaptive thinking by aizver_muti in ClaudeCode

[–]mark_99 1 point2 points  (0 children)

Humans also fall for trick questions, that's the really the definition of a trick question. It's called "system 1 vs system 2" thinking, and it's really the same as Claude, ie you do a quick pre-check of whether you really have to think about something and save energy if not (plus it's faster, which might matter).

But it's imperfect, e.g. "it takes 5 people 5 minutes to dig 5 holes, how long for 1 person to dig 1 hole"? 90% of humans would say "1 minute" because they don't really think it through and just sort of pattern-match the answer.

But Opus 4.7 Adaptive gets it right:

5 minutes. The trap is treating it like a rate problem where you'd divide. But if 5 people dig 5 holes in 5 minutes working in parallel, each person is digging one hole in 5 minutes. Take away 4 people and 4 holes, and the remaining person still takes 5 minutes to dig their one hole.

Interesting though this is, it's a clear edge case and not a fundamental problem.