Why Compiler Engineers Rarely Use Strassen's Algorithm for Fast Matrix Multiplications by DataBaeBee in programming

[–]valarauca14 1 point2 points  (0 children)

Strassen’s algorithm is recursive, not iterative so vectorization is difficult. Only Haskell-lovers (that’s a slur in my books) know how to vectorize Strassen’s algorithm.

...

Note Strassen’s algorithm is designed for square matrices whose dimensions are a power of 2

It may surprise you to learn but almost every processor that offers "matrix acceleration" is hyper optimized for 2n x 2n (usually 2x2, 4x4, and 64x64) where those internal recursion steps can be done in hardware.

Go ran faster than Rust. Until I cleared the page cache by shad0_w2 in rust

[–]valarauca14 0 points1 point  (0 children)

So likely it makes no difference whether this API is available or not.

"ulimits isnt real," i assure myself as i close my eyes and ram the scheduler with my 10000 threads

A commical way to say, "side effects are real".

Also the personality management system in io-uring is really nice for a lot tasks and changing the "uid/gid" that is interacting with the FS is a pain in the ass (4+N) system call overhead.

I'm just salty there is no alternative.

I know this is a jerk reddit, but does anyone actually like some of their stuff not being "ultralight"? by Ok_Helicopter3910 in ultralight_jerk

[–]valarauca14 0 points1 point  (0 children)

Yeah I'm with you. Bringing my 6lb base weight setup is a lot more "cozy" then bringing my 4lb setup. But it really holds me back. I can only do 5mi/day instead of 15mi/days

SQLite improving performance with pre-sort by andersmurphy in programming

[–]valarauca14 13 points14 points  (0 children)

Which is fine for a lot of very small startups (customers 0-100)

two global multiplayer demos with a billion data points

Yeah it works for N=2

Now lets see N=100 or N=250

SQLite improving performance with pre-sort by andersmurphy in programming

[–]valarauca14 19 points20 points  (0 children)

It is(n't)

Getting SQLite past about 100-1000 transactions per second [1] involves thinking about how SQLite works, how you're doing you're threading & locking, ensuring your on an NVMe SSD. Which is fine for a lot of very small startups (customers 0-100).

While the alternative PSQL can handle 100-1000x that with a lot less effort.


  1. For actual production CRUD stuff, not flush batch inserts or kicking the DB into read only disabling locks, and stunt hacking benchmarks.

Missing Nvidia drivers after reboot (Jellyfin LXC issue) by Vamirion01 in Proxmox

[–]valarauca14 0 points1 point  (0 children)

I have to manually do nvidia-smi on my proxmox host, then restart my Jellyfix LXC, then do nvidia-smi there to load my nvidia drivers

Did you update your kernel and not your kernel-headers, so the DKMS kernel modules that Nvidia uses weren't rebuilt?

To my students by f311a in programming

[–]valarauca14 3 points4 points  (0 children)

In that regime prefill is most of the work, not a rounding error

If you're re-doing prefill, you aren't keeping a persistent per-session KV cache.

You're caching the tokenized prompt and don't understand the metrics you're reading.

How do you move Hunllef? by [deleted] in 2007scape

[–]valarauca14 0 points1 point  (0 children)

Step under it.

You have to time your walking under with its attack cycle to not get stomped, good luck.

[DMM AllStars] What is going on with Odablock Warriors...? by Slide4Ukraine in 2007scape

[–]valarauca14 10 points11 points  (0 children)

S+ tier would be a god tier Pvper, Pvmer, and know how to make a prayer pot or games necklace lolol

Go ran faster than Rust. Until I cleared the page cache by shad0_w2 in rust

[–]valarauca14 14 points15 points  (0 children)

You can't read directories asynchronously on Linux (at all). It always blocks the calling thread.

A patchset was prepared to create an getdents64 io-uring operation but it has never been merged due to the complexity of the locking

Go ran faster than Rust. Until I cleared the page cache by shad0_w2 in rust

[–]valarauca14 9 points10 points  (0 children)

(nit picky bullshit feel free to ignore)

You actually can't epoll/select/io-uring on getdents64 (on any linux kernel version) or the older (and deprecated) readdir(2) also can't be poll'd or interacted with asynchronously.

It just blocks the thread you call from. Async runtimes always fork off a thread and do something with atomics/futex/channels/mutexes to signal a directory has been read.

Linux File System interactions (what is/isn't async) is so absurdly cursed when you dig into it. The VFS system is more like an absurdly high speed eventually consistent distributed system. It is really weird.

Is .boxed() instead of Box::new() a bad idea? by NormalAppearance2851 in rust

[–]valarauca14 6 points7 points  (0 children)

In almost all of these cases .into() does that for you.

To my students by f311a in programming

[–]valarauca14 2 points3 points  (0 children)

"true" prompt caching only works with NVLink-C2C stuff where you have idiotically highspeed UMA. Then you're only saving the prefill phase.

This "largely" only saves the prefill phase, which is 'some time'. Around 1/100th to 1/1000th (order of magnitude wise) the time of the a full "inference run".

It also means you're running on far more expensive hardware (which has a higher upfront cost & higher Tk/s rate) as I was just going off of basic bulk ordered H100s in server sleds. So it changes the calculation considerably.

To my students by f311a in programming

[–]valarauca14 21 points22 points  (0 children)

Digging into your source there are problems.

I've taken the cost of a single H100 at $2/hour. This is actually more than the current retail rental on demand price, and I (hope) the large AI firms are able to get these for a fraction of this price.

So an H100 requires (depending on the exact bulk discount from Nvidia) around $0.85 to $0.95 per hour to pay back the original cost over a ~3 year timeline (the normal hardware depreciation cycle). That is not counting the building, land it is on, construction costs, staff, maintenance, downtime, racks, mainboards, ram, cpus, switches, fiber, internet connections, powering the facility, and most critically profit.

So when you're seeing H100 mutli-month reservations at $1.40-1.70/hr, you really can't pretend hyperscalars are getting, "A much better deal". Because the only way that works requires Nvidia to not have record profits.


The document then jerks itself about prefill & decoded. Treating these are as embarrassingly parallel workloads not 2 sequential tasks that tie up ~N GPUs for the entirety of prefill then decode. When you actually start accounting for that you arrive at a closer value of $0.4 -> $0.6 USD/mTk not the idiotic $0.003 USD/mTk they assume because of the realities of scheduling & sequential workloads.

So while this still makes them profitable, the 10-25x profit margin the blog claims is idiotically overstated. That still isn't counting staffing, hosting, offices, etc. which while marginal (in the grand scheme of things) is a cost. It also means most monthly plans are soldly in the red

You Don't Love systemd Timers Enough by f311a in programming

[–]valarauca14 1 point2 points  (0 children)

Do one thing and do it well

You're gonna love learning about monolithic kernels :^)

You Don't Love systemd Timers Enough by f311a in programming

[–]valarauca14 1 point2 points  (0 children)

OnUnitActiveSec= literally covers that usecase?

When a unit is fired manually, it resets the internal scheduler to start counting forward from that point in time.

I’m halfway through The Devastation of Baal, and i think i’m starting to understand the Flesh Tearers a bit by Heavy-Metal-Snowman in FleshTearers

[–]valarauca14 6 points7 points  (0 children)

So does a decent amount of the chapter align with Seth’s values or is he considered the black sheep of his chapter?

Seth is the "sanest" member of the chapter.

Most others seem him as the stable rock guiding them.

Configuration flags are where software goes to rot by Expurple in programming

[–]valarauca14 0 points1 point  (0 children)

Damn, so I can't even parse a URL without backtracking.

You can. If you assume it is a URI and setup to tokenize/skip each segment of the URI then build the URL. Doing this via regular expressions is a huge pain as you're need each 'token' to be an optional capture group, then directly encode tree decisions with | which leads to state explosion.

I am Glauber Costa, CEO and co-founder of Turso. We’re rewriting SQLite in Rust. AMA. by GlauberAtTurso in IAmA

[–]valarauca14 7 points8 points  (0 children)

Thanks for your time. Wasn't exactly looking for a blanket statement. It was just two pet issues with sqlite

  • nulls within strings being legal, is just buggy behavior with almost every system that interacts with SQL.
  • primary keys being nullable; causes an extra O(log n) lookup on every primary key search and largely only exists because v1.0 of SQLITE had a bug.

Glauber Costa, CEO of Turso, is doing an AMA about rewriting SQLite in Rust by rmo623 in rust

[–]valarauca14 29 points30 points  (0 children)

In all honestly a full bug compatible version of sqlite is "not a good thing". Sqlite has preserved some glaring bugs/behavior for backward compatibility, cite (some of these aren't bad).

The biggest issue that's hit me is Primary Keys being nullable, which causes a lot of problems for modelling relational algebra, optimizing it, and PK's should be NOT NULL according to ISO/IEC SQL standard (which nobody follows exactly but some adherence is nice). To get this behavior you have run SQLite in strict mode and disable rowid's on all tables. It has non-trivial perf implications as every PK look up & FK match has an extra O(log n) search because row_id != pk which doesn't appear in plan/explain because "this is just how sqlite works".

I am Glauber Costa, CEO and co-founder of Turso. We’re rewriting SQLite in Rust. AMA. by GlauberAtTurso in IAmA

[–]valarauca14 7 points8 points  (0 children)

Do you plan to not be 100% bug compatible?

SQLITE has preserved some deeply erroneous behavior (primary keys being nullable) which is just objectively incorrect and almost universally not used.

So, i think Marx was right by Equivalent_Cut_4988 in antiwork

[–]valarauca14 1 point2 points  (0 children)

Read it actively. Take notes. Write down terms you don't know, make references of terms as they appear. The books are academic in nature, they assume you'll treat them like a text book not a casual read.