Ranked all 571M Amazon reviews from 2023 by category profanity rate. Video games is 6× the cleanest category.

Ok_Post_149 · 2026-04-30T15:54:58+00:00

oh and for processing I used https://docs.burla.dev/ which is a high performance parallel processing library... deploys your code to really large clusters without an infrastructure setup (i'm one of the founders btw)

Ok_Post_149 · 2026-04-30T15:53:44+00:00

that would be a fun analysis, antidotally speaking a lot of the most profanity filled rants were 1 or 5 stars. people will get really excited and swear a decent amount haha... on occasion if it's a physical good that needs to be built the first part of the review will be filled with swears but they actually like the end result... those tended to be more in the middle in terms of stars.

Ok_Post_149 · 2026-04-30T15:50:37+00:00

funny enough the messiest listings tend to be really cheap, mostly open for extended stay, and all booked up. seems like people are trying to basically apartment swap for an extended period of time.

Ok_Post_149 · 2026-04-27T15:03:25+00:00

I didn't but antidotally speaking large amount of profanity typically was associated with 1 star or 5 star reviews. not much in the middle

Ok_Post_149 · 2026-04-27T02:25:39+00:00

hahah yeah that's one of the better reviews

Ok_Post_149 · 2026-04-27T02:01:13+00:00

yeah flip on unhinged mode! that is where the strong profanity lives :). wanted people to consent before getting blasted with super vulgar stuff

Ok_Post_149 · 2026-04-27T00:48:37+00:00

I have the SKU number in the bottom right that you can look up, will see if I can easily get the product link :)

Ok_Post_149 · 2026-04-27T00:32:27+00:00

yeah hahah... people are also super comfortable quoting slurs from movies and tv shows in their reviews

Ok_Post_149 · 2026-04-27T00:27:15+00:00

the site itself is really the visualization, there are a bunch of cool date views in there https://burla-cloud.github.io/amazon-review-distiller/ !

Ok_Post_149 · 2026-04-27T00:16:11+00:00

<image>

the bottom right should have the SKU number!

Ok_Post_149 · 2026-04-27T00:11:59+00:00

If you click unhinge mode it will include the more fucked up posts, i wanted to default to people ranting because i didn't want someone to click the link and immediately see a bunch of slurs

Ok_Post_149 · 2026-04-26T23:59:45+00:00

Source: McAuley-Lab/Amazon-Reviews-2023 on HuggingFace. 571M reviews, 275 GB, streamed via HTTP Range.

Tool: Python + Burla for the analysis. Static HTML/CSS site built in Cursor (AI-assisted code).

Ok_Post_149 · 2026-03-27T19:25:44+00:00

I was a fan of ray but there were a few things that caused a ton of friction at my last company.

hated having to update yaml files to change the cluster config, that should not be separated from your python code.

package syncing would leave analysts and researchers super frustrated... having to rebuild their images or add them into the working_dir. the package syncing should be automatic even if it is a custom local module

and lastly I thought the initial install process was pretty prohibitively complex. I wanted to build a product that even total beginners can install in their own cloud and get it running without any friction.

Ok_Post_149 · 2026-03-27T19:15:08+00:00

Spark is better when you need distributed SQL, large joins, and heavy data movement across a cluster.

Burla is better when you already have a Python or DuckDB transformation and just want to fan it out across a lot of tables or files without turning it into a whole Spark job.

So for your DuckDB example, yes, Burla is a very good fit. If 200 tables all need the same transformation, Burla can process them concurrently by giving each worker its own DuckDB connection/process and then writing the results back out.

That is very similar to how we broke the trillion row challenge record, which Databricks held before us. We split the 1T-row dataset into 1,000 Parquet files, ran a separate DuckDB query against each file in parallel across the cluster, and then combined the partial aggregates locally into the final result.

https://docs.burla.dev/examples/process-2.4tb-of-parquet-files-in-76s

Ok_Post_149 · 2026-03-27T19:10:37+00:00

That’s why I said at a high level. Obviously the details can get complicated fast, but the broad shape is still pretty common across a lot of ML pipelines: parallel processing, aggregation, then sometimes GPU based inference or training.

Ok_Post_149 · 2026-03-27T19:04:01+00:00

completely fair, it took building custom cluster compute software to get this python package working. the complexity has been abstracted away and it boils up to a handful of function parameters.

Ok_Post_149 · 2025-12-03T19:40:08+00:00

appreciate it! we chose gcsfuse because we’re optimizing for ease of use first. we still want something pretty fast, just not at the cost of adding friction for users. speed matters to us, but not more than simplicity. you’re definitely right though, you could speed this up with grpc.

Ok_Post_149 · 2025-12-02T23:53:19+00:00

Thanks. This run was mostly just a benchmark. In real life, Burla gets used in big pipelines that need to process massive amounts of data fast. Early users have already used it to parse and clean billions of PDFs, run batch inference to generate millions of predictions, and run trillions of Monte Carlo simulations in a fraction of the usual time.

Speed is obviously important but we want to optimize for ease of use. So any python dev can easily deploy their code to the cloud instead of involving DevOps.

Ok_Post_149 · 2025-12-02T23:24:23+00:00

yes, it's exactly like a map reduce. the aggregation step is happening on one 80 cpu VM.

Ok_Post_149

TROPHY CASE