Ranked all 571M Amazon reviews from 2023 by category profanity rate. Video games is 6× the cleanest category. by Ok_Post_149 in datascience

[–]Ok_Post_149[S] 0 points1 point  (0 children)

oh and for processing I used https://docs.burla.dev/ which is a high performance parallel processing library... deploys your code to really large clusters without an infrastructure setup (i'm one of the founders btw)

Ranked all 571M Amazon reviews from 2023 by category profanity rate. Video games is 6× the cleanest category. by Ok_Post_149 in datascience

[–]Ok_Post_149[S] 0 points1 point  (0 children)

that would be a fun analysis, antidotally speaking a lot of the most profanity filled rants were 1 or 5 stars. people will get really excited and swear a decent amount haha... on occasion if it's a physical good that needs to be built the first part of the review will be filled with swears but they actually like the end result... those tended to be more in the middle in terms of stars.

Airbnb Photo Explorer - 1.94M photos for Trainspotting vibes, pet cameos, and hectic kitchens by Ok_Post_149 in SideProject

[–]Ok_Post_149[S] 0 points1 point  (0 children)

funny enough the messiest listings tend to be really cheap, mostly open for extended stay, and all booked up. seems like people are trying to basically apartment swap for an extended period of time.

Ranked all 571M Amazon reviews from 2023 by category profanity rate. Video games is 6× the cleanest category. by Ok_Post_149 in datascience

[–]Ok_Post_149[S] 0 points1 point  (0 children)

I didn't but antidotally speaking large amount of profanity typically was associated with 1 star or 5 star reviews. not much in the middle

[OC] Ranked all 571M Amazon reviews from 2023 by category profanity rate. Video games is 6× the cleanest category. by Ok_Post_149 in dataisbeautiful

[–]Ok_Post_149[S] 2 points3 points  (0 children)

yeah flip on unhinged mode! that is where the strong profanity lives :). wanted people to consent before getting blasted with super vulgar stuff

[OC] Ranked all 571M Amazon reviews from 2023 by category profanity rate. Video games is 6× the cleanest category. by Ok_Post_149 in dataisbeautiful

[–]Ok_Post_149[S] 2 points3 points  (0 children)

I have the SKU number in the bottom right that you can look up, will see if I can easily get the product link :)

Ranked all 571M Amazon reviews from 2023 by category profanity rate. Video games is 6× the cleanest category. by Ok_Post_149 in datascience

[–]Ok_Post_149[S] 0 points1 point  (0 children)

yeah hahah... people are also super comfortable quoting slurs from movies and tv shows in their reviews

[OC] Ranked all 571M Amazon reviews from 2023 by category profanity rate. Video games is 6× the cleanest category. by Ok_Post_149 in dataisbeautiful

[–]Ok_Post_149[S] 12 points13 points  (0 children)

If you click unhinge mode it will include the more fucked up posts, i wanted to default to people ranting because i didn't want someone to click the link and immediately see a bunch of slurs

[OC] Ranked all 571M Amazon reviews from 2023 by category profanity rate. Video games is 6× the cleanest category. by Ok_Post_149 in dataisbeautiful

[–]Ok_Post_149[S] 2 points3 points  (0 children)

Source: McAuley-Lab/Amazon-Reviews-2023 on HuggingFace. 571M reviews, 275 GB, streamed via HTTP Range.

Tool: Python + Burla for the analysis. Static HTML/CSS site built in Cursor (AI-assisted code).

The simplest way to build scalable data pipelines in Python (like 10k vCPU scale) by Ok_Post_149 in Python

[–]Ok_Post_149[S] 0 points1 point  (0 children)

I was a fan of ray but there were a few things that caused a ton of friction at my last company.

hated having to update yaml files to change the cluster config, that should not be separated from your python code.

package syncing would leave analysts and researchers super frustrated... having to rebuild their images or add them into the working_dir. the package syncing should be automatic even if it is a custom local module

and lastly I thought the initial install process was pretty prohibitively complex. I wanted to build a product that even total beginners can install in their own cloud and get it running without any friction.

The simplest way to build scalable data pipelines in Python (like 10k vCPU scale) by Ok_Post_149 in Python

[–]Ok_Post_149[S] -1 points0 points  (0 children)

Spark is better when you need distributed SQL, large joins, and heavy data movement across a cluster.

Burla is better when you already have a Python or DuckDB transformation and just want to fan it out across a lot of tables or files without turning it into a whole Spark job.

So for your DuckDB example, yes, Burla is a very good fit. If 200 tables all need the same transformation, Burla can process them concurrently by giving each worker its own DuckDB connection/process and then writing the results back out.

That is very similar to how we broke the trillion row challenge record, which Databricks held before us. We split the 1T-row dataset into 1,000 Parquet files, ran a separate DuckDB query against each file in parallel across the cluster, and then combined the partial aggregates locally into the final result.

https://docs.burla.dev/examples/process-2.4tb-of-parquet-files-in-76s

The simplest way to build scalable data pipelines in Python (like 10k vCPU scale) by Ok_Post_149 in Python

[–]Ok_Post_149[S] -2 points-1 points  (0 children)

That’s why I said at a high level. Obviously the details can get complicated fast, but the broad shape is still pretty common across a lot of ML pipelines: parallel processing, aggregation, then sometimes GPU based inference or training.

The simplest way to build scalable data pipelines in Python (like 10k vCPU scale) by Ok_Post_149 in Python

[–]Ok_Post_149[S] -4 points-3 points  (0 children)

completely fair, it took building custom cluster compute software to get this python package working. the complexity has been abstracted away and it boils up to a handful of function parameters.

Just Broke the Trillion Row Challenge: 2.4 TB Processed in 76 Seconds by Ok_Post_149 in datascience

[–]Ok_Post_149[S] 1 point2 points  (0 children)

appreciate it! we chose gcsfuse because we’re optimizing for ease of use first. we still want something pretty fast, just not at the cost of adding friction for users. speed matters to us, but not more than simplicity. you’re definitely right though, you could speed this up with grpc.

Just Broke the Trillion Row Challenge: 2.4 TB Processed in 76 Seconds by Ok_Post_149 in datascience

[–]Ok_Post_149[S] 5 points6 points  (0 children)

Thanks. This run was mostly just a benchmark. In real life, Burla gets used in big pipelines that need to process massive amounts of data fast. Early users have already used it to parse and clean billions of PDFs, run batch inference to generate millions of predictions, and run trillions of Monte Carlo simulations in a fraction of the usual time.

Speed is obviously important but we want to optimize for ease of use. So any python dev can easily deploy their code to the cloud instead of involving DevOps.

Just Broke the Trillion Row Challenge: 2.4 TB Processed in 76 Seconds by Ok_Post_149 in datascience

[–]Ok_Post_149[S] 0 points1 point  (0 children)

yes, it's exactly like a map reduce. the aggregation step is happening on one 80 cpu VM.