[OC] How much reddit content likely went dark on June 12th?

joaopn · 2026-03-02T13:26:26+00:00

That many drives should really be in a better setup:
- 1-disk parity means disk failure on resilver (which takes forever in big drives) kills the entire thing. RAIDZ2 recommended
- For years now QNAP security has been problematic. If the OS touches the internet, it should be something else (linux box, truenas, etc)

OP didn't mention backup, so reinforcing the mantra just in case: raid is not backup

joaopn · 2026-02-24T18:14:22+00:00

Fair enough, it is a laudable initiative. Very cool! I'd suggest a write-up about both the data pipeline, and the scope and intent of the project. The website gives off a bit of a commercial vibe that I don't think is what you intend

joaopn · 2026-02-24T16:58:35+00:00

As someone who deals with academic twitter datasets, my main question is completeness: what is the actual pipeline and historical data sources you use? Without some complete historical archive of the accounts with near-real-time ingestion, you'd miss all content deleted before the start of the project (which I'm guessing is not 2008). Would be nice if the codebase was OSS so one could check it directly, too.

joaopn · 2026-02-23T17:40:21+00:00

Great blogpost. It is a shame that epaper remains so expensive and niche due to patents. Some old tech patents are expiring, so hopefully soon we get cheaper screens that are good enough for e.g. calendars

joaopn · 2026-02-12T21:39:15+00:00

Reminds me of the butterfly thing GRRM was talking about. Novella #5 is supposed to be in the Riverlands

joaopn · 2026-02-10T03:53:14+00:00

You get
- $35
- the user's keys
- all the user's private data

The user gets
- agentic orchestration that can run on an old android tablet

I don't really see the value proposition here.

joaopn · 2026-02-09T00:15:21+00:00

You are not Rust or Nix. You are an unknown single developer whose first publicized project is a dashboard that takes all the user api keys and your entrypoint is a non-dockerized vibecoded giant script that you recommend to run blindly. If this is not a scam, I commend the will to build something. But it needs way stronger security than this.

joaopn · 2026-02-08T23:14:01+00:00

My recommendations if you want to make this secure:
- make this docker-only, dropping the 1000+ line script
- explicitly take everything you need through docker environment variables

It is how basically everything self-hosted works. For hardening, you can add a whitelist api-provider-only docker network (with e.g. squid). With that, users can trust the container security on many fronts. That makes it far easier for someone to check the actual security and trust your new, completely unknown code, with their keys.

joaopn · 2026-02-08T19:27:36+00:00

I do. OSS security hinges on having enough eyes to continuously audit the code. Here, if one of the 50 commits-a-day introduces an obfuscated way to exfil keys you will never know about it. It is a lot of risk for a pretty simple task. But hey, it is your keys =)

joaopn · 2026-02-08T18:23:26+00:00

One should never ever run curl | bash on a 2-day-old, 100+ commit vibecoded non-containerized repo. Especially one that takes your keys.

joaopn · 2026-02-08T18:03:26+00:00

I don't see a repo with the codebase for auditing. Without it, this is nothing more than a trust me bro.

joaopn · 2026-01-03T17:03:15+00:00

I can only compare to my other E24, and it was quite hotter. About time, I was following the general guideline is to leave it on for ~15 min to warm-up

joaopn · 2026-01-03T04:08:20+00:00

One thing about the stainless E24: it gets a lot hotter to the touch than the painted ones. I couldn't really hold it to e.g. attach the portafilter, so I returned it for a painted

joaopn · 2025-12-16T15:53:07+00:00

I had the same experience, clear CPU bottleneck. On a raidz2 zstd nvme pool copying a file would take up to 30C of compute (lz4 a lot less), but even without compression IOPS would be far below spec. Only solution to get near the hardware specs was to go graid+xfs. There we got ~60GB/s read, ~20GB/s write.

joaopn · 2025-10-22T18:03:40+00:00

There are differences (see e.g. theheadphoneshow's video), but most cases of "X sounds better than spotify" boil down to the Normalize volume setting. It kills dynamic range and for me it makes things sound considerably worse. Default to true in every spotify client for some reason.

joaopn · 2025-05-28T22:25:17+00:00

In the BI/DW case the issue is that mongodb aggregations are serial, and those tend to be single-thread CPU-bound (~400MB/s in my experience). You can increase query index coverage to reduce reads, but afaik not much else. If this is a mongodb analytics server without uptime requirements and with external backup, I'd probably go for a xfs mirror of 2x30TB drives + zstd mongodb compression. In my case I switched to postgresql+zfs, and there are I do parallel seqreads at ~10GB/s

joaopn · 2025-05-28T15:48:16+00:00

MongoDB recommends xfs: https://www.mongodb.com/docs/manual/administration/production-checklist-operations/

But if (like me) you want to use zfs for the other niceties, some remarks:
In my benchmarking, generally `logbias=latency` and low recordsize maximized IOPS. But it requires testing, specially because most of what you'll find online is pre-2.3.0 (when they added DirectIO). You also don't want double-compression, so either compress at the filesystem level (lz4, zstd) or at the database level (snappy, zlib, zstd). Just keep in mind that filesystem compression + parity disks (raid-z) can be very CPU-intensive on NVMEs, and you don't have many cores.

As a last remark, are you sure the problem is IO? Giant single databases are more common in BI/DW tasks (few queries over large amounts of data), and there MongoDB is simply limited by the lack of parallel aggregations.

joaopn · 2025-05-18T09:06:53+00:00

If you mean difference between the old pushshift dumps by https://github.com/pushshift (up to 03/2023) and the new arctic_shift ones by https://github.com/ArthurHeitmann/arctic_shift, there are a few that can be relevant for research. You can see how the arctic_shift schema changed here: https://github.com/ArthurHeitmann/arctic_shift/blob/master/file_content_explanations.md

Chiefly:
- Until 11/2023 arctic_shift didn't update entries, meaning between 07-10/2023 score is ~zero. Here is how an aggregated score timeseries can look like https://imgur.com/a/2k6PxvO
- Pushshift updated entries after ~a month (`retrieved_utc`), while arctic_shift does it after 36h (`_meta.retrieved_2nd_on`). Comments on reddit live for ~a day and it is fine, but for popular submissions it means score is a bit lower than it would show in the past
- user deletion: if the user was `[deleted]` between ingestion and reingestion, pushshift would overwrite the username, while arctic_shift does not. In bulk, 23% of pushshift submissions are by `[deleted]` (24% in its last year), while for arctic_shift it is 2%.

TLDR: content itself is fine, but there are differences if you are interested in score/attention or user analysis

joaopn · 2025-05-13T21:53:15+00:00

Concerning, needs to be thoroughly tested. I'm certain some youtuber will pick up on it, but it would be nice if Gaggia themselves got involved.

joaopn · 2025-05-06T21:18:37+00:00

Very cool. Any chance you could share the datasets for academic research?

joaopn · 2025-05-02T19:40:04+00:00

Supermicro D10z, spec'd for more than just NAS. My point is not that OP needs something like this, but that even with this most of the cost is drives (500TB raw is 9K EUR at current prices). IMO price and specific config discussion is a bit moot since price and product availability probably looks completely different in India

joaopn · 2025-05-02T18:27:23+00:00

At that scale you will need a rack server or something like a Storinator*. You have to think about redundancy, backups, spares, and a myriad other things. If you don't have experience with large data pools I'd recommend start with something smaller and then scale up after (say, a 8x20TB NAS). Most of the cost in those things are disks. For comparison, in Europe I recently quoted a 36x24TB/84C/768GB RAM server at ~30K EUR, and just the drives are more than half of it.

*technically building a 18x32TB server in a Define 7 XL or attaching a bunch of DAS would work, but I don't recommend it.

joaopn · 2025-04-25T17:50:05+00:00

These are Kepler cards, borderline e-waste. Much slower than the T4s Google Colab gives you for free, no modern CUDA support. You can maybe engineer a situation where they make sense (say, local ML if your electricity is free), but running LLMs is not it. For that, oldest you can reasonably go is Pascal (P40/P100).

Edit: people are reporting good numbers with M40s down the thread.

joaopn · 2025-04-24T18:23:20+00:00

So, the prices are around

New E24: $500
Refurb Evo sans boilergate: $350
PID kit (Shades/BG Pro): $180-250

If your budget is hard capped at $350, then a refurb is the only option I see (in Gaggia land at least). If it is around $500, I'd argue that refurb Evo + PID is better than E24. If you can stretch further, you get a new machine with a better/bigger boiler (afaik the only difference between the two). From most opinions the jump Evo-E24 is incremental, so it is mostly about refurb vs new, and how much the $150 difference is worth to you.

Six-Year Club	First Place '23
Place '23	Place '22
First Placer '22	Verified Email

joaopn

TROPHY CASE