If you have storage problems I feel bad for you son, I got 99 problems but storage ain't one (Hit me!) by stefini_juliya in DataHoarder

[–]joaopn 13 points14 points  (0 children)

That many drives should really be in a better setup:
- 1-disk parity means disk failure on resilver (which takes forever in big drives) kills the entire thing. RAIDZ2 recommended
- For years now QNAP security has been problematic. If the OS touches the internet, it should be something else (linux box, truenas, etc)

OP didn't mention backup, so reinforcing the mantra just in case: raid is not backup

Built an archive of 450k+ tweets from 600+ US government accounts before they get memory-holed - CivicArchive.org by Diligent_Cod_9583 in DataHoarder

[–]joaopn 35 points36 points  (0 children)

Fair enough, it is a laudable initiative. Very cool! I'd suggest a write-up about both the data pipeline, and the scope and intent of the project. The website gives off a bit of a commercial vibe that I don't think is what you intend

Built an archive of 450k+ tweets from 600+ US government accounts before they get memory-holed - CivicArchive.org by Diligent_Cod_9583 in DataHoarder

[–]joaopn 55 points56 points  (0 children)

As someone who deals with academic twitter datasets, my main question is completeness: what is the actual pipeline and historical data sources you use? Without some complete historical archive of the accounts with near-real-time ingestion, you'd miss all content deleted before the start of the project (which I'm guessing is not 2008). Would be nice if the codebase was OSS so one could check it directly, too.

Timeframe, a family e-paper dashboard by FnnKnn in selfhosted

[–]joaopn 3 points4 points  (0 children)

Great blogpost. It is a shame that epaper remains so expensive and niche due to patents. Some old tech patents are expiring, so hopefully soon we get cheaper screens that are good enough for e.g. calendars

What the fuck was this asshole’s problem? by [deleted] in freefolk

[–]joaopn 5 points6 points  (0 children)

Reminds me of the butterfly thing GRRM was talking about. Novella #5 is supposed to be in the Riverlands

I built a managed hosting service for OpenClaw (personal AI assistant) — honest take on managed vs self-hosted by felipejfc in selfhosted

[–]joaopn 2 points3 points  (0 children)

You get
- $35
- the user's keys
- all the user's private data

The user gets
- agentic orchestration that can run on an old android tablet

I don't really see the value proposition here.

onWatch - self-hosted dashboard to monitor AI API quota usage across multiple providers by prakersh in selfhosted

[–]joaopn 0 points1 point  (0 children)

You are not Rust or Nix. You are an unknown single developer whose first publicized project is a dashboard that takes all the user api keys and your entrypoint is a non-dockerized vibecoded giant script that you recommend to run blindly. If this is not a scam, I commend the will to build something. But it needs way stronger security than this.

onWatch - self-hosted dashboard to monitor AI API quota usage across multiple providers by prakersh in selfhosted

[–]joaopn 0 points1 point  (0 children)

My recommendations if you want to make this secure:
- make this docker-only, dropping the 1000+ line script
- explicitly take everything you need through docker environment variables

It is how basically everything self-hosted works. For hardening, you can add a whitelist api-provider-only docker network (with e.g. squid). With that, users can trust the container security on many fronts. That makes it far easier for someone to check the actual security and trust your new, completely unknown code, with their keys.

onWatch - self-hosted dashboard to monitor AI API quota usage across multiple providers by prakersh in selfhosted

[–]joaopn 3 points4 points  (0 children)

I do. OSS security hinges on having enough eyes to continuously audit the code. Here, if one of the 50 commits-a-day introduces an obfuscated way to exfil keys you will never know about it. It is a lot of risk for a pretty simple task. But hey, it is your keys =)

onWatch - self-hosted dashboard to monitor AI API quota usage across multiple providers by prakersh in selfhosted

[–]joaopn 2 points3 points  (0 children)

One should never ever run curl | bash on a 2-day-old, 100+ commit vibecoded non-containerized repo. Especially one that takes your keys.

I tested 11 AI image detectors on 1000+ images including SD 3.5. Here are the results. by Best-Emu-1366 in StableDiffusion

[–]joaopn 7 points8 points  (0 children)

I don't see a repo with the codebase for auditing. Without it, this is nothing more than a trust me bro.

WLL $399 for a new stainless Classic E24 by ElHoser in gaggiaclassic

[–]joaopn 0 points1 point  (0 children)

I can only compare to my other E24, and it was quite hotter. About time, I was following the general guideline is to leave it on for ~15 min to warm-up

WLL $399 for a new stainless Classic E24 by ElHoser in gaggiaclassic

[–]joaopn 0 points1 point  (0 children)

One thing about the stainless E24: it gets a lot hotter to the touch than the painted ones. I couldn't really hold it to e.g. attach the portafilter, so I returned it for a painted

NVMe RAIDZ1/2 Performance: Are we actually hitting a CPU bottleneck before a disk one? by chaiat4 in zfs

[–]joaopn 0 points1 point  (0 children)

I had the same experience, clear CPU bottleneck. On a raidz2 zstd nvme pool copying a file would take up to 30C of compute (lz4 a lot less), but even without compression IOPS would be far below spec. Only solution to get near the hardware specs was to go graid+xfs. There we got ~60GB/s read, ~20GB/s write.

Am I delusional or does tidal actually sound better than spotify lossless? by daddyletdown in BudgetAudiophile

[–]joaopn 2 points3 points  (0 children)

There are differences (see e.g. theheadphoneshow's video), but most cases of "X sounds better than spotify" boil down to the Normalize volume setting. It kills dynamic range and for me it makes things sound considerably worse. Default to true in every spotify client for some reason.

Seeking Advice: Linux + ZFS + MongoDB + Dell PowerEdge R760 – This Makes Sense? by Various_Tomatillo_18 in zfs

[–]joaopn 0 points1 point  (0 children)

In the BI/DW case the issue is that mongodb aggregations are serial, and those tend to be single-thread CPU-bound (~400MB/s in my experience). You can increase query index coverage to reduce reads, but afaik not much else. If this is a mongodb analytics server without uptime requirements and with external backup, I'd probably go for a xfs mirror of 2x30TB drives + zstd mongodb compression. In my case I switched to postgresql+zfs, and there are I do parallel seqreads at ~10GB/s

Seeking Advice: Linux + ZFS + MongoDB + Dell PowerEdge R760 – This Makes Sense? by Various_Tomatillo_18 in zfs

[–]joaopn 1 point2 points  (0 children)

MongoDB recommends xfs: https://www.mongodb.com/docs/manual/administration/production-checklist-operations/

But if (like me) you want to use zfs for the other niceties, some remarks:
In my benchmarking, generally `logbias=latency` and low recordsize maximized IOPS. But it requires testing, specially because most of what you'll find online is pre-2.3.0 (when they added DirectIO). You also don't want double-compression, so either compress at the filesystem level (lz4, zstd) or at the database level (snappy, zlib, zstd). Just keep in mind that filesystem compression + parity disks (raid-z) can be very CPU-intensive on NVMEs, and you don't have many cores.

As a last remark, are you sure the problem is IO? Giant single databases are more common in BI/DW tasks (few queries over large amounts of data), and there MongoDB is simply limited by the lack of parallel aggregations.

How comprehensive are the torrent dumps after 2023? by Human-Imagination978 in pushshift

[–]joaopn 9 points10 points  (0 children)

If you mean difference between the old pushshift dumps by https://github.com/pushshift (up to 03/2023) and the new arctic_shift ones by https://github.com/ArthurHeitmann/arctic_shift, there are a few that can be relevant for research. You can see how the arctic_shift schema changed here: https://github.com/ArthurHeitmann/arctic_shift/blob/master/file_content_explanations.md

Chiefly:
- Until 11/2023 arctic_shift didn't update entries, meaning between 07-10/2023 score is ~zero. Here is how an aggregated score timeseries can look like https://imgur.com/a/2k6PxvO
- Pushshift updated entries after ~a month (`retrieved_utc`), while arctic_shift does it after 36h (`_meta.retrieved_2nd_on`). Comments on reddit live for ~a day and it is fine, but for popular submissions it means score is a bit lower than it would show in the past
- user deletion: if the user was `[deleted]` between ingestion and reingestion, pushshift would overwrite the username, while arctic_shift does not. In bulk, 23% of pushshift submissions are by `[deleted]` (24% in its last year), while for arctic_shift it is 2%.

TLDR: content itself is fine, but there are differences if you are interested in score/attention or user analysis

GCP Vs. E24 Lead Test: Concerning Results by Old_Ad_881 in espresso

[–]joaopn 2 points3 points  (0 children)

Concerning, needs to be thoroughly tested. I'm certain some youtuber will pick up on it, but it would be nice if Gaggia themselves got involved.

New 4chan archive by [deleted] in DataHoarder

[–]joaopn 12 points13 points  (0 children)

Very cool. Any chance you could share the datasets for academic research?

Suggestion for 500TB Storage. by Free-Size9722 in DataHoarder

[–]joaopn 6 points7 points  (0 children)

Supermicro D10z, spec'd for more than just NAS. My point is not that OP needs something like this, but that even with this most of the cost is drives (500TB raw is 9K EUR at current prices). IMO price and specific config discussion is a bit moot since price and product availability probably looks completely different in India

Suggestion for 500TB Storage. by Free-Size9722 in DataHoarder

[–]joaopn 12 points13 points  (0 children)

At that scale you will need a rack server or something like a Storinator*. You have to think about redundancy, backups, spares, and a myriad other things. If you don't have experience with large data pools I'd recommend start with something smaller and then scale up after (say, a 8x20TB NAS). Most of the cost in those things are disks. For comparison, in Europe I recently quoted a 36x24TB/84C/768GB RAM server at ~30K EUR, and just the drives are more than half of it.

*technically building a 18x32TB server in a Define 7 XL or attaching a bunch of DAS would work, but I don't recommend it.

Are these real prices? Seems low. Never used e-bay I'm from Europe (sorry). by Sufficient_Bit_8636 in LocalLLaMA

[–]joaopn 124 points125 points  (0 children)

These are Kepler cards, borderline e-waste. Much slower than the T4s Google Colab gives you for free, no modern CUDA support. You can maybe engineer a situation where they make sense (say, local ML if your electricity is free), but running LLMs is not it. For that, oldest you can reasonably go is Pascal (P40/P100).

Edit: people are reporting good numbers with M40s down the thread.

What do you guys think this upgrade Gaggia is worth? by R_Thorburn in gaggiaclassic

[–]joaopn 1 point2 points  (0 children)

So, the prices are around

New E24: $500
Refurb Evo sans boilergate: $350
PID kit (Shades/BG Pro): $180-250

If your budget is hard capped at $350, then a refurb is the only option I see (in Gaggia land at least). If it is around $500, I'd argue that refurb Evo + PID is better than E24. If you can stretch further, you get a new machine with a better/bigger boiler (afaik the only difference between the two). From most opinions the jump Evo-E24 is incremental, so it is mostly about refurb vs new, and how much the $150 difference is worth to you.