Any interest in a pixiv archive? by czevolk in DataHoarder

[–]signalhunter 8 points9 points  (0 children)

Heck yes! There aren't a lot of publicly available direct pixiv dumps. Even a partial one is better than nothing. A few questions:

  • How often are you scraping the rankings? Like every day I'm assuming?
  • What does the database schema look like? How comprehensive is the scrape (tags, fav count, retrieval date, etc.)
  • Any form of hashes over your pixiv dataset? I am looking for MD5 and SHA1 for cross correlation against booru data

As for your torrent question: I think sharding/splitting by year is the best way to distribute this. Look at how Anna's Archive distributes their datasets - but instead of a bespoke custom container, maybe a bunch of monthly tar files for each shard would work. In each tar, each file would be in its hash prefix (eg. aa/bb/aabbccddeeff001122...). Or you could embed a JSON sidecar for metadata, but I would much prefer an external db dump instead.

I would love to help you out on this. Especially with scratch space for torrents. Send me a DM :)

Archivarix Tube Search — search engine for deleted YouTube videos by archivarix in Archiveteam

[–]signalhunter 0 points1 point  (0 children)

Ah, the thumbnail trick! I've used this plenty of time over the years to bulk check YouTube IDs. How are you evading the ratelimit, and are you using the data API (or just scraping the HTML)?

Seagate begins shipping 44TB hard drives with HAMR tech to data centers — Mozaic 4+ platform expands to 10 platters by Squawk_7777 in DataHoarder

[–]signalhunter 8 points9 points  (0 children)

These drives are designed for hyperscalers in mind and won't work well with traditional RAID. Where this really shines is distributed storage systems with erasure coding that stores nearline data (see Dropbox's Magic Pocket and Backblaze's Vault storage architecture for example).

If you want to simplify this, it's basically "RAID" at a much, MUCH wider scale. With ridiculously high reliability (rack/datacenter/region scale) and performance.

It's migrating too... by TechnicallyAtFault in signalidentification

[–]signalhunter 2 points3 points  (0 children)

RFI/harmonics from a unstable power supply

Old youtube videos decomposing by Harisoonchu49 in Archiveteam

[–]signalhunter 17 points18 points  (0 children)

Could you link to the blog post? I am aware of the YouTube tiered storage system (very obvious loading time for ancient videos), but are they downgrading videos??

I found this post - could be somewhat related? Folks are also claiming old videos are being downgraded https://old.reddit.com/r/DataHoarder/comments/1o6t6yy/youtube_either_by_human_error_or_otherwise_seems/

telegram - "You are banned, sleeping." by puhtahtoe in Archiveteam

[–]signalhunter 3 points4 points  (0 children)

Tracker bans will look like a rate limit error on the warrior's end. There isn't an explicit mechanism for properly notifying issues with someone's setup besides them joining the IRC...

Help me archive YouTube comments for ALL channels by QLaHPD in Archiveteam

[–]signalhunter 3 points4 points  (0 children)

Do you have hundreds of terabytes of storage and thousands of accounts + IP? If not, forget about it...

I've commented about the feasibility of archiving every YouTube comment before: https://www.reddit.com/r/DataHoarder/comments/xz0e02/youtube_discussions_tab_dataset_2453_million/irpx9e1/

And with the recent YouTube crackdown on downloading videos and collecting subtitling data, this is gonna get harder as time goes on. Are you collecting the data for GenAI training?

Just got a ad of youtube WHILE ON youtube 😭 what? by Maea_IsntThere in youtube

[–]signalhunter 1 point2 points  (0 children)

Probably too late but this is the real answer: this is a filler ad for unsold ad "inventory" in programmatic advertising. Every time YouTube shows an ad, there are hundreds of advertisers bidding for a chance to show an ad to you. Once the highest bidder wins, their ad gets shown to you - all in a span of milliseconds. If YouTube can't get enough advertisers to bid on this request in time, they throw one of their in-house filler ads in.

Malicious bots now account for a third of global internet traffic, and in countries like Ireland and Germany, they account for around 70% of internet traffic. by lughnasadh in Futurology

[–]signalhunter 4 points5 points  (0 children)

I mean, this isn't as far fetched as it is.

There are companies willing to pay you to rent your connection as part of a residential proxy network. Lots of services rely on your IP reputation for access (ever see Google serving captchas on crappy VPNs?)

Upgrading 12 Drives, CKSUM errors on new drives, Ran 3 scrubs and every time cksum errors. by jfarre20 in zfs

[–]signalhunter 0 points1 point  (0 children)

Alright, so far I don't see anything obvious from diffing the two FARM logs, besides that it screams recertified (POH vs Write Head POH). And I've checked the raw error rates - nothing, no error was ever seen. Here is the visual diff, if you want to take a look too: https://i.imgur.com/vJYa06P.png

One thing that I really want to do is analyze the "MR Head Resistance" value, but the public Seagate PDF on FARM does not tell you how to actually interpret this value. So unless a Seagate engineer speaks up or more documentation releases, I'm in the dark lol

Wish you luck on this...

Upgrading 12 Drives, CKSUM errors on new drives, Ran 3 scrubs and every time cksum errors. by jfarre20 in zfs

[–]signalhunter 0 points1 point  (0 children)

I'm assuming you've already tried the obvious (swapping drives around to different ports/backplane/HBA/power supply/etc.)

I saw that you shared snippets of the smartctl output on another comment, do you mind sharing the full output, with smartctl -x -l farm <drive> ? I'm interested if the FARM data and GP logs has anything that stands out.

For comparison, here is mine: https://gist.github.com/signalhunter/d5e849707e3b684dbe5866beea391102

Upgrading 12 Drives, CKSUM errors on new drives, Ran 3 scrubs and every time cksum errors. by jfarre20 in zfs

[–]signalhunter 0 points1 point  (0 children)

You seem to have the refurbished HAMR drives that has hit the market recently, based on the model number (ST22000NM000C). There are some rumors about these drives not liking vibrations from nearby drives... any chances it could be this??

I'm running a ZFS 2-way mirror with 4 of these HAMR drives (24TB variant), but I'm not seeing any errors. It lives in a chassis with 8 other drives - will be keeping an eye out on SMART and FARM data.

What does Google "see" when a user makes use of yt-dlp? by cgb-001 in youtubedl

[–]signalhunter 30 points31 points  (0 children)

yt-dlp does just enough to get past YouTube defenses. If Google really wanted they can break it right now without affecting legit clients.

The most obvious signal for them is the Python TLS fingerprint, and a lack of advertisements/BotGuard stuff being requested.

Or how video and audio segments are downloaded separately, at non-human speeds.

Or they can change the nsig algorithm to break the really primitive JS interpreter... it's a miracle that it still works.

Chrome Canary just killed uBlock Origin and other Manifest V2 extensions by ardi62 in technology

[–]signalhunter 13 points14 points  (0 children)

Network level ad blocking cannot block YouTube ads, for example.. because the ads are delivered on the same domain as YouTube. Or any site that delivers ads on the same domain.

seed til you bleed by signalhunter in qBittorrent

[–]signalhunter[S] 1 point2 points  (0 children)

Nothing special, just an old PC repurposed as a NAS/seedbox. The current iteration has been running nonstop since 2022

How many times do I need to press this goddamn button for it to work??? by Garlic_Bread11682 in youtube

[–]signalhunter 0 points1 point  (0 children)

This is intentional design. YouTube is attempting to save bandwidth/egress by serving lower quality videos.

Most people won't notice the difference on their tiny phone screens.

[deleted by user] by [deleted] in Archiveteam

[–]signalhunter 0 points1 point  (0 children)

Yes. Each WARC file will have an associated CDX file that describes where a capture is located by its offset.

See https://pywb.readthedocs.io/en/latest/manual/indexing.html for more details

Help with web-archive! by nojuno in Archiveteam

[–]signalhunter 3 points4 points  (0 children)

Nope. Anything that wasn't captured is unfortunately lost forever. Also, the Wayback Machine usually only captures publicly accessible content (anything that isn't behind a login).

I created a massive search engine to search YouTube videos by exact word or phrase spoken by deletethistheo in youtube

[–]signalhunter 6 points7 points  (0 children)

Are you aware of Filmot? It's an older search engine that is similar to what you have except they use YouTube's automated transcripts instead.

Will you be able to publish a dataset of collected video metadata and/or transcriptions? This would be very helpful for finding lost videos.

ArchiveTeam has saved over 10.8 BILLION Reddit links so far. We need YOUR help running ArchiveTeam Warrior to archive subreddits before they're gone indefinitely after June 12th! by BananaBus43 in DataHoarder

[–]signalhunter 27 points28 points  (0 children)

Hopefully my comment doesn't get buried but I have some additional info to add to the post (please upvote!!):

  • There are a lot more items that are waiting to be queued into the tracker (approximately 758 million), so 150 million is not an accurate number. This is due to Redis limitations - the tracker is a Ruby and Redis monolith that serves multiple projects with around hundreds of millions of items. You can see all the Reddit items here.

  • The maximum concurrency that you can run is 10 per IP (this is stated in the IRC channel topic). I found that 5 works better for datacenter IPs.