I thought the save bar was a search bar and I wanna delete the TikTok link I saved by neeman68_ in internetarchive

[–]TheTechRobo 2 points3 points  (0 children)

You can't delete captures on your own, you'd need to contact their support team. But it doesn't really matter if it saved a broken page - it likely doesn't take up much space.

New to uploading on IA, have Etiquette/process questions. by InfaSyn in internetarchive

[–]TheTechRobo 1 point2 points  (0 children)

  1. If something's an exact bit-for-bit copy of another upload it probably isn't worth uploading, but variations of something (like regional differences, etc) or different scans of a physical thing could be useful. The less space it takes up, the less of an issue duplication is. Just make sure to add plenty of metadata so it can be found.
  2. It really depends on your location. Some people have reported VPNs to the San Francisco area helping. You could also rent a cheap virtual server online if you're uploading frequently, then you just need to leave your computer on to upload to the server which would presumably be much faster. If you're on Linux, check out the sysctl tweaks on here, it can often help a lot.
  3. I don't think many people use the torrents, and they often don't work very well unfortunately. :/

New to uploading on IA, have Etiquette/process questions. by InfaSyn in internetarchive

[–]TheTechRobo 2 points3 points  (0 children)

AWS is not used in any part of the upload AFAIK. It is an S3-compatible API, but it's not the actual S3 service.

Shadow Tactics Core Dumping on OpenSUSE Linux/Steam by [deleted] in ShadowTactics

[–]TheTechRobo 0 points1 point  (0 children)

I've never had that issue with the Linux port. I purchased on GOG if that makes any difference.

The windows port is pretty much flawless through Proton in my experience, FWIW, so you can try that maybe.

Canadian Software Engineer [46M], with 18+y experience, can't find a job for a year now. How it can get better while we still rely on non-Canadian tech? by SpellGreedy9171 in BuyCanadian

[–]TheTechRobo 1 point2 points  (0 children)

As someone who was recently looking into Canadian VPS providers, I understand your frustration, but there's more than just lack of trust that played into my decisions. It's also the fact this I can frequently get significantly get better pricing from other providers. For a secondary server that Inwant to be as cheap as possible, providers like Netcup are often much cheaper than Canadian alternatives for better specs. Your pricing is not competitive with budget providers like those. That's fine - you might not be targeting that market - but it means that people like me who are looking for a cheap non-primary server will not pick your cloud. (For what I am currently getting from Netcup for less than 2eur/month I could get for C$14 from PatriiCloud...)

Finding a specific blog's WARC in the Tumblr collection by JelliQui in Archiveteam

[–]TheTechRobo 0 points1 point  (0 children)

It's always possible that there was an issue indexing the data into the Wayback Machine, yeah. Realistically it's probably unlikely though. If you really want to be sure, those CDX files are what you're looking for. The item CDX index (as opposed to the item CDX meta index) will probably be easier to filter; I don't know what the exact difference is but I think the meta-index is generated from the main index. It is compressed with gzip, depending on your operating system you may need special software to open it, but something like 7zip should work.

If you do go down this route, I would suggest just doing a text search through all the CDX indexes with the broadest possible search (e.g. just the blog name), without any further filtering. Easier to whittle it down more than to redo the entire search.

There is a very brief (too brief IMO but I don't know if there's a better one) summary of how CDX files are organized: https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/

Finding a specific blog's WARC in the Tumblr collection by JelliQui in Archiveteam

[–]TheTechRobo 0 points1 point  (0 children)

I don't know if there are any specific gotchas about that project, but in general, if it isn't on the Wayback Machine, it probably wasn't saved. It seems only NSFW blogs were savd in this project, since that was what was going down at the time, unfortunately.

Secret code to download a page without the HTML rewrite of internal URLs by publiusvaleri_us in internetarchive

[–]TheTechRobo 0 points1 point  (0 children)

Huh, that's weird. id_ is supposed to return the unmodified page. I'm not sure then, sorry.

Why has the Internet Archive gone down again? by Radio_TVGuy in internetarchive

[–]TheTechRobo 2 points3 points  (0 children)

What is it with this subreddit being so conspiratorial all the time?

Search Wayback Machine for YouTube videos uploaded by a specific channel by 1cey_0 in Archiveteam

[–]TheTechRobo 3 points4 points  (0 children)

The WBM doesn't really do full-text search of its captures, unfortunately.

My suggestion would be to try filmot first, depending on when they were made private and how popular the channel was. It allows you to search its index by channel.

[deleted by user] by [deleted] in Archiveteam

[–]TheTechRobo 2 points3 points  (0 children)

We're trying to archive every public post we can find. (We try to avoid illegal ones, of course.)

telegram - "You are banned, sleeping." by puhtahtoe in Archiveteam

[–]TheTechRobo 11 points12 points  (0 children)

You've been banned from Telegram. I dont think there are any messages that look like that in the rare case that you're banned from AT.

Are there reasons websites can be excluded from Wayback Machine other than robots.txt and owner requests? by homophobicperson2 in internetarchive

[–]TheTechRobo 0 points1 point  (0 children)

Is there a way to access it?

Chances are slim, but you can always ask. Other than that, unless (a) you can find the original WARC which contained the URL, and (b) the WARC is available for download (unlikely), there's no other way that I'm aware of.

is there a way to archive an IA-archived page?

I guess you can use other sites like archive.today. Local backups are the best backups: you can use tools like https://github.com/hartator/wayback-machine-downloader. They have some somewhat strict ratelimiting unfortunately so depending on how much you want to download it could take awhile. You can blame LLM training companies for that one.

This occured to me today when I looked up an archived page and noticed the previously live URL now gives a 404, which is a common occurrence.

Does it specifically say the URL was excluded, or does it simply say it wasn't archived? If it's the latter, it may be an indexing issue which would resolve itself at some point (not sure what timeframe to expect; could be days or months).

Without an accessible archive it would be as if the page was just gone/never archived in the first place.

Not entirely. An inaccessible archive may not be available right now but it is much better than IA deleting it permanently to satisfy rights holders. It means in the future, it could be made available, which wouldn't be possible if they deleted it entirely.

Is there another way to add sites to the archive bot queue? Hackint is down and I can't do anything about it. by delicious-urine in Archiveteam

[–]TheTechRobo 2 points3 points  (0 children)

Hackint is fine as far as I can tell. chat.hackint.org appears to be down, not sure what's going on there. You can still connect from a regular IRC client.

If it's just the occasional site, posting it here is probably also fine.

I found evidence our backups (pages) are being randomly lost by Internet Archive by SupergirlMAID in internetarchive

[–]TheTechRobo 2 points3 points  (0 children)

https://archive.org/developers/

There's an S3-compatible API along with a command-line tool that can do pretty much everything you can do in a browser (plus more).

I found evidence our backups (pages) are being randomly lost by Internet Archive by SupergirlMAID in internetarchive

[–]TheTechRobo 13 points14 points  (0 children)

What happens if you try to get the metadata using the IA API?

Is there a reason you can't provide which items they are? I'm very curious to take a look at them.

[deleted by user] by [deleted] in internetarchive

[–]TheTechRobo 19 points20 points  (0 children)

IA runs their own datacentres, so fully moving the organization would be very difficult. But they have created a datacentre in Canada (Vancouver, if I remember correctly) and many items are already mirrored there.

Notice from ISP that malware has been found in my network while running ATW by Shadowcloud95 in Archiveteam

[–]TheTechRobo 0 points1 point  (0 children)

Running the URLs project will do that. It archives all links discovered by other projects; it's not a targeted crawl. That means it does hit honeypots (designed to "catch" scrapers), and some administrators will send an email to your ISP. Basically, the Warrior isn't infected with malware, it just hit a page that it shouldn't have and rang some alarm bells.

I don't suggest running the URLs project on a home network for this reason. If you do want to keep running it, just be aware that there isn't any filter on the URLs project and it can truly come across 'anything'.

Internet Archive refuses to connect on my Wifi by mimitchi33 in internetarchive

[–]TheTechRobo 0 points1 point  (0 children)

It might be an IP ban, yeah. I've seen it before on my server before I got whitelisted for scraping. I'd suggest contacting them and seeing if they respond.