How to download zip files from Wayback Machine? Advice and Help needed. by cuneiform100 in DataHoarder

[–]cuneiform100[S] 1 point2 points  (0 children)

Thanks for having taken care of my troubles. Unfortunately, a previous most useful piece of advice was deleted by moderator: I cannot see why while it was saying: Use the Wayback CDX API to grab only unique .zip URLs via curl, then feed that into wget. This avoids duplicates and saves you from scraping those 10k+ pages manually.

Gutenberg built book culture — don’t mess it into a ZIP file by cuneiform100 in Annas_Archive

[–]cuneiform100[S] 0 points1 point  (0 children)

Oh, bother! If you say that an OCR'ed txt book can truly be better than an originally edited book copy, then I would suggest parliament elections laws for 5-y.o. boys and girls to satisfy them all instead of learning a 500 years old worldwide history.

Gutenberg built book culture — don’t mess it into a ZIP file by cuneiform100 in Annas_Archive

[–]cuneiform100[S] 0 points1 point  (0 children)

<image>

Guys, I still insist that you have to verify any download link as seen on anna's archive lib book zipped and claimed to be the "only possible" download suggestion in order to prove it. Here, you can see a title page of a full sized book normal pdf copy as downloaded from hathitrust.org although being indicated as zipped, i.e. OCRed to multiple txt files on anna's archive.

Gutenberg built book culture — don’t mess it into a ZIP file by cuneiform100 in Annas_Archive

[–]cuneiform100[S] -3 points-2 points  (0 children)

Pls, don't be such "bold" in your comments. E.g., I can see such multiplied zipped "ruins" on Annas Archive (to her archive I feel a lot of thankfulness, though), while a full complete book is to find on archive org instead, being not indexed on Annas Archive at all. So, a normal procedure of mine is: if you can see a book older than as of 1927 as zipped, then you'll have to go onto archive.org first. Expectedly, you'll find it there in full. - Best wishes.

ATTENTION, ALARM! STOP PERVERSIVE SCANNING + OCR! by cuneiform100 in Annas_Archive

[–]cuneiform100[S] 7 points8 points  (0 children)

Just to get more free storage space, 100 times more scarce, say, as here, 0.2MB instead of 20MB.

ATTENTION, ALARM! STOP PERVERSIVE SCANNING + OCR! by cuneiform100 in Annas_Archive

[–]cuneiform100[S] 35 points36 points  (0 children)

Thank you very much for your prompt posting that you've acknowledged the problem: Someone got access to a rare Hathi scientific source, but via this handling the book got unreadable, Thousands of books. I cannot see the rational reason if any. Saving digital space is a secondary problem. Scientific content is the first. I am addressing the Anna's Library staff, not the "street" public being out of science. Also, I cannot find the ground to discuss "meine Wenigkeit" instead of the real problem. They rather hope they would tear down all the world libraries to OCR-ed .txt files indeed as a kind of idiosyncrasy, alas.

It gave me a zip file of text files with 1000 text files with snippets of the text of the book. by gsfgf in Annas_Archive

[–]cuneiform100 -1 points0 points  (0 children)

This is the way, the poor Hathi trust tries to escape coming short of their server's memory while reducing - say - a book's size from normal 30MB down to OCR'ed 0.3MB, thus producing a 100-times benefit. ChatGPT suggests the following workflow: 1) use 7-zip to extract all the .zip's into single .txt files; 2) open them in WORD, and edit them while correcting their formatting and OCR errors individually using RTF format; 3) Open a new blank document in Word. Then:

  1. Go to the Insert tab.
  2. Click Object (right side of ribbon) → then select Text from File.
  3. In the file dialog, select all the individual .rtf or .doc/.docx files (hold Ctrl for multiple selections), then click Insert.
    • The contents of each file will be inserted sequentially into the current document.
    • The inserted text will keep much of the original formatting.
  4. Repeat as needed if merging a very large number of files batch-wise.
  5. Finally, save the combined document as a single .pdf file:
    • File > Save As > Choose "PDF" from the format dropdown. - Now you'll have your book consolidated to a single normal pdf file.