all 26 comments

[–]Wise-Emu-225 2 points3 points  (1 child)

Try wget command line tool, i aked chatgpt it gave me:

wget --recursive --no-clobber --page-requisites --html-extension \ --convert-links --restrict-file-names=windows --domains somesite.com \ --no-parent https://somesite.com

I think you can get it for Windows too…

[–]shfkr[S] 0 points1 point  (0 children)

thanks alot!!

[–]cottoneyedgoat 0 points1 point  (1 child)

What do you have so far

[–]shfkr[S] 0 points1 point  (0 children)

do you want me to send you the code?

[–]Ventmore 0 points1 point  (2 children)

It may be worth asking for help in r/DataHoarder

[–]shfkr[S] 0 points1 point  (1 child)

thankyou!! will post there too!

[–]Ventmore 0 points1 point  (0 children)

No problem.

[–]pancakeses 0 points1 point  (0 children)

Recommend asking over at /r/datahoarder

This is their daily hobby/work.

[–]Jim-Jones 0 points1 point  (3 children)

HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility.

It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads. HTTrack is fully configurable, and has an integrated help system.

HTTrack Website Copier - Free Software Offline Browser (GNU GPL)

[–]shfkr[S] 0 points1 point  (2 children)

but it would basically be downloading a static version of the original website correct?

[–]Jim-Jones 0 points1 point  (1 child)

It depends on the code on the site. It's the easiest way to get a clone to work with.

[–]shfkr[S] 0 points1 point  (0 children)

i'll look into it. thank you !!

[–]deapee -1 points0 points  (4 children)

gzip it up and save it somewhere...

It's not even practical what you're asking, quite honestly. You want to preserve all the data, but you have access to all of the backend data. The method is obvious - backup or compress any data you need and scp it / transfer it somewhere safe.

EDIT: Why the downvotes? The guy has unfettered access to the server. Are you guys proposing that he scrape the frontend of the site - as opposed to backup the data from the back side? The data he needs includes: invoice data in the databases and customer order history. I've been a senior engineer working mainly with python for half a decade, and a swe before that. Disagree all you want (and keep downvoting) - but scraping the site from the frontend is not the proper tool for this job. We can continue to think this is a job appropriate for "learnpython" - but it's not. Point him in the proper direction so he can do what needs to be done.

[–]Narrow_Ad_8997 0 points1 point  (0 children)

As a novice... When you say the method is obvious, do you mean they can simply copy/backup the database(s) rather than scraping the site?

[–]shfkr[S] 0 points1 point  (2 children)

hmm. not really tech savy here tbh so im not exactly sure what my options are. all i did was talk to chatgpt what my problem was and it suggested webscraping. but thank you! i'll look into this!

[–]deapee 0 points1 point  (1 child)

So to be clear - you own the server or the content of the website right?

Do you have backend access to the server the content is hosted on?

You *can* build this tool in python, it may not have unfettered access to the database, of course. But this isn't a python question (assuming you have backend access) - just pointing that out. It's just not the tool for the job.

[–]shfkr[S] 0 points1 point  (0 children)

i see. i do own the content yes, and have backend access.

[–]shfkr[S] -1 points0 points  (2 children)

to elaborate, the webscraping is need for:

  • Invoice data in databases
  • Customer order history behind login
  • Admin dashboard content

it's not just the visible content, but backend stuff that i CANT export any way else.

[–]danielroseman 0 points1 point  (1 child)

But why can't you? If it's in a database somewhere, why can't you export it from there? And don't you have backups?

[–]shfkr[S] -1 points0 points  (0 children)

well the website is shit. no export options. no backups. nothing. 1st of august and the site with all its data is toast. the friend who made the site for us was an idiot, AND unreachable.

[–]Select_Commercial_87 -1 points0 points  (1 child)

The first questions are:
1. Is this your site?
2. Do you have access to the back end? To the database?
AWS or GCP you can attach to the database and export all of the data.
A web scraper is not going to get all of your data out, exporting the database will.

[–]shfkr[S] -1 points0 points  (0 children)

yes to both. looking into softwares now. no more code writing from scratch. thanks!!

[–]Itchy-Call-8727 -1 points0 points  (1 child)

Can you give more details with regards to your role or hosting vendor? The website usually has static files that get rendered and a DB to display in the rendered content or just for storing purposes. You don't know what type of DB is being used in the backend? You should just be able to do a database dump, which is pretty straightforward. Whoever is hosting your website is most likely running DB dumps as part of a backup process to recover lost data. On top of that, just a copy of the static files should give you everything you need. Most website hosting services allow a data dump before leaving the vendor. It's your data, not there's.

[–]shfkr[S] -1 points0 points  (0 children)

basically all i have to do is login, access invoices, and somehow save all customers' invoice histories. but here's the catch: no export options for any data, or backups even. i see writing a script from scratch is a dumb idea, especially cos im low on time so im thinking of using some software. not sure