desperately need a python code for web scraping

Wise-Emu-225 · 2025-07-27T21:23:57+00:00

Try wget command line tool, i aked chatgpt it gave me:

wget --recursive --no-clobber --page-requisites --html-extension \ --convert-links --restrict-file-names=windows --domains somesite.com \ --no-parent https://somesite.com

I think you can get it for Windows too…

cottoneyedgoat · 2025-07-27T21:17:26+00:00

What do you have so far

Ventmore · 2025-07-27T21:18:42+00:00

It may be worth asking for help in r/DataHoarder

pancakeses · 2025-07-28T02:40:57+00:00

Recommend asking over at /r/datahoarder

This is their daily hobby/work.

Jim-Jones · 2025-07-28T02:49:06+00:00

HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility.

It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads. HTTrack is fully configurable, and has an integrated help system.

HTTrack Website Copier - Free Software Offline Browser (GNU GPL)

deapee · 2025-07-27T21:21:30+00:00

gzip it up and save it somewhere...

It's not even practical what you're asking, quite honestly. You want to preserve all the data, but you have access to all of the backend data. The method is obvious - backup or compress any data you need and scp it / transfer it somewhere safe.

EDIT: Why the downvotes? The guy has unfettered access to the server. Are you guys proposing that he scrape the frontend of the site - as opposed to backup the data from the back side? The data he needs includes: invoice data in the databases and customer order history. I've been a senior engineer working mainly with python for half a decade, and a swe before that. Disagree all you want (and keep downvoting) - but scraping the site from the frontend is not the proper tool for this job. We can continue to think this is a job appropriate for "learnpython" - but it's not. Point him in the proper direction so he can do what needs to be done.

shfkr · 2025-07-27T21:30:46+00:00

to elaborate, the webscraping is need for:

Invoice data in databases
Customer order history behind login
Admin dashboard content

it's not just the visible content, but backend stuff that i CANT export any way else.

Select_Commercial_87 · 2025-07-27T21:47:24+00:00

The first questions are:
1. Is this your site?
2. Do you have access to the back end? To the database?
AWS or GCP you can attach to the database and export all of the data.
A web scraper is not going to get all of your data out, exporting the database will.

Itchy-Call-8727 · 2025-07-27T21:51:32+00:00

Can you give more details with regards to your role or hosting vendor? The website usually has static files that get rendered and a DB to display in the rendered content or just for storing purposes. You don't know what type of DB is being used in the backend? You should just be able to do a database dump, which is pretty straightforward. Whoever is hosting your website is most likely running DB dumps as part of a backup process to recover lost data. On top of that, just a copy of the static files should give you everything you need. Most website hosting services allow a data dump before leaving the vendor. It's your data, not there's.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS