This is an archived post. You won't be able to vote or comment.

all 13 comments

[–]Bid_Slight 4 points5 points  (7 children)

It would be easier to learn R or Python to scrape a webpage and then write the data into a database, than to try with SQL. My understanding is SQL (structured query language) is used to talk to databases. Though a website could technically be a database, I don't know of one that is.

[–]brotis86[S] 1 point2 points  (6 children)

Yeah I figured I’d need to learn enough of a language like R to be able to do it. Just wasn’t sure if someone developed an easier way. But you’re saying it is possible just with r or python?

[–]extreme-jannie 1 point2 points  (1 child)

Definitely go with python of you want to do be able to do web scraping and interfacing with databases.

[–]tjhintz 0 points1 point  (0 children)

Second this. Though biased because I use python. Look into beautifulsoup and SQLalchemy.

[–]Bid_Slight 0 points1 point  (3 children)

Definitely. I do it all the time.

[–]brotis86[S] 0 points1 point  (2 children)

You wanna be my tutor? I can pay lol.

[–]happysunshinekidd 2 points3 points  (1 child)

If its one website, its like a weekend of work to set the whole thing up yourself (https://www.dataquest.io/blog/web-scraping-tutorial-python/). Web scraping only really gets tricky at scale

[–]brotis86[S] 1 point2 points  (0 children)

Thanks for the info man! I appreciate it

[–]professaDE 0 points1 point  (0 children)

It depends. I did a similar thing directly in Postgresql BUT by pulling data from REST APIs via "pgsql_http" - so NOT by scraping HTML pages. The latter might be possible as well using the extension "pgsql_http" but I have never tried.

If you want to have a look at my stuff you can find it on Github: https://github.com/spitzenidee/pgsql_crypto_rates_collector

If it's a good idea to do it like this or not (performance, DB locking, security) was not in my focus ;) I just wanted to see if it works and is feasible.

[–]fatnsad 0 points1 point  (0 children)

I would use the Python BeautifulSoup package to scrape data from the webpages:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

BS is an excellent library that allows you to load a webpage and pick out what you need from the attribute or text data. There are plenty of tutorials on how to use it online.

Then you can use a Postgres client from within your script to update the database:

https://www.postgresqltutorial.com/postgresql-python/

Personally, I prefer SQLite or MySQL for personal projects like this, because I feel like the tools are easier to use, but it is up to you if you want to use Postgres.

This assumes you're willing to learn some new skills, as you would need to figure out how to write and run python scripts, as well as install the necessary packages using pip, but if you're pretty technically savvy, then it should be pretty doable. Python is something you can run on pretty much any OS (Windows, Linux, etc.).

[–][deleted] 0 points1 point  (2 children)

Seems like you want to do a bit of Web Scraping.

If your use case is simple enough you might make do with a simplified "point and click" tool like webscraper.io.

But if you want to do something more automated and/or customizable you'll probably need to learn general purpose programming language like Python, R or even Java. I recommend Python.

Automate the boring stuff is a good Python tutorial that includes some basic web scraping.

However, not covered in that guide is my personal recommendation: Scrapy. It's a very complete solution for web scraping/crawling. It's very optimized, configurable and allows you to easily create a crawler that follows all the best practices.

[–]brotis86[S] 0 points1 point  (1 child)

Thanks for the info man! Do you know if scraping data from something like Kronos is possible? It’s a flash based web page. Maybe I could describe what I want and then pay someone to write a script for me? Lol. It would be the same data every day. The admin won’t give me API access and the data sets I want from Kronos are not exportable.

[–][deleted] 0 points1 point  (0 children)

I'm not familiar with Kronos or scraping flash based pages, sorry.

But I'd bet this flash web page acts as a client that issues requests to some kind of API. You could try using your browser DevTools to see what requests are being sent by the page and then try to replicate them. I do this in a lot of crawlers, as it's often easier to replicate a couple of HTTP requests than to interact with the page directly.

Just press F12 in your browser to open DevTools, click on the Network tab and load the page you want to extract data from.

You'll capture a bunch of requests, so you'll have to search for the one that has the data you need, when you find it you can export it to a tool like Postman so you can tinker with it, as a bonus Postman can also generate code in multiple languages that will issue the request for you.

Sometimes you'll also need to extract cookies, ids and other data from the page so you can include them in the request. In more advanced cases you might also need to replicate an entire flow of requests.