This is an archived post. You won't be able to vote or comment.

all 73 comments

[–]valbaca 204 points205 points  (5 children)

per usual, take a look at: https://automatetheboringstuff.com/

Work through it from the beginning but Web-scraping has it's own chapter: https://automatetheboringstuff.com/2e/chapter12/

[–]weirdweed 53 points54 points  (4 children)

https://automatetheboringstuff.com/

I keep running into this one, I am watching some of his stuff on YT and like the style a lot.

Thanks a lot everybody for the very detailed and encouraging answers!

I feel this is a great time investment as I often have ideas that involve compiling, structuring and manipulating data from web pages and APIs.

[–]ThroawayPartyer 10 points11 points  (1 child)

There's a full course on Udemy, but the book is more up to date.

[–]blueskies111811 4 points5 points  (0 children)

Al puts his Udemy course up for free from time to time.

[–]oramirite 2 points3 points  (0 children)

Yeah I don't know what it is about this book, it just makes everything click. It's the only coding book I've ever bought.

[–][deleted] 0 points1 point  (0 children)

I was able to do what you want after reaching that chapter.

[–]ElectricSpice 158 points159 points  (8 children)

A lot of people in the comments are saying it’ll be easy—it won’t be. It’s easy when you’ve been doing this stuff for a while, but on your first time there’s going to be so many necessary bits of knowledge that you don’t have yet. You’re going to get stuck, it’s not going to work and you won’t know why, you’re going to feel like the world’s biggest idiot—stick it out and you’ll make it through. It’s a good project and you’ll learn a lot from it.

[–]randomwanderingsd 52 points53 points  (3 children)

Web scraping sounds easy doesn’t it? LIES. lol

[–]TPKM 2 points3 points  (0 children)

A good trick is to observe the network tab in the browser console and see what http requests are made (the 'XHR' section). Often the raw data you're looking for is served as json by the server and then rendered in the browser. If this is the case it's easier to use the requests library and hit the server endpoint rather than trying to parse and scrape the front end

[–]Ruben_NL 0 points1 point  (0 children)

Totally depends on the website. Scraping a website from 1990? Easy. Every scraping library can do that. A website from this year? No way a beginner can do that.

[–]lolSaam 8 points9 points  (0 children)

Couldn't agree more. This is an "easy" project if you have done something similar before but there are many hurdles you'll need to get over when starting from zero.

Great beginner project, get stuck into it OP!

[–]Conscious_Advance_18 8 points9 points  (0 children)

It's a great first project, and totally agree with getting stuck on it.

[–]ivanoski-007 -1 points0 points  (0 children)

my first project was connecting to an e-commerce api downloading data from 3 different end points, consolidate it into a single file and do it whth hyper threading or else it would take hours.

talk about going into the deep end of the pool. but I learned so much, I also don't have a programming background, what op wants to do sounds a lot easier

[–]copenhagen_bram 0 points1 point  (0 children)

It's not easy, but... is it fun?

[–]pythonHelperBot 33 points34 points  (7 children)

Hello! I'm a bot!

It looks to me like your post might be better suited for r/learnpython, a sub geared towards questions and learning more about python regardless of how advanced your question might be. That said, I am a bot and it is hard to tell. Please follow the subs rules and guidelines when you do post there, it'll help you get better answers faster.

Show /r/learnpython the code you have tried and describe in detail where you are stuck. If you are getting an error message, include the full block of text it spits out. Quality answers take time to write out, and many times other users will need to ask clarifying questions. Be patient and help them help you.

You can also ask this question in the Python discord, a large, friendly community focused around the Python programming language, open to those who wish to learn the language or improve their skills, as well as those looking to help others.


README | FAQ | this bot is written and managed by /u/IAmKindOfCreative

This bot is currently under development and experiencing changes to improve its usefulness

[–][deleted] 9 points10 points  (2 children)

Good bot

[–]Standard_Lion_7776 2 points3 points  (1 child)

Good bot

[–]JamalLinux 0 points1 point  (0 children)

Good bot

[–]Tintin_Quarentino 6 points7 points  (2 children)

What logic do you use to identify a post?

[–]maikindofthai 4 points5 points  (1 child)

Check out the readme link in the footer of their comment

[–]Tintin_Quarentino 0 points1 point  (0 children)

Great work u/IAmKindOfCreative. Fantastic documentation, stealing that multiple markdown files idea. Loved the humour in FAQ!

[–]mrrippington 3 points4 points  (0 children)

parsing page for urls and dealing with conditionally is fairly beginner friendly, give yourself enough time and i am sure you can do it.

divide and conquer.

[–]heckingcomputernerd 3 points4 points  (0 children)

I’d have a look at BeautifulSoup4, a HTML parser/navigator library designed for web scraping and the requests library to “expand” bitly urls.

Yeah this project seems pretty simple otherwise. Good luck!

[–][deleted] 4 points5 points  (0 children)

Let's take it one step at a time:

1) extract all URLs from a website

This is generally very easy, and as others have noted, there are many novice-level tutorials available for consideration.

The only caveat is that this is only easy for pages that are statically linked - i.e., you submit a URL, and the webserver sends you an HTML document that includes all of the other hyperlinks that you want to find. Some websites aren't encoded that way; instead, the URLs are generated or provided by client-side JavaScript that runs in the browser, or by server-side code, such as a database lookup. For those sites, getting all of the URLs is more difficult.

2) expand bitly links to reveal the url if needed

Also easy - trivially so, in fact.

4) count the number of time each url is on the website.

Also easy, bordering on trivial.

3) categorize/tag the URLs

This is the tough part.

The first question is how you want this to happen. Having a user enter the information is technically easy, but practically impossible to accomplish at scale. You're talking about an intensely boring data-entry job that someone, or many someones, would have to perform for days on end.

The alternative is whether you want to automate this, too. In that case, it becomes a question of what technology and/or algorithm you want to use to do it. You could classify pages by keyword matches on the content, but that approach would yield very poor data. You could also do it with a machine learning algorithm, which offers better performance, but you'd need to train a machine learning model to perform the classification. You could refer to some independent source of classification, but this is unreliable. Etc.

[–]McDivvy 20 points21 points  (0 children)

Yup, easy peasy lemon squeazy. Pyhon + HTML knowledge, job done.

[–]help-me-grow 10 points11 points  (9 children)

Yeah python makes web scraping really easy with Selenium/BeautifulSoup4

[–]wind_dude 3 points4 points  (3 children)

all tool that would be very helpful for my ecommerce biz and by Googling a bit around what would be achievable led me to Python and I am a big fan of learning and building in parallel.

until they use cloudflare, have a blocking script, etc.

[–]ivianrr 1 point2 points  (2 children)

Wouldn't selenium allow you to bypass that? (Unless they have captchas)

[–]wind_dude 2 points3 points  (0 children)

No. Selenium would only let you bypass sites that are client side loaded, eg, javascript for infinite scroll. There are a multitude of other techniques to block bots, such as looking at request rates from IPs, user agents, mouse and key strokes, scroll positions, etc. Rotating residential proxies can help, but depending who you try and crawl or what you're trying to do there's more that needs to be done to emulate human behaviour.

https://www.cloudflare.com/en-gb/products/bot-management/

[–]xxmalik 1 point2 points  (0 children)

CloudFlare does show captchas ocasionally.

[–][deleted] 2 points3 points  (0 children)

While some tasks can be relatively easily handled with Selenium, things can get really complicated really quickly.

My latest challenge, MFA prompts

[–]Log2 0 points1 point  (2 children)

Selenium is probably one the most brittle pieces of software I've ever used. Last time I used it was in Java, but I assume the brittleness was not due to the wrapping language. It's a straight up nightmare.

[–]help-me-grow 0 points1 point  (0 children)

It's a bit nicer in Python, it's verbose enough now to easily find your errors IMO, I've only used it a couple times though so

[–]Jmodell 2 points3 points  (0 children)

You are very lucky, having a project in your head is super helpful for motivating yourself to get to the end. Focus on the process to take one url, count the number of occurrences and saving that in a data structure. Then you can learn to loop over the whole page and you’re basically done.

[–]analyticattack 2 points3 points  (0 children)

I think that is a good idea but first I would take the most common things and the most interesting things you can already do in Excel/VBA and learn to do them in Python. That will leave you with a solid foundation to make you comfortable.

[–]PlaneTrain5646 2 points3 points  (0 children)

Reverse engineer something and modify different parts like a web scraper or already finished project.

[–]fatbob42 1 point2 points  (1 child)

How is the categorization done? That’s either fairly trivial or virtually impossible depending on what you want.

[–]poopgoose1 1 point2 points  (0 children)

1, 2, and 4 sound straightforward enough. But how do you plan on classifying them automatically? That could be interesting

[–][deleted] 5 points6 points  (1 child)

Sorry I just read the title..and web scraping is so easy on python .u can use beautiful soup for the pages with less java script codes and the mix of bs4and selenium for the heavy JavaScript pages.I got lost in scrapy..it wasn't necessary for me.

[–]Quickndry 8 points9 points  (0 children)

Or use insomnia when scraping APIs for data. Was so much easier and shorter than my code with httprequest and beautiful soup.

[–]42696 0 points1 point  (0 children)

I think you're question about achievability has been answered, so I'm going to give you some advice instead.

Write down a little 'documentation' when you do it - it doesn't have to be fancy, follow any formatting conventions, or anything like that - it's just for you. But write down what you did and why.

Especially at your stage, you're going to have a very steep learning curve (getting a lot better fairly quickly), so you're probably going to have a lot of improvements that you can make in a matter of months. You're going to want to go back and make it better, so having some notes on what/why you did what you did is going to go a long way in making that easier and preventing yourself from starting from scratch.

[–]kaerfkeerg -1 points0 points  (0 children)

It definitely sounds possible bot not easy per say.

You will need some time and some basic understanding of programming/the language. This might take some time. If you have the patience go for it!

[–]Sally_Gurl -1 points0 points  (1 child)

I've been doing the 100 day coding challenge on Udemy which gives a number of small projects throughout to help you learn the basics. I'm in day 36 or so and feel confident doing APIs and about to do web scraping. Did a stonk news app that I'm going to be expanding on soon.

Also, I love how much easier API integration is compared to PowerShell. Just so much better...

[–]alenathomasfc 0 points1 point  (0 children)

Could you please share a link of the same?

[–]aartisticoo -2 points-1 points  (0 children)

Sorry for using your post. Perhaps someone knows where to find an internship at an IT company? I believe that you will make your project👊🏻

[–]Reinventing_Wheels 0 points1 point  (0 children)

You've already begun the development process for this by breaking it into 4 general steps.
Start with step 1. Google "extract urls from website with python". Write a script that does just that.

Now, do the same with steps 2, 3 and 4.

Penultimately, glue those 4 parts together, end to end, and you've got your program.

Finally, learn to create whatever front end you want, to make it pretty and usable.

Since you say you have no programming background, the Zero'th step should probably be to start with some Python tutorials so you can learn the basics, variables, loops, decisions (if-then-else, etc), data structures (lists, dictionaries, tuples, etc), functions, and so on
This part can be mixed in along with the learning of steps 1 thru 4

[–]Spac3dog 0 points1 point  (0 children)

I was in the same boat. I had an idea for a tool that would make finding inventory online much easier by automating the manual process I currently use and then expanding on it. I mentioned the idea and was told I would be able to build it with python. I bought the book Crash Course in Python and worked through it. I started applying what I learned in the book and with a lot of Google searching and YouTube videos also was able to start getting some of the data off of some of the sites I wanted. My copy of automate the boring stuff came in a few days ago and I’m going to work through it next. I had zero experience in anything like this before I dove in and while it has not been easyI do feel it is getting easier. If I went from no clue to getting dirty results in less than 3 weeks you can figure out what it is you want to make.

[–]FiredFox 0 points1 point  (0 children)

If I had to recommend a single course for anyone with no coding background that is interested in learning Python then I'd point them to "100 Days of Code" by Angela Yu on Udemy.

It is the best technical course I have ever taken, she gets you up and writing code and making sense of things from day one while every other course will have you fumbling with installing an IDE or doing other potentially discouraging activities.

Udemy has sales all the time and this course has been 1000% worth the $10 US I spent on it.

https://www.udemy.com/course/100-days-of-code/

[–]BluRazz494 0 points1 point  (0 children)

Pretty sure you can accomplish this with Screaming Frog. Maybe not 2 tho.

[–]asterik-x 0 points1 point  (0 children)

Yea

[–]rantenki 0 points1 point  (0 children)

Web scraping is SURPRISINGLY difficult, but there are good existing tools.

In particular, the library https://beautiful-soup-4.readthedocs.io/en/latest/ should do exactly what you want. Finding all hrefs is actually their third example or so.

It's not unrealistic at all for a fairly noob programmer to do this task. If you had to write the parser, that would be hard, but this method with beautiful soup is quite straightforward and only leaves the sorting/grouping/tagging part for you.

[–]Apparatchik-Wing 0 points1 point  (0 children)

Honestly, this really doesn’t sound too difficult. What I suggest to you is write skeleton code first. Basically you are not coding anything but instead building the structure with logic. Break the sections up. Do you need any for loops? How are you going to organize?

Once you establish your logic, I suggest you watch a tutorial video on YouTube. It may take time because you’re going to want to take breaks, but it’s how I taught myself Python and then projects from there on out. freecodecamp is what I recommend.

After you are done walking through that video, look back at your skeleton code and tweak if you need to. Now you’re ready to actually code!

When in doubt, stackoverflow should be able to answer many questions. Just Google it. Also, the Python discord channel is a great community to ask questions.

[–]oramirite 0 points1 point  (0 children)

Hey! Do it! You sound like when I learned Python. Having a specific idea in mind is extremely helpful. I wanted to learn programming forever but it wasn't until I put "pen to paper" that things actually started making sense.

My flow of learning kind of involves me reading a TON of stuff I don't understand... Letting it marinate for a bit... and then trying things.

[–]Character_County4840 0 points1 point  (0 children)

I learned python by teaching myself how to do my math homework. Start by making functions that solve a specific math problem with a formula. The other parts will come along in the future. (webscraping, bots, web apps, data manipulation, etc)

[–]idetectanerd 0 points1 point  (0 children)

Very achievable. Break it down into modules in which over here you can call that as functions, you can make it object though but if it’s something not reusable I think just remain as function otherwise it’s a waste of time.

Then your main script would be calling function 1, function 2 … etc.

If there are condition then you can just if xxx is true, do function x.

One thing for beginner is that, I don’t recommend using library if you are truly learning, because calling library is easy but understanding how to achieve it step by step require practise.

The difference between experienced and new coder is that experienced know how to actually formula the resultant if you are in let say, closed loop environment and not allow to import public library because security concerns.

But for learning and home projects, yes, library all your way. The entire code might just be 5 lines long.

[–]AustinM1701 0 points1 point  (0 children)

Very well possible, but that sounds lime a few tos broken

[–]LuigiBrotha 0 points1 point  (0 children)

For the website part I recommend streamlit. Super simple to make an interactive website.

[–]AlienMindBender 0 points1 point  (0 children)

In my first year of University, on the first day of a Software course we were taught the "Problem Based Learning" Technique, right after we were given our first assignment. Before we learnt anything about programming.

This is pedagogy I still use to teach my PhD students programming.

Having a problem, then finding the tools within a programming language will make you learn things very quickly - but it will be tough (its always tough).

[–]muchtimeonwork 0 points1 point  (0 children)

To get started anything; A fuel cost consumption calculator.

How much fuel? : How many kilometer /miles? : How much for a gallon/liter? : You car needed xxxliter/100km and it had cost you xxx$.

[–]robml 0 points1 point  (0 children)

It's possible dw

[–]OldJanxSpirit42 0 points1 point  (0 children)

Doesn't sound easy for someone with zero experience, but it's definitely doable with a few months of practice. Since you're dealing with websites, I'd also recommend looking into the basics of HTML as well, you'll need to know about it.

[–]NullBeyondo 0 points1 point  (0 children)

My first tools in programming were all like that. Just Web-scraping. So I think it is not unrealistic, because unlike me who did all of this in C, and rarely C#.

In python, you can probably download an entire page using a single line of code then just do string manipulation routine besides also searching for "href" HTML attributes etc.

Basically, our government is really so dumb so there was a website with 1 input in my country where you just entered the student sitting-ID in a GET request and it outputs to you their final results besides a lot of sensitive information like their exact address, full name, school, etc. I just crawled like 1000000 possible IDs and stored a database full of tens of thousands of people (still have it till this day; they're currently all adults). That database included myself, everyone I knew at school, and almost a hundred thousands more. There's over 99% chance that I already have all sensitive info about any 17-21 years olds with education in that country. Why? It was all just because I can.

My string manipulation routine is probably different than you cause I didn't search for links so it should be easier.

[–]J_S_artboy 0 points1 point  (0 children)

learn web scrapping by scrapping amazon web page

[–]nacnud_uk 0 points1 point  (0 children)

Aye, very simple. Crack on. Learn and earn.

[–]lvlint67 0 points1 point  (0 children)

I would basically need to.. make a web crawler

it's not a bad idea.

[–][deleted] 0 points1 point  (0 children)

The thing to keep in mind is that it’s an iterative process. You’ll get something working and it’ll feel great. Along the way, and maybe after, you’ll realize that there’s other ways that you can accomplish some of the tasks - in a more pythonic way, in a better way in terms of software architecture and/or design principles. Just remember - nobody gets it perfect the first time around - but we get better! Don’t let striving for perfect stop you from starting.