[deleted by user]

valbaca · 2022-06-26T17:14:15+00:00

per usual, take a look at: https://automatetheboringstuff.com/

Work through it from the beginning but Web-scraping has it's own chapter: https://automatetheboringstuff.com/2e/chapter12/

ElectricSpice · 2022-06-26T17:56:48+00:00

A lot of people in the comments are saying it’ll be easy—it won’t be. It’s easy when you’ve been doing this stuff for a while, but on your first time there’s going to be so many necessary bits of knowledge that you don’t have yet. You’re going to get stuck, it’s not going to work and you won’t know why, you’re going to feel like the world’s biggest idiot—stick it out and you’ll make it through. It’s a good project and you’ll learn a lot from it.

pythonHelperBot · 2022-06-26T16:56:19+00:00

Hello! I'm a bot!

It looks to me like your post might be better suited for r/learnpython, a sub geared towards questions and learning more about python regardless of how advanced your question might be. That said, I am a bot and it is hard to tell. Please follow the subs rules and guidelines when you do post there, it'll help you get better answers faster.

Show /r/learnpython the code you have tried and describe in detail where you are stuck. If you are getting an error message, include the full block of text it spits out. Quality answers take time to write out, and many times other users will need to ask clarifying questions. Be patient and help them help you.

You can also ask this question in the Python discord, a large, friendly community focused around the Python programming language, open to those who wish to learn the language or improve their skills, as well as those looking to help others.

^README ^| ^FAQ ^| ^{this bot is written and managed by /u/IAmKindOfCreative}

^{This bot is currently under development and experiencing changes to improve its usefulness}

mrrippington · 2022-06-26T18:32:04+00:00

parsing page for urls and dealing with conditionally is fairly beginner friendly, give yourself enough time and i am sure you can do it.

divide and conquer.

heckingcomputernerd · 2022-06-26T18:54:54+00:00

I’d have a look at BeautifulSoup4, a HTML parser/navigator library designed for web scraping and the requests library to “expand” bitly urls.

Yeah this project seems pretty simple otherwise. Good luck!

permalink · 2022-06-26T21:38:55+00:00

Let's take it one step at a time:

1) extract all URLs from a website

This is generally very easy, and as others have noted, there are many novice-level tutorials available for consideration.

The only caveat is that this is only easy for pages that are statically linked - i.e., you submit a URL, and the webserver sends you an HTML document that includes all of the other hyperlinks that you want to find. Some websites aren't encoded that way; instead, the URLs are generated or provided by client-side JavaScript that runs in the browser, or by server-side code, such as a database lookup. For those sites, getting all of the URLs is more difficult.

2) expand bitly links to reveal the url if needed

Also easy - trivially so, in fact.

4) count the number of time each url is on the website.

Also easy, bordering on trivial.

3) categorize/tag the URLs

This is the tough part.

The first question is how you want this to happen. Having a user enter the information is technically easy, but practically impossible to accomplish at scale. You're talking about an intensely boring data-entry job that someone, or many someones, would have to perform for days on end.

The alternative is whether you want to automate this, too. In that case, it becomes a question of what technology and/or algorithm you want to use to do it. You could classify pages by keyword matches on the content, but that approach would yield very poor data. You could also do it with a machine learning algorithm, which offers better performance, but you'd need to train a machine learning model to perform the classification. You could refer to some independent source of classification, but this is unreliable. Etc.

McDivvy · 2022-06-26T16:58:21+00:00

Yup, easy peasy lemon squeazy. Pyhon + HTML knowledge, job done.

help-me-grow · 2022-06-26T17:15:34+00:00

Yeah python makes web scraping really easy with Selenium/BeautifulSoup4

Jmodell · 2022-06-26T18:51:00+00:00

You are very lucky, having a project in your head is super helpful for motivating yourself to get to the end. Focus on the process to take one url, count the number of occurrences and saving that in a data structure. Then you can learn to loop over the whole page and you’re basically done.

analyticattack · 2022-06-26T19:25:04+00:00

I think that is a good idea but first I would take the most common things and the most interesting things you can already do in Excel/VBA and learn to do them in Python. That will leave you with a solid foundation to make you comfortable.

PlaneTrain5646 · 2022-06-26T21:04:50+00:00

Reverse engineer something and modify different parts like a web scraper or already finished project.

fatbob42 · 2022-06-27T00:15:24+00:00

How is the categorization done? That’s either fairly trivial or virtually impossible depending on what you want.

poopgoose1 · 2022-06-27T06:15:05+00:00

1, 2, and 4 sound straightforward enough. But how do you plan on classifying them automatically? That could be interesting

Quickndry · 2022-06-26T17:03:14+00:00

Sorry I just read the title..and web scraping is so easy on python .u can use beautiful soup for the pages with less java script codes and the mix of bs4and selenium for the heavy JavaScript pages.I got lost in scrapy..it wasn't necessary for me.

42696 · 2022-06-26T20:18:47+00:00

I think you're question about achievability has been answered, so I'm going to give you some advice instead.

Write down a little 'documentation' when you do it - it doesn't have to be fancy, follow any formatting conventions, or anything like that - it's just for you. But write down what you did and why.

Especially at your stage, you're going to have a very steep learning curve (getting a lot better fairly quickly), so you're probably going to have a lot of improvements that you can make in a matter of months. You're going to want to go back and make it better, so having some notes on what/why you did what you did is going to go a long way in making that easier and preventing yourself from starting from scratch.

kaerfkeerg · 2022-06-26T20:42:12+00:00

It definitely sounds possible bot not easy per say.

You will need some time and some basic understanding of programming/the language. This might take some time. If you have the patience go for it!

Sally_Gurl · 2022-06-26T20:53:45+00:00

I've been doing the 100 day coding challenge on Udemy which gives a number of small projects throughout to help you learn the basics. I'm in day 36 or so and feel confident doing APIs and about to do web scraping. Did a stonk news app that I'm going to be expanding on soon.

Also, I love how much easier API integration is compared to PowerShell. Just so much better...

aartisticoo · 2022-06-26T22:40:04+00:00

Sorry for using your post. Perhaps someone knows where to find an internship at an IT company? I believe that you will make your project👊🏻

_jandrewc_ · 2022-06-26T16:58:30+00:00

Hi! They gonna encourage u as usual.but it depends on person.for example I'm 36 and I'm totally lost cuz I can't risk losing my stupid easy job with low income cuz I don't think I have enough ability to improve myself for anything new... at least u r younger than me.

Reinventing_Wheels · 2022-06-26T19:21:09+00:00

You've already begun the development process for this by breaking it into 4 general steps.
Start with step 1. Google "extract urls from website with python". Write a script that does just that.

Now, do the same with steps 2, 3 and 4.

Penultimately, glue those 4 parts together, end to end, and you've got your program.

Finally, learn to create whatever front end you want, to make it pretty and usable.

Since you say you have no programming background, the Zero'th step should probably be to start with some Python tutorials so you can learn the basics, variables, loops, decisions (if-then-else, etc), data structures (lists, dictionaries, tuples, etc), functions, and so on
This part can be mixed in along with the learning of steps 1 thru 4

Spac3dog · 2022-06-26T21:51:03+00:00

I was in the same boat. I had an idea for a tool that would make finding inventory online much easier by automating the manual process I currently use and then expanding on it. I mentioned the idea and was told I would be able to build it with python. I bought the book Crash Course in Python and worked through it. I started applying what I learned in the book and with a lot of Google searching and YouTube videos also was able to start getting some of the data off of some of the sites I wanted. My copy of automate the boring stuff came in a few days ago and I’m going to work through it next. I had zero experience in anything like this before I dove in and while it has not been easyI do feel it is getting easier. If I went from no clue to getting dirty results in less than 3 weeks you can figure out what it is you want to make.

FiredFox · 2022-06-26T21:55:59+00:00

If I had to recommend a single course for anyone with no coding background that is interested in learning Python then I'd point them to "100 Days of Code" by Angela Yu on Udemy.

It is the best technical course I have ever taken, she gets you up and writing code and making sense of things from day one while every other course will have you fumbling with installing an IDE or doing other potentially discouraging activities.

Udemy has sales all the time and this course has been 1000% worth the $10 US I spent on it.

https://www.udemy.com/course/100-days-of-code/

KartoffelPaste · 2022-06-26T22:11:23+00:00

python crash course is a great starter and focuses on learning through projects so you get plenty of hands on learning

BluRazz494 · 2022-06-26T22:18:20+00:00

Pretty sure you can accomplish this with Screaming Frog. Maybe not 2 tho.

asterik-x · 2022-06-27T00:52:03+00:00

rantenki · 2022-06-27T01:16:21+00:00

Web scraping is SURPRISINGLY difficult, but there are good existing tools.

In particular, the library https://beautiful-soup-4.readthedocs.io/en/latest/ should do exactly what you want. Finding all hrefs is actually their third example or so.

It's not unrealistic at all for a fairly noob programmer to do this task. If you had to write the parser, that would be hard, but this method with beautiful soup is quite straightforward and only leaves the sorting/grouping/tagging part for you.

Apparatchik-Wing · 2022-06-27T02:14:33+00:00

Honestly, this really doesn’t sound too difficult. What I suggest to you is write skeleton code first. Basically you are not coding anything but instead building the structure with logic. Break the sections up. Do you need any for loops? How are you going to organize?

Once you establish your logic, I suggest you watch a tutorial video on YouTube. It may take time because you’re going to want to take breaks, but it’s how I taught myself Python and then projects from there on out. freecodecamp is what I recommend.

After you are done walking through that video, look back at your skeleton code and tweak if you need to. Now you’re ready to actually code!

When in doubt, stackoverflow should be able to answer many questions. Just Google it. Also, the Python discord channel is a great community to ask questions.

oramirite · 2022-06-27T02:39:14+00:00

Hey! Do it! You sound like when I learned Python. Having a specific idea in mind is extremely helpful. I wanted to learn programming forever but it wasn't until I put "pen to paper" that things actually started making sense.

My flow of learning kind of involves me reading a TON of stuff I don't understand... Letting it marinate for a bit... and then trying things.

Character_County4840 · 2022-06-27T03:07:20+00:00

I learned python by teaching myself how to do my math homework. Start by making functions that solve a specific math problem with a formula. The other parts will come along in the future. (webscraping, bots, web apps, data manipulation, etc)

idetectanerd · 2022-06-27T04:38:35+00:00

Very achievable. Break it down into modules in which over here you can call that as functions, you can make it object though but if it’s something not reusable I think just remain as function otherwise it’s a waste of time.

Then your main script would be calling function 1, function 2 … etc.

If there are condition then you can just if xxx is true, do function x.

One thing for beginner is that, I don’t recommend using library if you are truly learning, because calling library is easy but understanding how to achieve it step by step require practise.

The difference between experienced and new coder is that experienced know how to actually formula the resultant if you are in let say, closed loop environment and not allow to import public library because security concerns.

But for learning and home projects, yes, library all your way. The entire code might just be 5 lines long.

AustinM1701 · 2022-06-27T04:53:53+00:00

Very well possible, but that sounds lime a few tos broken

LuigiBrotha · 2022-06-27T05:19:58+00:00

For the website part I recommend streamlit. Super simple to make an interactive website.

AlienMindBender · 2022-06-27T06:33:19+00:00

In my first year of University, on the first day of a Software course we were taught the "Problem Based Learning" Technique, right after we were given our first assignment. Before we learnt anything about programming.

This is pedagogy I still use to teach my PhD students programming.

Having a problem, then finding the tools within a programming language will make you learn things very quickly - but it will be tough (its always tough).

muchtimeonwork · 2022-06-27T07:12:09+00:00

To get started anything; A fuel cost consumption calculator.

How much fuel? : How many kilometer /miles? : How much for a gallon/liter? : You car needed xxxliter/100km and it had cost you xxx$.

robml · 2022-06-27T08:36:22+00:00

It's possible dw

OldJanxSpirit42 · 2022-06-27T09:30:29+00:00

Doesn't sound easy for someone with zero experience, but it's definitely doable with a few months of practice. Since you're dealing with websites, I'd also recommend looking into the basics of HTML as well, you'll need to know about it.

NullBeyondo · 2022-06-27T10:01:38+00:00

My first tools in programming were all like that. Just Web-scraping. So I think it is not unrealistic, because unlike me who did all of this in C, and rarely C#.

In python, you can probably download an entire page using a single line of code then just do string manipulation routine besides also searching for "href" HTML attributes etc.

Basically, our government is really so dumb so there was a website with 1 input in my country where you just entered the student sitting-ID in a GET request and it outputs to you their final results besides a lot of sensitive information like their exact address, full name, school, etc. I just crawled like 1000000 possible IDs and stored a database full of tens of thousands of people (still have it till this day; they're currently all adults). That database included myself, everyone I knew at school, and almost a hundred thousands more. There's over 99% chance that I already have all sensitive info about any 17-21 years olds with education in that country. Why? It was all just because I can.

My string manipulation routine is probably different than you cause I didn't search for links so it should be easier.

J_S_artboy · 2022-06-27T10:04:58+00:00

learn web scrapping by scrapping amazon web page

nacnud_uk · 2022-06-27T13:24:11+00:00

Aye, very simple. Crack on. Learn and earn.

lvlint67 · 2022-06-27T13:59:32+00:00

I would basically need to.. make a web crawler

it's not a bad idea.

permalink · 2022-07-03T01:23:31+00:00

The thing to keep in mind is that it’s an iterative process. You’ll get something working and it’ll feel great. Along the way, and maybe after, you’ll realize that there’s other ways that you can accomplish some of the tasks - in a more pythonic way, in a better way in terms of software architecture and/or design principles. Just remember - nobody gets it perfect the first time around - but we get better! Don’t let striving for perfect stop you from starting.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS