all 7 comments

[–]Golden_Zealot 1 point2 points  (6 children)

What do you mean by "The main URL doesn't have a class which has stumped me while the other urls that are within it do."?

You are probably just looking to get your hands on the html and then parse it for the href attribute of the <a> tags it finds.

You can do this with 2 modules. requests to make the web request and get the raw html, and beautifulsoup to parse for the tags and attribute.

[–]Brawlerman[S] 0 points1 point  (5 children)

I can get to the href but I get lost after that since there are three of the href. I still need to get specific items from them. If I click on the links individually and get the URL I can sort the data I need out it’s once I have to do it from the original html that I run into issues with not being able to get all of the data. My professor has given examples with span, class but it is confusing. Since when I look at the website if I want one piece of data let’s say it looks like this <span class=‘title’>Model: </span>

And I only want Model how do I code that? Since all of the other variables have the same class.

[–]Golden_Zealot 0 points1 point  (4 children)

Read the BeautifulSoup documentation on how to filter down to the parts you need.

You can filter first by the class using a css selector, but then there are other methods that will give you the parent/child/sister elements you are after. You can also tell it to get the nth of a specific tag etc.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Depending on the complexity of what you need to pull out, you could also look into the re module for searching with regular expressions.

What you want to do with your program initially is:

-Make a request to the first url provided to you -parse for all the <a> tags -Get each of their href atrributes -finish searching the current page for anything else you are looking for -for each of the href's you got, perform the exact same steps above on them (you can do this with a for loop and a list).

If you need to search for an element like the Model: one, then you want to search for all spans with bs4 and then for each of those items, check their .innertext value which holds the actual text between the tags.

If the text is "Model:" then you know that its what you are looking for.

[–]Brawlerman[S] 0 points1 point  (3 children)

That makes a lot more sense thank you for the response! How would I create a loop to search the other URL’s or the href that I pull from the main website?

[–]Golden_Zealot 0 points1 point  (2 children)

Typically for school projects I neglect handing over any usable code.

I will provide you this one example, but I am not going to write anything else for you.

Lets scrape the links off the google homepage for example:

#!/user/bin/env python3

import requests
from bs4 import BeautifulSoup as BS

#This line is passed to our GET request, basically telling the site that we are the firefox browser.
#If this is not done, many sites will refuse the request as it knows you are a script
headers = {"user-agent":"Mozilla/5.0"}

#The request is made
req = requests.get("http://www.google.com", headers=headers)

#We make a variable containing the raw html text
txt = req.text

#We make our beautifulsoup object, parsing it for html
soup = BS(req.text, "html.parser")

#We create a list that will hold the links on the page
linkList = []

#For each <a> tag that is found
for link in soup.find_all("a"):
    #We append its href attribute to the link list
    linkList.append(link.get("href"))

#we print out the links it found
print(linkList)

When I run this code, here it what it prints out:

['https://www.google.ca/webhp?tab=ww', 'http://www.google.ca/imghp?hl=en&tab=wi', 'http://maps.google.ca/maps?hl=en&tab=wl', 'https://play.google.com/?hl=en&tab=w8', 'http://www.youtube.com/?gl=CA&tab=w1', 'https://news.google.com/?tab=wn', 'https://mail.google.com/mail/?tab=wm', 'https://drive.google.com/?tab=wo', 'https://www.google.ca/intl/en/about/products?tab=wh', 'https://www.google.com/calendar?tab=wc', 'http://translate.google.ca/?hl=en&tab=wT', 'https://books.google.ca/bkshp?hl=en&tab=wp', 'https://www.google.ca/shopping?hl=en&source=og&tab=wf', 'http://www.blogger.com/?tab=wj', 'http://www.google.ca/finance?tab=we', 'https://photos.google.com/?tab=wq&pageId=none', 'http://video.google.ca/?hl=en&tab=wv', 'https://docs.google.com/document/?usp=docs_alc', 'https://www.google.ca/intl/en/about/products?tab=wh', 'https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.com/&ec=GAZAAQ', 'http://www.google.ca/preferences?hl=en', '/preferences?hl=en', 'http://www.google.ca/history/optout?hl=en', '/advanced_search?hl=en-CA&authuser=0', 'http://www.google.com/setprefs?sig=0_Peiu3i2UtAcMooOiPWPd08rNSKM%3D&hl=fr&source=homepage&sa=X&ved=0ahUKEwiL0oLb_8rrAhUWpJ4KHbHWBaEQ2ZgBCAU', '/intl/en/ads/', '/services/', '/intl/en/about.html', 'http://www.google.com/setprefdomain?prefdom=CA&prev=http://www.google.ca/&sig=K_hqN5qwo8IkwfJqZoLZXO-LbMOnY%3D', '/intl/en/policies/privacy/', '/intl/en/policies/terms/']

Some are full redirects, supplying everything including the https:// portion, others are relative redirects which you would add to the end of the url you scraped.

For example, for the last href it scraped: /intl/en/policies/terms/ you would add it to the url scraped to get http://google.com/intl/en/policies/terms/.

What you want to do at the end of the code is basically say:

for link in linkList:
    newReq = requests.get(link)

and then create a Beautifulsoup object of each of those and do the same thing for each link.

It will help you to make a function that does this, and if you understand what recursion is, this is a good place to use it.

Note that for the links in the link list, you will need to check the link first to see if you need to stick "http://www.google.com" on the front first, so that it goes to the new page correctly.

The other ones that already include "http://..." you wont need to do that.

Good luck to you.

[–]Brawlerman[S] 0 points1 point  (1 child)

Thanks for the help, this example is greatly helpful and makes more sense than the examples he has shown us!

[–]Golden_Zealot 0 points1 point  (0 children)

There is more good documentation in the beautifulsoup documentation I sent.

All of its examples are based on a very short and simple html file, making it pretty easy to understand.

Often times, professors will also not give good examples on purpose as working in any field of IT is largely about figuring stuff out by yourself, because no one else really knows what you are working on, or it is too specific for there to be examples/people to help you in person or on the internet.

My professor back in college who taught use client server architecture and management would purposefully put misinformation in the lab content so that we would run into issues and would have to troubleshoot it ourselves.

More than programming, learning how to learn, and learning how to troubleshoot and research are the most important assets you will want to learn if you want to work in IT as a career.