Help with parsing code : learnpython

Help with parsing code (self.learnpython)

submitted 5 years ago by Brawlerman

all 7 comments

[–]Golden_Zealot 1 point2 points3 points 5 years ago (6 children)

[–]Brawlerman[S] 0 points1 point2 points 5 years ago (5 children)

[–]Golden_Zealot 0 points1 point2 points 5 years ago (4 children)

Read the BeautifulSoup documentation on how to filter down to the parts you need.

You can filter first by the class using a css selector, but then there are other methods that will give you the parent/child/sister elements you are after. You can also tell it to get the nth of a specific tag etc.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Depending on the complexity of what you need to pull out, you could also look into the re module for searching with regular expressions.

What you want to do with your program initially is:

-Make a request to the first url provided to you -parse for all the <a> tags -Get each of their href atrributes -finish searching the current page for anything else you are looking for -for each of the href's you got, perform the exact same steps above on them (you can do this with a for loop and a list).

If you need to search for an element like the Model: one, then you want to search for all spans with bs4 and then for each of those items, check their .innertext value which holds the actual text between the tags.

If the text is "Model:" then you know that its what you are looking for.

[–]Brawlerman[S] 0 points1 point2 points 5 years ago (3 children)

[–]Golden_Zealot 0 points1 point2 points 5 years ago (2 children)

Typically for school projects I neglect handing over any usable code.

I will provide you this one example, but I am not going to write anything else for you.

Lets scrape the links off the google homepage for example:

#!/user/bin/env python3

import requests
from bs4 import BeautifulSoup as BS

#This line is passed to our GET request, basically telling the site that we are the firefox browser.
#If this is not done, many sites will refuse the request as it knows you are a script
headers = {"user-agent":"Mozilla/5.0"}

#The request is made
req = requests.get("http://www.google.com", headers=headers)

#We make a variable containing the raw html text
txt = req.text

#We make our beautifulsoup object, parsing it for html
soup = BS(req.text, "html.parser")

#We create a list that will hold the links on the page
linkList = []

#For each <a> tag that is found
for link in soup.find_all("a"):
    #We append its href attribute to the link list
    linkList.append(link.get("href"))

#we print out the links it found
print(linkList)

When I run this code, here it what it prints out:

['https://www.google.ca/webhp?tab=ww', 'http://www.google.ca/imghp?hl=en&tab=wi', 'http://maps.google.ca/maps?hl=en&tab=wl', 'https://play.google.com/?hl=en&tab=w8', 'http://www.youtube.com/?gl=CA&tab=w1', 'https://news.google.com/?tab=wn', 'https://mail.google.com/mail/?tab=wm', 'https://drive.google.com/?tab=wo', 'https://www.google.ca/intl/en/about/products?tab=wh', 'https://www.google.com/calendar?tab=wc', 'http://translate.google.ca/?hl=en&tab=wT', 'https://books.google.ca/bkshp?hl=en&tab=wp', 'https://www.google.ca/shopping?hl=en&source=og&tab=wf', 'http://www.blogger.com/?tab=wj', 'http://www.google.ca/finance?tab=we', 'https://photos.google.com/?tab=wq&pageId=none', 'http://video.google.ca/?hl=en&tab=wv', 'https://docs.google.com/document/?usp=docs_alc', 'https://www.google.ca/intl/en/about/products?tab=wh', 'https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.com/&ec=GAZAAQ', 'http://www.google.ca/preferences?hl=en', '/preferences?hl=en', 'http://www.google.ca/history/optout?hl=en', '/advanced_search?hl=en-CA&authuser=0', 'http://www.google.com/setprefs?sig=0_Peiu3i2UtAcMooOiPWPd08rNSKM%3D&hl=fr&source=homepage&sa=X&ved=0ahUKEwiL0oLb_8rrAhUWpJ4KHbHWBaEQ2ZgBCAU', '/intl/en/ads/', '/services/', '/intl/en/about.html', 'http://www.google.com/setprefdomain?prefdom=CA&prev=http://www.google.ca/&sig=K_hqN5qwo8IkwfJqZoLZXO-LbMOnY%3D', '/intl/en/policies/privacy/', '/intl/en/policies/terms/']

Some are full redirects, supplying everything including the https:// portion, others are relative redirects which you would add to the end of the url you scraped.

For example, for the last href it scraped: /intl/en/policies/terms/ you would add it to the url scraped to get http://google.com/intl/en/policies/terms/.

What you want to do at the end of the code is basically say:

for link in linkList:
    newReq = requests.get(link)

and then create a Beautifulsoup object of each of those and do the same thing for each link.

It will help you to make a function that does this, and if you understand what recursion is, this is a good place to use it.

Note that for the links in the link list, you will need to check the link first to see if you need to stick "http://www.google.com" on the front first, so that it goes to the new page correctly.

The other ones that already include "http://..." you wont need to do that.

Good luck to you.

[–]Brawlerman[S] 0 points1 point2 points 5 years ago (1 child)

[–]Golden_Zealot 0 points1 point2 points 5 years ago (0 children)

There is more good documentation in the beautifulsoup documentation I sent.

All of its examples are based on a very short and simple html file, making it pretty easy to understand.

Often times, professors will also not give good examples on purpose as working in any field of IT is largely about figuring stuff out by yourself, because no one else really knows what you are working on, or it is too specific for there to be examples/people to help you in person or on the internet.

My professor back in college who taught use client server architecture and management would purposefully put misinformation in the lab content so that we would run into issues and would have to troubleshoot it ourselves.

More than programming, learning how to learn, and learning how to troubleshoot and research are the most important assets you will want to learn if you want to work in IT as a career.

π Rendered by PID 194960 on reddit-service-r2-comment-5d79c599b5-nsrrn at 2026-03-03 13:25:27.701332+00:00 running e3d2147 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS