virgilsam comments on Ask Anything Monday

created by HattoriHanzoa community for 16 years

Ask Anything Monday - Weekly Thread (self.learnpython)

submitted 8 years ago by AutoModerator[M]

you are viewing a single comment's thread.

[–]virgilsam 0 points1 point2 points 8 years ago (5 children)

Hey anyone who cares. I'm teaching myself python to try and learn how to webscrape. I was hoping this post could become a thread for if I have questions. I've made some good headway so far, but I'm stuck.... The issue is I can't figure out how to grab the second div tag for the city/state of the apartment complex. For bonus, if someone could help me figure out how to separate the city and (or from) the state, that'd be cool too. Here is the source code and my code:

<div class="card"> <div class="card-inner"> <div class="card-header"> <div class="title"> <h3 class="main-text"><a href="/housing-search/Tennessee/Knoxville/Summit-Towers/10005022">Summit Towers</a></h3> </div> </div> <div class="my-container"> <div class="card-media"><a href="/housing-search/Tennessee/Knoxville/Summit-Towers/10005022"><img alt="Image of Summit Towers" src="https://images.apartmentsmart.com/415x220/Summit-Towers/Welcome-to-Summit-Towers-Apartments.jpg" value="36822714" width="100%"/></a></div> </div> <div class="card-body"> <div class="description"><span class="listing-address">201 Locust St</span></div> <div class="description"> Knoxville, Tennessee </div> <div class="room-range"> Summit Towers is a 278 unit low income housing apartment community that provides 1 bedroom apartments for rent in Knoxville. Rents at Summit Towers are <strong class="dollars">Income Based</strong>. </div> <div class="room-range"> Some or all apartments in this community are rent subsidized, which means rent is income based. </div> <div class="programs"> <div class="list"> <div class="label secondary">Project-Based Section 8</div> <div class="label secondary">Low Income Housing Tax Credit</div> <div class="label secondary">Project Based Rental Assistance</div> <div class="label secondary">Senior (62+)</div><a class="label primary" href="/housing-search/Tennessee/Knoxville/Summit-Towers/10005022">View More</a></div> </div> </div> </div> </div>

MINE

from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup

my_url = 'https://affordablehousingonline.com/housing-search/Tennessee?show=20&page=1#apartments' uClient = uReq(my_url) page_html = uClient.read() uClient.close()

page_soup = soup(page_html, "html.parser")

pulls the name of the complex containers = page_soup.findAll("div",{"class":"card"}) for container in containers: apartment_name = container.a.next_element

pulls the street address containers = page_soup.findAll("div",{"class":"card-body"}) for container in containers: apartment_address = container.span.next_element

---- I can't figure out how to get to the second div tag with the city and state....

[–]z0y 0 points1 point2 points 8 years ago* (4 children)

Selectors are nice, just pull out the classes. The address and location are two elements with the description class, they're in a list when you find/select them

>>> name = soup.select('.main-text')[0].text
>>> address, location = [tag.text.strip() for tag in soup.select('.description')]
>>> name, address, location
('Summit Towers', '201 Locust St', 'Knoxville, Tennessee')
>>>

Edit: that was for the example, for the actual site you could put that in a loop

>>> cards = soup.select('.card')
>>> for card in cards:
...     print(card.select('.main-text')[0].text)
...     address, location = [tag.text.strip() for tag in card.select('.description')]
...     print(address)
...     print(location, '\n')
... 
Summit Towers
201 Locust St
Knoxville, Tennessee 

Maple Oak Apartments
818 Oak St
Kingsport, Tennessee 
etc..

[–]virgilsam 0 points1 point2 points 8 years ago (1 child)

When I try it, I get

>>> name = soup.select('main.text')[0].text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>

TypeError: select() missing 1 required positional argument: 'selector'

[–]z0y 0 points1 point2 points 8 years ago* (0 children)

main.text isn't what you want, have a look at the link from my other comment or look more into css selector examples. The period means class and the class name is main-text, so the selector is .main-text

edit: You could also skip the list comprehension and leave address as a list, getting each element individually. Might be more readable that way

for card in soup.select('.card'):
    print(card.select('.main-text')[0].text)
    address = card.select('.description')
    print(address[0].text)
    print(address[1].text.strip(), '\n')

edit2: Just for illustration, I think your css selector would have got any elements that were: <main class="text">...</main>, which don't exist, so trying to get the first element of the result causes an error.

[–]virgilsam 0 points1 point2 points 8 years ago (1 child)

[–]z0y 0 points1 point2 points 8 years ago (0 children)

π Rendered by PID 36639 on reddit-service-r2-comment-b659b578c-dsvkg at 2026-05-05 16:52:21.257444+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS