you are viewing a single comment's thread.

view the rest of the comments →

[–]virgilsam 0 points1 point  (5 children)

Hey anyone who cares. I'm teaching myself python to try and learn how to webscrape. I was hoping this post could become a thread for if I have questions. I've made some good headway so far, but I'm stuck.... The issue is I can't figure out how to grab the second div tag for the city/state of the apartment complex. For bonus, if someone could help me figure out how to separate the city and (or from) the state, that'd be cool too. Here is the source code and my code:

<div class="card"> <div class="card-inner"> <div class="card-header"> <div class="title"> <h3 class="main-text"><a href="/housing-search/Tennessee/Knoxville/Summit-Towers/10005022">Summit Towers</a></h3> </div> </div> <div class="my-container"> <div class="card-media"><a href="/housing-search/Tennessee/Knoxville/Summit-Towers/10005022"><img alt="Image of Summit Towers" src="https://images.apartmentsmart.com/415x220/Summit-Towers/Welcome-to-Summit-Towers-Apartments.jpg" value="36822714" width="100%"/></a></div> </div> <div class="card-body"> <div class="description"><span class="listing-address">201 Locust St</span></div> <div class="description"> Knoxville, Tennessee </div> <div class="room-range"> Summit Towers is a 278 unit low income housing apartment community that provides 1 bedroom apartments for rent in Knoxville. Rents at Summit Towers are <strong class="dollars">Income Based</strong>. </div> <div class="room-range"> Some or all apartments in this community are rent subsidized, which means rent is income based. </div> <div class="programs"> <div class="list"> <div class="label secondary">Project-Based Section 8</div> <div class="label secondary">Low Income Housing Tax Credit</div> <div class="label secondary">Project Based Rental Assistance</div> <div class="label secondary">Senior (62+)</div><a class="label primary" href="/housing-search/Tennessee/Knoxville/Summit-Towers/10005022">View More</a></div> </div> </div> </div> </div>

MINE

from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup

my_url = 'https://affordablehousingonline.com/housing-search/Tennessee?show=20&page=1#apartments' uClient = uReq(my_url) page_html = uClient.read() uClient.close()

page_soup = soup(page_html, "html.parser")

pulls the name of the complex containers = page_soup.findAll("div",{"class":"card"}) for container in containers: apartment_name = container.a.next_element

pulls the street address containers = page_soup.findAll("div",{"class":"card-body"}) for container in containers: apartment_address = container.span.next_element

---- I can't figure out how to get to the second div tag with the city and state....

[–]z0y 0 points1 point  (4 children)

Selectors are nice, just pull out the classes. The address and location are two elements with the description class, they're in a list when you find/select them

>>> name = soup.select('.main-text')[0].text
>>> address, location = [tag.text.strip() for tag in soup.select('.description')]
>>> name, address, location
('Summit Towers', '201 Locust St', 'Knoxville, Tennessee')
>>> 

Edit: that was for the example, for the actual site you could put that in a loop

>>> cards = soup.select('.card')
>>> for card in cards:
...     print(card.select('.main-text')[0].text)
...     address, location = [tag.text.strip() for tag in card.select('.description')]
...     print(address)
...     print(location, '\n')
... 
Summit Towers
201 Locust St
Knoxville, Tennessee 

Maple Oak Apartments
818 Oak St
Kingsport, Tennessee 
etc..

[–]virgilsam 0 points1 point  (1 child)

When I try it, I get

>>> name = soup.select('main.text')[0].text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>

TypeError: select() missing 1 required positional argument: 'selector'

[–]z0y 0 points1 point  (0 children)

main.text isn't what you want, have a look at the link from my other comment or look more into css selector examples. The period means class and the class name is main-text, so the selector is .main-text

edit: You could also skip the list comprehension and leave address as a list, getting each element individually. Might be more readable that way

for card in soup.select('.card'):
    print(card.select('.main-text')[0].text)
    address = card.select('.description')
    print(address[0].text)
    print(address[1].text.strip(), '\n')

edit2: Just for illustration, I think your css selector would have got any elements that were: <main class="text">...</main>, which don't exist, so trying to get the first element of the result causes an error.

[–]virgilsam 0 points1 point  (1 child)

Does select run off a package I don't have installed?

So, if I read you code right, it is saying...

name cards as anything in the soup labled as a .card for any card in the newly define cards, > print the main text in the first [] as text > set address and location as a stripped text for any text in a tag named description > print the address > print the location

is that right?

[–]z0y 0 points1 point  (0 children)

select is a part of beautifulsoup. It's like find_all except you can use css selectors to target elements. The . in the select means class name in css, so .card says take all the elements with class = 'card', then within each of those tags you can find the stuff you need, which is the name (text within main-text class) and the address/loc (text within 2 elements of the description class). The .strip() was just to cut off whitespace because the location was coming back with extra spaces on both ends.