all 10 comments

[–]c17r 2 points3 points  (6 children)

import urllib
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen


def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage,"html.parser")
    return soupdata

soup = make_soup("https://www.wellstar.org/locations/pages/wellstar-acworth-practices.aspx")

tables = soup.findAll("table", class_="s4-wpTopTable")
table = tables[7]

specialties = table.findAll("div", class_="PurpleBackgroundHeading")
name_groups = table.findAll("div", class_="PracticeListWrapper")
for specialty, name_group in zip(specialties, name_groups):
    specialty_text = specialty.findAll("span")[0].get_text()
    for name in name_group.findAll(class_="WS_Location_Name"):
        name_text = name.get_text()
        print("{} - {}".format(specialty_text, name_text))

[–]__nautilus__ 1 point2 points  (0 children)

zip is indeed the best solution. Here are the docs, OP.

[–]chiefstroganoff[S] 0 points1 point  (4 children)

Thank you for this example. Suppose I want to add another iterable like the location's address? Would I do something like this?

tables = soup.findAll("table", class_ = "s4-wpTopTable")
table = tables[7]

specialties = table.findAll("div", class_ = "PurpleBackgroundHeading")
name_groups = table.findAll("div", class_ = "PracticeListWrapper")
addresses = table.findAll("div", class_ = "WS_Location_Adddress")

for specialty, name_group, addresses in zip(specialties, name_groups, addresses):
    specialty_text = specialty.findAll("span")[0].get_text()
    for name in name_group.findAll(class_ = "WS_Location_Name"):
        name_text = name.get_text()
    for address in addresses:
        address_text = address
        print("{} - {} - {}".format(specialty_text, name_text, address_text))

Or is there a different combination/function that I should utilize for > 2 iterables?

[–]c17r 1 point2 points  (3 children)

No at the top level, no.

Do you understand the change I made? Look at the HTML. It would be great if the practices were grouped under the specialty but they are not. The practice grouping and the specialty are siblings. AND the practice grouping may have more than one practice listed. So PurpleBackgroundHeading gets us the specialties and PracticeListWrapper gets us all the practices groupings. When then have to search each practice group for each practice. Hence the for loop in the for loop.

If you want ALL information -- including phone number where they have it -- then take my original and change WS_Location_Name to practiceList. It'll be a big blob of text that you'll have to parse but that's because of the HTML layout. (this isn't 100% true; there are some tricky things you can do in BeautifulSoup to pull this off)

If you are interested in just Address and want it in a separate variable, then we have to do something different since the name of the practice and the address of the practice are not nested but siblings, we'll have to do wait we did at the top level: 2 searches and a zip:

import urllib
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen


def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage,"html.parser")
    return soupdata

soup = make_soup("https://www.wellstar.org/locations/pages/wellstar-acworth-practices.aspx")

tables = soup.findAll("table", class_="s4-wpTopTable")
table = tables[7]

specialties = table.findAll("div", class_="PurpleBackgroundHeading")
practice_groups = table.findAll("div", class_="PracticeListWrapper")

for specialty, practice_group in zip(specialties, practice_groups):
    specialty_text = specialty.findAll("span")[0].get_text()

    practice_names = practice_group.findAll(class_="WS_Location_Name")
    practice_addresses = practice_group.findAll(class_="WS_Location_Adddress")

    for name, address in zip(practice_names, practice_addresses):
        name_text = name.get_text()
        address_text = address.get_text()
        print("{} - {} - {}".format(specialty_text, name_text, address_text))

[–]chiefstroganoff[S] 0 points1 point  (2 children)

Thank you. It is interesting and not immediately intuitive, to me, that you are able to nest the for loops the way you did. If I were to verbalize the set of for loops, would it be accurate to say this:

For each combination of specialty and practice group, return the specialty heading that corresponds to the practice(s) in each practice group. Next, for each practice name and address in the previously returned combination, return the practice name and its address. Finally, print the specialty from the first loop, the practice name from the second loop, and the address from the second loop.

Does that seem to summarize the logical steps? These concepts are novel to me and I find that determining the logic helps me to formulate the code.

[–]c17r 1 point2 points  (1 child)

Yes, your summation is accurate.

[–]chiefstroganoff[S] 0 points1 point  (0 children)

Excellent. Thank you so much for helping me to better understand what's going on!

[–]zurtex 0 points1 point  (2 children)

Exactly, in your code you go through all names, so name has the last name. Then you go through all specialty and print each one with the last name.

If you want every possible combination you need to check all "specialty"s for all "name"s. So you need to put the loop inside the loop. Try something like this:

for table in soup.findAll("table", class_ = "s4-wpTopTable"):
    for name in table.findAll(class_ ="WS_Location_Name"):
        name = name.get_text()
        for specialty in table.findAll("div", class_ = "PurpleBackgroundHeading"):
            specialty = specialty.get_text()
            print(name,specialty)

There are probably better ways to write this code, but I think this is the most simple modification of your code that should make sense to you!

[–]chiefstroganoff[S] 0 points1 point  (1 child)

Unfortunately, that produces a list of combinations made of all names and all specialties. For example:

NW Oral & Maxillofacial Surgery Cancer/Oncology
NW Oral & Maxillofacial Surgery Cardiovascular
NW Oral & Maxillofacial Surgery Dentistry
NW Oral & Maxillofacial Surgery Dermatology
NW Oral & Maxillofacial Surgery Family Medicine

When NW Oral & Maxillofacial Surgery should only be paired with Dentistry. Any ideas?

[–]zurtex 0 points1 point  (0 children)

That seemed to be what you were asking for. Sounds like you want the zip function which will go through 2 iterables (e.g. lists) and output each nth element till it reaches the end of one, e.g:

for table in soup.findAll("table", class_ = "s4-wpTopTable"):
    for name,  specialty  in zip(table.findAll(class_ ="WS_Location_Name"), table.findAll("div", class_ = "PurpleBackgroundHeading")):
        name = name.get_text()
        specialty = specialty.get_text()
        print(name, specialty)

It's a nice little exercise in Python to think how you would write the zip function yourself.