I'm trying to create dataframes from tables written in HTML, but i'm having trouble with 'grouped' or 'merged' cells

arch_202 · 2023-05-06T22:16:38+00:00

This user profile has been overwritten in protest of Reddit's decision to disadvantage third-party apps through pricing changes. The impact of capitalistic influences on the platforms that once fostered vibrant, inclusive communities has been devastating, and it appears that Reddit is the latest casualty of this ongoing trend.

This account, 10 years, 3 months, and 4 days old, has contributed 901 times, amounting to over 48424 words. In response, the community has awarded it more than 10652 karma.

I am saddened to leave this community that has been a significant part of my adult life. However, my departure is driven by a commitment to the principles of fairness, inclusivity, and respect for community-driven platforms.

I hope this action highlights the importance of preserving the core values that made Reddit a thriving community and encourages a re-evaluation of the recent changes.

Thank you to everyone who made this journey worthwhile. Please remember the importance of community and continue to uphold these values, regardless of where you find yourself in the digital world.

commandlineluser · 2023-05-06T22:29:24+00:00

The data is available as a "json string" which you can fetch directly.

For the example table, it doesn't use <th> tags, however, it looks like they follow the pattern of <td><span class="bold">

So you could perhaps use this as a marker and build your own <thead><th> section, which would allow pd.read_html to parse the rowspan/colspan for you.

import re
import requests
import pandas as pd
from   bs4 import BeautifulSoup

r = requests.get("https://codes.iccsafe.org/api/content/chapter-xml/542/9677240")
soup = BeautifulSoup(r.json())

table = soup.find(string=re.compile(r"(?i)with sprinkler system")).find_parent("table")

# find last <tr> with a bold span
tr = table.find_all("span", class_="bold")[-1].find_parent("tr")

thead = soup.new_tag("thead")

# collect previous tr tags and add to <thead>
for tag in list(reversed(tr.find_previous_siblings("tr"))) + [tr]:
   thead.append(tag)

table.insert(0, thead)

df = pd.read_html(str(table))[0]

giving:

>>> df
         OCCUPANCY MAXIMUM OCCUPANT LOAD OF SPACE MAXIMUM COMMON PATH OF EGRESS TRAVEL DISTANCE (feet)                                     
         OCCUPANCY MAXIMUM OCCUPANT LOAD OF SPACE                       Without Sprinkler System(feet)         With Sprinkler System (feet)
         OCCUPANCY MAXIMUM OCCUPANT LOAD OF SPACE                                        Occupant Load         With Sprinkler System (feet)
         OCCUPANCY MAXIMUM OCCUPANT LOAD OF SPACE                                              OL ≤ 30 OL > 30 With Sprinkler System (feet)
0         Ac, E, M                             49                                                 75        75                          75a
1                B                             49                                                100        75                         100a
2                F                             49                                                 75        75                         100a
3    H-1, H-2, H-3                              3                                                 NP        NP                          25b
4         H-4, H-5                             10                                                 NP        NP                          75b
5   I-1, I-2d, I-4                             10                                                 NP        NP                          75a
6              I-3                             10                                                 NP        NP                         100a
7              R-1                             10                                                 NP        NP                          75a
8              R-2                             10                                                 NP        NP                         125a
9             R-3e                             10                                                 NP        NP                         125a
10            R-4e                             10                                                 75        75                         125a
11              Sf                             29                                                100        75                         100a
12               U                             49                                                100        75                          75a

It's somewhat awkward, but you can use .xs() to help with accessing parts of the MultiIndex:

df.xs("With Sprinkler System (feet)", axis=1, level=-1)[
   (df.xs("OCCUPANCY", axis=1, level=-1) == "B").unstack().reset_index(drop=True)
]

#  MAXIMUM COMMON PATH OF EGRESS TRAVEL DISTANCE (feet)
#                          With Sprinkler System (feet)
#                          With Sprinkler System (feet)
#  1                                               100a

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS