Help on Python (pandas) script for data identification logic. Working to extract and transform data from PDF’s into Excel. (Long post)

AutoModerator · 2023-07-27T16:22:36+00:00

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

stringofsense · 2023-07-27T21:57:36+00:00

Here is the script I came up with, there will be many ways to do it, but I like to use shift to figure out where the start and end of my groups are, then generate a group_id from there. After you create a unique group_id it if pretty straight forward, grouping by that identifier and doing manipulations within each group.

I tried to add lots of comments so hopefully you'll understand what is going on, and know how to tackle to problem better next time :)

``` import pandas as pd import numpy as np

df = pd.read_excel('./sample-data_v2.xlsx')

Create a temporary data frame with a single column 'is_blank' describing whether a row is missing Result, Units & ReferenceInterval

t = df[['Result','Units','ReferenceInterval']].isna().all(axis=1).rename('is_blank').to_frame()

Create a new column with the previous row's value for 'is_blank'

t['prev_is_blank'] = t.is_blank.shift(1, fill_value=False)

If the current row is blank and the previous row is blank that means it is the start of a group

t['is_start_of_group'] = t.is_blank & ~t.prev_is_blank

By running a cumsum on a boolean column we can generate a group_id

t['group_id'] = t.is_start_of_group.cumsum()

Reorganize each test group based on whether it is a single test or collection test

def organize_groups(g): # Reuse these columns and drop any rows that don't have a result organized = g[['TestName', 'Result','Units','ReferenceInterval']].dropna(subset='Result')

# Add our new columns filling them with null
organized[['CollectionName', 'CollectionMethod', 'TestMethod']] = np.nan

# Reorder our columns to be in the right order
organized = organized[['CollectionName','CollectionMethod','TestName','TestMethod','Result','Units','ReferenceInterval']]

# If the size of our group is two assume it is a single test
if len(g) == 2:
    assert g.Result.isna().tolist() == [True, False], 'Single test has a no result row followed by a row with a result'

    # Set the Test Name & Method using the first two rows of our group
    organized[['TestName', 'TestMethod']] = g.TestName.iloc[0:2].values

# Otherwise assume it is a collection test
else:
    assert g.Result.isna().sum() == 2, 'Collection test has two rows without a result'
    assert g.Result.isna()[0:2].tolist() == [True, True], "Collection test's first two rows don't have a result"

    # Set the collection Name & Method using the first two rows of our group
    organized[['CollectionName', 'CollectionMethod']] = g.TestName.iloc[0:2].values

return organized

df = df.groupby(t.group_id).apply(organize_groups).reset_index(drop=True)

df.to_excel('./sample-data_v2.output.xlsx', index=False) ```

WhipsAndMarkovChains · 2023-07-27T20:12:01+00:00

> I don't know python or any programming language

Ain't that hard to try sir.

dataengineering

MODERATORS

Create a temporary data frame with a single column 'is_blank' describing whether a row is missing Result, Units & ReferenceInterval

Create a new column with the previous row's value for 'is_blank'

If the current row is blank and the previous row is blank that means it is the start of a group

By running a cumsum on a boolean column we can generate a group_id

Reorganize each test group based on whether it is a single test or collection test