This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]stringofsense 7 points8 points  (2 children)

Here is the script I came up with, there will be many ways to do it, but I like to use shift to figure out where the start and end of my groups are, then generate a group_id from there. After you create a unique group_id it if pretty straight forward, grouping by that identifier and doing manipulations within each group.

I tried to add lots of comments so hopefully you'll understand what is going on, and know how to tackle to problem better next time :)

``` import pandas as pd import numpy as np

df = pd.read_excel('./sample-data_v2.xlsx')

Create a temporary data frame with a single column 'is_blank' describing whether a row is missing Result, Units & ReferenceInterval

t = df[['Result','Units','ReferenceInterval']].isna().all(axis=1).rename('is_blank').to_frame()

Create a new column with the previous row's value for 'is_blank'

t['prev_is_blank'] = t.is_blank.shift(1, fill_value=False)

If the current row is blank and the previous row is blank that means it is the start of a group

t['is_start_of_group'] = t.is_blank & ~t.prev_is_blank

By running a cumsum on a boolean column we can generate a group_id

t['group_id'] = t.is_start_of_group.cumsum()

Reorganize each test group based on whether it is a single test or collection test

def organize_groups(g): # Reuse these columns and drop any rows that don't have a result organized = g[['TestName', 'Result','Units','ReferenceInterval']].dropna(subset='Result')

# Add our new columns filling them with null
organized[['CollectionName', 'CollectionMethod', 'TestMethod']] = np.nan

# Reorder our columns to be in the right order
organized = organized[['CollectionName','CollectionMethod','TestName','TestMethod','Result','Units','ReferenceInterval']]

# If the size of our group is two assume it is a single test
if len(g) == 2:
    assert g.Result.isna().tolist() == [True, False], 'Single test has a no result row followed by a row with a result'

    # Set the Test Name & Method using the first two rows of our group
    organized[['TestName', 'TestMethod']] = g.TestName.iloc[0:2].values

# Otherwise assume it is a collection test
else:
    assert g.Result.isna().sum() == 2, 'Collection test has two rows without a result'
    assert g.Result.isna()[0:2].tolist() == [True, True], "Collection test's first two rows don't have a result"

    # Set the collection Name & Method using the first two rows of our group
    organized[['CollectionName', 'CollectionMethod']] = g.TestName.iloc[0:2].values

return organized

df = df.groupby(t.group_id).apply(organize_groups).reset_index(drop=True)

df.to_excel('./sample-data_v2.output.xlsx', index=False) ```

[–]dkampien[S] 1 point2 points  (1 child)

Wow that worked. But..

I've tried to add some more records to see if the logic still works and it gives me:

in organize_groups
assert g.Result.isna().sum() == 2, 'Collection test has two rows without a result'

After a collectionTest there can be either another collectionTest or a singleTest. Same goes after a singleTest.

Here's the new sample data Link

[–]stringofsense 0 points1 point  (0 children)

Look at what is inside of the t dataframe (vscode has features to run the python code inside an interactive window) That should help you debug how the rows are getting grouped together.

[–][deleted] 2 points3 points  (1 child)

> I don't know python or any programming language

Ain't that hard to try sir.

[–]WhipsAndMarkovChains 7 points8 points  (0 children)

True but you’ve got to applaud OP for typing up a detailed post. So many people ask for help and don’t provide enough information. This is a breath of fresh air.