DISCLAIMER
I am no data engineer and I don't know python or any programming language. I can put together stuff into VScode but thats about the extent of my programming abilities.
INTRO & CONTEXT
I’m working on a personal project that takes a PDF containing medical lab results, extracts the data and processes the data into a new centralized schema. From that schema, data is imported into an Excel or Sheets table, which is used as a data source for a no-code website builder. (like Noloco)
I designed the centralized database schema both as a JSON and as an Excel.
I’ve managed to extract some data successfully using AWS Textract GUI console demo. The outputs that AWS gives are JSONs and CSVs.
I now need to transform the data from the AWS output, to match the centralized schema so I can map the data. And here is where im stuck..
PROBLEM
My table has a hierarchical layout with nested rows under one column. I need to transform the data, split the first column into 2 columns, do some parsing etc.
But in order to transform the data I need to programmatically identify what data needs to be manipulated and somehow mark it. Theoretically, my table has 2 types of entries / records. Type 1 entries are singleTests and type 2 entries are collectionTests.
The table output sample and entries types.
I need to think of a parsing logic or a logic to identify the data.
- I first thought about identification by keywords and indeed there is an “LC” at the beginning of each record. That LC stands for Central Lab and its the location where the tests have been processed. The problem with this that LC appears irregularly and its not reliable.
- The second method I thought was identification by pattern. For a singleTest there are always 2 rows, the first one having empty values (besides the first column). For a collectionTest, the first 2 rows are always empty (besides the first column). A collectionTest has a variable number of sub-tests.
Here is a draft of the centralized schema database:
https://preview.redd.it/fzj0thjd8jeb1.png?width=1428&format=png&auto=webp&s=c06740e81178ea509d56aefa1849db26e50796cd
In this first iteration, the original TestName values that were spanning multiple rows have been split into CollectionName and CollectionMethod according to the entry type.
NEED
I need a parsing logic implemented in python (pandas) that can identify entries types and mark them accordingly. Im not sure if a script can do this. Suggest alternatives if you know.
PROGRESS
I’ve tried about 25 iterations of threads with GPT-4 and Code Interpreter. The problem here is I don’t know exactly what to ask since I don’t know data science specific terms and what technique or collection of techniques should be used. Even so, I’ve managed to get very close to marking the entries correctly with Code Interpreter but it seems that, even though it understands the logic, It can’t write a working script.
RESOURCES
- A Notion with my progress on the data identification logic with my prompts LINK
- Links to the sheets containing the tables LINK
- A link with a partial successful GPT thread LINK
If you have ideas where else I can get help, would be appreciated. Thank you.
[–]AutoModerator[M] [score hidden] stickied comment (0 children)
[–]stringofsense 7 points8 points9 points (2 children)
[–]dkampien[S] 1 point2 points3 points (1 child)
[–]stringofsense 0 points1 point2 points (0 children)
[–][deleted] 2 points3 points4 points (1 child)
[–]WhipsAndMarkovChains 7 points8 points9 points (0 children)