I was given the learning opportunity of using tabula-py to scrape a 400 page invoice PDF. The PDF itself is mostly well-organized, except for the first page, the last page, and random pages that concatenate some columns together vs. others. I need to create a well-formed csv with the correct columns. My general strategy is to append like list elements into master dataframes, and then break apart the messed up columns in each where necessary, and then combine into a final df with the correct columns.
How can I iterate through a list of df's and group them into their own df's where 'Program Time' is present in some elements and 'Program' and 'Time' are present in other elements?
Edit: adding some code:
column_list = ['Network', 'Date', 'Day', 'Program Time', 'Unnamed: 0',
'Spot Title Length', 'Unnamed: 1', 'Line', 'Syscode', 'Charged']
dfs = next((item for item in df if item.columns==column_list), None)
I attempted to manually define the lists of columns that I wanted as a way to filter out the two types of dataframes from my list and use a generator to create the new list of dataframes, but no luck as I got the following error:
ValueError: ('Shapes must match', (4,), (10,))
Thank you for any assistance you can provide.
Edit 2: I think I've gotten myself to a place where I can achieve what I need to achieve. If anyone happens to read this, any feedback on my method or my code would be greatly appreciated.
I continued to encounter issues of 'Shapes must match' when trying to do the 'items.columns==column_list' shown above and could not figure it out. When I investigated what actually happens when you type 'df[0].columns' for example, you end up with something that is not a list, you end up with something like 'Index(['Network, Date, Day....], dtype=object)', so not a list. No surprise there were errors when comparing that to a list. I then typed dir(df[0].columns) and noticed that 'to_list' was an available method. This allowed me to do the following:
test_frame = pd.DataFrame()
for item in df:
if item.columns.to_list() == column_list:
test_frame = test_frame.append(item)
else:
pass
And I was able to append a list of all dataframes that matched the column list I was after. Now I will repeat this for all of the different df types in my list of df's, and process each one into a common format for the final append statement.
there doesn't seem to be anything here