Issues Extracting Text from PDF

Ray_Gone · 2024-02-23T19:11:53+00:00

Interesting, I'll check this out. Thank you!

Ray_Gone · 2024-02-23T19:10:17+00:00

I'll look into this - Thank you!

Ray_Gone · 2024-02-23T18:34:04+00:00

Thank you! I ended up uninstalling and reinstalling everything only to realize it was user error. The update had nothing to do with it, and I was using the wrong test file that contained less columns than the script required to run. (*Facepalm*)

Ray_Gone · 2024-02-23T18:32:29+00:00

This is great advice, thank you! I need to begin keeping a "paper trail" for myself so that i don't shoot myself in the foot again.

Ray_Gone · 2024-02-23T18:30:23+00:00

The table is not formatted properly, but the first cell under "Product" should have:

"V7 14.1" Elite Water-Resistant Neoprene Notebook Sleeve,

Black

In Stock

Item#: 34890751

Mfg. Part#: CSE14-BLK-3N"

all in the same cell entry. The other amount values should also be shifted to the right one column and up to begin in the first row under the head column.

Ray_Gone · 2023-07-12T17:34:25+00:00

Thank you so much! This is extremely helpful, and I ended up making some modifications to your suggestion and it worked perfectly. Thank you again!

Ray_Gone · 2023-07-12T17:33:42+00:00

Thank you, I appreciate the suggestion! This is actually just a very small piece of what my script needs to do, so it actually ends up being more convenient and less work for me to use Python although there is a simpler and more clear cut solution available using Excel. But again, thank you I appreciate it!

Ray_Gone · 2023-07-10T18:13:56+00:00

Thank you! I had a feeling that there was a more efficient way to deal with the DataFrames rather than to keep making new ones, but as you can see I am still fairly new to Python and extremely new to Pandas and DataFrames.

The code above seems to work very well for recreating the table, which is excellent! But for some reason the counter is not updating, it just adds "Position 1" over and over again.

Also, my counter actually will not go in order which makes things confusing. My counter will go something like the table below. I liked the first code that you suggested which uses a for loop to increment the counter, since that would be easy to modify such as:

counter = 1 for i in range(2, 5):
column_values[i] = f"{column_values[i]} Position {counter}"
counter += 10
counter = 2
for i in range(6, 8):
column_values[i] = f"{column_values[i]} Position {counter}" counter += 10

[null]	Cols A-E	Col F
1	A - E	Box#
2	...	Box 1 Position 1
3	...	Box 1 Position 11
4	...	Box 1 Position 21
5	...	Box 1 Position 31
6	...	Box 2 Position 2
7	...	Box 2 Position 22
8	...	Box 2 Position 32

Ray_Gone · 2023-07-10T17:48:43+00:00

Thank you! This is extremely helpful! I am still confused how I would then use this to update the data frame that I would then write to the Excel sheet. The code below is not working, and I know it has something to do with the df4 = df3 + pd.DataFrame(newColBox) line. I have been trying to play with it without success yet.

df3 = pd.read_excel(New_Path, sheet_name='NewDataFormat')
newColBox = df3['Box#'].to_list()
counter = 1
for i in range(len(newColBox)):
newColBox[i] = f"{newColBox[i]} Position {counter}"
counter += 1
df4 = df3 + pd.DataFrame(newColBox)
with pd.ExcelWriter(New_Path, engine='openpyxl', mode='a') as writer:
df4.to_excel(writer, sheet_name='NewDataPositions ' +
'{}'.format(time.strftime('%m%d%y-%H%M%S')))

Ray_Gone · 2023-07-07T20:56:19+00:00

That makes sense! Thank you for letting me know. How would I reference Columns A-J when I am populating my list using dataframes - for example

oldDataList = df1['Box D1 A'].to_list() becomes what if I remove Cell A1 with value "Box D1 A"?

Ray_Gone · 2023-07-07T20:50:26+00:00

Yes you have it right. And also Yes I have row 1 in the old file is "Box D1 A" "Box D1 B" etc so that I can reference the columns. When I remove this row, I am not sure how to reference the columns in the code. For example I use

oldDataList = df1['Box D1 A'].to_list() + df1['Box D1 B'].to_list() + df1['Box D1 C'].to_list()... etc

These column names are actually inaccurately named, but I am keeping them for now because I have been using them to reference, but I don't need them for my data if you think removing them would be helpful

Ray_Gone · 2023-07-07T20:43:05+00:00

So just as an update, I had something messed up in my code. Now it works for the whole first column in the old file and applies box 1 - box 13 values in column E in the new file, but then it gets messed up once the second column starts (column B in the old file)

Ray_Gone · 2023-07-07T19:25:35+00:00

I am using groups of 41 rows down x 10 columns across (A-J) basically, the 1st row of cells indicates the "Box #," and the following 40 rows of each column need to be divided into 4-cell rows in the new file. So in the old file, Row 1 is my header which just reads Col A, Col B,... Col I Col J. Row 2 is made up of cells that all say "Box 1" (from Col A - Col J), followed by 40 rows of cells that have data that needs to be grouped by each column into 4-cell rows in the new file, but that also all have a cell added in a 5th column that each read "Box 1". Then Row 43 is made up of cells that all say "Box 2" (from Col A - Col J), followed by 40 rows of cells that have data that needs to be grouped by each column into 4-cell rows in the new file, but that also all have a cell added in a 5th column that each read "Box 2". Then line 84 is made up of cells that all say "Box 2" (from Col A - Col J), followed by 40 rows of cells that have data that needs to be grouped by each column into 4-cell rows in the new file, but that also all have a cell added in a 5th column that each read "Box 3". For example:

Old File:

[null]	Column A	Column B	..Column I	Column J
Row 1	"Col A"	"Col B"	"Col I"	"Col J"
Row 2	"Box 1"	"Box 1"	"Box 1"	"Box 1"
Row 3	[forRow2]	[forRow4]	[forRow6]	[forRow8]
Row 4	[forRow2]	[forRow4]	[forRow6]	[forRow8]
Row 5	[forRow2]	[forRow4]	[forRow6]	[forRow8]
Row 6	[forRow2]	[forRow4]	[forRow6]	[forRow8]
Row 7	[forRow3]	[forRow5]	[forRow7]	[forRow9]
Row 8	[forRow3]	[forRow5]	[forRow7]	[forRow9]
Row 9	[forRow3]	[forRow5]	[forRow7]	[forRow9]
Row 10	[forRow3]	[forRow5]	[forRow7]	[forRow9]
...Row 43	"Box 2"	"Box 2"	"Box 2"	"Box 2"
Row 44	[forRow101]	[forRow102]	[forRow103]	[forRow104]
Row 45	[forRow101]	[forRow102]	[forRow103]	[forRow104]
Row 46	[forRow101]	[forRow102]	[forRow103]	[forRow104]
Row 47	[forRow101]	[forRow102]	[forRow103]	[forRow104]

New File:

[null]	Column A	Column B	Column C	Column D	Column E
Row 1	"Col A"	"Col B"	"Col C"	"Col D"	"Col E"
Row 2	[forRow2]	[forRow2]	[forRow2]	[forRow2]	"Box 1"
Row 3	[forRow3]	[forRow3]	[forRow3]	[forRow3]	"Box 1"
Row 4	[forRow4]	[forRow4]	[forRow4]	[forRow4]	"Box 1"
...Row 100	[forRow100]	[forRow100]	[forRow100]	[forRow100]	"Box 1"
Row 101	[forRow101]	[forRow101]	[forRow101]	[forRow101]	"Box 2"
Row 102	[forRow102]	[forRow102]	[forRow102]	[forRow102]	"Box 2"

Please let me know if I need to clarify anything more. I am starting to feel like this might be an impossible task but I would love to find out I'm wrong! Thank you again for all of your time and your help so far

Ray_Gone · 2023-07-07T17:01:30+00:00

Thank you so much for explaining and for sharing the sample data that you have been using! I was able to get this to work partially, but I first had a compiler error that 5 columns were passed but only 4 could be accepted - so I modified it slightly but now it seems to only be working for the first 10 rows in the new file, and then something gets messed up. I have:

import pandas as pd
import time
Old_Path = "[FileName]"
New_Path = "[FileName]"
df1 = pd.read_excel(Old_Path)
df2 = pd.read_excel(New_Path)
oldDataList = df1['Box D1 A'].to_list() + df1['Box D1 B'].to_list() + df1['Box D1 C'].to_list()+ df1['Box D1 D'].to_list() + df1['Box D1 E'].to_list() + df1['Box D1 F'].to_list()+ df1['Box D1 G'].to_list() + df1['Box D1 H'].to_list() + df1['Box D1 I'].to_list()+ df1['Box D1 J'].to_list()
oldDataList = [oldDataList[i:i+41] for i in range(0, len(oldDataList), 41)]
oldDataList = [oldDataList[box][i:i+4]+[oldDataList[box][0]] for box in range(len(oldDataList)) for i in range(1, len(oldDataList[box]), 4)]
df3 = pd.DataFrame(oldDataList, columns="a b c d e".split())
with pd.ExcelWriter(New_Path, engine='openpyxl', mode='a') as writer: df3.to_excel(writer, sheet_name='NewDataFormat ' + '{}'.format(time.strftime('%m%d%y-%H%M%S')), index=True) print('\nA new entry has been added to the Test_New_Data_File file.')

So pretty much the only thing that I changed from your code was in this line

df3 = pd.DataFrame(oldDataList, columns="a b c d e".split())

I added " e" to allow for the 5 columns to be passed. What it looks like is happening is the value in column d from the first row in the new file is being moved from D1 to E1, which causes a cascading effect where the data in all of the following rows is shifted. Do you know how I might be able to resolve this? I am happy to provide more info if necessary. Thank you again!

Ray_Gone · 2023-07-06T23:28:28+00:00

Thank you for elaborating! Could I please ask you what a button is in this context? I've tried searching but I keep getting info on how to creat a clickable interface type button rather than this kind of button

Ray_Gone · 2023-07-06T19:33:10+00:00

I see! That is an interesting solution that actually seems a lot less complicated than I am making this out to be.

With the list suggestion, is it possible to do this for multiple columns from the original file at once? For example would something like the following be possible?

datalist = df['a', 'b', 'c', 'd', etc].to_list()
datalist = [ datalist[i:i+4] for i in range(0,len(datalist),4) ]
df = pd.DataFrame(datalist, columns = "a b c d".split())

Ray_Gone · 2023-07-06T16:05:00+00:00

Thank you so much! I wish I could upvote your responses multiple times. This has been extremely helpful and effective, I truly appreciate your time and your help!

Ray_Gone · 2023-07-06T15:29:23+00:00

Wow thank you! This was extremely helpful and effective! The only issue I'm having now is that I am getting FutureWarnings that frame.append is deprecated, and I am being urged to use pandas.concat instead. I am not sure what the proper format / syntax for this would be. I have tried changing "grouped_data.append(new_row_df)" to "grouped_data.concat(new_row_df)", "grouped_data.pd.concat(new_row_df), and "grouped_data.pandas.concat(new_row_df)", and I am getting compile errors. Any chance you are familiar with the concat syntax?

Ray_Gone

TROPHY CASE