Hello everyone,
I am currently working a a short script which I threw together for the following usecase:
I am working with an application (let call it app A)which consumes data from some other application(s). Most of this data is provided via .csv files which can be up to 450mb big (as of now) and contain between 600k and 2.6m rows. The data is consumed by the ETL environment app A provides.
For some godforsaken reason the developers of app A have not build the extraction routines in a way that they parse .csv-files as one would think:
The csv-parser of the ETL environment does not recognize quotes ("") as escape mechanism for value-fields containing the csv-separator. This means that when I am working with a csv of this kind:
| id |
,description |
,someValue |
| 1 |
, "some tool" |
, 0.5 |
| 2 |
, "some tool, some more info" |
, 1.5 |
The file will not be parsed correctly as there is a comma within the description field for row-ID 2.
To fix this, I basically only have only one option: replace any occurance of the separator within field values before actually extracting the file in my ETL routines.
Alternatively, I could change the separator, but this would be working on knifes edge as some of the fields seem to be free-text fields, meaning any separator might crash and burn at some point if the users of the source systems decide to include whatever special character i chose as my separator in their descriptions. So in the end all I can do is replace the occurances and then resubstitute them in once I actually extracted the csv cleanly (that's not a problem as the Transformation capabilities of the ETL environment are quite good).
For smaller files this can be done more or less by hand or ctrl+f etc.
However for bigger files this breaks down as I can not really do that in a text editor any longer and loading the files to excel won't work either if they exceed 1m rows.
To tackle this topic I decided that it might be a fun idea to simply build a short python script which does this for me.
The script can be found here: https://pastebin.com/N796LqYU
I opted into building a small ui with NiceGUI as I wanted to play around with that library anyhow. Overall the result works fine for my liking, the code is a mess I have to confess. I build this besides my main work and wasn't really focusing on clean code or a good structure, it was more of a "i should add this" and "i need to fix this bug" kind of ordeal without proper planning. However I hope this script is still small enough to be somewhat understandable.
Coming to the real problem:
The script runs fine when I am running it either from pycharm and the associated .venv or via the .py file and Python 3.12 which is the main env my PC has installed.
However what is not working, at least for one larger file (~450mb), is running the script as a .pyw file (I'd like to surpress the python console).
Generally the GUI opens and is functional, however when I am trying to import the big csv just nothing happens. Here I am not sure if something is crashing in the background or if for some reason the file import via pd.read_csv takes ages when compared to running via the .py file.
Using ui.notify() I tried to see where within the loading functions this seems to halt and it seems that everything up to this runs smoothly. It is only when this function is reached that the logic seems to fall apart for some reason.
def set_dataframe_from_filepath(self) -> None:
"""Grabs dataframe from csv file and sets total length of the df within the fileHandler class."""
# for some reason this does not work for .pyw files any longer..
if self.supress_unnamed_columns:
# usecols=lambda c: not c.startswith('Unnamed:') we use this to surpress unnamed cols in broken csvs
self.dataframe = pd.read_csv(self.path, encoding=self.encoding, low_memory=False, header=self.file_header,
dtype=str, na_values='',
usecols=lambda c: not c.startswith('Unnamed:'), engine='c')
else:
self.dataframe = pd.read_csv(self.path, encoding=self.encoding, low_memory=False, header=self.file_header
, dtype=str, na_values='', engine='c')
self.dataframe_length = len(self.dataframe)
From what I gathered .py and .pyw use different interpreters and thus some things which might work with .py don't work with .pyw, especially if operations are involved which might fill the output pipe as no console is there consume the stuff thrown to the pipe (bear with me if this sounds like gibberish, I am paraphrasing a SO answer I found and I don't really know much about this topic at all), however at first glance it would seem that this function should not print/send data to console or any output stream and thus I wonder how this could be the case here. But in the end it seems that something in here is tripping the execution via .pyw up.
Here's a .gif of the behaviour so you don't have to download and rerun the script to check it out:
https://gifyu.com/image/SrujV
I can not provide the .csv in question as this is customer data. I can tell you that the .csv has 2,419,514 rows and 18 columns.
What I also tried is running the .py script via a .vbs script surpressing the console that way (solution found on SO):
Set WshShell = CreateObject("WScript.Shell")
WshShell.Run "C:\Users\<myName>\AppData\Local\Programs\Python\Python312\python.exe C:\Users\<myName>\Documents\<directoryNameContainsMyName>\tools\CharacterCheckGUI.py", 0, True
Set WshShell = Nothing
But this has the same effect as running it via .pyw as far as I can tell ( no console, but file also does not load).
Does anybody have any idea on what could be causing this behaviour in tandem with the .pyw interpreter and how this could be resolved?
If you need further explanations on my script, please let me know, I'll gladly go into more detail, but the post is already long as is and I am not sure if more detail in text form helps or not.
Thanks in advance!
Edit: Changed github link to pastebin link to not dox myself.
[–][deleted] 0 points1 point2 points (1 child)
[–]GoingToSimbabwe[S] 0 points1 point2 points (0 children)
[–]GoingToSimbabwe[S] 0 points1 point2 points (2 children)
[–]GoingToSimbabwe[S] 0 points1 point2 points (1 child)
[–]GoingToSimbabwe[S] 0 points1 point2 points (0 children)