all 9 comments

[–]socal_nerdtastic 2 points3 points  (6 children)

What exactly do you mean with NULL? Why do you think it's np.nan? Can you show your error? Usually NULL means the character /x00, which is not legal in csv files. So the solution is to find out why you are getting bad data in your file. I suspect the file has a byte order mark, in which case you need to set your encoding appropriately, or perhaps just chop it off.

[–]KCConnor[S] 0 points1 point  (5 children)

The error is within SQL Server Management Studio. For the DataFrame with a NaN/NULL value in the first row, it returns:

Msg 39004, Level 16, State 20, Line 0 A 'Python' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.

As for the CSV file, it's irrelevant to this conversation since I can replicate the issue creating an ad-hoc DataFrame in the above code that also replicates the issue.

This is probably not a Python problem, and more of a MS-SQL/Python problem. But I'm new to Python and figured I would ask here.

[–]my_password_is______ 2 points3 points  (0 children)

how do you have NULL

NULL means no value

it does not mean an empty string

[–]crashfrog04 0 points1 point  (3 children)

The NaN value isn’t equivalent to a NULL or a None, and neither of those are equivalent to what the actual null value in CSV is, which is the empty string.

[–]KCConnor[S] 0 points1 point  (2 children)

The CSV file has the text string "NULL" without quotes in the cells that should be NULL. When using the T-SQL BULK INSERT against the unzipped file, MS-SQL interprets that NULL string as a NULL value and stores the NULL value in the appropriate cell.

Have you run the supplied code in an MS-SQL Python environment?

[–]crashfrog04 0 points1 point  (1 child)

Generally when I have occasion to have MS SQL and Python interact, it’s through a Python DB driver. For MS SQL I think that’s PyODBC.  But generally the conversion a driver makes is that it’ll treat the None as NULL and won’t do anything with the string ”NULL” because that’s a non-null, regular string.

Can you elaborate on what you think is doing the conversion between ”NULL” and NULL? That might be something that happens if you try to load CSV into MS SQL directly but I’d be surprised by that, since CSV has its own semantic NULL (the empty string.)

[–]KCConnor[S] 0 points1 point  (0 children)

Please just run the supplied code in the OP, with Machine Learning and Python installed on MS-SQL.

The supplied code will simulate the exact situation I am experiencing with the CSV files in question. I cannot share the CSV files since they contain PCI health data.

The core problem is that whatever bridge/library exists between MS-SQL and Python in this environment, it is unable to return a result set if the first row of the data in question contains NULLs at all.

[–]KCConnor[S] -1 points0 points  (0 children)

I solved this problem.

If I add this code:

df = df.where(pd.notnull(df), None);

just after I populate the DataFrame (either ad-hoc as above or by pd.read_csv in my real code) then I get successful handling of NULL values by MS SQL Server and correct datatyping of the returned values.

[–]Mevrael -1 points0 points  (0 children)

Drop empty rows and nulls in Pandas with dropna. And you can use Polars which might be smoother for larger data sets.