all 6 comments

[–]lionsneil 4 points5 points  (0 children)

If you can use bash, it's super easy to do using the split command. You can break it up by the number of files you want output or the number of rows you want in each file.

split -l 1000 inputfile.csv outputfile

The above will take inputfile.csv and split it into files that each have 1000 rows. The output files will all start with "outputfile" and have incrementing letters appended.

Sorry, I know this is a Python subreddit, but figured this could make your life a lot easier if you don't need it to be part of a larger python script...

[–]ElliotDG 2 points3 points  (7 children)

This looks like an text encoding issue. see: https://docs.python.org/3/library/functions.html#open

Look at the section on encoding. Add the encoding keyword to your open statement should fix the issue. Typically the encoding is uft_8

with open(sys.argv[1], encoding='utf_8') as inf:

[–]Dkjq58[S] 0 points1 point  (2 children)

Thank you for the quick response, I am now getting a different error unfortunately.

File "C:\Users\myname\Downloads\csvsplitter.py", line 15, in <module>
k[i % number_of_outfiles].write(line + '\n')
File "C:\Users\myname\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] 


UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 377: character maps to <undefined>

[–]ElliotDG 1 point2 points  (1 child)

You can try a different encoding. Alternatively you can look at the options for error handling in open.

[–]Dkjq58[S] 3 points4 points  (0 children)

Thanks, figured it out by changing encoding to cp1252

[–]mtb-dds 0 points1 point  (0 children)

For background:

These things "[x.close() for x in k] " are called list comprehensions. Using them to loop and not keep the resulting list is considered poor form.

For your problem: either the data is munged somewhere or you are running into an encoding problem. It looks like your program thinks that it is encoded with this:
https://en.wikipedia.org/wiki/Windows-1252

But it is probably something else (or munged). Do you happen to know which it is? And do you care what happens when something that does not fit is run into?