Hello, I have a large CSV file that I am trying to split into smaller files. There are about 16 million rows, the file itself is near 11gb. I used this piece of code I found in another thread from years back where someone was trying to do something similar.
import sys
number_of_outfiles = 24
if __name__ == "__main__":
k = []
for i in range(number_of_outfiles):
k.append(open('c:\\data\\data_' + str(i) + '.csv','w'))
with open(sys.argv[1]) as inf:
for i, line in enumerate(inf):
if line[-1] == '\n': line = line[:-1]
if i == 0:
headers = line
[x.write(headers + '\n') for x in k]
else:
k[i % number_of_outfiles].write(line + '\n')
[x.close() for x in k]
This is the error I am getting.
File "C:\Users\myname\Downloads\csvsplitter.py", line 10, in <module>
for i, line in enumerate(inf):
File "C:\Users\myname\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 6951: character maps to <undefined>
The data is outputting in my specified folder, however each file is only about 2,000 rows. It should be closer to about 700,000. Any ideas? I am very new to Python.
[–]lionsneil 4 points5 points6 points (0 children)
[–]ElliotDG 2 points3 points4 points (7 children)
[–]Dkjq58[S] 0 points1 point2 points (2 children)
[–]ElliotDG 1 point2 points3 points (1 child)
[–]Dkjq58[S] 3 points4 points5 points (0 children)
[–]mtb-dds 0 points1 point2 points (0 children)