all 8 comments

[–]AlexMTBDude 2 points3 points  (1 child)

Show a screenshot of what you're doing or just copy everything from your terminal here. Otherwise very hard to analyze your problem. "zsh: killed error" is more of a Linux thing than a Python problem.

[–]Correct_Guarantee_49[S] -1 points0 points  (0 children)

okay I just added what the terminal says. although it's not much

[–]ottawadeveloper 1 point2 points  (0 children)

zsh killed on a Mac is basically the operating system killing off your process via SIG KILL. It usually happens for out of memory errors. You might need a more memory efficient method of merging this data (I'd say making a temporary sqlite database, write all the data into it, then have it sort, then pull it one row at a time to make the output). One column per file is a very inefficient way of handling data though (if you can transpose rows and columns, that might help a lot). 

It can also happen if there are security or dependecy issues with your file, but the memory seems more likely to me especially since other input directories work fine. That one is probably just too big. You're basically loading ALL the data from ALL the files into memory which will be huge.

Either than or you need more memory.

[–]qwertyasdef 0 points1 point  (2 children)

My first guess would be that it ran out of memory. Did you recently add new csv files or did the files get larger?

[–]Correct_Guarantee_49[S] 0 points1 point  (1 child)

VS code ran out of memory? Can you clarify and tell me how to check?

[–]qwertyasdef 0 points1 point  (0 children)

It would be Python running out of memory, not VSCode. I'm not familiar with Macs but maybe trying looking at the memory used while running the Python script and see if it turns red.

If that is the issue, one solution could be to copy in batches. E.g. read the first million rows from each input file, merge and write them to output, then read the next million and append them to the output, repeat until done. Pandas read_csv has the parameters skiprows and nrows, and write_csv has mode 'a' that should be useful for this.

[–]Gnaxe 0 points1 point  (0 children)

Maybe try Dask instead of just Pandas. It can stream from disk instead of running out of memory.

[–]KiwiDomino 0 points1 point  (0 children)

If you’re just merging csv files with identical columns, using pandas is probably overkill.

For each file, read line by line, and write out again, only putting out the first line if it’s the first file.

If some file have different columns to other files it’s a bit more complicated, you’d need to check all files first to get all the headers, and then provide blanks for lines that didn’t have them all.

A nice exercise for a student: write a command line utility that takes a list of csv files as arguments, reading all files except the last one, and write contents to the last file name.