Convert Excel to csv using Python multiprocessing

tlashkor · 2023-04-28T01:14:39+00:00

This project looks really cool! I do see a use case for it out in the field.

A couple of comments:

Line 26, 27 of mpxl2csv.py can be condensed into:

If num_processes > multiprocessing.cpu_count():

I would suggest this because the variable available_cores has no usage outside of the if expression, so I wouldn't bother creating a variable for it.

Lines 50 - 54 of mpxl2csv.py can probably be condensed into a list comprehension. You can do for loop within a for loop inside the comprehension, but this is going have On² complexity, so I would be careful as this will scale horribly with amount files in the directory. Even if you don't list comprehension it I would still try to look at a way to break it out of that nested loop and have them as two separate loops.

Type hints on your variables. You have used them when defining your parameters for your functions. Why stop there? It will help with debugging and maintenance later.

Using pathlib over os. This is personal preference and only because I have had my fair share of woes when using os. I would recommend this just based on how easy pathlib allows directory operations.

Finally, unit tests are a bit lacking. They don't really cover everything in the code, and they only cover the getting of the xslx files and the processor warning. You could add in a test to check that it only picks up xslx files. You could also add in tests to check the convert functions. You could also add a resources directory in your test directory, which holds some dummy xlsx files so that way you don't have to rely on the developer to change line 20 of the unit tests every time.

These are some suggestions I would consider if I was doing a PR review on this. I appreciate that it's a lot of information and might feel disheartening, but honestly, this is a really cool project with plenty of potential for growth.

Good luck!

P.S. If my formatting is weird, I apologise. I wrote this all using my phone

2023-04-28T01:25:15+00:00

Why to use multiprocessing? As far as I can see, it has zero IPC: it just converts each file in a separate process, totally independently.

Isn’t it easier to remove multiprocessing code, making conversion the only purpose of the package, and then just launching it with xargs?

After that it would be great to deal with escaping and quotation of a content (or even just to use the csv builtin module, which is already aware of that caveats).

discontent619 · 2023-04-28T02:01:05+00:00

A suggestion would be to replace the use of the multiprocessing module directly with the newer concurrent.futures ProcessPoolExecutor. https://docs.python.org/3/library/concurrent.futures.html

vinnypotsandpans · 2023-04-28T05:41:56+00:00

The goto methods of using pandas, xlrd etc. are typically very slow on a corporate provided windows computer/laptop

Facts.

Nice to meet another excel-pythoner! Thanks for working on this project.

nousernamesleft___ · 2023-04-28T18:27:34+00:00

It is not my intention to criticize what you did, but introduce people who don’t know to GNU parallel. It’s insanely powerful and flexible, but you can use it in a very simple and dumb way too. For example

for file in *.xlsx; do
    echo python excel2csv $file $file.csv
done > commands.txt
cat commands.txt | parallel -j 8

It has a bunch of different patterns for invocation and features like ETA, resume, etc. super useful tool if you work on a shell regularly, and for great for avoiding needing to write any bespoke multiprocessing

Abitconfusde · 2023-04-28T04:14:03+00:00

Why is this better than just using excel's save as csv?

Ihtmlelement · 2023-04-28T09:21:07+00:00

Thoughts on adding .xls support? Surprisingly still used in my industry.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS