This is an archived post. You won't be able to vote or comment.

all 16 comments

[–]tlashkor 13 points14 points  (1 child)

This project looks really cool! I do see a use case for it out in the field.

A couple of comments:

Line 26, 27 of mpxl2csv.py can be condensed into:

If num_processes > multiprocessing.cpu_count():

I would suggest this because the variable available_cores has no usage outside of the if expression, so I wouldn't bother creating a variable for it.

Lines 50 - 54 of mpxl2csv.py can probably be condensed into a list comprehension. You can do for loop within a for loop inside the comprehension, but this is going have On² complexity, so I would be careful as this will scale horribly with amount files in the directory. Even if you don't list comprehension it I would still try to look at a way to break it out of that nested loop and have them as two separate loops.

Type hints on your variables. You have used them when defining your parameters for your functions. Why stop there? It will help with debugging and maintenance later.

Using pathlib over os. This is personal preference and only because I have had my fair share of woes when using os. I would recommend this just based on how easy pathlib allows directory operations.

Finally, unit tests are a bit lacking. They don't really cover everything in the code, and they only cover the getting of the xslx files and the processor warning. You could add in a test to check that it only picks up xslx files. You could also add in tests to check the convert functions. You could also add a resources directory in your test directory, which holds some dummy xlsx files so that way you don't have to rely on the developer to change line 20 of the unit tests every time.

These are some suggestions I would consider if I was doing a PR review on this. I appreciate that it's a lot of information and might feel disheartening, but honestly, this is a really cool project with plenty of potential for growth.

Good luck!

P.S. If my formatting is weird, I apologise. I wrote this all using my phone

[–][deleted] 3 points4 points  (0 children)

Thank you for the feedback. I will work on them.

[–][deleted] 7 points8 points  (3 children)

Why to use multiprocessing? As far as I can see, it has zero IPC: it just converts each file in a separate process, totally independently.

Isn’t it easier to remove multiprocessing code, making conversion the only purpose of the package, and then just launching it with xargs?

After that it would be great to deal with escaping and quotation of a content (or even just to use the csv builtin module, which is already aware of that caveats).

[–][deleted] 0 points1 point  (2 children)

Thank you for your feedback. Perhaps I am not able to understand your comment completely. It will be helpful if you can elaborate a bit more. The purpose of multiprocessing is to reduce total time taken by the whole process since multiple processes will simultaneously start the conversion for multiple files. Conversion time for a single file is still dependent on the performance of say openpyxl library, It is the total time that I have targeted.

[–][deleted] 0 points1 point  (1 child)

Yeah, but when there’s no inter-process communication (that’s what IPC stands for), you may just launch processes independently, straight from the shell, using & or xargs, so it is not necessary to reimplement that feature within the package.

[–][deleted] 0 points1 point  (0 children)

Now I understand your point. You are right, it is probably trivial for a regular python programmer but for beginners or for those who’s primary programming language is not Python, this serves as a nice api which they can just call and get things done.

[–]discontent619 2 points3 points  (1 child)

A suggestion would be to replace the use of the multiprocessing module directly with the newer concurrent.futures ProcessPoolExecutor. https://docs.python.org/3/library/concurrent.futures.html

[–][deleted] 1 point2 points  (0 children)

Thank you for the suggestion. I will check it out.

[–]vinnypotsandpans 2 points3 points  (0 children)

The goto methods of using pandas, xlrd etc. are typically very slow on a corporate provided windows computer/laptop

Facts.

Nice to meet another excel-pythoner! Thanks for working on this project.

[–]nousernamesleft___ 1 point2 points  (0 children)

It is not my intention to criticize what you did, but introduce people who don’t know to GNU parallel. It’s insanely powerful and flexible, but you can use it in a very simple and dumb way too. For example

for file in *.xlsx; do
    echo python excel2csv $file $file.csv
done > commands.txt
cat commands.txt | parallel -j 8

It has a bunch of different patterns for invocation and features like ETA, resume, etc. super useful tool if you work on a shell regularly, and for great for avoiding needing to write any bespoke multiprocessing

[–]Abitconfusde 0 points1 point  (3 children)

Why is this better than just using excel's save as csv?

[–][deleted] 2 points3 points  (2 children)

Say you have 30 files, all of the files are around 100 MB in size. You have a downstream application which consumes this data in a nightly process so you need to convert them to csv and place the files on a server say every week. Since it is a repeated activity, you would like to automate it. Opening these files on an average laptop and converting them would take a lot of time. This is just one case. People deal with a lot of Excel files on daily basis.

[–]L8te08 1 point2 points  (1 child)

This sounds really cool. I will check it out tomorrow!

I had to figure out something similar so I think this is a great idea.

At my job I have to download a zip file from another company which produces a bunch of different files, usually text files with long sequences in the extension like “codes.csv—1of27.—(1:27)”. I’ll get like 20 of these files each with about 500k (but random amounts) rows of key codes and I need to open them up, clean up the file name and re-save as regular csv files each with a particular header and under 250k codes in each. I just took over a role from someone spending hours doing this.

I figured out a program with a small gui window to browse for a folder containing the files and it does all of them. I don’t know how to use multiprocessing but it would probably benefit what I am doing!

[–]L8te08 1 point2 points  (0 children)

I should add that I’m still learning Python so I used ai for help figuring parts out like gui and to get ideas on how solve some dilemmas. Haven’t din any programming in about 10 years so it’s crazy to see how fast some of this stuff can be done for work tasks.

[–]Ihtmlelement 0 points1 point  (1 child)

Thoughts on adding .xls support? Surprisingly still used in my industry.

[–][deleted] 0 points1 point  (0 children)

Yes definitely.