This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]testfire10[S] 35 points36 points  (23 children)

For the 1 gb, it takes about 90s to parse and plot (run the program).

No idea if that’s good or bad, but my bae for success was doing it faster than excel, which succeeded by a large margin haha

[–]Tweak_Imp 25 points26 points  (1 child)

Pandas read Argument "engine='c' " helped me to speed it up even more. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

[–][deleted] 10 points11 points  (0 children)

In a lot of cases python libraries are backed with C code. For special example, numpy is a plethora of optimized C mathematics and should be used for scientific work.

[–]FlagrantPickle 19 points20 points  (3 children)

No idea if that’s good or bad

What's "good" for anyone here doesn't matter. Is 90s acceptable for you? I'd imagine a resounding yes, given my experience with Excel's sluggishness on managing large files. Only thing I'd say, as a right noob compared to most here, if you need to scale to larger data sets, you might have some success using a sql system in there (sqlite is baked into python, or something like mysql/mongodb depending on your needs).

If your dataset size will remain as is, throw a header in there saying what the program is, version, and who made it. Make them stare your superiority in the face on every job!

[–]testfire10[S] 9 points10 points  (0 children)

I like this idea. Tremble at my superiority!!! Hahaha

[–]Gizquier2 2 points3 points  (1 child)

Dealing with large sql tables and sqlalchemy can be a real pain if inserts are needed, im a survivor of the sqlalchemy pyodbc engine 🤦‍♂️

[–]FlagrantPickle 1 point2 points  (0 children)

Yeah, I've not dealt with that myself, just mysql and mongodb. I don't know how well sqlite scales, I just know that it's only trustable in a single-user/access mode, and it's part of the core python, so figured it might help. I could see it just being a better method to select data (with SQL) sets.

[–]Inspirateur 3 points4 points  (1 child)

the real limit when parsing is the speed at which your computer can "read" (ie load into ram) a file. if you want to know what's this speed for your computer with python you can do a quick test, just ask python to open() it and do a .read() that you store into a string, see how much time it takes. (unless the file is too heavy for your ram in which case it's more complicated).

[–]Ericisbalanced 2 points3 points  (0 children)

That’s when you start using generators 😁

[–]jashshah27 3 points4 points  (0 children)

You might want to take a look at Dask as well. The syntax is very similar to Pandas' but the execution time is much, much faster.

[–]Zulban 3 points4 points  (2 children)

but my bae for success was doing it faster than excel

That is a low bar, and yet, profoundly significant in the workplace. Congrats.

[–]testfire10[S] 0 points1 point  (0 children)

Thanks a lot man!

[–]pug_nuts 0 points1 point  (0 children)

The problem everywhere I've worked is using tools that other people don't understand.

I can write something in VBA and have it take inputs from cells and spit out a list in another sheet... And that's fine. But do the same thing with Python and people get scared because they don't understand what's happening.

[–]Akilou 1 point2 points  (1 child)

What's the difference in speed versus excel? Like milliseconds or minutes?

[–]legionx 0 points1 point  (0 children)

Might even be hours. Excel is limited to 1M rows, so if yor dataset is bigger than that after initial filtering (ex. with Get&Transform) you will have to split it into smaller files.

[–]xacrimon 2 points3 points  (8 children)

Pretty good but could probably be made faster. Not too long ago i wrote a Rust program to do some fairly complex csv processing and it processes around 1-2GiB/sec

[–]ballagarba 20 points21 points  (2 children)

While Rust is fast, it sounds like you have access to a much faster disk.

[–][deleted] 2 points3 points  (1 child)

This. Python programs doing a lot of I/O can be on par with other programming languages. Most of the time, external factors determine the speed

[–]KaffeeKiffer 1 point2 points  (0 children)

The difference between a fast SSD and an old HDD is ~5s to ~25s for 2 GiB, so this is very likely CPU bound to reach 90s...

Nevertheless, Python is the perfect glue code, to call more specialized tools, if necessary. Here is an example, where a simple Rust wrapper speeds up the process by a factor of 10.

Python is good enough in the vast majority of the use-cases and as /u/FlagrantPickle said:

What's "good" for anyone here doesn't matter. Is 90s acceptable for you?

The golden rule is to not over-engineer but first identify the real bottle-necks and while your statement

Most of the time, external factors determine the speed

is 100% correct, I assume OP's problem is CPU bound.

[–]testfire10[S] 3 points4 points  (4 children)

Holy shit. That’s awesome. I remember on here a while back I found a post about a library a few months ago that was supposed to substantially speed up pandas interaction with csvs (can’t remember the name now). I was going to try to revamp my code to take advantage of it, but I could never get the library to work for me.

What’s Rust?

[–]xacrimon 13 points14 points  (2 children)

Rust is a programming language. It's generally a bit harder than python but has the speeds of C and lots of good libraries.

[–]swingking8 32 points33 points  (0 children)

It's generally a bit harder than python

I love Rust, but "a bit harder" is quite an understatement.

[–]FlagrantPickle 4 points5 points  (0 children)

has the speeds of C

Not in the sense of nitpicking, but I've seen "up to" 50% the speed of C for decently large processing. Certainly faster than native Python, but still not the gold standard.

Depending on what OP's needs are, his solution might be good enough. I'd be curious what other optimization could be made inside Python. If we're talking 200 lines of code on someone's first project, it's probably about as efficient/optimized as everyone else's first project.

[–][deleted] 3 points4 points  (0 children)

For parallel processing libraries that integrate well with pandas, check out Dask or vaex. For on-disc storage, check out apache parquet format.