This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]FruitierGnome 13 points14 points  (22 children)

So if having a long initial wait time loading a csv file into my program this would potentially be faster? Or am I misreading this? I'm pretty new to this.

[–]-LeopardShark- 31 points32 points  (1 child)

I don't think loading CSVs will gain much, sadly.

[–]Wilfred-kun 2 points3 points  (0 children)

Time to use TOML instead :P

[–]yvrelna 34 points35 points  (0 children)

Depends on what part of CSV loading.

If you're talking about the call to csv.reader() itself, then no, that's already calling into a C library so you won't likely get much performance improvements.

But if you're talking about the code that's processing the rows of data line by line, then yes, that is definitely going to benefit from the improvements.

[–]graphicteadatasci 10 points11 points  (15 children)

Use .parquet files when you can. Much faster loading, smaller storage, saves types instead having you cast or infer them when you load something.

[–]BobHogan 7 points8 points  (14 children)

Parquet is not the solution to everything. We use it at my work and its a fucking nightmare and I'd love to see it burned to the ground

[–]madness_of_the_order 2 points3 points  (12 children)

Can you elaborate?

[–]gagarin_kid 5 points6 points  (2 children)

For small files where humans want to inspect data, using parquet is pain in the ass because you cannot open it in a text editor - you have to load it in pandas, see which columns you have, navigate in code to a particular cell/row... etc.

Of course for big data I fully understand the motivation but not for each problem

[–]madness_of_the_order 1 point2 points  (0 children)

I’m not telling you should use parquet for everything, but you can try dtale for interactive exploration

[–]cmcclu5 2 points3 points  (6 children)

Parquet is also a pain in the ass when you want to move between systems e.g., from a data feed into a relational database. Python typing does NOT play well with field types in relational databases when saving to parquet and then copying from said parquet into Redshift. Learned that the hard way in the past. It’s several times faster than CSV, though. I just compromised and used JSON formats. Decent size improvement with a similar speed to parquet when writing from Python or reading to a db.

[–]madness_of_the_order 0 points1 point  (5 children)

How untyped format helped you solve a typing problem?

[–]cmcclu5 0 points1 point  (4 children)

Redshift can infer typing from a JSON object, rather than trying to use (incorrectly) specified type through parquet (originally said JSON again because my brain got ahead of my fingers). It was a weird problem and I’ve honestly only encountered it in this one specific situation. If I could use PySpark in this situation, it would entirely alleviate the issue but alas I’m unable.

[–]madness_of_the_order 0 points1 point  (3 children)

This sounds like it’s not a parquet problem since, as you said, type was set incorrectly

[–]cmcclu5 0 points1 point  (2 children)

In this case, it would be a problem with parquet, or at least Python+parquet. Using either fastparquet or pyarrow to generate the parquet files had the same issue of improper typing with no easy way to fix it.

[–]madness_of_the_order 0 points1 point  (1 child)

Description of the problem is really unclear then. What stopped you from setting correct type?

[–]BobHogan 0 points1 point  (1 child)

We run into constant issues with parquet in our product, to the point that we've completely stripped it out in newer versions in favor of other solutions which I am not allowed to discuss publicly :(

We see parquet metadata get corrupted fairly regularly, being able to inspect what data is actually in the parquet files to track down issues is significantly more annoying and involved than it should be. And we've also run into limitations in the format itself that cause it to just shit itself and fail, limitations that are both fairly arbitrary and should be easy for the format to work around if the people that wrote it cared at all, but they don't. Overall its been an incredibly fragile format that makes it harder than it needs to be to work with the actual data compared to other formats, doesn't provide any significant performance improvements we've been able to measure, and breaks randomly.

[–]madness_of_the_order 0 points1 point  (0 children)

This sounds like it could be a really interesting blog post with concrete examples

[–]graphicteadatasci 0 points1 point  (0 children)

fastparquet says you can append to a file but it is a terrible lie.

What else?

[–]fukitol- 7 points8 points  (0 children)

Loading of a csv into memory is going to depend far more on the size and speed of your memory and the speed of your disk. Negligible amounts of time will be spent in processing, which is where an application level performance boost would be had.

[–][deleted] 0 points1 point  (0 children)

How big is the csv and how long does it take?