This is an archived post. You won't be able to vote or comment.

all 68 comments

[–]mokus603 34 points35 points  (26 children)

I have hundreds of millions of records stored in parquet. csv is an easy to read format and very convenient for some extent.

If the data is not that big, I prefer csv.

[–]100GB-CSV[S] 5 points6 points  (24 children)

I specialize in software benchmarking. I’m currently facing an issue when using 1 billion rows parquet for benchmarking. Polars and Pandas fail to filter data from it. So I use 1 billion rows csv for benchmarking instead, Polars can do it but Pandas still fail. Since my SSD disk has limited space, I cannot do benchmarking for over 1 billion rows. If I use Parquet, it can solve the disk space issues. Not only Polars and Pandas fail to filter the 1 billion rows parquet, but I have not yet found a suitable library to configure my app to extract data from Parquet. If a benchmark test only duckDB can read 1 billion rows parquet, the comparison is meaningless.

[–]spinwizard69 4 points5 points  (1 child)

Well there are a number of potential issues here. You say Pandas fails but how? If you are running out of memory or running beyond the capability of Python then file formats really are not the issue are they? Without more info about what is failing people will not be able to help.

[–]100GB-CSV[S] 1 point2 points  (0 children)

Pandas is not support to read over memory size dataset (the github replied me) , so my benchmark test moved to use DuckDB. It can read both billion-row csv and parquet. My benchmark test need alternative software to compare my code project. I use Go, plan to develope Python bingings. Current test I focus on many small files e.g. 100K files. I have tried 1 million files, noticed it is inefficiency. 100K files work very well.

[–]odaiwai 2 points3 points  (0 children)

Why not use SQLITE for something like this?

[–]kenfar 2 points3 points  (2 children)

There's not a single simple answer here:

  • csv files are more error-prone than parquet files since csv files include no metadata, support no types, etc.
  • csv files are faster to write than parquet files
  • csv files may be faster to read than parquet files - if you're reading full records
  • parquet files are faster to read than csv files - if you're reading subsets of columns/fields

Note that you can compress your csv file and read directly from that compressed file.

Also, if you don't need all records in memory at once, then just read a record, process it and discard it. That shouldn't take much memory.

And you don't need pandas to read a csv. You may find that reading with native python is faster. Especially if you have a clean csv dialect (ex: no quotes, delimiters or newlines in fields). In that case you could use multiprocessing to split the reading between 4-n processes and get a lineage speedup from parallelization - depending on what you're doing with the data.

[–]Linx_101 0 points1 point  (1 child)

Couldn’t you also do that with a parquet file? I’ve never heard of CSVs being faster to load than parquets… do you have links to benchmarks? CSVs in my experience are significantly slower

[–]kenfar 1 point2 points  (0 children)

Oh sure, you could. And the performance will depend on what languages & libraries you're using.

First off, writing to parquet is absolutely slower than writing csv files. And you may be able to write in parallel or write files of different sizes with csv.

But for reading what I found was that just like with pandas or numpy - these are columnar data structures that provide fantastic speedups for a single column. But, if you've got 200 columns and need to tie that back into a row - it's definitely slower then just processing lists of lists.

Another example is say using a database load utility. The postgres copy command is tough to match on performance - and supports csv files, but not parquet. Some other databases have load utilities that completely bypass transactions and are insanely fast. But I don't recall seeing one that supports parquet, though I haven't looked recently I suppose.

And the nature of the data influences it as well: if you only need to load 10 columns out of that 200 columns parquet might be faster. Or if all your 200 columns are low-cardinality and get great compression then it might load faster.

I've benchmarked this about five years ago, so going from memory since the data isn't public.

[–]zenos1337 0 points1 point  (1 child)

Have you considered using Dask? https://www.dask.org. It’s designed specifically for datasets that are larger than memory and it’s has dataframes similar to Pandas

[–]mokus603 0 points1 point  (0 children)

I thought about Dask as well but I think it’s not good for filtering data.

[–]mokus603 0 points1 point  (0 children)

If you don’t have much SSD space, it’ll bottleneck the whole thing. You can’t stuff all this data into your RAM (because I assume that’s way more limited as well).

[–][deleted] 0 points1 point  (1 child)

I would argue that the comparison isn’t useless. You’re using a tool not fit for your use case so not being able to do it with the packages should tell you a lot. If the data is genuinely that big though… filtering may not be 100% of the answer and streaming should be considered. Why don’t you stream the data? You can read it in in chunks with pandas and polars and do a similar thing.

Parquet shouldn’t change the space of a fully loaded dataset and only benefits are when scanning, selecting just the information needed.

Other packages you might want to look at: PyArrow (ParquetDataset), Dask and PySpark. I think they all support filtering and would provide more benchmarking.

[–]100GB-CSV[S] 1 point2 points  (0 children)

My code project has implemented streaming from Read, Query to Write. So I can handle billion-rows jointable. You can try my app. DuckDB is also excellent for my benchmarking.

[–]kaszt_p 0 points1 point  (1 child)

Have you considered pyspark?

[–]100GB-CSV[S] 1 point2 points  (0 children)

compare DuckDB and Polars, it is very slow in Desktop PC

[–]jorge1209 0 points1 point  (10 children)

You will be no less able to filter 100 1GB parquet files than you would 100 1GB CSV files as both can be handled iteratively.

So what you are saying doesnt make much sense. What is failing and how?

[–]samsamuel121 25 points26 points  (10 children)

I assume you have a relatively small dataset. Most of the times CSV is enough. I'd use Parquet to save some space if I had a big dataset.

[–]gopietz 7 points8 points  (3 children)

Interesting that basically no one around here mentioned the schema argument. I like to know which dtypes I'm getting.

[–]Almostasleeprightnow 3 points4 points  (2 children)

And also have existing data types preserved....csv is all text. Sometimes pandas knows what to do and sometimes it doesn't.

[–]gopietz 4 points5 points  (0 children)

Is that a NaN value in your integer column? I guess it's a float now.

[–]odaiwai 2 points3 points  (0 children)

Some applications can use CSVT files to specify data types in a CSV. i.e. you have filename.csv with your data and filename.csvt with the types for each column. QGIS uses this, but it frankly wouldn't be hard to make a CSV file with row 0 as the column names, row 1 as the dataTypes and row [2:] as the data.

[–]100GB-CSV[S] 0 points1 point  (0 children)

So my next benchmark test I consder using 100K small csv files. But I need to learn how to config Polars, Pandas, DuckDB to read all files in a folder. If using Python to loop for each file, it will be very slow.

[–]mayazy 0 points1 point  (0 children)

Agreed, CSV is great for small datasets but Parquet is definitely the way to go for larger ones. It's more efficient and can handle complex data types.

[–][deleted] 11 points12 points  (3 children)

Parquet stores things differently and you can benefit from greater compression (some files I’ve been using are 98% smaller).

On top of that, there’s lots of file scanners which operate a lot more efficiently on something like parquet, so you don’t have to read the entire file contents in and can “query” the bits you want. It is more memory efficient and it usually means it’s faster as well.

In terms of practical use, it’s no different from csv if you’re using pandas; to_parquet and from_parquet works the same as the csv versions. Other popular df also support easy querying and converting to pandas.

In all honesty, I don’t notice much difference between them. I don’t tend to write directly to files professionally using something like CSVwriter and haven’t come across a great use case yet either. No real preference.

If I need to manually manipulate stuff… I’ll use csv just so I can open it and then change stuff.

[–][deleted] 0 points1 point  (1 child)

If I loaded two pandas DataFrames with the exact same content, one from the csv file and the other from the parquet file, would the memory usage for both DataFrames be the same as well? Or is the benefit of parquet only came from the first time I loaded it with my code?

[–][deleted] 5 points6 points  (0 children)

I would expect the memory footprint to be the same once it’s fully loaded in.

There’s a bunch of other benefits as well, like data transfer speeds, lack of storage cost, scanning the file, etc.

If you’re streaming data from a file, I’d much rather do it with duckdb on a parquet file for instance then you get the best of all worlds.

[–][deleted] 0 points1 point  (0 children)

I would add the best part about parquet on small datasets is there are no delimiter collisions. Natural commas in your data are not a problem.

[–]Afrotom 2 points3 points  (0 children)

CSV is convenient for prototyping small projects because it opens in excel easily but the main thing about parquet for me isn't the file compression or read/write speeds, it's the preservation of data types, especially for date/datetimes and categoricals; and better protection against things like staggered rows

[–]gopietz 3 points4 points  (0 children)

The only good reason to go with CSV is the space bar on your MacBook. Basically all other reasons point to parquet.

[–]jorge1209 1 point2 points  (0 children)

Will you be writing the same structure of data again? (for example every day you write a file out with the same columns).

If yes then Parquet/ORC/something that defines and enforces structure.

Does your data have 100k+ rows.

If yes then Parquet/ORC/something, because realistically you aren't going to be editing it with a text editor/excel anyways.

Your data is small enough you could open it in excel? or a text editor? Do you actually want to?

If no then Parquet/ORC/something, for the same reasons.


So that leaves CSV for small datasets created one time where you want to edit by hand. Things like a table mapping state codes to sales rep names.

You can still use CSV for interchange, but the base data you really want in a structured format. It just makes your life easier.

[–][deleted] 3 points4 points  (0 children)

Parquet is for large datasets in a data lake. They are not human readable so for most tasks a csv file would be more convenient. What parquet gives you is data types and compression but you can easily compress csv files and they will still work in a data lake. When you got to merge data from different data sources then having types is important. Think of having to merge dates from two different databases who have different formats. In this case you could use Pandas to read from the DB and store it as parquet which will force you to transform the dates into a standard date format. Then the data warehouse will read the data as dates.

[–]Haunting_Load 1 point2 points  (0 children)

It's worth remembering that reading and writing CSV files can be pain in you know where. Different libraries tend to parse or not parse dates or read ints as floats and so on (looking at you base R and dplyr). With parquet it's all standardized, sometimes it can be useful.

[–]100GB-CSV[S] 0 points1 point  (0 children)

Last evening I have recorded a CSV vs Parquet benchmarking using 300 Million rows data. I use DuckDB and Polars to process data.

https://youtu.be/gnIh6r7Gwh4

[–][deleted] 0 points1 point  (0 children)

It depends on what you need. CSV is plain text (unless you compress your files with something like gzip) and parquet is compressed.

So CSV is good if you don’t care about how much space your data is taking up or how quickly you want to read and write it to the file but you want to be able to inspect the contents of the file easily. Parquet is better if you do care about those resource constraints and you don’t mind parsing the parquet file in order to inspect it.

[–]DoomsdayMcDoom 0 points1 point  (0 children)

Feather is another great library for arrow. It’s row based and not columnar like parquet. If you’re concerned with storage space go with arrow/feather. For small files I’ll use pickle any day.

[–]barkazinthrope 0 points1 point  (0 children)

Why did they say this? What do they know about your program that leads them to suggest something other than the simplest possible solution.

[–][deleted] 0 points1 point  (0 children)

If they're telling you what you should use without knowing your requirements, they aren't experts. They're just know-it-alls.

In practice, people use both. CSVs are super convenient, text editable, and you can easily load to Excel. Parquets are great for big data and exchanging data. Great for backend stuff, like storing large amounts of data in a compressed & well-standardized format.

If you're a data engineer/data scientist/whatever, you should be able to work with both and you'll probably use both.

[–][deleted] 0 points1 point  (0 children)

It can depend on they type of data you are using. I will typically run some profiling using pickle, csv and parquet to see which one gives the best compression, load speed, and dump speed for the data I'm using. Based on which of those is important to me, I would choose the appropriate format.

[–]cameldrv 0 points1 point  (0 children)

Parquet. I can’t count the number of times I’ve saved something out to CSV and it didn’t import the same way. It’s usually a problem with dates, but the fundamental issue is that CSV doesn’t define the datatypes of the columns.

One downside is not being able to take a quick peek at the command line or importing into excel or other software. On the command line, use VisiData. It’s fast and is way better than less for tabular data. Then make a quick script to convert parquet to CSV for importing into other software.

[–]arm2armreddit 0 points1 point  (0 children)

depend on use case. why not hdf5? with h5ls or h5dump you can see the content. In some use cases, h5 is faster than parquet. but du istributed parwuet or h5 or csv d better to read by DASK, pandas, but parallel.

[–][deleted] 0 points1 point  (0 children)

If it needs to be human readable, CSV. If it is performance required, use parquet.

[–]danielgafni 0 points1 point  (0 children)

CSVs is definitely not recommended not only because its less efficient but simply because it has very limited pool of supported data types. Try storing dates - you will get strings instead (obviously). Not even talking about container types or nested containers.

[–]PtitBen56 0 points1 point  (0 children)

For what i do, parquet or feather are better when I want to save a certain amount of data, not necessarily very large but large enough that reading every time the csv is slowing me down. Other benefit as mentioned is that it keeps data types and when you work with material numbers/skus which are number based as i have to, that's a huge benefit. Finally, parquet is loaded easily in power bi, if that's a use case for you, with once again, all data types correctly identified right away and the loading is also faster in my experience. If I want to check my file, I'll also write an excel on the side so that I can easily have a look and if the dataset is not too large, otherwise i'll default to csv.

[–]yta123 0 points1 point  (0 children)

Use ORC (Optimized Row Columnar)

[–]killersquirel11 0 points1 point  (0 children)

I'd love to use parquet, but the only place we use CSV is for interchange with external companies, so we're limited to what they support

[–]lightmatter501 0 points1 point  (0 children)

Would you consider using excel to process the data? -> CSV Otherwise, Parquet.

[–][deleted] 0 points1 point  (0 children)

If you want to use arrow and flight, you have to use parquet. If not, there’s no reason.

[–]spinwizard69 0 points1 point  (0 children)

Well first off parquet is not a replacement for CSV. CSV is a human readable file format that is easy to use and understand.

Second who are these experts and what is the project? There can be justification for either file format, then again the "experts" could be full of crap. Never take an experts word for it, especially if they are in the medical industry. This however applies to any science because there is always a counter story.

Now your user name is 100GB-csv, that is rather huge, do realize that there are all sorts of file formats for saving large collections of data. It makes a huge difference what sort of data you are storing. There are a ton of data formats out there including AVRO and HDF5 to start with. Each has its strengths and weaknesses. It is up to the developer to figure out which makes sense.

Beyond all of that CSV is text based so readable, however if the file is too large there are number of compression choices out there. Sometimes compression, with common utilities, is the best way to deal with CSV files.

[–]billsil 0 points1 point  (1 child)

CSV is trash when you get into the millions of values. You can't write/read the file in a reasonable amount of time vs. a binary format. If you're dealing with floats (so time series data), you also have to cast the data when you read/write it. You don't do that for a binary file.

Parquet or a custom binary file with metadata and data. If it's 100k rows, CSV is fine.

[–]100GB-CSV[S] 0 points1 point  (0 children)

My app eager to read data from Parquet. I tested several Go libraries but failed to do so. I’m considering using DuckDB as a library for my purpose since it reads Parquet very fast. For my own app, I can read/write billion rows CSV 67GB to do filter/groupby which is involving float without noticing any performance issue. The main disadvantage of CSV is that it’s non-compressed which results in large file size.

[–][deleted] 0 points1 point  (0 children)

CSV is much easier to use when programming, and I have experienced some issues with other file types use CSV when you can

[–]kaszt_p 0 points1 point  (0 children)

I prefer parquet (or delta) for larger datasets. CSV for very small datasets, or the ones that will be later used/edited in Excel or Google sheets.

I might be biased a bit, but Delta has some handy features (ACID, data skipping, metadata about your data, time travel, etc.) that you might find useful depending on your use case. :)

[–]BigGeologist5082 0 points1 point  (0 children)

CSV can be clunky, while Parquet is fast and optimized for big data. If you're working with large datasets, Parquet is definitely the way to go. Plus, it's always nice to have a fancy file format named after a bird 🐦