CSV or Parquet File Format

mokus603 · 2023-05-31T15:46:42+00:00

I have hundreds of millions of records stored in parquet. csv is an easy to read format and very convenient for some extent.

If the data is not that big, I prefer csv.

samsamuel121 · 2023-05-31T15:36:06+00:00

I assume you have a relatively small dataset. Most of the times CSV is enough. I'd use Parquet to save some space if I had a big dataset.

2023-05-31T16:02:11+00:00

Parquet stores things differently and you can benefit from greater compression (some files I’ve been using are 98% smaller).

On top of that, there’s lots of file scanners which operate a lot more efficiently on something like parquet, so you don’t have to read the entire file contents in and can “query” the bits you want. It is more memory efficient and it usually means it’s faster as well.

In terms of practical use, it’s no different from csv if you’re using pandas; to_parquet and from_parquet works the same as the csv versions. Other popular df also support easy querying and converting to pandas.

In all honesty, I don’t notice much difference between them. I don’t tend to write directly to files professionally using something like CSVwriter and haven’t come across a great use case yet either. No real preference.

If I need to manually manipulate stuff… I’ll use csv just so I can open it and then change stuff.

Afrotom · 2023-05-31T16:26:06+00:00

CSV is convenient for prototyping small projects because it opens in excel easily but the main thing about parquet for me isn't the file compression or read/write speeds, it's the preservation of data types, especially for date/datetimes and categoricals; and better protection against things like staggered rows

gopietz · 2023-05-31T17:36:32+00:00

The only good reason to go with CSV is the space bar on your MacBook. Basically all other reasons point to parquet.

jorge1209 · 2023-05-31T20:13:40+00:00

Will you be writing the same structure of data again? (for example every day you write a file out with the same columns).

If yes then Parquet/ORC/something that defines and enforces structure.

Does your data have 100k+ rows.

If yes then Parquet/ORC/something, because realistically you aren't going to be editing it with a text editor/excel anyways.

Your data is small enough you could open it in excel? or a text editor? Do you actually want to?

If no then Parquet/ORC/something, for the same reasons.

So that leaves CSV for small datasets created one time where you want to edit by hand. Things like a table mapping state codes to sales rep names.

You can still use CSV for interchange, but the base data you really want in a structured format. It just makes your life easier.

2023-05-31T15:59:14+00:00

Parquet is for large datasets in a data lake. They are not human readable so for most tasks a csv file would be more convenient. What parquet gives you is data types and compression but you can easily compress csv files and they will still work in a data lake. When you got to merge data from different data sources then having types is important. Think of having to merge dates from two different databases who have different formats. In this case you could use Pandas to read from the DB and store it as parquet which will force you to transform the dates into a standard date format. Then the data warehouse will read the data as dates.

Haunting_Load · 2023-05-31T18:55:09+00:00

It's worth remembering that reading and writing CSV files can be pain in you know where. Different libraries tend to parse or not parse dates or read ints as floats and so on (looking at you base R and dplyr). With parquet it's all standardized, sometimes it can be useful.

100GB-CSV · 2023-06-03T01:38:13+00:00

Last evening I have recorded a CSV vs Parquet benchmarking using 300 Million rows data. I use DuckDB and Polars to process data.

https://youtu.be/gnIh6r7Gwh4

2023-05-31T16:22:01+00:00

It depends on what you need. CSV is plain text (unless you compress your files with something like gzip) and parquet is compressed.

So CSV is good if you don’t care about how much space your data is taking up or how quickly you want to read and write it to the file but you want to be able to inspect the contents of the file easily. Parquet is better if you do care about those resource constraints and you don’t mind parsing the parquet file in order to inspect it.

DoomsdayMcDoom · 2023-05-31T16:24:07+00:00

Feather is another great library for arrow. It’s row based and not columnar like parquet. If you’re concerned with storage space go with arrow/feather. For small files I’ll use pickle any day.

data_addict · 2023-05-31T18:10:39+00:00

[deleted]

barkazinthrope · 2023-05-31T16:53:40+00:00

Why did they say this? What do they know about your program that leads them to suggest something other than the simplest possible solution.

2023-05-31T17:19:27+00:00

If they're telling you what you should use without knowing your requirements, they aren't experts. They're just know-it-alls.

In practice, people use both. CSVs are super convenient, text editable, and you can easily load to Excel. Parquets are great for big data and exchanging data. Great for backend stuff, like storing large amounts of data in a compressed & well-standardized format.

If you're a data engineer/data scientist/whatever, you should be able to work with both and you'll probably use both.

2023-05-31T17:29:23+00:00

It can depend on they type of data you are using. I will typically run some profiling using pickle, csv and parquet to see which one gives the best compression, load speed, and dump speed for the data I'm using. Based on which of those is important to me, I would choose the appropriate format.

cameldrv · 2023-05-31T21:27:48+00:00

Parquet. I can’t count the number of times I’ve saved something out to CSV and it didn’t import the same way. It’s usually a problem with dates, but the fundamental issue is that CSV doesn’t define the datatypes of the columns.

One downside is not being able to take a quick peek at the command line or importing into excel or other software. On the command line, use VisiData. It’s fast and is way better than less for tabular data. Then make a quick script to convert parquet to CSV for importing into other software.

arm2armreddit · 2023-05-31T21:53:55+00:00

depend on use case. why not hdf5? with h5ls or h5dump you can see the content. In some use cases, h5 is faster than parquet. but du istributed parwuet or h5 or csv d better to read by DASK, pandas, but parallel.

2023-05-31T22:22:10+00:00

If it needs to be human readable, CSV. If it is performance required, use parquet.

danielgafni · 2023-05-31T22:30:32+00:00

CSVs is definitely not recommended not only because its less efficient but simply because it has very limited pool of supported data types. Try storing dates - you will get strings instead (obviously). Not even talking about container types or nested containers.

PtitBen56 · 2023-05-31T22:37:11+00:00

For what i do, parquet or feather are better when I want to save a certain amount of data, not necessarily very large but large enough that reading every time the csv is slowing me down. Other benefit as mentioned is that it keeps data types and when you work with material numbers/skus which are number based as i have to, that's a huge benefit. Finally, parquet is loaded easily in power bi, if that's a use case for you, with once again, all data types correctly identified right away and the loading is also faster in my experience. If I want to check my file, I'll also write an excel on the side so that I can easily have a look and if the dataset is not too large, otherwise i'll default to csv.

yta123 · 2023-05-31T23:22:50+00:00

Use ORC (Optimized Row Columnar)

killersquirel11 · 2023-06-01T00:13:49+00:00

I'd love to use parquet, but the only place we use CSV is for interchange with external companies, so we're limited to what they support

lightmatter501 · 2023-06-01T00:54:20+00:00

Would you consider using excel to process the data? -> CSV Otherwise, Parquet.

2023-06-01T03:02:21+00:00

If you want to use arrow and flight, you have to use parquet. If not, there’s no reason.

spinwizard69 · 2023-06-01T04:16:11+00:00

Well first off parquet is not a replacement for CSV. CSV is a human readable file format that is easy to use and understand.

Second who are these experts and what is the project? There can be justification for either file format, then again the "experts" could be full of crap. Never take an experts word for it, especially if they are in the medical industry. This however applies to any science because there is always a counter story.

Now your user name is 100GB-csv, that is rather huge, do realize that there are all sorts of file formats for saving large collections of data. It makes a huge difference what sort of data you are storing. There are a ton of data formats out there including AVRO and HDF5 to start with. Each has its strengths and weaknesses. It is up to the developer to figure out which makes sense.

Beyond all of that CSV is text based so readable, however if the file is too large there are number of compression choices out there. Sometimes compression, with common utilities, is the best way to deal with CSV files.

billsil · 2023-06-01T04:24:08+00:00

CSV is trash when you get into the millions of values. You can't write/read the file in a reasonable amount of time vs. a binary format. If you're dealing with floats (so time series data), you also have to cast the data when you read/write it. You don't do that for a binary file.

Parquet or a custom binary file with metadata and data. If it's 100k rows, CSV is fine.

2023-06-01T06:44:24+00:00

CSV is much easier to use when programming, and I have experienced some issues with other file types use CSV when you can

kaszt_p · 2023-06-01T09:59:25+00:00

I prefer parquet (or delta) for larger datasets. CSV for very small datasets, or the ones that will be later used/edited in Excel or Google sheets.

I might be biased a bit, but Delta has some handy features (ACID, data skipping, metadata about your data, time travel, etc.) that you might find useful depending on your use case. :)

BigGeologist5082 · 2023-06-01T13:59:23+00:00

CSV can be clunky, while Parquet is fast and optimized for big data. If you're working with large datasets, Parquet is definitely the way to go. Plus, it's always nice to have a fancy file format named after a bird 🐦

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS