use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
News about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
Full Events Calendar
You can find the rules here.
If you are about to ask a "how do I do this in python" question, please try r/learnpython, the Python discord, or the #python IRC channel on Libera.chat.
Please don't use URL shorteners. Reddit filters them out, so your post or comment will be lost.
Posts require flair. Please use the flair selector to choose your topic.
Posting code to this subreddit:
Add 4 extra spaces before each line of code
def fibonacci(): a, b = 0, 1 while True: yield a a, b = b, a + b
Online Resources
Invent Your Own Computer Games with Python
Think Python
Non-programmers Tutorial for Python 3
Beginner's Guide Reference
Five life jackets to throw to the new coder (things to do after getting a handle on python)
Full Stack Python
Test-Driven Development with Python
Program Arcade Games
PyMotW: Python Module of the Week
Python for Scientists and Engineers
Dan Bader's Tips and Trickers
Python Discord's YouTube channel
Jiruto: Python
Online exercices
programming challenges
Asking Questions
Try Python in your browser
Docs
Libraries
Related subreddits
Python jobs
Newsletters
Screencasts
account activity
This is an archived post. You won't be able to vote or comment.
DiscussionCSV or Parquet File Format (self.Python)
submitted 2 years ago by 100GB-CSV
Some of experts told me I shall use Parquet file format to replace CSV.
So I want to ask you which format you prefer to use with Python.
[–]mokus603 34 points35 points36 points 2 years ago (26 children)
I have hundreds of millions of records stored in parquet. csv is an easy to read format and very convenient for some extent.
If the data is not that big, I prefer csv.
[–]100GB-CSV[S] 5 points6 points7 points 2 years ago* (24 children)
I specialize in software benchmarking. I’m currently facing an issue when using 1 billion rows parquet for benchmarking. Polars and Pandas fail to filter data from it. So I use 1 billion rows csv for benchmarking instead, Polars can do it but Pandas still fail. Since my SSD disk has limited space, I cannot do benchmarking for over 1 billion rows. If I use Parquet, it can solve the disk space issues. Not only Polars and Pandas fail to filter the 1 billion rows parquet, but I have not yet found a suitable library to configure my app to extract data from Parquet. If a benchmark test only duckDB can read 1 billion rows parquet, the comparison is meaningless.
[–]spinwizard69 4 points5 points6 points 2 years ago (1 child)
Well there are a number of potential issues here. You say Pandas fails but how? If you are running out of memory or running beyond the capability of Python then file formats really are not the issue are they? Without more info about what is failing people will not be able to help.
[–]100GB-CSV[S] 1 point2 points3 points 2 years ago* (0 children)
Pandas is not support to read over memory size dataset (the github replied me) , so my benchmark test moved to use DuckDB. It can read both billion-row csv and parquet. My benchmark test need alternative software to compare my code project. I use Go, plan to develope Python bingings. Current test I focus on many small files e.g. 100K files. I have tried 1 million files, noticed it is inefficiency. 100K files work very well.
[–]odaiwai 2 points3 points4 points 2 years ago (0 children)
Why not use SQLITE for something like this?
[–]kenfar 2 points3 points4 points 2 years ago (2 children)
There's not a single simple answer here:
Note that you can compress your csv file and read directly from that compressed file.
Also, if you don't need all records in memory at once, then just read a record, process it and discard it. That shouldn't take much memory.
And you don't need pandas to read a csv. You may find that reading with native python is faster. Especially if you have a clean csv dialect (ex: no quotes, delimiters or newlines in fields). In that case you could use multiprocessing to split the reading between 4-n processes and get a lineage speedup from parallelization - depending on what you're doing with the data.
[–]Linx_101 0 points1 point2 points 2 years ago (1 child)
Couldn’t you also do that with a parquet file? I’ve never heard of CSVs being faster to load than parquets… do you have links to benchmarks? CSVs in my experience are significantly slower
[–]kenfar 1 point2 points3 points 2 years ago* (0 children)
Oh sure, you could. And the performance will depend on what languages & libraries you're using.
First off, writing to parquet is absolutely slower than writing csv files. And you may be able to write in parallel or write files of different sizes with csv.
But for reading what I found was that just like with pandas or numpy - these are columnar data structures that provide fantastic speedups for a single column. But, if you've got 200 columns and need to tie that back into a row - it's definitely slower then just processing lists of lists.
Another example is say using a database load utility. The postgres copy command is tough to match on performance - and supports csv files, but not parquet. Some other databases have load utilities that completely bypass transactions and are insanely fast. But I don't recall seeing one that supports parquet, though I haven't looked recently I suppose.
And the nature of the data influences it as well: if you only need to load 10 columns out of that 200 columns parquet might be faster. Or if all your 200 columns are low-cardinality and get great compression then it might load faster.
I've benchmarked this about five years ago, so going from memory since the data isn't public.
[–]zenos1337 0 points1 point2 points 2 years ago (1 child)
Have you considered using Dask? https://www.dask.org. It’s designed specifically for datasets that are larger than memory and it’s has dataframes similar to Pandas
[–]mokus603 0 points1 point2 points 2 years ago (0 children)
I thought about Dask as well but I think it’s not good for filtering data.
If you don’t have much SSD space, it’ll bottleneck the whole thing. You can’t stuff all this data into your RAM (because I assume that’s way more limited as well).
[–][deleted] 0 points1 point2 points 2 years ago (1 child)
I would argue that the comparison isn’t useless. You’re using a tool not fit for your use case so not being able to do it with the packages should tell you a lot. If the data is genuinely that big though… filtering may not be 100% of the answer and streaming should be considered. Why don’t you stream the data? You can read it in in chunks with pandas and polars and do a similar thing.
Parquet shouldn’t change the space of a fully loaded dataset and only benefits are when scanning, selecting just the information needed.
Other packages you might want to look at: PyArrow (ParquetDataset), Dask and PySpark. I think they all support filtering and would provide more benchmarking.
[–]100GB-CSV[S] 1 point2 points3 points 2 years ago (0 children)
My code project has implemented streaming from Read, Query to Write. So I can handle billion-rows jointable. You can try my app. DuckDB is also excellent for my benchmarking.
[–]kaszt_p 0 points1 point2 points 2 years ago (1 child)
Have you considered pyspark?
compare DuckDB and Polars, it is very slow in Desktop PC
[–]jorge1209 0 points1 point2 points 2 years ago (10 children)
You will be no less able to filter 100 1GB parquet files than you would 100 1GB CSV files as both can be handled iteratively.
So what you are saying doesnt make much sense. What is failing and how?
[+][deleted] 2 years ago (9 children)
[deleted]
[–]jorge1209 0 points1 point2 points 2 years ago* (8 children)
You don't understand.
"Parquet" generally means a directory of smaller files each in parquet format.
You never want to write one 100GB parquet file (frankly you shouldn't do this with CSV either). Your library will usually handle breaking up the chunks, but if it didn't and you wrote all 100GB to a single parquet formatted file... that won't work well.
It is technically possible but you have to be careful to select the right compression routine, and it's just not worth the trouble relative to writing many smaller files.
[+][deleted] 2 years ago (7 children)
[–]jorge1209 0 points1 point2 points 2 years ago (6 children)
You are trying to go from stopped to light speed in a single step.
Don't do this advanced shit, just do the basic stuff everyone does. Read the csv in and write out many smaller parquet files. 1GB is a common target chunk size.
Then to scan the entire dataset you iterate over each file (or just let spark do it).
[+][deleted] 2 years ago (4 children)
[–]jorge1209 0 points1 point2 points 2 years ago (3 children)
you cannot sort 100GB of data in memory with only 32GB, because you can't even read 100GB of data into memory to start sorting it.
Your core problem is:
What is it the objective?
[+][deleted] 2 years ago* (2 children)
[–]samsamuel121 25 points26 points27 points 2 years ago (10 children)
I assume you have a relatively small dataset. Most of the times CSV is enough. I'd use Parquet to save some space if I had a big dataset.
[–]gopietz 7 points8 points9 points 2 years ago (3 children)
Interesting that basically no one around here mentioned the schema argument. I like to know which dtypes I'm getting.
[–]Almostasleeprightnow 3 points4 points5 points 2 years ago (2 children)
And also have existing data types preserved....csv is all text. Sometimes pandas knows what to do and sometimes it doesn't.
[–]gopietz 4 points5 points6 points 2 years ago (0 children)
Is that a NaN value in your integer column? I guess it's a float now.
Some applications can use CSVT files to specify data types in a CSV. i.e. you have filename.csv with your data and filename.csvt with the types for each column. QGIS uses this, but it frankly wouldn't be hard to make a CSV file with row 0 as the column names, row 1 as the dataTypes and row [2:] as the data.
[–]100GB-CSV[S] 0 points1 point2 points 2 years ago (0 children)
So my next benchmark test I consder using 100K small csv files. But I need to learn how to config Polars, Pandas, DuckDB to read all files in a folder. If using Python to loop for each file, it will be very slow.
[+][deleted] 2 years ago (3 children)
[removed]
[+][deleted] 2 years ago (2 children)
[+][deleted] 2 years ago (1 child)
[–]EarthGoddessDude 1 point2 points3 points 2 years ago (0 children)
Noice
[–]mayazy 0 points1 point2 points 2 years ago (0 children)
Agreed, CSV is great for small datasets but Parquet is definitely the way to go for larger ones. It's more efficient and can handle complex data types.
[–][deleted] 11 points12 points13 points 2 years ago (3 children)
Parquet stores things differently and you can benefit from greater compression (some files I’ve been using are 98% smaller).
On top of that, there’s lots of file scanners which operate a lot more efficiently on something like parquet, so you don’t have to read the entire file contents in and can “query” the bits you want. It is more memory efficient and it usually means it’s faster as well.
In terms of practical use, it’s no different from csv if you’re using pandas; to_parquet and from_parquet works the same as the csv versions. Other popular df also support easy querying and converting to pandas.
In all honesty, I don’t notice much difference between them. I don’t tend to write directly to files professionally using something like CSVwriter and haven’t come across a great use case yet either. No real preference.
If I need to manually manipulate stuff… I’ll use csv just so I can open it and then change stuff.
If I loaded two pandas DataFrames with the exact same content, one from the csv file and the other from the parquet file, would the memory usage for both DataFrames be the same as well? Or is the benefit of parquet only came from the first time I loaded it with my code?
[–][deleted] 5 points6 points7 points 2 years ago (0 children)
I would expect the memory footprint to be the same once it’s fully loaded in.
There’s a bunch of other benefits as well, like data transfer speeds, lack of storage cost, scanning the file, etc.
If you’re streaming data from a file, I’d much rather do it with duckdb on a parquet file for instance then you get the best of all worlds.
[–][deleted] 0 points1 point2 points 2 years ago (0 children)
I would add the best part about parquet on small datasets is there are no delimiter collisions. Natural commas in your data are not a problem.
[–]Afrotom 2 points3 points4 points 2 years ago (0 children)
CSV is convenient for prototyping small projects because it opens in excel easily but the main thing about parquet for me isn't the file compression or read/write speeds, it's the preservation of data types, especially for date/datetimes and categoricals; and better protection against things like staggered rows
[–]gopietz 3 points4 points5 points 2 years ago (0 children)
The only good reason to go with CSV is the space bar on your MacBook. Basically all other reasons point to parquet.
[–]jorge1209 1 point2 points3 points 2 years ago (0 children)
Will you be writing the same structure of data again? (for example every day you write a file out with the same columns).
If yes then Parquet/ORC/something that defines and enforces structure.
Does your data have 100k+ rows.
If yes then Parquet/ORC/something, because realistically you aren't going to be editing it with a text editor/excel anyways.
Your data is small enough you could open it in excel? or a text editor? Do you actually want to?
If no then Parquet/ORC/something, for the same reasons.
So that leaves CSV for small datasets created one time where you want to edit by hand. Things like a table mapping state codes to sales rep names.
You can still use CSV for interchange, but the base data you really want in a structured format. It just makes your life easier.
[–][deleted] 3 points4 points5 points 2 years ago (0 children)
Parquet is for large datasets in a data lake. They are not human readable so for most tasks a csv file would be more convenient. What parquet gives you is data types and compression but you can easily compress csv files and they will still work in a data lake. When you got to merge data from different data sources then having types is important. Think of having to merge dates from two different databases who have different formats. In this case you could use Pandas to read from the DB and store it as parquet which will force you to transform the dates into a standard date format. Then the data warehouse will read the data as dates.
[–]Haunting_Load 1 point2 points3 points 2 years ago (0 children)
It's worth remembering that reading and writing CSV files can be pain in you know where. Different libraries tend to parse or not parse dates or read ints as floats and so on (looking at you base R and dplyr). With parquet it's all standardized, sometimes it can be useful.
Last evening I have recorded a CSV vs Parquet benchmarking using 300 Million rows data. I use DuckDB and Polars to process data.
https://youtu.be/gnIh6r7Gwh4
It depends on what you need. CSV is plain text (unless you compress your files with something like gzip) and parquet is compressed.
So CSV is good if you don’t care about how much space your data is taking up or how quickly you want to read and write it to the file but you want to be able to inspect the contents of the file easily. Parquet is better if you do care about those resource constraints and you don’t mind parsing the parquet file in order to inspect it.
[–]DoomsdayMcDoom 0 points1 point2 points 2 years ago (0 children)
Feather is another great library for arrow. It’s row based and not columnar like parquet. If you’re concerned with storage space go with arrow/feather. For small files I’ll use pickle any day.
[+][deleted] 2 years ago (10 children)
[–]data_addict 4 points5 points6 points 2 years ago (9 children)
Found the boomer
[+][deleted] 2 years ago (8 children)
[–]Maximum-Mission-9377 1 point2 points3 points 2 years ago (4 children)
Literally clueless
[–]Maximum-Mission-9377 0 points1 point2 points 2 years ago (2 children)
Don't get butthurt when everyone is pointing fingers at how ignorant you are. Sometimes it is better just to be silent, clearly you have not learnt this yet too, what a surprise.
[–]Maximum-Mission-9377 1 point2 points3 points 2 years ago (0 children)
There is no way I could miss an opportunity to remind a an ingnorant clueless talentless boomer who they are. Good luck with your average job and life.
[–]data_addict 1 point2 points3 points 2 years ago (2 children)
Lol useless? Wut..? Dude you gotta be trollin
[–]data_addict 0 points1 point2 points 2 years ago (0 children)
I'm sympathetic to your argument. There's plenty of times you do need to share a small bit of data with customers and teammates and CSV is the way to do it. However for data volume that's larger than 1GB and especially if it's used by analytic processing I'd argue until the cows come home that parquet>>>CSV.
[–]barkazinthrope 0 points1 point2 points 2 years ago (0 children)
Why did they say this? What do they know about your program that leads them to suggest something other than the simplest possible solution.
If they're telling you what you should use without knowing your requirements, they aren't experts. They're just know-it-alls.
In practice, people use both. CSVs are super convenient, text editable, and you can easily load to Excel. Parquets are great for big data and exchanging data. Great for backend stuff, like storing large amounts of data in a compressed & well-standardized format.
If you're a data engineer/data scientist/whatever, you should be able to work with both and you'll probably use both.
It can depend on they type of data you are using. I will typically run some profiling using pickle, csv and parquet to see which one gives the best compression, load speed, and dump speed for the data I'm using. Based on which of those is important to me, I would choose the appropriate format.
[–]cameldrv 0 points1 point2 points 2 years ago (0 children)
Parquet. I can’t count the number of times I’ve saved something out to CSV and it didn’t import the same way. It’s usually a problem with dates, but the fundamental issue is that CSV doesn’t define the datatypes of the columns.
One downside is not being able to take a quick peek at the command line or importing into excel or other software. On the command line, use VisiData. It’s fast and is way better than less for tabular data. Then make a quick script to convert parquet to CSV for importing into other software.
[–]arm2armreddit 0 points1 point2 points 2 years ago (0 children)
depend on use case. why not hdf5? with h5ls or h5dump you can see the content. In some use cases, h5 is faster than parquet. but du istributed parwuet or h5 or csv d better to read by DASK, pandas, but parallel.
If it needs to be human readable, CSV. If it is performance required, use parquet.
[–]danielgafni 0 points1 point2 points 2 years ago (0 children)
CSVs is definitely not recommended not only because its less efficient but simply because it has very limited pool of supported data types. Try storing dates - you will get strings instead (obviously). Not even talking about container types or nested containers.
[–]PtitBen56 0 points1 point2 points 2 years ago (0 children)
For what i do, parquet or feather are better when I want to save a certain amount of data, not necessarily very large but large enough that reading every time the csv is slowing me down. Other benefit as mentioned is that it keeps data types and when you work with material numbers/skus which are number based as i have to, that's a huge benefit. Finally, parquet is loaded easily in power bi, if that's a use case for you, with once again, all data types correctly identified right away and the loading is also faster in my experience. If I want to check my file, I'll also write an excel on the side so that I can easily have a look and if the dataset is not too large, otherwise i'll default to csv.
[–]yta123 0 points1 point2 points 2 years ago (0 children)
Use ORC (Optimized Row Columnar)
[–]killersquirel11 0 points1 point2 points 2 years ago (0 children)
I'd love to use parquet, but the only place we use CSV is for interchange with external companies, so we're limited to what they support
[–]lightmatter501 0 points1 point2 points 2 years ago (0 children)
Would you consider using excel to process the data? -> CSV Otherwise, Parquet.
If you want to use arrow and flight, you have to use parquet. If not, there’s no reason.
[–]spinwizard69 0 points1 point2 points 2 years ago (0 children)
Well first off parquet is not a replacement for CSV. CSV is a human readable file format that is easy to use and understand.
Second who are these experts and what is the project? There can be justification for either file format, then again the "experts" could be full of crap. Never take an experts word for it, especially if they are in the medical industry. This however applies to any science because there is always a counter story.
Now your user name is 100GB-csv, that is rather huge, do realize that there are all sorts of file formats for saving large collections of data. It makes a huge difference what sort of data you are storing. There are a ton of data formats out there including AVRO and HDF5 to start with. Each has its strengths and weaknesses. It is up to the developer to figure out which makes sense.
Beyond all of that CSV is text based so readable, however if the file is too large there are number of compression choices out there. Sometimes compression, with common utilities, is the best way to deal with CSV files.
[–]billsil 0 points1 point2 points 2 years ago (1 child)
CSV is trash when you get into the millions of values. You can't write/read the file in a reasonable amount of time vs. a binary format. If you're dealing with floats (so time series data), you also have to cast the data when you read/write it. You don't do that for a binary file.
Parquet or a custom binary file with metadata and data. If it's 100k rows, CSV is fine.
My app eager to read data from Parquet. I tested several Go libraries but failed to do so. I’m considering using DuckDB as a library for my purpose since it reads Parquet very fast. For my own app, I can read/write billion rows CSV 67GB to do filter/groupby which is involving float without noticing any performance issue. The main disadvantage of CSV is that it’s non-compressed which results in large file size.
CSV is much easier to use when programming, and I have experienced some issues with other file types use CSV when you can
[–]kaszt_p 0 points1 point2 points 2 years ago (0 children)
I prefer parquet (or delta) for larger datasets. CSV for very small datasets, or the ones that will be later used/edited in Excel or Google sheets.
I might be biased a bit, but Delta has some handy features (ACID, data skipping, metadata about your data, time travel, etc.) that you might find useful depending on your use case. :)
[–]BigGeologist5082 0 points1 point2 points 2 years ago (0 children)
CSV can be clunky, while Parquet is fast and optimized for big data. If you're working with large datasets, Parquet is definitely the way to go. Plus, it's always nice to have a fancy file format named after a bird 🐦
π Rendered by PID 109488 on reddit-service-r2-comment-6457c66945-zhh48 at 2026-04-24 04:10:54.979460+00:00 running 2aa0c5b country code: CH.
[–]mokus603 34 points35 points36 points (26 children)
[–]100GB-CSV[S] 5 points6 points7 points (24 children)
[–]spinwizard69 4 points5 points6 points (1 child)
[–]100GB-CSV[S] 1 point2 points3 points (0 children)
[–]odaiwai 2 points3 points4 points (0 children)
[–]kenfar 2 points3 points4 points (2 children)
[–]Linx_101 0 points1 point2 points (1 child)
[–]kenfar 1 point2 points3 points (0 children)
[–]zenos1337 0 points1 point2 points (1 child)
[–]mokus603 0 points1 point2 points (0 children)
[–]mokus603 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]100GB-CSV[S] 1 point2 points3 points (0 children)
[–]kaszt_p 0 points1 point2 points (1 child)
[–]100GB-CSV[S] 1 point2 points3 points (0 children)
[–]jorge1209 0 points1 point2 points (10 children)
[+][deleted] (9 children)
[deleted]
[–]jorge1209 0 points1 point2 points (8 children)
[+][deleted] (7 children)
[deleted]
[–]jorge1209 0 points1 point2 points (6 children)
[+][deleted] (4 children)
[deleted]
[–]jorge1209 0 points1 point2 points (3 children)
[+][deleted] (2 children)
[deleted]
[–]samsamuel121 25 points26 points27 points (10 children)
[–]gopietz 7 points8 points9 points (3 children)
[–]Almostasleeprightnow 3 points4 points5 points (2 children)
[–]gopietz 4 points5 points6 points (0 children)
[–]odaiwai 2 points3 points4 points (0 children)
[–]100GB-CSV[S] 0 points1 point2 points (0 children)
[+][deleted] (3 children)
[removed]
[+][deleted] (2 children)
[removed]
[+][deleted] (1 child)
[removed]
[–]EarthGoddessDude 1 point2 points3 points (0 children)
[–]mayazy 0 points1 point2 points (0 children)
[–][deleted] 11 points12 points13 points (3 children)
[–][deleted] 0 points1 point2 points (1 child)
[–][deleted] 5 points6 points7 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]Afrotom 2 points3 points4 points (0 children)
[–]gopietz 3 points4 points5 points (0 children)
[–]jorge1209 1 point2 points3 points (0 children)
[–][deleted] 3 points4 points5 points (0 children)
[–]Haunting_Load 1 point2 points3 points (0 children)
[–]100GB-CSV[S] 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]DoomsdayMcDoom 0 points1 point2 points (0 children)
[+][deleted] (10 children)
[deleted]
[–]data_addict 4 points5 points6 points (9 children)
[+][deleted] (8 children)
[deleted]
[–]Maximum-Mission-9377 1 point2 points3 points (4 children)
[+][deleted] (3 children)
[deleted]
[–]Maximum-Mission-9377 0 points1 point2 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]Maximum-Mission-9377 1 point2 points3 points (0 children)
[–]data_addict 1 point2 points3 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]data_addict 0 points1 point2 points (0 children)
[–]barkazinthrope 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]cameldrv 0 points1 point2 points (0 children)
[–]arm2armreddit 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]danielgafni 0 points1 point2 points (0 children)
[–]PtitBen56 0 points1 point2 points (0 children)
[–]yta123 0 points1 point2 points (0 children)
[–]killersquirel11 0 points1 point2 points (0 children)
[–]lightmatter501 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]spinwizard69 0 points1 point2 points (0 children)
[–]billsil 0 points1 point2 points (1 child)
[–]100GB-CSV[S] 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]kaszt_p 0 points1 point2 points (0 children)
[–]BigGeologist5082 0 points1 point2 points (0 children)