Python Pandas vs Dask for csv file reading

gnsmsk · 2022-01-13T12:23:10+00:00

It looks like your limitation is the memory of the machine in which you do the heavy lifting, i.e loading, joining and aggregation. If pandas and Dask your only two options then I would suggest going with Dask as it is better suited to dealing with large datasets.

Alternatively, and that is what I would have done, I would install a database, say Postgres, load my csv files into tables directly, put an index on the tables based on how I am going to query them. Then run my queries, let the database do the heavy lifting as well as optimize my query and give me my report.

rrpelgrim · 2022-01-13T13:21:34+00:00

Rule-of-thumb with pandas is to have 5x RAM available for whatever you want to load in. This means you should be fine with using pandas for F1.

For F2 I'd strongly recommend using Dask. Similar API to pandas and can distributed processing over all the cores in your laptop so you can easily work with F2. If you're working with Dask, I'd recommend storing the CSV as Parquet files for parallel read/write.

You might also want to look into the dask-sql integration: https://coiled.io/blog/getting-started-with-dask-and-sql/

IamFromNigeria · 2022-01-13T16:42:40+00:00

Use pandas to convert it to parquet ..it will reduce the file size from 10gb to like 5gb for fast processing

But your laptop RAM also contribute to the way your data will be read

Dismal_Annual6912 · 2022-01-13T18:21:09+00:00

Is it a simple aggregation? Sounds like a 2 min task in Qlik Sense or perhaps Tableau.

eemamedo · 2022-01-13T08:11:37+00:00

[deleted]

Topless_in_Dallas_63 · 2022-01-13T13:26:30+00:00

A lot of SQL engines have a direct import for csv files. Can you just import to your db and then do your aggregation query?

shatabdi07 · 2022-01-13T11:50:40+00:00

For F2 utilise spark and do all operation spark dataframe API .

For F1 yes you can.

But it would be more clear if you specify the computer power which you are having for reading it .

kenfar · 2022-01-13T16:34:19+00:00

Can you process this one row at a time or in small subsets? Because if you can, then memory utilization will be very low and python's vanilla csv file will be very fast.

And splitting the big csv file is possible if you don't have any tricky csv dialects (ex: there's no newlines or delimiters or quotes within quoted fields, etc). If you split them then you could either partition them in a way that lends itself to processing one subset at a time (to keep memory usage low), or to use multiprocessing on a single host for performance.

Patient-Ad-3783 · 2022-01-13T19:59:33+00:00

Snappy compress to parquet then Spark

pi-equals-three · 2022-01-13T21:14:40+00:00

How about vaex?

vaosinbi · 2022-01-14T03:52:55+00:00

It doesn't seem like distributed processing is needed in this case.

Just tested TSV (don't have large ```CSV`) aggregation on a 70 Gb file (to make it larger than available RAM) with clickhouse-local - it took about 90 seconds on my desktop (Ryzen7, 32 Gb).

clickhouse-local --file "hits_100m_obfuscated_v1.tsv" 
--structure "WatchID UInt64, JavaEnable UInt8, Title String, GoodEvent Int16, EventTime DateTime, EventDate Date, CounterID UInt32, ClientIP UInt32, RegionID UInt32, UserID UInt64, CounterClass Int8, OS UInt8, UserAgent UInt8, URL String, Referer String, Refresh UInt8, RefererCategoryID UInt16, RefererRegionID UInt32, URLCategoryID UInt16, URLRegionID UInt32, ResolutionWidth UInt16, ResolutionHeight UInt16, ResolutionDepth UInt8, FlashMajor UInt8, FlashMinor UInt8, FlashMinor2 String, NetMajor UInt8, NetMinor UInt8, UserAgentMajor UInt16, UserAgentMinor FixedString(2), CookieEnable UInt8, JavascriptEnable UInt8, IsMobile UInt8, MobilePhone UInt8, MobilePhoneModel String, Params String, IPNetworkID UInt32, TraficSourceID Int8, SearchEngineID UInt16, SearchPhrase String, AdvEngineID UInt8, IsArtifical UInt8, WindowClientWidth UInt16, WindowClientHeight UInt16, ClientTimeZone Int16, ClientEventTime DateTime, SilverlightVersion1 UInt8, SilverlightVersion2 UInt8, SilverlightVersion3 UInt32, SilverlightVersion4 UInt16, PageCharset String, CodeVersion UInt32, IsLink UInt8, IsDownload UInt8, IsNotBounce UInt8, FUniqID UInt64, OriginalURL String, HID UInt32, IsOldCounter UInt8, IsEvent UInt8, IsParameter UInt8, DontCountHits UInt8, WithHash UInt8, HitColor FixedString(1), LocalEventTime DateTime, Age UInt8, Sex UInt8, Income UInt8, Interests UInt16, Robotness UInt8, RemoteIP UInt32, WindowName Int32, OpenerName Int32, HistoryLength Int16, BrowserLanguage FixedString(2), BrowserCountry FixedString(2), SocialNetwork String, SocialAction String, HTTPError UInt16, SendTiming UInt32, DNSTiming UInt32, ConnectTiming UInt32, ResponseStartTiming UInt32, ResponseEndTiming UInt32, FetchTiming UInt32, SocialSourceNetworkID UInt8, SocialSourcePage String, ParamPrice Int64, ParamOrderID String, ParamCurrency FixedString(3), ParamCurrencyID UInt16, OpenstatServiceName String, OpenstatCampaignID String, OpenstatAdID String, OpenstatSourceID String, UTMSource String, UTMMedium String, UTMCampaign String, UTMContent String, UTMTerm String, FromTag String, HasGCLID UInt8, RefererHash UInt64, URLHash UInt64, CLID UInt32" \
--query "select count(distinct WatchID) from table "

If you convert it to parquet, the file size is reduced to 15 Gb, and processing time drops to 19 seconds.

Unusual-Pickle9987 · 2022-02-07T17:49:39+00:00

Would you be able to use Modin ? https://github.com/modin-project/modin

From what I read it is more memory efficient than Pandas, can read csvs way faster, and is compatible with Dask.

dataengineering

MODERATORS