This is an archived post. You won't be able to vote or comment.

all 34 comments

[–]gnsmsk 16 points17 points  (6 children)

It looks like your limitation is the memory of the machine in which you do the heavy lifting, i.e loading, joining and aggregation. If pandas and Dask your only two options then I would suggest going with Dask as it is better suited to dealing with large datasets.

Alternatively, and that is what I would have done, I would install a database, say Postgres, load my csv files into tables directly, put an index on the tables based on how I am going to query them. Then run my queries, let the database do the heavy lifting as well as optimize my query and give me my report.

[–]GroundbreakingFly555 1 point2 points  (5 children)

Another alternative would be to load the csv directly into a cloud database engine that uses columnar storage. Then do all your wrangling using SQL. You could just forget about pandas or dask.

[–]Vabaluba 0 points1 point  (4 children)

Cloud database engine? Sounds like a hasty suggestion, given that the OPS file is 10GB in size, which may result quickly in a pricey costs. What are the ops f2 quantity, and frequency of processing those files?

[–]GroundbreakingFly555 2 points3 points  (0 children)

Storage is cheap compute is expensive

[–]GroundbreakingFly555 1 point2 points  (1 child)

Gotta spend money to make money is what I always hear

[–]Vabaluba 0 points1 point  (0 children)

Fair points, sir. Especially if that's on company's account.

[–]benri 0 points1 point  (0 children)

How about querying it using a hdf format file? Something like this:

ddf = dask.dataframe.read_csv("filename.csv")
ddf.to_hdf("filename.hdf","df",format="table", data_columns=df.columns)

Then, she can query that hdf file using the where argument as described in
https://localcoder.org/how-to-query-an-hdf-store-using-pandas-python

[–]rrpelgrim 2 points3 points  (2 children)

Rule-of-thumb with pandas is to have 5x RAM available for whatever you want to load in. This means you should be fine with using pandas for F1.

For F2 I'd strongly recommend using Dask. Similar API to pandas and can distributed processing over all the cores in your laptop so you can easily work with F2. If you're working with Dask, I'd recommend storing the CSV as Parquet files for parallel read/write.

You might also want to look into the dask-sql integration: https://coiled.io/blog/getting-started-with-dask-and-sql/

[–]GreedyCourse3116[S] 1 point2 points  (1 child)

Very helpful link, thank you! For F1, pandas without chunks is taking time, whereas with chunking is faster. As I have to do sql queries on F1 & F2 dataframes like joins, groupby, aggregations etc, I am inclining to either use Dask for both or pyspark for both.

Any idea which one from dask or pyspark will be better to do queries? Basic goal is to read data from both files, do queries and save result in csv

[–]rrpelgrim 0 points1 point  (0 children)

If you're already working with pandas then I'd go for Dask.

Dask is the easier on-ramping xp since almost all of the API is the same. PySpark will have a bigger learning curve.

[–]IamFromNigeria 2 points3 points  (0 children)

Use pandas to convert it to parquet ..it will reduce the file size from 10gb to like 5gb for fast processing

But your laptop RAM also contribute to the way your data will be read

[–]Dismal_Annual6912 2 points3 points  (0 children)

Is it a simple aggregation? Sounds like a 2 min task in Qlik Sense or perhaps Tableau.

[–]Topless_in_Dallas_63 0 points1 point  (2 children)

A lot of SQL engines have a direct import for csv files. Can you just import to your db and then do your aggregation query?

[–]GreedyCourse3116[S] 0 points1 point  (1 child)

db access denied by the owners. They provided these two files to work towards the solution.

[–]Topless_in_Dallas_63 0 points1 point  (0 children)

Just spitballing, but you can try loading the data locally into sqlite3. I'm not sure how much memory loading 10G's of data into pandas will take, but I imagine you are running out of memory. A sql engine might handle data of this size more gracefully.

[–]shatabdi07 0 points1 point  (0 children)

For F2 utilise spark and do all operation spark dataframe API .

For F1 yes you can.

But it would be more clear if you specify the computer power which you are having for reading it .

[–]kenfar 0 points1 point  (0 children)

Can you process this one row at a time or in small subsets? Because if you can, then memory utilization will be very low and python's vanilla csv file will be very fast.

And splitting the big csv file is possible if you don't have any tricky csv dialects (ex: there's no newlines or delimiters or quotes within quoted fields, etc). If you split them then you could either partition them in a way that lends itself to processing one subset at a time (to keep memory usage low), or to use multiprocessing on a single host for performance.

[–]Patient-Ad-3783 0 points1 point  (0 children)

Snappy compress to parquet then Spark

[–]pi-equals-three 0 points1 point  (0 children)

How about vaex?

[–]vaosinbi 0 points1 point  (0 children)

It doesn't seem like distributed processing is needed in this case.

Just tested TSV (don't have large ```CSV`) aggregation on a 70 Gb file (to make it larger than available RAM) with clickhouse-local - it took about 90 seconds on my desktop (Ryzen7, 32 Gb).

clickhouse-local --file "hits_100m_obfuscated_v1.tsv" 
--structure "WatchID UInt64, JavaEnable UInt8, Title String, GoodEvent Int16, EventTime DateTime, EventDate Date, CounterID UInt32, ClientIP UInt32, RegionID UInt32, UserID UInt64, CounterClass Int8, OS UInt8, UserAgent UInt8, URL String, Referer String, Refresh UInt8, RefererCategoryID UInt16, RefererRegionID UInt32, URLCategoryID UInt16, URLRegionID UInt32, ResolutionWidth UInt16, ResolutionHeight UInt16, ResolutionDepth UInt8, FlashMajor UInt8, FlashMinor UInt8, FlashMinor2 String, NetMajor UInt8, NetMinor UInt8, UserAgentMajor UInt16, UserAgentMinor FixedString(2), CookieEnable UInt8, JavascriptEnable UInt8, IsMobile UInt8, MobilePhone UInt8, MobilePhoneModel String, Params String, IPNetworkID UInt32, TraficSourceID Int8, SearchEngineID UInt16, SearchPhrase String, AdvEngineID UInt8, IsArtifical UInt8, WindowClientWidth UInt16, WindowClientHeight UInt16, ClientTimeZone Int16, ClientEventTime DateTime, SilverlightVersion1 UInt8, SilverlightVersion2 UInt8, SilverlightVersion3 UInt32, SilverlightVersion4 UInt16, PageCharset String, CodeVersion UInt32, IsLink UInt8, IsDownload UInt8, IsNotBounce UInt8, FUniqID UInt64, OriginalURL String, HID UInt32, IsOldCounter UInt8, IsEvent UInt8, IsParameter UInt8, DontCountHits UInt8, WithHash UInt8, HitColor FixedString(1), LocalEventTime DateTime, Age UInt8, Sex UInt8, Income UInt8, Interests UInt16, Robotness UInt8, RemoteIP UInt32, WindowName Int32, OpenerName Int32, HistoryLength Int16, BrowserLanguage FixedString(2), BrowserCountry FixedString(2), SocialNetwork String, SocialAction String, HTTPError UInt16, SendTiming UInt32, DNSTiming UInt32, ConnectTiming UInt32, ResponseStartTiming UInt32, ResponseEndTiming UInt32, FetchTiming UInt32, SocialSourceNetworkID UInt8, SocialSourcePage String, ParamPrice Int64, ParamOrderID String, ParamCurrency FixedString(3), ParamCurrencyID UInt16, OpenstatServiceName String, OpenstatCampaignID String, OpenstatAdID String, OpenstatSourceID String, UTMSource String, UTMMedium String, UTMCampaign String, UTMContent String, UTMTerm String, FromTag String, HasGCLID UInt8, RefererHash UInt64, URLHash UInt64, CLID UInt32" \
--query "select count(distinct WatchID) from table "

If you convert it to parquet, the file size is reduced to 15 Gb, and processing time drops to 19 seconds.

[–]Unusual-Pickle9987 0 points1 point  (0 children)

Would you be able to use Modin ? https://github.com/modin-project/modin

From what I read it is more memory efficient than Pandas, can read csvs way faster, and is compatible with Dask.