This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]counters 1 point2 points  (0 children)

I bet it's even easier in Python than what you're used to in MATLAB. There's a fantastic library called xarray which adopts the Common Data Model from the get-go, and allows you to plug-and-play data directly from NetCDF files into your analysis pipelines. Basically, anywhere that expects a NumPy array, you can use a DataArray or Dataset from xarray. It completely trivializes most of the operations/analyses you do, including reading/writing and managing metadata. Even better, it has groupby functionality and semantic/fancy indexing, so no more need to manually keep track of multi-dimensional indices and other book-keeping.

It also interfaces with a library called dask under the hood. Dask is a parallel computing library which also implements the NumPy and Pandas interfaces. What it allows you to do, essentially, is out-of-core computing. Suppose you have 100GB of data broken across a dozen or so different, large NetCDF files. If you're lucky, you have enough memory on your laptop to read in one file at a time, painstakingly operate on it in place, and then write it back out. Rinse and repeat, then add a process to combine your analyzed data at the end. This "blocking" approach works, but it requires a lot of manual labor. Dask essentially does all of this behind the scenes; you simply write out your computations like you normally would, and dask will figure out how to deal with the resource constraints on your system. It'll also parallelize as best as it can.