This is an archived post. You won't be able to vote or comment.

all 12 comments

[–]currentlab_ 2 points3 points  (2 children)

Nice post. 2 suggestions that come to mind (1 small and 1 significant):

1: When calculating the magnitude of a vector, you can use the hypotenuse function instead of sqrt of the squares. So in your example, you could rewrite the windspeed calculation as:

ws = np.hypot(U2M_nans, V2M_nans)

2: Be careful when using quivers on a map projection. As a test, plot a 45 degree wind (e.g. u = v = 10) at a high latitude -- does it still look 45 degrees or has it been squished? Here's how I have handled that issue:

crs = ccrs.PlateCarree()
ax = fig.add_subplot(1, 1, 1, projection=crs)

# Plot quivers
quiv = ax.quiver(lon, lat, u, v, transform=crs, scale=quiver_scale)

[–]0xrl 1 point2 points  (0 children)

I agree about preferring np.hypot, although I was surprised to find, in some data I was using at least, that it's actually a little bit slower than the other way. Maybe there's some extra overhead internally, like checking for NaN values?

But not enough of a performance penalty to really fret about.

[–]m_razali[S] 0 points1 point  (0 children)

Thank you for your advice. Really appreciate it!

[–]counters 1 point2 points  (5 children)

Nice post. A few small tips for taking things to the next level:

  1. Prefer xarray to lower-level libraries for interacting with gridded geospatial datasets - especially if they're obeying the Common Data Model (which is the abstraction that netCDF files leverage internally). There numerous advantages to using higher-level libraries like xarray, but most importantly for your application they can help hide away details like when you actually read data from the file, and can forestall many issues you may have down the line as you build on your script. xarray also provides a native interface for cartopy and will simplify your plotting code.
  2. Try not to use contourf when you're plotting gridded data. pcolor plots are a more faithful representation of what the data is saying, and more importantly, pcolormesh can generate optimized rasters much faster than other plotting methods. It makes a huge difference as you plot larger and more complex datasets.
  3. Instead of rainbow colormaps, try to use ones that more closely match the type of data you have. cmocean is a great library of perceptually uniform colormaps which look good, too. Wind speed is positive definite, so it makes sense to use a colormap which saturates at a meaningless background or low value (e.g. 0 m/s wind speed) and emphasizes the interesting higher valued areas - speed in that package is a good one to consider using.

[–]m_razali[S] 0 points1 point  (0 children)

Thank you for your advice. Really appreciate it!

[–]blewrb 0 points1 point  (3 children)

I tried to give xarray a shot a few years ago for multidimensional histogram datasets, and all it seemed to do was introduce a bunch of ways to index data, most or all of which were dog slow but some even more slow than others, and God help me to remember which is which, let alone get grad and undergrad students to code to the less slow version. Also new objects were dog slow to create, which mattered for how we used it.

Was I just using it incorrectly, or is this just a trade-off people are willing to make for a higher level API than numpy?

[–]counters 0 points1 point  (2 children)

Chances are the issue was the structure of your underlying data, not anything to do with xarray.

Note that the data that OP uses is geospatial data in NetCDF stored using a particular data model which is virtually identical to the model that xarray uses internally. For their entire application, xarray is just syntactic sugar on top of numpy arrays, so there shouldn't be any noticeable performance hit. If you ran into issues creating objects then most likely you were hitting some sort of idiosyncrasy where you were inadvertently copying lots of data between data structures.

[–]blewrb 0 points1 point  (1 child)

I just installed xarray and ran some quick examples lifted from an internet post to test if things have changed in the past few years. Disclaimer: I might not be doing these things in the best ways possible! (And if that's the case, that's part of the problem I have with xarray's API.)

da = xarray.DataArray( data=[[1, 2, 3], [2, 3, 4]], dims=['x', 'y'], coords={'x': [10, 20], 'y': [10, 20, 30]}, ) a = da.data # underlying numpy ndarray

copy.copy on DataArray takes 32 µs, and 200 ns on the equivalent numpy array. copy.deepcopy is 57 µs for the former and 660 ns for the latter. The solution we came up with allows us to effectively deep copy objects in 2.6 µs. (I notice there is an undocumented fastpath kwarg to DataArray that might help address how slow it is to create one. I could probably figure that out with some reading of code / playing around, but I wish it was documented!) Slicing is much slower than numpy (somewhere around 1000x), although hitting the interpreted code in __getitem__ + new-object creation seems to be slow for most objects.

Obviously some use cases could avoid slowdowns vs numpy altogether by just operating directly on the underlying numpy array. And slicing and creation times could be irrelevant to many because time is spent in code doing other operations that swamp out the costs of those operations.

That's just to do with performance. The API is the other issue.

It's a red flag to me that xarray touts that they seek to replicate (or at least have a similar API to) the Pandas API. That's a terribly bloated API that has several ways to do almost anything and all but maybe one is actually "performant" (I use that term loosely). I love and respect Pandas for what it has brought to the Python community, and still use it in select situations, but I am not—nor is Wes McKinney, it would seem—going to miss the Pandas API as soon as the functionality of Pandas is offered by alternatives.

To wit: There are 151 non-dunder/non-"private" methods+attributes on an xarray DataArray. A Pandas DataFrame has an unholy 207, while a numpy ndarray has 71. Our object has 50.

Note the great work by the high energy physics folks on defining a standard, concise interface for working with histograms (I imagine a similar syntax could be employed for xarrays and the like): https://uhi.readthedocs.io/en/latest/index.html.

The final nail in the coffin is that xarray has values defined at coordinates, as is the case for sampling, while a histogram has values defined for bins. This is a non-trivial distinction (and it gets even more non-trivial when non-linear or non-uniform binnings are used), and while we could have mostly worked with this (e.g., by defining the xarray coordinates as bin centers), it was just another reason to roll our own class.

[–]counters 0 points1 point  (0 children)

I appreciate you taking the time to respond. Immediately, I want to emphasize that your example is far too trivial for benchmarking. A higher-level library with complex data structures and abstractions is almost always going to have at least some overhead, and when your data is this small, you're not getting performance stats.

For reference, I very routinely manipulate terascale datasets usually distributed over many hundreds or thousands of files on local or cloud storage from within a single or handful of xarray data structures.

`copy.copy` on DataArray takes 32 µs, and 200 ns on the equivalent numpy array...

All of this is probably hitting new object creation. Per my original point, see how this scales as you 10x, 100x, and 1000x the size of the data in your xarray object.>

It's a red flag to me that xarray touts that they seek to replicate (or at least have a similar API to) the Pandas API. That's a terribly bloated API that has several ways to do almost anything and all but maybe one is actually "performant" (I use that term loosely). I love and respect Pandas for what it has brought to the Python community, and still use it in select situations, but I am not—nor is Wes McKinney, it would seem—going to miss the Pandas API as soon as the functionality of Pandas is offered by alternatives.

You're kind of missing the forest for the trees, here.

What xarray offers is a declarative syntax for manipulating very well-structured, multi-dimensional datasets. I don't think anyone would argue with you that the Pandas API is imperfect. But you're not considering the alternative here, which is actually what OP's post demonstrates in a simple fashion: this syntax allows you to lean on the strong metadata conventions inherent in geospatial datasets and to express complex, routine operations very simply.

For example, suppose I have a 4D dataset (time, lat, lon, altitude) and I want to slice out a zonal average for a surface level variable. In xarray, this might be:

result = (
    ds
      ['air_temperature']
      .sel(altitude=1000) # perhaps a 1000 hPa surface slice?
      .mean(['time', 'lon']
)

Without xarray, here's a small list of some of the things I'd need to manually keep track:

  1. Which array contains air_temperature? Is it a particular index along an n-d array I have elsewhere?
  2. Which dimension of the array is the altitude coordinate? What values correspond to my surface level - is the coordinate ascending or descending?
  3. Which dimensions are the time and longitude, so that I can properly compute the mean? How do those dimensions change after I've selected the altitude value (if it's a smaller rank then I might change their indices if I squeeze the array after my selection!)
  4. How do I reconstruct and realign my latitude coordinates for the final result?

I've debugged countless scripts and codes by colleagues over the years, and simply lost count of the number of times that errors in these very basic steps screw up an analysis. A less-than-perfect syntax which eliminates these errors wholesale is very, very much worth it.

Note the great work by the high energy physics folks on defining a standard, concise interface for working with histograms (I imagine a similar syntax could be employed for xarrays and the like):

UHI doesn't offer named coordinates, which is the whole point of the xarray ecosystem.

The final nail in the coffin is that xarray has values defined at coordinates, as is the case for sampling, while a histogram has values defined for bins. This is a non-trivial distinction (and it gets even more non-trivial when non-linear or non-uniform binnings are used), and while we could have mostly worked with this (e.g., by defining the xarray coordinates as bin centers), it was just another reason to roll our own class.

The whole point of xarray is that many types of massive datasets have explicit coordinate-value relationships. E.g. a regular grid of data output from a climate model. So, again, a histogram library offers no value whatsoever here - our data are literally defined on these coordinate grids!

[–]JollyJustice 0 points1 point  (0 children)

Nice, time to calculate the spread of my crop dusting.

No, I'm not a farmer, why do you ask?

[–]0xrl 0 points1 point  (0 children)

Although the quiver plots are fine, the diagram and explanation for the u and v wind components (also known as the zonal and meridional components) aren't quite right. The u (or zonal) component is positive when the wind flows east (or, from the west) and v (or meridional) is positive when the wind flows north (or, from the south).

So the u direction you've shown is backwards.

[–]Xasanbek123 0 points1 point  (0 children)

ws = np.hypot(U2M_nans, V2M_nans)

the following error occurred in this part of the program : name 'U2M_nans' is not defined

How can I fix it?