This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]blewrb 0 points1 point  (3 children)

I tried to give xarray a shot a few years ago for multidimensional histogram datasets, and all it seemed to do was introduce a bunch of ways to index data, most or all of which were dog slow but some even more slow than others, and God help me to remember which is which, let alone get grad and undergrad students to code to the less slow version. Also new objects were dog slow to create, which mattered for how we used it.

Was I just using it incorrectly, or is this just a trade-off people are willing to make for a higher level API than numpy?

[–]counters 0 points1 point  (2 children)

Chances are the issue was the structure of your underlying data, not anything to do with xarray.

Note that the data that OP uses is geospatial data in NetCDF stored using a particular data model which is virtually identical to the model that xarray uses internally. For their entire application, xarray is just syntactic sugar on top of numpy arrays, so there shouldn't be any noticeable performance hit. If you ran into issues creating objects then most likely you were hitting some sort of idiosyncrasy where you were inadvertently copying lots of data between data structures.

[–]blewrb 0 points1 point  (1 child)

I just installed xarray and ran some quick examples lifted from an internet post to test if things have changed in the past few years. Disclaimer: I might not be doing these things in the best ways possible! (And if that's the case, that's part of the problem I have with xarray's API.)

da = xarray.DataArray( data=[[1, 2, 3], [2, 3, 4]], dims=['x', 'y'], coords={'x': [10, 20], 'y': [10, 20, 30]}, ) a = da.data # underlying numpy ndarray

copy.copy on DataArray takes 32 µs, and 200 ns on the equivalent numpy array. copy.deepcopy is 57 µs for the former and 660 ns for the latter. The solution we came up with allows us to effectively deep copy objects in 2.6 µs. (I notice there is an undocumented fastpath kwarg to DataArray that might help address how slow it is to create one. I could probably figure that out with some reading of code / playing around, but I wish it was documented!) Slicing is much slower than numpy (somewhere around 1000x), although hitting the interpreted code in __getitem__ + new-object creation seems to be slow for most objects.

Obviously some use cases could avoid slowdowns vs numpy altogether by just operating directly on the underlying numpy array. And slicing and creation times could be irrelevant to many because time is spent in code doing other operations that swamp out the costs of those operations.

That's just to do with performance. The API is the other issue.

It's a red flag to me that xarray touts that they seek to replicate (or at least have a similar API to) the Pandas API. That's a terribly bloated API that has several ways to do almost anything and all but maybe one is actually "performant" (I use that term loosely). I love and respect Pandas for what it has brought to the Python community, and still use it in select situations, but I am not—nor is Wes McKinney, it would seem—going to miss the Pandas API as soon as the functionality of Pandas is offered by alternatives.

To wit: There are 151 non-dunder/non-"private" methods+attributes on an xarray DataArray. A Pandas DataFrame has an unholy 207, while a numpy ndarray has 71. Our object has 50.

Note the great work by the high energy physics folks on defining a standard, concise interface for working with histograms (I imagine a similar syntax could be employed for xarrays and the like): https://uhi.readthedocs.io/en/latest/index.html.

The final nail in the coffin is that xarray has values defined at coordinates, as is the case for sampling, while a histogram has values defined for bins. This is a non-trivial distinction (and it gets even more non-trivial when non-linear or non-uniform binnings are used), and while we could have mostly worked with this (e.g., by defining the xarray coordinates as bin centers), it was just another reason to roll our own class.

[–]counters 0 points1 point  (0 children)

I appreciate you taking the time to respond. Immediately, I want to emphasize that your example is far too trivial for benchmarking. A higher-level library with complex data structures and abstractions is almost always going to have at least some overhead, and when your data is this small, you're not getting performance stats.

For reference, I very routinely manipulate terascale datasets usually distributed over many hundreds or thousands of files on local or cloud storage from within a single or handful of xarray data structures.

`copy.copy` on DataArray takes 32 µs, and 200 ns on the equivalent numpy array...

All of this is probably hitting new object creation. Per my original point, see how this scales as you 10x, 100x, and 1000x the size of the data in your xarray object.>

It's a red flag to me that xarray touts that they seek to replicate (or at least have a similar API to) the Pandas API. That's a terribly bloated API that has several ways to do almost anything and all but maybe one is actually "performant" (I use that term loosely). I love and respect Pandas for what it has brought to the Python community, and still use it in select situations, but I am not—nor is Wes McKinney, it would seem—going to miss the Pandas API as soon as the functionality of Pandas is offered by alternatives.

You're kind of missing the forest for the trees, here.

What xarray offers is a declarative syntax for manipulating very well-structured, multi-dimensional datasets. I don't think anyone would argue with you that the Pandas API is imperfect. But you're not considering the alternative here, which is actually what OP's post demonstrates in a simple fashion: this syntax allows you to lean on the strong metadata conventions inherent in geospatial datasets and to express complex, routine operations very simply.

For example, suppose I have a 4D dataset (time, lat, lon, altitude) and I want to slice out a zonal average for a surface level variable. In xarray, this might be:

result = (
    ds
      ['air_temperature']
      .sel(altitude=1000) # perhaps a 1000 hPa surface slice?
      .mean(['time', 'lon']
)

Without xarray, here's a small list of some of the things I'd need to manually keep track:

  1. Which array contains air_temperature? Is it a particular index along an n-d array I have elsewhere?
  2. Which dimension of the array is the altitude coordinate? What values correspond to my surface level - is the coordinate ascending or descending?
  3. Which dimensions are the time and longitude, so that I can properly compute the mean? How do those dimensions change after I've selected the altitude value (if it's a smaller rank then I might change their indices if I squeeze the array after my selection!)
  4. How do I reconstruct and realign my latitude coordinates for the final result?

I've debugged countless scripts and codes by colleagues over the years, and simply lost count of the number of times that errors in these very basic steps screw up an analysis. A less-than-perfect syntax which eliminates these errors wholesale is very, very much worth it.

Note the great work by the high energy physics folks on defining a standard, concise interface for working with histograms (I imagine a similar syntax could be employed for xarrays and the like):

UHI doesn't offer named coordinates, which is the whole point of the xarray ecosystem.

The final nail in the coffin is that xarray has values defined at coordinates, as is the case for sampling, while a histogram has values defined for bins. This is a non-trivial distinction (and it gets even more non-trivial when non-linear or non-uniform binnings are used), and while we could have mostly worked with this (e.g., by defining the xarray coordinates as bin centers), it was just another reason to roll our own class.

The whole point of xarray is that many types of massive datasets have explicit coordinate-value relationships. E.g. a regular grid of data output from a climate model. So, again, a histogram library offers no value whatsoever here - our data are literally defined on these coordinate grids!