home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

27 rows where issue = 324350248 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 9

  • TomNicholas 10
  • shoyer 7
  • ewquon 3
  • jnhansen 2
  • jhamman 1
  • dcherian 1
  • aidanheerdegen 1
  • chiaral 1
  • aluhamaa 1

author_association 3

  • MEMBER 19
  • NONE 6
  • CONTRIBUTOR 2

issue 1

  • Concatenate across multiple dimensions with open_mfdataset · 27 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
531910110 https://github.com/pydata/xarray/issues/2159#issuecomment-531910110 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDUzMTkxMDExMA== ewquon 18267059 2019-09-16T18:54:39Z 2019-09-16T18:54:39Z NONE

Thanks @dcherian. As you suggested, I ended up using v0.12.3 and xr.combine_by_coords() to get the expected behavior.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
531792346 https://github.com/pydata/xarray/issues/2159#issuecomment-531792346 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDUzMTc5MjM0Ng== dcherian 2448579 2019-09-16T14:05:34Z 2019-09-16T14:05:34Z MEMBER

Hi @ewquon I think you need to specify combine="by_coords" or combine="nested" to get the new behaviour.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
528596388 https://github.com/pydata/xarray/issues/2159#issuecomment-528596388 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDUyODU5NjM4OA== ewquon 18267059 2019-09-05T21:29:14Z 2019-09-05T21:29:14Z NONE

I can confirm that this issue persists in v0.12.3 as well.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
528594727 https://github.com/pydata/xarray/issues/2159#issuecomment-528594727 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDUyODU5NDcyNw== ewquon 18267059 2019-09-05T21:24:34Z 2019-09-05T21:24:34Z NONE

I'm running xarray v0.12.1, released in June 5 of this year, which should include @TomNicholas's fix merged back in Dec of last year. However, the original MWE still gives the unwanted result with the repeated coordinates.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
505505980 https://github.com/pydata/xarray/issues/2159#issuecomment-505505980 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDUwNTUwNTk4MA== TomNicholas 35968931 2019-06-25T15:50:33Z 2019-06-25T15:50:33Z MEMBER

Closed by #2616

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
437579539 https://github.com/pydata/xarray/issues/2159#issuecomment-437579539 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDQzNzU3OTUzOQ== TomNicholas 35968931 2018-11-10T12:10:00Z 2018-11-10T12:10:00Z MEMBER

@shoyer see my PR trying to implement this (#2553).

Inputting a list of lists into auto_combine() is working, but it wasn't obvious to me how to handle this within open_mfdataset(). A few approaches:

1) I could try to somehow generalise all of the list comprehensions in open_mfdataset(), which would be messy but general

2) Write some kind of recursive iterator function which would allow me to apply the preproccess and dask.compute functions to all the objects in the nested list

3) Separate the logic so that the input is assumed to be a flat list unless infer_order_from_coords=True

4) Always recursively flatten the input before opening the files, but store the structure somehow

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
435733418 https://github.com/pydata/xarray/issues/2159#issuecomment-435733418 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDQzNTczMzQxOA== shoyer 1217238 2018-11-05T02:03:00Z 2018-11-05T02:03:00Z MEMBER

This is fine though right? We can do all of this, because it should compartmentalise fairly easily shouldn't it? You end up with logic like:

Yes, this seems totally fine to me.

We don't need to, but I don't think it would be that hard (if the structure above is feasible), and I think it's a common use case. Also there's an argument for putting in special effort to generalize this function as much as possible, because it lowers the barrier to entry for xarray for new users. Though perhaps I'm just biased because it happens to be my use case...

Sure, no opposition from me if you want to do it! 👍

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
435706762 https://github.com/pydata/xarray/issues/2159#issuecomment-435706762 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDQzNTcwNjc2Mg== TomNicholas 35968931 2018-11-04T21:10:55Z 2018-11-05T00:56:06Z MEMBER

we probably want to support all permutations of 1/2.

This is fine though right? We can do all of this, because it should compartmentalise fairly easily shouldn't it? You end up with logic like:

```python def auto_combine(ds_sequence, infer_order_from_coords=True, check_alignment=True): if check_alignment: # Check alignment along non-concatenated dimensions (your (2))

if infer_order_from_coords:
    # Use coordinates to determine tile_ID for each dataset in N-D (your (1))
else:
    # Determine tile_IDs by structure of input in N-D (i.e. ordering in list-of-lists)

# Join everything together
return _concat_nd(tile_IDs, ds_sequence)

```

I'm not sure we need to support this yet

We don't need to, but I don't think it would be that hard (if the structure above is feasible), and I think it's a common use case. Also there's an argument for putting in special effort to generalize this function as much as possible, because it lowers the barrier to entry for xarray for new users. Though perhaps I'm just biased because it happens to be my use case...

Also if we know what form the tile_IDs should take then I can write the _concat_nd function now regardless of what happens with the alignment logic.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
435544853 https://github.com/pydata/xarray/issues/2159#issuecomment-435544853 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDQzNTU0NDg1Mw== shoyer 1217238 2018-11-03T00:22:46Z 2018-11-03T00:22:46Z MEMBER

@TomNicholas I agree with your steps 1/2/3 for open_mfdataset.

My concern with a single prealigned=True argument is that there are two distinct use-cases: 1. Checking coordinates along the dimension(s) being concatenated to determine the result order 2. Checking coordinates along other non-concatenated dimensions to verify consistency

Currently we always do (2) and never do (1).

We definitely want an option to disable (2) for speed, and also want an option to support (1) (what you propose here). But these are distinct use cases -- we probably want to support all permutations of 1/2.

However this would mean that users wanting to do a multi-dimensional auto_combine on data without monotonic indexes would have to supply their datasets in some way that specifies their desired N-dimensional ordering

I'm not sure we need to support this yet -- it would be enough to have keyword argument for falling back to the existing behavior that only supports 1D concatenation in the order provided.

Also I'm assuming we are not going to provide functionality to handle uneven sub-lists, e.g. [[t1x1, t1x2], [t2x1, t2x2, t2x3]]?

Agreed, not important unless someone really wants/needs it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
435336049 https://github.com/pydata/xarray/issues/2159#issuecomment-435336049 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDQzNTMzNjA0OQ== TomNicholas 35968931 2018-11-02T10:29:24Z 2018-11-02T11:07:17Z MEMBER

I was thinking about the general solution to this problem again and wanted to clarify some things.

Currently concat() will concatenate datasets in the order they are supplied, and will not check that the resulting dimensions indexes are monotonic. This behvaiour violates CF conventions (as mentioned by @aluhamaa) but currently passes silently.

I think that any general multi-dimensional version of the auto_combine() function (and therefore open_mfdataset()) should:

1) If possible use the values in the dimension indexes to arrange the datasets so that the indexes are monotonic,

2) Else issue a warning that some of the indexes supplied are not monotonic,

3) Then instead concatenate the supplied datasets in the order supplied (for some N-dimensional definition of "order"). The warning should tell the user that's what it's doing.

This approach would then be backwards-compatible, accommodate users whose data does not have monotonic indexes (they would just have to arrange their datasets into the correct order themselves first), while still doing the obviously correct thing in unambiguous cases.

However this would mean that users wanting to do a multi-dimensional auto_combine on data without monotonic indexes would have to supply their datasets in some way that specifies their desired N-dimensional ordering. This could be done as list-of-lists, combining the inner-most dimensions first, e.g. [[x1y1, x2y1], [x1y2, x2y2]], concat_dims=['y', 'x']. But auto_combine would then have to be able to handle this type of input, and quickly distinguish between the two cases of monotonic & non-monotonic indices. Is this the behaviour which we want?

Also I'm assuming we are not going to provide functionality to handle uneven sub-lists, e.g. [[t1x1, t1x2], [t2x1, t2x2, t2x3]]?

Edit:

I've just realised that there is a lot of related discussion in #2039, #1385, & #1823. I suppose what I'm suggesting here is essentially the N-D generalisation of the approach discussed in those issues, namely an extra argument prealigned for open_mfdataset(), which defaults to False. Then with prealigned=True, the required input would be a nested list of (paths to) datasets, which is nested the same number of times as there are dimensions in concat_dims. Then to recreate the current behaviour for an ordered 1D list of datasets with non-monotonic indexes you would only have to pass prealigned=True.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
427892990 https://github.com/pydata/xarray/issues/2159#issuecomment-427892990 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDQyNzg5Mjk5MA== TomNicholas 35968931 2018-10-08T16:12:06Z 2018-10-08T16:12:06Z MEMBER

Thanks @shoyer for the description of how this should be done properly.

In the meantime however, I thought I would describe how I solved the problem in my last comment. My method works but you probably wouldn't want to use it in xarray itself because it's pretty "hacky".

To avoid the issue of numpy reading the __array__ methods of xarray objects and doing weird things, I simply contained each dataset within a single-element dictionary in order to hide the offending methods, i.e.

python data = create_test_data() data_grid = np.array([{'key': data}], dtype='object') With this then creating something which will concatenate the numpy grid-like array of (dicts holding) datasets is quick: ```python from xarray import concat import numpy as np

def _concat_nd(obj_grid, concat_dims=None, data_vars=None, kwargs): # Combine datasets along one dimension at a time, # Have to start with last axis and finish with axis=0 otherwise axes will disappear before the loop reaches them for axis in reversed(range(obj_grid.ndim)): obj_grid = np.apply_along_axis(_concat_dicts, axis, arr=obj_grid, dim=concat_dims[axis], data_vars=data_vars[axis], kwargs)

# Grid should now only contain one dict which contains the concatenated xarray object
return obj_grid.item()['key']

def _concat_dicts(dict_objs, dim, data_vars, kwargs): objs = [dict_obj['key'] for dict_obj in dict_objs] return {'key': concat(objs, dim, data_vars, kwargs)}

``` In case anyone is interested then this is how I've (hopefully temporarily) solved the N-D concatenation problem in the case of my data.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
418476055 https://github.com/pydata/xarray/issues/2159#issuecomment-418476055 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDQxODQ3NjA1NQ== shoyer 1217238 2018-09-04T18:44:35Z 2018-09-04T22:16:34Z MEMBER

NumPy's handling of object arrays is unfortunately inconsistent. So maybe it isn't the best idea to use NumPy arrays for this.

Python's built-in list/dict might be better choices here. Something like: python def concat_nd(datasets): # find the set of dimensions across which to possibly merge # could possibly use OrderedSet here: # https://github.com/pydata/xarray/blob/v0.10.8/xarray/core/utils.py#L401 all_dims = set(ds.dims for ds in datasets) # Create a map from each dimension to a tuple giving the size of each # dimension on an input dataset. Not all collections of datasets have consistent # sizes along each dimension, but the ones we can automatically concatenate do. # I recommend researching how "chunks" work in dask.array: # http://dask.pydata.org/en/latest/array-design.html # http://dask.pydata.org/en/latest/array-chunks.html chunks = {dim: ... for dim in all_dims} # find the sorted, de-duplicated union of all indexes along those dimensions # np.unique followed by wrapping with pd.Index() # might work OK for the "union" function here combined_indexes = {dim: union([ds.indexes[dim] for ds in datasets]) for dim in all_dims} # create a map mapping from "tile id" to dataset # get_indexes() should use pandas.Index.get_indexer to lookup ds.indexes[dim] # in the combined index, e.g., of type Dict[Tuple[int, ...], xarray.Dataset] indexes_to_dataset = {get_indexes(ds, chunks, combined_coords): ds for ds in datasets} # call concat() in a loop to construct the combined dataset

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
417802225 https://github.com/pydata/xarray/issues/2159#issuecomment-417802225 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDQxNzgwMjIyNQ== TomNicholas 35968931 2018-08-31T22:12:28Z 2018-08-31T22:16:15Z MEMBER

I started having a go at writing the second half of this - the "n-dimensional-concatenation" function which would accept a grid of xarray.DataSet/DataArray objects (assumed to be in the correct order along all dimensions), and return a single merged dataset.

However, I think there's an issue with using

something like a NumPy object array of xarray.Dataset/DataArray objects.

My plan was to call np.apply_along_axis to apply the 1D xr.concat() function along each axis in turn, something like ```python from numpy import apply_along_axis from xarray import concat

def concat_nd(obj_grid, concat_dims=None): """ Concatenates a structured ndarray of xarray Datasets along multiple dimensions.

Parameters
----------
obj_grid : numpy array of Dataset and DataArray objects
    N-dimensional numpy object array containing xarray objects in the shape they
    are to be concatenated. Each object is expected to
    consist of variables and coordinates with matching shapes except for
    along the concatenated dimension.
concat_dims : list of str or DataArray or pandas.Index
    Names of the dimensions to concatenate along. Each dimension in this argument
    is passed on to :py:func:`xarray.concat` along with the dataset objects.
    Should therefore be a list of valid dimension arguments to xarray.concat().

Returns
-------
combined : xarray.Dataset
"""

# Combine datasets along one dimension at a time
# Start with last axis and finish with axis=0
for axis in reversed(range(obj_grid.ndim)):
    obj_grid = apply_along_axis(concat, axis, arr=obj_grid, dim=concat_dims[axis])

# Grid should now only contain one xarray object
return obj_grid.item

However, testing this code withpython def test_concat_1d(self): data = create_test_data()

split_data = [data.isel(dim1=slice(3)), data.isel(dim1=slice(3, None))]

# Will explain why I'm forced to create ndarray like this shortly
split_data_grid = np.empty(shape=(2), dtype=np.object)
split_data_grid[0] = split_data[0]
split_data_grid[1] = split_data[1]

reconstructed = concat_nd(split_data_grid, ['dim1'])

xrt.assert_identical(data, reconstructed)

throws an error from within `np.apply_along_axis` TypeError: cannot directly convert an xarray.Dataset into a numpy array. Instead, create an xarray.DataArray first, either with indexing on the Dataset or by invoking the to_array() method. ```

I think this is because even just the idea of having a ndarray containing xarray datasets seems to cause problems - if I do it with a single item then xarray thinks I'm trying to convert the Dataset into a numpy array and throws the same error: python data = create_test_data() data_grid = np.array(data, dtype='object')

and if I do it with multiple items then numpy will dive down and extract the variables in the dataset instead of just storing a reference to the dataset: python data = create_test_data() split_data = [data.isel(dim1=slice(3)), data.isel(dim1=slice(3, None))] split_data_grid = np.array(split_data, dtype='object') print(split_data_grid) returns [['time' 'dim2' 'dim3' 'var1' 'var2' 'var3' 'numbers'] ['time' 'dim2' 'dim3' 'var1' 'var2' 'var3' 'numbers']] when I expected something more like numpy.array([<xarray.Dataset object at 0x7f94efa62278>, <xarray.Dataset object at 0x9f23phj10582>]) (This is why I had to create an empty array and then fill it afterwards in my example test further up.)

Is this the intended behaviour of xarray? Does this mean I can't use numpy arrays of xarray objects at all for this problem? If so then what structure do you think I should use instead (list of lists etc.)?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
416389795 https://github.com/pydata/xarray/issues/2159#issuecomment-416389795 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDQxNjM4OTc5NQ== shoyer 1217238 2018-08-27T22:29:22Z 2018-08-27T22:29:22Z MEMBER

@TomNicholas I think your analysis is correct here.

I suspect that in most cases we could figure out how to tile datasets by looking at 1D coordinates along each dimension (e.g., indexes for each dataset), e.g., to find a "chunk id" along each concatenated dimension.

These could be used to build something like a NumPy object array of xarray.Dataset/DataArray objects, which could split up into a bunch of 1D calls to xarray.concat().

I would rather avoid using the positions argument of concat. It doesn't really add any flexibility compared to reordering the inputs with xarray.core.nputils.inverse_permutation.

Final point - this common use case also has the added complexity of having ghost or guard cells around every dataset, which should be thrown away. Clearly some user input is required here (ghost_cells_x=2, ghost_cells_y=2, ghost_cells_z=0, ...), but I'm really not sure what the best way to fit that kind of logic in is. Yet more arguments to open_mfdataset?

We could potentially just encourage using the existing preprocess argument.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
412177726 https://github.com/pydata/xarray/issues/2159#issuecomment-412177726 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDQxMjE3NzcyNg== TomNicholas 35968931 2018-08-10T19:08:56Z 2018-08-11T00:09:28Z MEMBER

I've been looking through the functions open_mfdataset, auto_combine, _auto_concat and concat to see how one might go about achieving this in general.

The current behaviour isn't completely explicit, and I would like to check my understanding with a few questions:

1) If you concat two datasets along a dimension which doesn't have a coordinate, then concat will not be able to know what order to concatenate them in, so it just does it in the order they were provided?

2) Although auto_combine can determine the common dimension to concatenate datasets over, it doesn't know anything about insertion order! Even if the datasets have dimension coordinates, the line

python grouped = itertoolz.groupby(lambda ds: tuple(sorted(ds.data_vars)), datasets).values()

will only organise the datasets into groups according to the set of dimensions they have, it doesn't order the datasets within each group according to the values in the dimension coordinates?

We can show this because this (new) testcase fails:

```python @requires_dask def test_auto_combine_along_coords(self): # drop the third dimension to keep things relatively understandable data = create_test_data() for k in list(data.variables): if 'dim3' in data[k].dims: del data[k]

data_split1 = data.isel(dim2=slice(4))
data_split2 = data.isel(dim2=slice(4, None))
split_data = [data_split2, data_split1]  # Deliberately arrange datasets in wrong order
assert_identical(data, auto_combine(split_data, 'dim2'))

```

with output

E AssertionError: <xarray.Dataset> E Dimensions: (dim1: 8, dim2: 9, dim3: 10, time: 20) E Coordinates: E * time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-20 E * dim2 (dim2) float64 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 E Dimensions without coordinates: dim1, dim3 E Data variables: E var1 (dim1, dim2) float64 1.473 1.363 -1.192 ... 0.2341 -0.3403 0.405 E var2 (dim1, dim2) float64 -0.7952 0.7566 0.2468 ... -0.6822 1.455 0.7314 E <xarray.Dataset> E Dimensions: (dim1: 8, dim2: 9, time: 20) E Coordinates: E * time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-20 E * dim2 (dim2) float64 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 E Dimensions without coordinates: dim1 E Data variables: E var1 (dim1, dim2) float64 1.496 -1.834 -0.6588 ... 1.326 0.6805 -0.2999 E var2 (dim1, dim2) float64 0.7926 -1.063 0.1062 ... -0.447 -0.8955

3) So the call to _auto_concat just assumes that the datasets are provided in the correct order:

python concatenated = [_auto_concat(ds, dim=dim, data_vars=data_vars, coords=coords) for ds in grouped]

4) Therefore what needs to be done here is the groupby call needs to be replaced with something that actually orders the datasets according to the value in the dimension coordinates, works in N dimensions, and outputs a structure of datasets upon which _auto_concat can be called repeatedly, along every concatenation dimension?

Also, concat has a positions argument, which allows you to manually specify the concatenation order, but it's not used at all by auto_combine. In the main use case imagined here (concatenating the domains of multi-parallel simulation output) then the user will know the desired positions of each dataset, because it will correspond to how they divided up their domain in the first place. Perhaps an easier approach to providing for that use case would be to propagate the positions argument upwards so that the user can do something like

```python

User specifies how they split up their domain

domain_decomposition_structure = how_was_this_parallelized('output.*.nc')

Feeds this info into open_mfdataset

full_domain = xr.open_mfdataset('output.*.nc', positions=domain_decomposition_structure) ``` This approach would be much less general but would dodge the issue of writing generalized N-D auto-concatenation logic.

Final point - this common use case also has the added complexity of having ghost or guard cells around every dataset, which should be thrown away. Clearly some user input is required here (ghost_cells_x=2, ghost_cells_y=2, ghost_cells_z=0, ...), but I'm really not sure what the best way to fit that kind of logic in is. Yet more arguments to open_mfdataset?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
410361639 https://github.com/pydata/xarray/issues/2159#issuecomment-410361639 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDQxMDM2MTYzOQ== jnhansen 2622379 2018-08-03T20:03:21Z 2018-08-03T20:14:17Z NONE

Yes, xarray should support that very easily -- assuming you have dask installed: python ds = auto_merge('*.nc') ds.to_netcdf('larger_than_memory.nc') auto_merge conserves the chunk sizes resulting from the individual files. If the single files are still too large to fit into memory individually you can rechunk to smaller chunk sizes. The same goes of course for the original xarray.open_mfdataset.

I tested it on a ~25 GB dataset (on a machine with less memory than that).

Note: ds = auto_merge('*.nc') actually runs in a matter of milliseconds, as it merely provides a view of the merged dataset. Only once you call ds.to_netcdf('larger_than_memory.nc') all the disk I/O happens.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
410357191 https://github.com/pydata/xarray/issues/2159#issuecomment-410357191 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDQxMDM1NzE5MQ== TomNicholas 35968931 2018-08-03T19:44:32Z 2018-08-03T19:44:32Z MEMBER

Thanks @jnhansen ! I actually ended up writing my own, much lower level, version of this using the netcdf library. The reason I did that was because I was finding it hard to work out how to merge multiple datasets, then write the data out to a new netcdf file in chunks - I kept accidentally loading the entire merged dataset into memory at once. This might just be because I wasn't using the dask integration properly though.

Have you tried using your function to merge netcdf files, then write out a single file which is larger than RAM? Is that even possible in xarray?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
410348249 https://github.com/pydata/xarray/issues/2159#issuecomment-410348249 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDQxMDM0ODI0OQ== jnhansen 2622379 2018-08-03T19:07:10Z 2018-08-03T19:12:21Z NONE

I just had the exact same problem, and while I didn't yet have time to dig into the source code of xarray.open_mfdataset, I wrote my own function to achieve this:

https://gist.github.com/jnhansen/fa474a536201561653f60ea33045f4e2

Maybe it's helpful to some of you.

Note that I make the following assumptions (which are reasonable for my use case): * the data variables in each part are identical * equality of the first element of two coordinate arrays is sufficient to assume equality of the two coordinate arrays

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
399319465 https://github.com/pydata/xarray/issues/2159#issuecomment-399319465 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDM5OTMxOTQ2NQ== aidanheerdegen 6063709 2018-06-22T04:47:36Z 2018-06-22T04:47:36Z CONTRIBUTOR

👍 for this feature

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
393626605 https://github.com/pydata/xarray/issues/2159#issuecomment-393626605 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDM5MzYyNjYwNQ== shoyer 1217238 2018-05-31T18:19:32Z 2018-05-31T18:19:32Z MEMBER

@aluhamaa I don't think you're missing anything here. I agree that it would be pretty straightforward, it just would take a bit of work.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
393479854 https://github.com/pydata/xarray/issues/2159#issuecomment-393479854 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDM5MzQ3OTg1NA== aluhamaa 10668114 2018-05-31T09:57:01Z 2018-05-31T09:57:01Z NONE

Just wanted to add the same request;)

  • But, I want to add that current behavior, where combined y = 10 20 30 40 50 60 10 20 30 40 50 60 is incorrect in the context of CF, which states that coordinates must be monotonic, and I cannot see many other use cases where such a result would work.
  • Also, currently the documentation is not clear about what will happen if you have multiple dimensions with different ranges.

I also do not understand what is the real complexity of implementing it. As I understand the problem, the initial full dataset is some sort of N-d hypercube and then it is being split into parts along any nr of dimensions. When reading multiple files, which are just parts of this hypercube, it should be enough to just find the possible dimension values, form a hypercube and place each files content into the correct slot? What am I missing here?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
391512018 https://github.com/pydata/xarray/issues/2159#issuecomment-391512018 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDM5MTUxMjAxOA== TomNicholas 35968931 2018-05-23T22:07:30Z 2018-05-23T22:07:30Z MEMBER

@shoyer At the risk of going off on a tangent - I think that only works if the number of guard cells you want to remove can be determined from the data in the dataset you're loading, because preprocess doesn't accept any further arguments.

For example, say you want to remove all ghost cells except the ones at the edge of your simulation domain. If there's no information in each dataset which marks it as a dataset containing a simulation boundary region, then the preprocess function can't know to treat it differently without further arguments. I might be wrong though?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
391509672 https://github.com/pydata/xarray/issues/2159#issuecomment-391509672 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDM5MTUwOTY3Mg== shoyer 1217238 2018-05-23T21:57:56Z 2018-05-23T21:57:56Z MEMBER

@TomNicholas I think you could use the existing preprocess argument to open_mfdataset() for that.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
391501504 https://github.com/pydata/xarray/issues/2159#issuecomment-391501504 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDM5MTUwMTUwNA== TomNicholas 35968931 2018-05-23T21:25:12Z 2018-05-23T21:25:12Z MEMBER

Another suggestion: as one of the obvious uses for this is in collecting the output from parallelized simulations, which always have ghost cells around the domain each processor computes on, would it be worth adding an option to throw those away as the mf dataset is loaded? Or is that a task better dealt with by slicing the resultant array after the fact?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
391499524 https://github.com/pydata/xarray/issues/2159#issuecomment-391499524 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDM5MTQ5OTUyNA== shoyer 1217238 2018-05-23T21:17:42Z 2018-05-23T21:17:42Z MEMBER

I agree with @jhamman that it would take effort from an interested developer to do this but in principle it's quite doable.

I think our logic in auto_combine (which powers open_mfdataset) could probably be extended to handle concatenation across multiple dimensions. The main implementation would need to look at coordinates along concatenated dimensions to break the operation into multiple calls xarray.concat()

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
390689448 https://github.com/pydata/xarray/issues/2159#issuecomment-390689448 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDM5MDY4OTQ0OA== jhamman 2443309 2018-05-21T15:29:40Z 2018-05-21T15:29:40Z MEMBER

Since you linked to my SO answer, I will add that I think it is quite possible for us to develop this functionality in xarray. My view is that it will take a concerted effort by an interested developer to come up with an approach to do this but it is possible.

Also, I seem to remember seeing this topic before in our issue tracker but I'm not finding it now.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248
390670582 https://github.com/pydata/xarray/issues/2159#issuecomment-390670582 https://api.github.com/repos/pydata/xarray/issues/2159 MDEyOklzc3VlQ29tbWVudDM5MDY3MDU4Mg== chiaral 8453445 2018-05-21T14:28:08Z 2018-05-21T14:28:08Z CONTRIBUTOR

Thanks for opening up this issue. This would be very helpful for the forecasting community as well, where we usually concatenate along Start time and Lead time dimensions. Here, however, was mentioned that it is quite difficult to generalize it, and he suggested a workaround. I know that some people did it for specific datasets, so maybe it would be helpful to add an example to the documentation that shows how this can be implemented on a case by case basis?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Concatenate across multiple dimensions with open_mfdataset 324350248

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 148.701ms · About: xarray-datasette