home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

34 rows where user = 743508 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 15

  • #1161 WIP to vectorize isel_points 9
  • open_mfdataset too many files 6
  • Generated Dask graph is huge - performance issue? 4
  • Remove caching logic from xarray.Variable 2
  • Huge memory use when using FacetGrid 2
  • Support for Scipy Sparse Arrays 2
  • TypeError: invalid type promotion when reading multi-file dataset 1
  • Dataset variable reference fails after renaming 1
  • open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 1
  • Many methods are broken (e.g., concat/stack/sortby) when using repeated dimensions 1
  • CF conventions for time doesn't support years 1
  • Keeping attributes when using DataArray.astype 1
  • DataArray.rolling() does not preserve chunksizes in some cases 1
  • to_dataframe fails if dataarray has dimension 1 1
  • Error when rechunking from Zarr store 1

user 1

  • mangecoeur · 34 ✖

author_association 1

  • CONTRIBUTOR 34
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1311919228 https://github.com/pydata/xarray/issues/7280#issuecomment-1311919228 https://api.github.com/repos/pydata/xarray/issues/7280 IC_kwDOAMm_X85OMkx8 mangecoeur 743508 2022-11-11T16:27:57Z 2022-11-11T16:27:57Z CONTRIBUTOR

@keewis using your solution things seem to more or less work, except that every operation of course 'loses' the __array_namespace__ attr so anything like slicing only half works, plus a lot of indexing operations are not implemented on scipy sparse arrays.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Support for Scipy Sparse Arrays 1445486904
1311902588 https://github.com/pydata/xarray/issues/7280#issuecomment-1311902588 https://api.github.com/repos/pydata/xarray/issues/7280 IC_kwDOAMm_X85OMgt8 mangecoeur 743508 2022-11-11T16:14:12Z 2022-11-11T16:14:12Z CONTRIBUTOR

Ok I had assumed that scipy would have directly implemented the array interface, I will see if there is already an issue open there. Then we can slowly see what else does/doesn't work.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Support for Scipy Sparse Arrays 1445486904
795114188 https://github.com/pydata/xarray/issues/4380#issuecomment-795114188 https://api.github.com/repos/pydata/xarray/issues/4380 MDEyOklzc3VlQ29tbWVudDc5NTExNDE4OA== mangecoeur 743508 2021-03-10T09:00:48Z 2021-03-10T09:00:48Z CONTRIBUTOR

Running into the same issue, when I:

  1. Load input from a Zarr data source
  2. Queue some processing (delayed dask ufuncs)
  3. Re-chunk using chunk() to get the dask task size I want
  4. use to_zarr to trigger the calculation (dask distributed backend) and save to a new file on disk

I get the chunk size mismatch error which I solve by manually overwriting the encoding['chunks'] value, which seems unintuitive to me. Since I'm going from->to a zarr, I assumed that calling chunk() would set the chunk size for both the dask arrays and the zarr output, since calling to_zarr on a dask array will only work if the dask and zarr encoding chunk size match.

I didn't realize the overwrite_encoded_chunks option existed but it's also a bit confusing that to get the right chunksize on the output i need to set the overwrite option on the input.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Error when rechunking from Zarr store 686608969
602795869 https://github.com/pydata/xarray/issues/1378#issuecomment-602795869 https://api.github.com/repos/pydata/xarray/issues/1378 MDEyOklzc3VlQ29tbWVudDYwMjc5NTg2OQ== mangecoeur 743508 2020-03-23T19:02:26Z 2020-03-23T19:02:26Z CONTRIBUTOR

Just wondering what the status of this is. I've been running into bugs trying to model symmetric distance matrices using the same dimension. Interestingly, it does work very well for selecting, e.g. if use .sel(nodes=node_list) on a square matrix i correctly get a square matrix subset 👍 But unfortunately a lot of other things seems to break, e.g. concatenating fails with ValueError: axes don't match array :( What would need to happen to make this work?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Many methods are broken (e.g., concat/stack/sortby) when using repeated dimensions 222676855
584701023 https://github.com/pydata/xarray/issues/2049#issuecomment-584701023 https://api.github.com/repos/pydata/xarray/issues/2049 MDEyOklzc3VlQ29tbWVudDU4NDcwMTAyMw== mangecoeur 743508 2020-02-11T15:47:28Z 2020-02-11T15:48:08Z CONTRIBUTOR

Just run into this issue, present in 0.15, also does not respect the option keep_attrs=True

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Keeping attributes when using DataArray.astype 313010564
583488834 https://github.com/pydata/xarray/issues/3761#issuecomment-583488834 https://api.github.com/repos/pydata/xarray/issues/3761 MDEyOklzc3VlQ29tbWVudDU4MzQ4ODgzNA== mangecoeur 743508 2020-02-07T16:37:05Z 2020-02-07T16:37:05Z CONTRIBUTOR

I think it makes sense to support the conversion. Perhaps a better example is with a dataset:

```python x = np.arange(10) y = np.arange(10)

data = np.zeros((len(x), len(y)))

ds = xr.Dataset({k: xr.DataArray(data, coords=[x, y], dims=['x', 'y']) for k in ['a', 'b', 'c']}) ds.sel(x=1,y=1)

<xarray.Dataset> Dimensions: () Coordinates: x int64 1 y int64 1 Data variables: a float64 0.0 b float64 0.0 c float64 0.0 ```

The output is a dataset of scalars, which converts fairly intuitively to a single row dataframe. But the folloiwing throws the same error.

python ds.sel(x=1,y=1).to_dataframe()

Or think of it another way - isn't it very un-intuitive that converting a single-item dataset to a dataframe works only if the item was selected using a length-1 list? To me that seems like a very arbitrary restriction. Following that logic, it also makes sense to have consistent behaviour between Datasets and DataArrays (even if you end up producing a single-element table).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_dataframe fails if dataarray has dimension 1 561539035
460174589 https://github.com/pydata/xarray/issues/2531#issuecomment-460174589 https://api.github.com/repos/pydata/xarray/issues/2531 MDEyOklzc3VlQ29tbWVudDQ2MDE3NDU4OQ== mangecoeur 743508 2019-02-04T09:06:14Z 2019-02-04T09:06:43Z CONTRIBUTOR

Perhaps related - I was running into MemoryErrors with a large array and also noticed that chunksizes were not respected (basically xarray tried to process the array in one go) - but it turned out that i'd forgotten to install both bottleneck and numexpr and after installing both (just installing bottleneck was not enough), everything worked as expected.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.rolling() does not preserve chunksizes in some cases 376154741
311621960 https://github.com/pydata/xarray/issues/1467#issuecomment-311621960 https://api.github.com/repos/pydata/xarray/issues/1467 MDEyOklzc3VlQ29tbWVudDMxMTYyMTk2MA== mangecoeur 743508 2017-06-28T10:33:33Z 2017-06-28T10:33:33Z CONTRIBUTOR

I think I do mean 'years' in the CF convention sense, in this case the time dimension is:

double time(time=145); :standard_name = "time"; :units = "years since 1860-1-1 12:00:00"; :calendar = "proleptic_gregorian";

This is correctly interpreted by the NASA Panoply NetCDF file viewer. From glancing at the xarray code, it seems it depends on the pandas Timedelta object which in turn doesn't support years as delta objects (although date ranges can be generated at year intervals so it should be possible to implement).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CF conventions for time doesn't support years 238990919
303857073 https://github.com/pydata/xarray/issues/1424#issuecomment-303857073 https://api.github.com/repos/pydata/xarray/issues/1424 MDEyOklzc3VlQ29tbWVudDMwMzg1NzA3Mw== mangecoeur 743508 2017-05-24T21:28:44Z 2017-05-24T21:28:44Z CONTRIBUTOR

Dataset isn't chunked, and yes I am using cartopy to draw coastlines following the example in the docs:

python p = heatwaves_pop.plot(x='longitude', y='latitude', col='time', col_wrap=3, cmap='RdBu_r', vmin=-v_both, vmax=v_both, size=2, subplot_kws=dict(projection=crs.PlateCarree()) ) for ax in p.axes.flat: ax.coastlines()

where heatwaves_pop is calculated from a bunch of other xarray datasets. What surprised me was that they should all have been loaded into memory so I did not expect further increase in memory use.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Huge memory use when using FacetGrid 231061878
303748239 https://github.com/pydata/xarray/issues/1424#issuecomment-303748239 https://api.github.com/repos/pydata/xarray/issues/1424 MDEyOklzc3VlQ29tbWVudDMwMzc0ODIzOQ== mangecoeur 743508 2017-05-24T14:51:06Z 2017-05-24T14:51:06Z CONTRIBUTOR

16 maps, although like you say, I'm not sure if this is coming from xarray or matplotlib

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Huge memory use when using FacetGrid 231061878
285052725 https://github.com/pydata/xarray/issues/1301#issuecomment-285052725 https://api.github.com/repos/pydata/xarray/issues/1301 MDEyOklzc3VlQ29tbWVudDI4NTA1MjcyNQ== mangecoeur 743508 2017-03-08T14:20:30Z 2017-03-08T14:20:30Z CONTRIBUTOR

My 2cents - I've found that with big files any %prun tends to show method 'acquire' of '_thread.lock' as one of the highest time but it's not necessarily indicative of where the perf issue comes from because it's effectively just waiting for IO which is always slow. One thing that helps get a better profile is setting dask backend to the non-parallel sync option which gives cleaner profiles.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 212561278
274602298 https://github.com/pydata/xarray/pull/1162#issuecomment-274602298 https://api.github.com/repos/pydata/xarray/issues/1162 MDEyOklzc3VlQ29tbWVudDI3NDYwMjI5OA== mangecoeur 743508 2017-01-23T20:09:24Z 2017-01-23T20:09:24Z CONTRIBUTOR

Crickey. Fixed merge hopefully it works (I hate merge conflicts)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  #1161 WIP to vectorize isel_points 195125296
274567523 https://github.com/pydata/xarray/pull/1162#issuecomment-274567523 https://api.github.com/repos/pydata/xarray/issues/1162 MDEyOklzc3VlQ29tbWVudDI3NDU2NzUyMw== mangecoeur 743508 2017-01-23T18:04:09Z 2017-01-23T18:04:09Z CONTRIBUTOR

OK added a performance improvements section to the docs

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  #1161 WIP to vectorize isel_points 195125296
274564256 https://github.com/pydata/xarray/pull/1162#issuecomment-274564256 https://api.github.com/repos/pydata/xarray/issues/1162 MDEyOklzc3VlQ29tbWVudDI3NDU2NDI1Ng== mangecoeur 743508 2017-01-23T17:52:33Z 2017-01-23T17:52:33Z CONTRIBUTOR

Note - waiting for 0.9.0 to be released before updating whats new, don't want to end up with conflicts in docs

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  #1161 WIP to vectorize isel_points 195125296
272844516 https://github.com/pydata/xarray/pull/1162#issuecomment-272844516 https://api.github.com/repos/pydata/xarray/issues/1162 MDEyOklzc3VlQ29tbWVudDI3Mjg0NDUxNg== mangecoeur 743508 2017-01-16T11:59:01Z 2017-01-16T11:59:01Z CONTRIBUTOR

Ok will wait for 0.9.0 to be released

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  #1161 WIP to vectorize isel_points 195125296
272715240 https://github.com/pydata/xarray/pull/1162#issuecomment-272715240 https://api.github.com/repos/pydata/xarray/issues/1162 MDEyOklzc3VlQ29tbWVudDI3MjcxNTI0MA== mangecoeur 743508 2017-01-15T18:53:26Z 2017-01-15T18:53:26Z CONTRIBUTOR

Completed changes based on recommendations and cleaned up old code and comments.

As for benchmarks, I don't have anything rigourous but I do have the following example dataset weather data from the CFSR dataset, 7 variables at hourly resolution, collected in one netCDF3 file per variable per month. In the particular case the difference is striking!

python %%time data = dataset.isel_points(time=np.arange(0,1000), lat=np.ones(1000, dtype=int), lon=np.ones(1000, dtype=int)) data.load()

Results:

``` xarray 0.8.2 CPU times: user 1min 21s, sys: 41.5 s, total: 2min 2s Wall time: 47.8 s

master CPU times: user 385 ms, sys: 238 ms, total: 623 ms Wall time: 288 ms ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  #1161 WIP to vectorize isel_points 195125296
269093854 https://github.com/pydata/xarray/pull/1162#issuecomment-269093854 https://api.github.com/repos/pydata/xarray/issues/1162 MDEyOklzc3VlQ29tbWVudDI2OTA5Mzg1NA== mangecoeur 743508 2016-12-24T17:49:10Z 2016-12-24T17:49:10Z CONTRIBUTOR

@shoyer Tidied up based on recommendations, now everything done in a single loop (still need to make distinction between variables and coordinates for output but still a lot neater)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  #1161 WIP to vectorize isel_points 195125296
269026887 https://github.com/pydata/xarray/pull/1162#issuecomment-269026887 https://api.github.com/repos/pydata/xarray/issues/1162 MDEyOklzc3VlQ29tbWVudDI2OTAyNjg4Nw== mangecoeur 743508 2016-12-23T18:13:52Z 2016-12-23T18:25:03Z CONTRIBUTOR

OK I adjusted for the new behaviour and all tests pass locally, hopefully travis agrees...

Edit: Looks like it's green

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  #1161 WIP to vectorize isel_points 195125296
268927305 https://github.com/pydata/xarray/pull/1162#issuecomment-268927305 https://api.github.com/repos/pydata/xarray/issues/1162 MDEyOklzc3VlQ29tbWVudDI2ODkyNzMwNQ== mangecoeur 743508 2016-12-23T01:42:03Z 2016-12-23T01:42:03Z CONTRIBUTOR

@shoyer I'm down to 1 test failing locally in sel_points but not sure what the desired behaviour is. I get:

<xarray.Dataset> Dimensions: (points: 3) Coordinates: * points (points) int64 0 1 2 Data variables: foo (points) int64 0 4 8 instead of

AssertionError: <xarray.Dataset> Dimensions: (points: 3) Coordinates: o points (points) - Data variables: foo (points) int64 0 4 8

But here I'm not sure if my code is wrong or the test. It seems that the test requires sel_points NOT to generate a new coordinate values for points - however I'm pretty sure isel_points does require this (it passes in any case). Don't really see a way in my code to generate subsets without having a matching coordinate array (I don't know how to use the Dataset constructors without one for instance).

I've updated the test according to how I think it should be working, but please correct me if i misunderstood.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  #1161 WIP to vectorize isel_points 195125296
266995169 https://github.com/pydata/xarray/pull/1162#issuecomment-266995169 https://api.github.com/repos/pydata/xarray/issues/1162 MDEyOklzc3VlQ29tbWVudDI2Njk5NTE2OQ== mangecoeur 743508 2016-12-14T10:10:11Z 2016-12-14T10:10:36Z CONTRIBUTOR

So it seems to work fine in the Dask case, but I don't have a deep understanding of how DataArrays are constructed from arrays and dims so it fails in the non-dask case. Also not sure how you feel about making a special case for the dask backend here (since up till now it was all backend agnostic).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  #1161 WIP to vectorize isel_points 195125296
266598007 https://github.com/pydata/xarray/issues/1161#issuecomment-266598007 https://api.github.com/repos/pydata/xarray/issues/1161 MDEyOklzc3VlQ29tbWVudDI2NjU5ODAwNw== mangecoeur 743508 2016-12-13T00:29:16Z 2016-12-13T00:29:16Z CONTRIBUTOR

Seems to run a lot faster for me too...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generated Dask graph is huge - performance issue? 195050684
266596464 https://github.com/pydata/xarray/issues/1161#issuecomment-266596464 https://api.github.com/repos/pydata/xarray/issues/1161 MDEyOklzc3VlQ29tbWVudDI2NjU5NjQ2NA== mangecoeur 743508 2016-12-13T00:20:12Z 2016-12-13T00:20:12Z CONTRIBUTOR

Done with PR #1162

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generated Dask graph is huge - performance issue? 195050684
266587849 https://github.com/pydata/xarray/issues/1161#issuecomment-266587849 https://api.github.com/repos/pydata/xarray/issues/1161 MDEyOklzc3VlQ29tbWVudDI2NjU4Nzg0OQ== mangecoeur 743508 2016-12-12T23:32:19Z 2016-12-12T23:33:03Z CONTRIBUTOR

Thanks, I've been looking around and I think i'm getting close, however i'm not sure the best way to turn the array slice i get from vindex into a DataArray variable. I'm thinking I might but together a draft PR for comments. This is what i have so far:

```python

def isel_points(self, dim='points', **indexers): """Returns a new dataset with each array indexed pointwise along the specified dimension(s).

This method selects pointwise values from each array and is akin to
the NumPy indexing behavior of `arr[[0, 1], [0, 1]]`, except this
method does not require knowing the order of each array's dimensions.

Parameters
----------
dim : str or DataArray or pandas.Index or other list-like object, optional
    Name of the dimension to concatenate along. If dim is provided as a
    string, it must be a new dimension name, in which case it is added
    along axis=0. If dim is provided as a DataArray or Index or
    list-like object, its name, which must not be present in the
    dataset, is used as the dimension to concatenate along and the
    values are added as a coordinate.
**indexers : {dim: indexer, ...}
    Keyword arguments with names matching dimensions and values given
    by array-like objects. All indexers must be the same length and
    1 dimensional.

Returns
-------
obj : Dataset
    A new Dataset with the same contents as this dataset, except each
    array and dimension is indexed by the appropriate indexers. With
    pointwise indexing, the new Dataset will always be a copy of the
    original.

See Also
--------
Dataset.sel
Dataset.isel
Dataset.sel_points
DataArray.isel_points
"""
from .dataarray import DataArray

indexer_dims = set(indexers)

def relevant_keys(mapping):
    return [k for k, v in mapping.items()
            if any(d in indexer_dims for d in v.dims)]

data_vars = relevant_keys(self.data_vars)
coords = relevant_keys(self.coords)

# all the indexers should be iterables
keys = indexers.keys()
indexers = [(k, np.asarray(v)) for k, v in iteritems(indexers)]
# Check that indexers are valid dims, integers, and 1D
for k, v in indexers:
    if k not in self.dims:
        raise ValueError("dimension %s does not exist" % k)
    if v.dtype.kind != 'i':
        raise TypeError('Indexers must be integers')
    if v.ndim != 1:
        raise ValueError('Indexers must be 1 dimensional')

# all the indexers should have the same length
lengths = set(len(v) for k, v in indexers)
if len(lengths) > 1:
    raise ValueError('All indexers must be the same length')

# Existing dimensions are not valid choices for the dim argument
if isinstance(dim, basestring):
    if dim in self.dims:
        # dim is an invalid string
        raise ValueError('Existing dimension names are not valid '
                         'choices for the dim argument in sel_points')
elif hasattr(dim, 'dims'):
    # dim is a DataArray or Coordinate
    if dim.name in self.dims:
        # dim already exists
        raise ValueError('Existing dimensions are not valid choices '
                         'for the dim argument in sel_points')

if not utils.is_scalar(dim) and not isinstance(dim, DataArray):
    dim = as_variable(dim, name='points')

variables = OrderedDict()
indexers_dict = dict(indexers)
non_indexed = list(set(self.dims) - indexer_dims)

# TODO need to figure out how to make sure we get the indexed vs non indexed dimensions in the right order
for name, var in self.variables.items():
    slc = []

    for k in var.dims:
        if k in indexers_dict:
            slc.append(indexers_dict[k])
        else:
            slc.append(slice(None, None))
    if hasattr(var.data, 'vindex'):
        variables[name] = DataArray(var.data.vindex[tuple(slc)], name=name)
    else:
        variables[name] = var[tuple(slc)]

points_len = lengths.pop()

new_variables = OrderedDict()
for name, var in variables.items():
    if name not in self.dims:
        coords = [variables[k] for k in non_indexed]
        new_variables[name] = DataArray(var, coords=[np.arange(points_len)] + coords, dims=[dim] + non_indexed)

return xr.merge([v for k,v in new_variables.items() if k not in selection.dims])
# TODO: This would be sped up with vectorized indexing. This will
# require dask to support pointwise indexing as well.

return concat([self.isel(**d) for d in

[dict(zip(keys, inds)) for inds in

zip(*[v for k, v in indexers])]],

dim=dim, coords=coords, data_vars=data_vars)

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generated Dask graph is huge - performance issue? 195050684
266519121 https://github.com/pydata/xarray/issues/1161#issuecomment-266519121 https://api.github.com/repos/pydata/xarray/issues/1161 MDEyOklzc3VlQ29tbWVudDI2NjUxOTEyMQ== mangecoeur 743508 2016-12-12T18:59:15Z 2016-12-12T18:59:15Z CONTRIBUTOR

Ok I will have a look, where is this implemented (I always seem to have trouble pinpointing the dask-specific bits in the codebase :S )

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generated Dask graph is huge - performance issue? 195050684
265966887 https://github.com/pydata/xarray/pull/1128#issuecomment-265966887 https://api.github.com/repos/pydata/xarray/issues/1128 MDEyOklzc3VlQ29tbWVudDI2NTk2Njg4Nw== mangecoeur 743508 2016-12-09T09:08:48Z 2016-12-09T09:08:48Z CONTRIBUTOR

@shoyer thanks, with a little testing it seems lock=False is fine (so don't automatically need dask dev for lock=dask.utils.SerializableLock()). Using spawning pool is necessary, just doesn't work without. Also looks like using dask distributed ipython backend works fine (works similar to spawn pool in that the worker engines aren't forked but kinda live in their own little world) - this is really nice because ipython in turn has good support for HPC systems (SGE batch scheduling + MPI for process handling).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Remove caching logic from xarray.Variable 189817033
265875012 https://github.com/pydata/xarray/pull/1128#issuecomment-265875012 https://api.github.com/repos/pydata/xarray/issues/1128 MDEyOklzc3VlQ29tbWVudDI2NTg3NTAxMg== mangecoeur 743508 2016-12-08T22:28:25Z 2016-12-08T22:28:25Z CONTRIBUTOR

I'm trying out the latest code to subset a set of netcdf4 files with dask.multiprocessing using set_options(get=dask.multiprocessing.get) but I'm still getting TypeError: can't pickle _thread.lock objects - this expect or there something specific I need to do to make it work?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Remove caching logic from xarray.Variable 189817033
230289863 https://github.com/pydata/xarray/issues/894#issuecomment-230289863 https://api.github.com/repos/pydata/xarray/issues/894 MDEyOklzc3VlQ29tbWVudDIzMDI4OTg2Mw== mangecoeur 743508 2016-07-04T13:23:53Z 2016-07-04T13:23:53Z CONTRIBUTOR

I think this is also a bug if you load a multifile dataset, since when you rename it you get a new dataset but when you trigger a read that goes back to the original files which haven't been renamed on-disk.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset variable reference fails after renaming 163414759
223918870 https://github.com/pydata/xarray/issues/463#issuecomment-223918870 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzkxODg3MA== mangecoeur 743508 2016-06-06T10:09:48Z 2016-06-06T10:09:48Z CONTRIBUTOR

So using a cleaner minimal example it does appear that the files are closed after the dataset is closed. However, they are all open during dataset loading - this is what blows past the OSX default max open file limit.

I think this could be a real issue when using Xarray to handle too-big-for-ram datasets - you could easily be trying to access 1000s of files (especially with weather data), so Xarray should limit the number it holds open at any one time during data load. Not being familiar with the internals I'm not sure if this is an issue in Xarray itself or in the Dask backend.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223905394 https://github.com/pydata/xarray/issues/463#issuecomment-223905394 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzkwNTM5NA== mangecoeur 743508 2016-06-06T09:06:33Z 2016-06-06T09:06:33Z CONTRIBUTOR

@shoyer thanks - here's how i'm using mfdataset - not using any options. I'm going to try using the h5netcdf backend to see if I get the same results. I'm still not 100% confident that I'm tracking open files correctly with lsof so I'm going to try to make a minimal example to investigate.

``` python

def weather_dataset(root_path: Path, *, start_date: datetime = None, end_date: datetime = None): flat_files_paths = get_dset_file_paths(root_path, start_date=start_date, end_date=end_date) # Convert Paths to list of strings for xarray dataset = xr.open_mfdataset([str(f) for f in flat_files_paths]) return dataset

def cfsr_weather_loader(db, site_lookup_fn=None, dset_start=None, dset_end=None, site_conf=None): # Pull values out of the dt_conf = site_conf if site_conf else WEATHER_CFSR dset_start = dset_start if dset_start else dt_conf['start_dt'] dset_end = dset_end if dset_end else dt_conf['end_dt']

if site_lookup_fn is None:
    site_lookup_fn = site_lookup_postcode_district

def weather_loader(site_id, start_date, end_date, resample=None):
    # using the tuple because always getting mixed up with lon/lat
    geo_lookup = site_lookup_fn(site_id, db)

    # With statement should ensure dset is closed after loading.
    with weather_dataset(WEATHER_CFSR['path'],
                         start_date=dset_start,
                         end_date=dset_end) as weather:
        data = weighted_regional_timeseries(weather, start_date, end_date,
                                            lon=geo_lookup.lon,
                                            lat=geo_lookup.lat,
                                            weights=geo_lookup.weights)

    # RENAME from CFSR standard
    data = data.rename(columns=WEATHER_RENAME)

    if resample is not None:
        data = data.resample(resample).mean()
    data.irradiance /= 1000.0  # convert irradiance to kW
    return data

return weather_loader

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223837612 https://github.com/pydata/xarray/issues/463#issuecomment-223837612 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzgzNzYxMg== mangecoeur 743508 2016-06-05T21:05:40Z 2016-06-05T21:05:40Z CONTRIBUTOR

So on investigation, even though my dataset creation is wrapped in a with block, using lsof to check the file handles held by my iPython kernel suggests that all the input files are still open. Are you certain that the backend correctly closes files in a multifile dataset? Is there a way to explicitly force this to happen?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223810723 https://github.com/pydata/xarray/issues/463#issuecomment-223810723 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzgxMDcyMw== mangecoeur 743508 2016-06-05T12:34:11Z 2016-06-05T12:34:11Z CONTRIBUTOR

I still hit this issue after wrapping my open_mfdataset in a with statement. I'm suspecting to be an OSX problem, MacOS has a very low default max-open-files limit for applications started from the shell (like 256). It's not yet clear to me whether my datasets are being correctly closed, investigating...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223687053 https://github.com/pydata/xarray/issues/463#issuecomment-223687053 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzY4NzA1Mw== mangecoeur 743508 2016-06-03T20:31:56Z 2016-06-03T20:31:56Z CONTRIBUTOR

It seems to happen even with a freshly restarted notebook, but I'll try a with statement to see if helps. On 3 Jun 2016 19:53, "Stephan Hoyer" notifications@github.com wrote:

I suspect you hit this in IPython after rerunning cells, because file handles are only automatically closed when programs exit. You might find it a good idea to explicitly close files by calling .close() (or using a "with" statement) on Datasets opened with open_mfdataset.

On Fri, Jun 3, 2016 at 11:08 AM, mangecoeur notifications@github.com wrote:

I'm also running into this error - but strangely it only happens when using IPython interactive backend. I have some tests which work fine, but doing the same in IPython fails.

I'm opening a few hundred files (about 10Mb each, one per month across a few variables). I'm using the default NetCDF backend.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/463#issuecomment-223651454, or mute the thread < https://github.com/notifications/unsubscribe/ABKS1sOTvuTtWVVFM7tnP7tnuGKvI-MBks5qIG2YgaJpZM4FWKen

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/463#issuecomment-223663026, or mute the thread https://github.com/notifications/unsubscribe/AAtYVCtspqRb0AXy1ilbgoRuZN_syEDvks5qIHglgaJpZM4FWKen .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
223651454 https://github.com/pydata/xarray/issues/463#issuecomment-223651454 https://api.github.com/repos/pydata/xarray/issues/463 MDEyOklzc3VlQ29tbWVudDIyMzY1MTQ1NA== mangecoeur 743508 2016-06-03T18:08:24Z 2016-06-03T18:08:24Z CONTRIBUTOR

I'm also running into this error - but strangely it only happens when using IPython interactive backend. I have some tests which work fine, but doing the same in IPython fails.

I'm opening a few hundred files (about 10Mb each, one per month across a few variables). I'm using the default NetCDF backend.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset too many files 94328498
222995827 https://github.com/pydata/xarray/issues/864#issuecomment-222995827 https://api.github.com/repos/pydata/xarray/issues/864 MDEyOklzc3VlQ29tbWVudDIyMjk5NTgyNw== mangecoeur 743508 2016-06-01T13:42:21Z 2016-06-01T13:42:59Z CONTRIBUTOR

On further investigation, it appears the problem is the dataset contains a mix of string and float data - the strings are redundant representations of the time stamp, therefore they don't appear in the index query. When I tried to convert to array, the numpy chokes on the mixed types. Explicitly selecting on the desired data variable solves this:

selection = cfsr_new.TMP_L103.sel(lon=lon_sel, lat=lat_sel, time=time_sel)

I think a clearer error message may be needed: when you do sel without indexing on certain dimensions, those are included in the resulting selection. It's possible for those to be of mixed incompatible types. Clearly to do to_array you need a numpy-friendly uniform type. The error should make this clearer.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  TypeError: invalid type promotion when reading multi-file dataset 157886730

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 23.607ms · About: xarray-datasette