home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

37 rows where author_association = "NONE" and user = 2418513 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 15

  • Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 10
  • [WIP] Support nano second time encoding. 6
  • xarray.concat() with compat='identical' fails for DataArray attrs 3
  • Millisecond precision is lost on datetime64 during IO roundtrip 3
  • KeyError on selecting empty time slice from a datetime-indexed Dataset 2
  • DataArray plotting: pyplot compat and passing the style 2
  • mypy --strict fails on scripts/packages depending on xarray; __all__ required 2
  • hardcoded xarray.__all__ 2
  • Creation of an empty DataArray 1
  • xr.concat loses coordinate dtype information with recarrays in 0.9 1
  • Explicit indexes in xarray's data-model (Future of MultiIndex) 1
  • Structured numpy arrays, xarray and netCDF(4) 1
  • keepdims=True for xarray reductions 1
  • Dataset.from_records()? 1
  • Ensure maximum accuracy when encoding and decoding np.datetime64[ns] values 1

user 1

  • aldanor · 37 ✖

author_association 1

  • NONE · 37 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
807126680 https://github.com/pydata/xarray/issues/2857#issuecomment-807126680 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNzEyNjY4MA== aldanor 2418513 2021-03-25T17:17:48Z 2021-03-25T17:18:21Z NONE

OK, we might check if that depends on the data size or on the number of groups, or both.

It scales with data size it seems, but: even if you reduce data size to 1 element, after 50 iterations a single write goes up to 150ms already (whereas it's a few milliseconds in an empty file). These 150ms is the pure 'file traversal' etc part; the rest (of the 2 seconds) is the part where it seemingly reads stuff - which scales with data. Ideally it should just stay at <10ms all the time.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806863336 https://github.com/pydata/xarray/issues/2857#issuecomment-806863336 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjg2MzMzNg== aldanor 2418513 2021-03-25T14:35:28Z 2021-03-25T17:15:06Z NONE

I wonder if it would help to use the same underlying h5py.File or h5netcdf.File when appending.

I don't think it's about what's happening in the current Python's process, which instances are being cached or not, it's about the general logic.

For instance, in the example above, if you run it once (e.g. set the range to 50); and then run it but comment out the block that clears the file, and set the range to 50-100. The very first dataset written the second time will be already very slow, slower than the last dataset written the first time - which means it's not about reusing the same File instance.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806776909 https://github.com/pydata/xarray/issues/2857#issuecomment-806776909 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjc3NjkwOQ== aldanor 2418513 2021-03-25T13:48:04Z 2021-03-25T13:48:29Z NONE

Without digging into implementational details, my logic as a library user would be this:

  • If I write one dataset to file1 and another dataset to file2 using to_netcdf(), to different groups
  • And then I simply combine the two hdf5 files using some external tools (again, datasets stored in different groups)
  • I will be able to read them both perfectly well using open_dataset() or load_dataset()
  • This implies that the datasets can be written just fine independently without knowing about each other
  • Why then those writing functions (flush in particular) traverse and read the entire file every time?
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806767981 https://github.com/pydata/xarray/issues/2857#issuecomment-806767981 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjc2Nzk4MQ== aldanor 2418513 2021-03-25T13:44:22Z 2021-03-25T13:45:04Z NONE

Just checked it out.

| Number of datasets in file | netCDF4 (ms/write) | h5netcdf (ms/write) | | --- | --- | --- | | 1 | 4 | 11 | | 250 | 142| 1933 |

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806740965 https://github.com/pydata/xarray/issues/2857#issuecomment-806740965 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjc0MDk2NQ== aldanor 2418513 2021-03-25T13:27:17Z 2021-03-25T13:27:17Z NONE

Here's the minimal example, try running this:

```python import time import xarray as xr import numpy as np import h5py

arr = xr.DataArray(np.random.RandomState(0).randint(-100, 100, (50_000, 3)), dims=['x', 'y']) ds = xr.Dataset({'arr': arr})

filename = 'test.h5' save = lambda group: ds.to_netcdf(filename, engine='h5netcdf', mode='a', group=str(group))

with h5py.File(filename, 'w') as _: pass

for i in range(250): t0 = time.time() save(i) print(time.time() - t0) ```

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806713825 https://github.com/pydata/xarray/issues/2857#issuecomment-806713825 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjcxMzgyNQ== aldanor 2418513 2021-03-25T13:10:13Z 2021-03-25T13:10:13Z NONE

Is it possible to use .to_netcdf() without h5netcdf.File touching any of the pre-existing data or attempting to read it or traverse it? This will inevitably cause quadratic slowdowns as you write multiple datasets to the file - and that's what seems to be happening.

Or at least, don't traverse anything above the current root group that the dataset is being written into.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806711702 https://github.com/pydata/xarray/issues/2857#issuecomment-806711702 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjcxMTcwMg== aldanor 2418513 2021-03-25T13:08:46Z 2021-03-25T13:08:46Z NONE

@kmuehlbauer Just installed h5netcdf=0.10.0, here's the timings when there's 200 groups in file - store.close() takes 92.4% of time again:

1078 1 1.0 1.0 0.0 try: 1079 # TODO: allow this work (setting up the file for writing array data) 1080 # to be parallelized with dask 1081 2 221642.0 110821.0 4.2 dump_to_store( 1082 1 2.0 2.0 0.0 dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims 1083 ) 1084 1 3.0 3.0 0.0 if autoclose: 1085 store.close() 1086 1087 1 1.0 1.0 0.0 if multifile: 1088 return writer, store 1089 1090 1 6.0 6.0 0.0 writes = writer.sync(compute=compute) 1091 1092 1 1.0 1.0 0.0 if path_or_file is None: 1093 store.sync() 1094 return target.getvalue() 1095 finally: 1096 1 2.0 2.0 0.0 if not multifile and compute: 1097 1 4857912.0 4857912.0 92.6 store.close()

And here's _lookup_dimensions(): (note that it only takes half of the time, there's tons of other time spent in File.flush() which I don't understand):

``` Timer unit: 1e-06 s

Total time: 2.44857 s File: .../python3.8/site-packages/h5netcdf/core.py Function: _lookup_dimensions at line 92

Line # Hits Time Per Hit % Time Line Contents

92                                               def _lookup_dimensions(self):
93       400      65513.0    163.8      2.7          attrs = self._h5ds.attrs
94       400       6175.0     15.4      0.3          if "_Netcdf4Coordinates" in attrs:
95                                                       order_dim = _reverse_dict(self._parent._dim_order)
96                                                       return tuple(
97                                                           order_dim[coord_id] for coord_id in attrs["_Netcdf4Coordinates"]
98                                                       )
99

100 400 44938.0 112.3 1.8 child_name = self.name.split("/")[-1] 101 400 5006.0 12.5 0.2 if child_name in self._parent.dimensions: 102 return (child_name,) 103
104 400 350.0 0.9 0.0 dims = [] 105 400 781.0 2.0 0.0 phony_dims = defaultdict(int) 106 1400 166093.0 118.6 6.8 for axis, dim in enumerate(self._h5ds.dims): 107 # get current dimension 108 1000 119507.0 119.5 4.9 dimsize = self.shape[axis] 109 1000 2459.0 2.5 0.1 phony_dims[dimsize] += 1 110 1000 34345.0 34.3 1.4 if len(dim): 111 1000 2001071.0 2001.1 81.7 name = _name_from_dimension(dim) 112 else: 113 # if unlabeled dimensions are found 114 if self._root._phony_dims_mode is None: 115 raise ValueError( 116 "variable %r has no dimension scale " 117 "associated with axis %s. \n" 118 "Use phony_dims=%r for sorted naming or " 119 "phony_dims=%r for per access naming." 120 % (self.name, axis, "sort", "access") 121 ) 122 else: 123 # get dimension name 124 name = self._parent._phony_dims[(dimsize, phony_dims[dimsize] - 1)] 125 1000 1820.0 1.8 0.1 dims.append(name) 126 400 512.0 1.3 0.0 return tuple(dims) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806680140 https://github.com/pydata/xarray/issues/2857#issuecomment-806680140 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjY4MDE0MA== aldanor 2418513 2021-03-25T12:48:23Z 2021-03-25T12:49:19Z NONE

There's some absolutely obscure things here, e.g. h5netcdf.core.BaseVariable._lookup_dimensions:

For 0 datasets:

``` Timer unit: 1e-06 s

Total time: 0.005034 s File: .../python3.8/site-packages/h5netcdf/core.py Function: _lookup_dimensions at line 86

Line # Hits Time Per Hit % Time Line Contents

86                                               def _lookup_dimensions(self):
87         2        633.0    316.5     12.6          attrs = self._h5ds.attrs
88         2         53.0     26.5      1.1          if '_Netcdf4Coordinates' in attrs:
89                                                       order_dim = _reverse_dict(self._parent._dim_order)
90                                                       return tuple(order_dim[coord_id]
91                                                                    for coord_id in attrs['_Netcdf4Coordinates'])
92                                           
93         2        471.0    235.5      9.4          child_name = self.name.split('/')[-1]
94         2         51.0     25.5      1.0          if child_name in self._parent.dimensions:
95                                                       return (child_name,)
96                                           
97         2          4.0      2.0      0.1          dims = []
98         7       1671.0    238.7     33.2          for axis, dim in enumerate(self._h5ds.dims):
99                                                       # TODO: read dimension labels even if there is no associated

100 # scale? it's not netCDF4 spec, but it is unambiguous... 101 # Also: the netCDF lib can read HDF5 datasets with unlabeled 102 # dimensions. 103 5 355.0 71.0 7.1 if len(dim) == 0: 104 raise ValueError('variable %r has no dimension scale ' 105 'associated with axis %s' 106 % (self.name, axis)) 107 5 1772.0 354.4 35.2 name = _name_from_dimension(dim) 108 5 18.0 3.6 0.4 dims.append(name) 109 2 6.0 3.0 0.1 return tuple(dims) ```

For 200 datasets:

``` Timer unit: 1e-06 s

Total time: 2.34179 s File: .../python3.8/site-packages/h5netcdf/core.py Function: _lookup_dimensions at line 86

Line # Hits Time Per Hit % Time Line Contents

86                                               def _lookup_dimensions(self):
87       400      66185.0    165.5      2.8          attrs = self._h5ds.attrs
88       400       6106.0     15.3      0.3          if '_Netcdf4Coordinates' in attrs:
89                                                       order_dim = _reverse_dict(self._parent._dim_order)
90                                                       return tuple(order_dim[coord_id]
91                                                                    for coord_id in attrs['_Netcdf4Coordinates'])
92                                           
93       400      45176.0    112.9      1.9          child_name = self.name.split('/')[-1]
94       400       5006.0     12.5      0.2          if child_name in self._parent.dimensions:
95                                                       return (child_name,)
96                                           
97       400        317.0      0.8      0.0          dims = []
98      1400     168708.0    120.5      7.2          for axis, dim in enumerate(self._h5ds.dims):
99                                                       # TODO: read dimension labels even if there is no associated

100 # scale? it's not netCDF4 spec, but it is unambiguous... 101 # Also: the netCDF lib can read HDF5 datasets with unlabeled 102 # dimensions. 103 1000 35653.0 35.7 1.5 if len(dim) == 0: 104 raise ValueError('variable %r has no dimension scale ' 105 'associated with axis %s' 106 % (self.name, axis)) 107 1000 2012597.0 2012.6 85.9 name = _name_from_dimension(dim) 108 1000 1640.0 1.6 0.1 dims.append(name) 109 400 400.0 1.0 0.0 return tuple(dims) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806667029 https://github.com/pydata/xarray/issues/2857#issuecomment-806667029 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjY2NzAyOQ== aldanor 2418513 2021-03-25T12:40:18Z 2021-03-25T12:49:00Z NONE
  • All of the time in store.close() is, in its turn spent in CachingFileManager.close()
  • That time is spent in h5netcdf.File.close()
  • All of which is spent in h5netcdf.File.flush()

h5netcdf.File.flush() when there's 0 datasets in file:

``` 0.21619391441345215

Timer unit: 1e-06 s

Total time: 0.006862 s File: .../python3.8/site-packages/h5netcdf/core.py Function: flush at line 689

Line # Hits Time Per Hit % Time Line Contents

689 def flush(self): 690 1 4.0 4.0 0.1 if 'r' not in self._mode: 691 1 111.0 111.0 1.6 self._set_unassigned_dimension_ids() 692 1 3521.0 3521.0 51.3 self._create_dim_scales() 693 1 3224.0 3224.0 47.0 self._attach_dim_scales() 694 1 2.0 2.0 0.0 if not self._preexisting_file and self._write_ncproperties: 695 self.attrs._h5attrs['_NCProperties'] = _NC_PROPERTIES ```

h5netcdf.File.flush() when there's 200 datasets in file (758 times slower):

``` Timer unit: 1e-06 s

Total time: 4.55295 s File: .../python3.8/site-packages/h5netcdf/core.py Function: flush at line 689

Line # Hits Time Per Hit % Time Line Contents

689 def flush(self): 690 1 3.0 3.0 0.0 if 'r' not in self._mode: 691 1 1148237.0 1148237.0 25.2 self._set_unassigned_dimension_ids() 692 1 462926.0 462926.0 10.2 self._create_dim_scales() 693 1 2941779.0 2941779.0 64.6 self._attach_dim_scales() 694 1 2.0 2.0 0.0 if not self._preexisting_file and self._write_ncproperties: 695 self.attrs._h5attrs['_NCProperties'] = _NC_PROPERTIES ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806651823 https://github.com/pydata/xarray/issues/2857#issuecomment-806651823 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjY1MTgyMw== aldanor 2418513 2021-03-25T12:30:39Z 2021-03-25T12:46:26Z NONE

@shoyer This problem persisted all of this time, but since I faced it again, I did a bit of digging. (it's strange noone else noticed it so far as it's pretty bad)

I've line-profiled this snippet for various number of datasets already written to file (xarray.backends.api.to_netcdf):

https://github.com/pydata/xarray/blob/8452120e52862df564a6e629d1ab5a7d392853b0/xarray/backends/api.py#L1075-L1094

| Number of datasets in file | dump_to_store() | store_open() | store.close() | | --- | --- | --- | --- | | 0 | 88% | 1% | 10% | | 50 | 18% | 2% | 80% | | 200 | 4% | 2% | 94% |

The above can be measured simply in a notebook via %lprun -f xarray.backends.api.to_netcdf test_func(). The writing was done in mode='a', with blosc:zstd compression. All datasets are written into different groups (i.e. by passing group=...).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
745366696 https://github.com/pydata/xarray/pull/4684#issuecomment-745366696 https://api.github.com/repos/pydata/xarray/issues/4684 MDEyOklzc3VlQ29tbWVudDc0NTM2NjY5Ng== aldanor 2418513 2020-12-15T15:29:30Z 2020-12-15T15:29:30Z NONE

Looks great, thanks! Do I understand this correctly - you won't have to specify encoding manually, as int64 encoding will be picked by default for M8[ns] dtype?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Ensure maximum accuracy when encoding and decoding np.datetime64[ns] values 764440458
735851973 https://github.com/pydata/xarray/issues/4045#issuecomment-735851973 https://api.github.com/repos/pydata/xarray/issues/4045 MDEyOklzc3VlQ29tbWVudDczNTg1MTk3Mw== aldanor 2418513 2020-11-30T15:22:09Z 2020-11-30T15:22:09Z NONE

Can we use the encoding["dtype"] field to solve this? i.e. use int64 when encoding["dtype"] is not set and use the specified value when available?

I think a lot of logic needs to be reshuffled, because as of right now it will complain "you can't store a float64 in int64" or something along those lines, when trying to do it with a nanosecond timestamp.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Millisecond precision is lost on datetime64 during IO roundtrip 614275938
735849936 https://github.com/pydata/xarray/issues/4045#issuecomment-735849936 https://api.github.com/repos/pydata/xarray/issues/4045 MDEyOklzc3VlQ29tbWVudDczNTg0OTkzNg== aldanor 2418513 2020-11-30T15:18:55Z 2020-11-30T15:21:02Z NONE

In principle we should be able to handle this (contributions are welcome)

I don't mind contributing but not knowing the netcdf stuff inside out I'm not sure I have a good vision on what's the proper way to do it. My use case is very simple - I have an in-memory xr.Dataset that I want to save() and then load() without losses.

Should it just be an xr.save(..., m8=True) (or whatever that flag would be called), so that all of numpy's M8[...] and m8[...] would be serialized transparently (as int64, that is) without passing them through the whole cftime pipeline. It would be then nice, of course, if xr.load was also aware of this convention (via some special attribute or somehow else) and could convert them back like .view('M8[ns]') when loading. I think xarray should also throw an exception if it detects timestamps/timedeltas of nanosecond precision that it can't serialize without going through int-float-int routine (or automatically revert to using this transparent but netcdf-incompatible mode).

Maybe this is not the proper way to do it - ideas welcome (there's also an open PR - #4400 - mind checking that out?)

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Millisecond precision is lost on datetime64 during IO roundtrip 614275938
735777126 https://github.com/pydata/xarray/pull/4400#issuecomment-735777126 https://api.github.com/repos/pydata/xarray/issues/4400 MDEyOklzc3VlQ29tbWVudDczNTc3NzEyNg== aldanor 2418513 2020-11-30T13:12:47Z 2020-11-30T13:12:47Z NONE

Yea, well, in this case it's not about Python... M8[ns] datatype is simply an int64 underneath, why not just store it as that, no bells and whistles required, no corruption possible, no funky conversions?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [WIP] Support nano second time encoding. 690546795
735431187 https://github.com/pydata/xarray/pull/4400#issuecomment-735431187 https://api.github.com/repos/pydata/xarray/issues/4400 MDEyOklzc3VlQ29tbWVudDczNTQzMTE4Nw== aldanor 2418513 2020-11-29T17:52:37Z 2020-11-29T17:52:37Z NONE

I'm working on an application where nanosecond-resolution is critical and took me days to find why my timestamps are all scrambled or off-by-1 after I write them with xarray and them read them back... would probably much rather prefer if it threw an exception instead of corrupting your data silently.

Non-standard netcdf or not, if it was possible to just store them as plain int64s and read them back as is, that would help a ton...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [WIP] Support nano second time encoding. 690546795
735430231 https://github.com/pydata/xarray/pull/4400#issuecomment-735430231 https://api.github.com/repos/pydata/xarray/issues/4400 MDEyOklzc3VlQ29tbWVudDczNTQzMDIzMQ== aldanor 2418513 2020-11-29T17:45:14Z 2020-11-29T17:45:14Z NONE

I think netcdf lists "nanoseconds" as a valid unit though?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [WIP] Support nano second time encoding. 690546795
734963454 https://github.com/pydata/xarray/pull/4400#issuecomment-734963454 https://api.github.com/repos/pydata/xarray/issues/4400 MDEyOklzc3VlQ29tbWVudDczNDk2MzQ1NA== aldanor 2418513 2020-11-27T19:38:47Z 2020-11-27T19:38:47Z NONE

But the test already passes (i.e. you can at least do a .encoding={.... 'nanoseconds'} and avoid float conversion?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [WIP] Support nano second time encoding. 690546795
734962866 https://github.com/pydata/xarray/pull/4400#issuecomment-734962866 https://api.github.com/repos/pydata/xarray/issues/4400 MDEyOklzc3VlQ29tbWVudDczNDk2Mjg2Ng== aldanor 2418513 2020-11-27T19:36:02Z 2020-11-27T19:36:02Z NONE

Oh, that requires cftime._cftime support first? :/

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [WIP] Support nano second time encoding. 690546795
734962563 https://github.com/pydata/xarray/pull/4400#issuecomment-734962563 https://api.github.com/repos/pydata/xarray/issues/4400 MDEyOklzc3VlQ29tbWVudDczNDk2MjU2Mw== aldanor 2418513 2020-11-27T19:34:48Z 2020-11-27T19:34:48Z NONE

Is there anything preventing to merge this?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [WIP] Support nano second time encoding. 690546795
734951187 https://github.com/pydata/xarray/issues/4045#issuecomment-734951187 https://api.github.com/repos/pydata/xarray/issues/4045 MDEyOklzc3VlQ29tbWVudDczNDk1MTE4Nw== aldanor 2418513 2020-11-27T18:47:26Z 2020-11-27T18:51:00Z NONE

Just stumbled upon this as well. Internally, datetime64[ns] is simply an 8-byte int. Why on earth would it be serialized in a lossy way as a float64?...

Simply telling it to encoding={...: {'dtype': 'int64'}} won't work since then it complains about serializing float as an int.

Is there a way out of this, other than not using M8[ns] dtypes at all with xarray?

This is a huge issue, as anyone using nanosecond-precision timestamps with xarray would unknowingly and silently read wrong data after deserializing.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Millisecond precision is lost on datetime64 during IO roundtrip 614275938
687267764 https://github.com/pydata/xarray/issues/1626#issuecomment-687267764 https://api.github.com/repos/pydata/xarray/issues/1626 MDEyOklzc3VlQ29tbWVudDY4NzI2Nzc2NA== aldanor 2418513 2020-09-04T16:55:48Z 2020-09-04T16:55:48Z NONE

This is an ancient issue, but still - wondering if anyone here managed to hack together some workarounds?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Structured numpy arrays, xarray and netCDF(4) 264582338
575835942 https://github.com/pydata/xarray/pull/3703#issuecomment-575835942 https://api.github.com/repos/pydata/xarray/issues/3703 MDEyOklzc3VlQ29tbWVudDU3NTgzNTk0Mg== aldanor 2418513 2020-01-17T23:39:39Z 2020-01-17T23:39:39Z NONE

Wondering, would it be possible to release a minor version with this stuff anytime soon, or is the plan to wait for the next big 0.15?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  hardcoded xarray.__all__ 551532886
575835720 https://github.com/pydata/xarray/pull/3703#issuecomment-575835720 https://api.github.com/repos/pydata/xarray/issues/3703 MDEyOklzc3VlQ29tbWVudDU3NTgzNTcyMA== aldanor 2418513 2020-01-17T23:38:20Z 2020-01-17T23:38:20Z NONE

Thanks a million!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  hardcoded xarray.__all__ 551532886
575371718 https://github.com/pydata/xarray/issues/3695#issuecomment-575371718 https://api.github.com/repos/pydata/xarray/issues/3695 MDEyOklzc3VlQ29tbWVudDU3NTM3MTcxOA== aldanor 2418513 2020-01-16T22:13:55Z 2020-01-16T22:13:55Z NONE

Any thoughts?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  mypy --strict fails on scripts/packages depending on xarray; __all__ required 549712566
574555353 https://github.com/pydata/xarray/issues/3695#issuecomment-574555353 https://api.github.com/repos/pydata/xarray/issues/3695 MDEyOklzc3VlQ29tbWVudDU3NDU1NTM1Mw== aldanor 2418513 2020-01-15T08:43:10Z 2020-01-15T08:43:10Z NONE

https://mypy.readthedocs.io/en/latest/command_line.html#cmdoption-mypy-no-implicit-reexport

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  mypy --strict fails on scripts/packages depending on xarray; __all__ required 549712566
491231541 https://github.com/pydata/xarray/issues/277#issuecomment-491231541 https://api.github.com/repos/pydata/xarray/issues/277 MDEyOklzc3VlQ29tbWVudDQ5MTIzMTU0MQ== aldanor 2418513 2019-05-10T09:52:35Z 2019-05-10T09:53:36Z NONE

It might also make sense then to implement all numpy-like constructors for DataArray, plus the empty(), which is typically faster for larger arrays:

  • .full() (kind of what's suggested here)
  • .ones()
  • .zeros()
  • .empty()

This should be trivial to implement.

{
    "total_count": 9,
    "+1": 9,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Creation of an empty DataArray 48301141
491229992 https://github.com/pydata/xarray/issues/1603#issuecomment-491229992 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ5MTIyOTk5Mg== aldanor 2418513 2019-05-10T09:47:39Z 2019-05-10T09:47:39Z NONE

There's now a good few dozen issues that reference this PR.

Wondering if there's any particular help needed (in the form of coding, discussion, or any other fashion), so as to try and speed it up and unblock those issues?

(I'm personally interested in resolving problems like #934 myself - allowing selection on non-dim coords, which seems to be a major hassle for a lot of use cases.)

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
475605323 https://github.com/pydata/xarray/issues/2836#issuecomment-475605323 https://api.github.com/repos/pydata/xarray/issues/2836 MDEyOklzc3VlQ29tbWVudDQ3NTYwNTMyMw== aldanor 2418513 2019-03-22T12:36:48Z 2019-03-22T12:36:48Z NONE

Ooh I missed that too! This probably wont serialize well to netcdf, would it?

Prob not, with n-d attrs? It would serialize just fine to plain HDF5 though...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray.concat() with compat='identical' fails for DataArray attrs 423749397
475284043 https://github.com/pydata/xarray/issues/2837#issuecomment-475284043 https://api.github.com/repos/pydata/xarray/issues/2837 MDEyOklzc3VlQ29tbWVudDQ3NTI4NDA0Mw== aldanor 2418513 2019-03-21T15:43:56Z 2019-03-21T15:58:23Z NONE

matplotlib only knows about numpy arrays so plt.plot(arr, ...) will act like plt.plot(arr.values, ...) by design.

How does it (matplotlib) preserve Series index then?

style is pandas-only kwarg (xarray lightly wraps matplotlib)

Would it make sense to make it (DA plotting interface) a bit more pandas-compatible by supporting style? Given that it copies pandas syntax like arr.plot.line() anyway...

Also, if plot() is meant to be a thin wrapper around matplotlib, it should support positional arguments, since you can do plt.plot(x, y, '.-') just fine, but da.plot('.-') fails complaining about unexpected positional arguments.

Currently, neither of the two options above work, making DA plot interface inferior to both raw matplotlib and pandas.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray plotting: pyplot compat and passing the style 423774214
475289244 https://github.com/pydata/xarray/issues/2837#issuecomment-475289244 https://api.github.com/repos/pydata/xarray/issues/2837 MDEyOklzc3VlQ29tbWVudDQ3NTI4OTI0NA== aldanor 2418513 2019-03-21T15:55:13Z 2019-03-21T15:55:13Z NONE

I think it plots assuming that the index is [0:len(da.values)].

Nope. It plots datetime index just fine.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray plotting: pyplot compat and passing the style 423774214
475285050 https://github.com/pydata/xarray/issues/2836#issuecomment-475285050 https://api.github.com/repos/pydata/xarray/issues/2836 MDEyOklzc3VlQ29tbWVudDQ3NTI4NTA1MA== aldanor 2418513 2019-03-21T15:46:13Z 2019-03-21T15:46:13Z NONE

I could try; what's the most stable way to check equality? Do we want to enforce that types are the same, shame/ndim are the same (dtypes?), plus element-wise comparison? What if one is DA array, one is np array?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray.concat() with compat='identical' fails for DataArray attrs 423749397
475264613 https://github.com/pydata/xarray/issues/2836#issuecomment-475264613 https://api.github.com/repos/pydata/xarray/issues/2836 MDEyOklzc3VlQ29tbWVudDQ3NTI2NDYxMw== aldanor 2418513 2019-03-21T14:59:28Z 2019-03-21T14:59:28Z NONE

@dcherian In the second example that fails, the attr in question is 1-D, one-dimensional attributes are fine?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray.concat() with compat='identical' fails for DataArray attrs 423749397
474909166 https://github.com/pydata/xarray/issues/2825#issuecomment-474909166 https://api.github.com/repos/pydata/xarray/issues/2825 MDEyOklzc3VlQ29tbWVudDQ3NDkwOTE2Ng== aldanor 2418513 2019-03-20T16:16:43Z 2019-03-20T16:16:43Z NONE

IIRC the workaround is to use a slice with neighbouring dates which is unintuitive and ugly.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  KeyError on selecting empty time slice from a datetime-indexed Dataset 423023519
474908707 https://github.com/pydata/xarray/issues/2825#issuecomment-474908707 https://api.github.com/repos/pydata/xarray/issues/2825 MDEyOklzc3VlQ29tbWVudDQ3NDkwODcwNw== aldanor 2418513 2019-03-20T16:15:47Z 2019-03-20T16:15:47Z NONE

Oh God! Classic pandas...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  KeyError on selecting empty time slice from a datetime-indexed Dataset 423023519
474786687 https://github.com/pydata/xarray/issues/2170#issuecomment-474786687 https://api.github.com/repos/pydata/xarray/issues/2170 MDEyOklzc3VlQ29tbWVudDQ3NDc4NjY4Nw== aldanor 2418513 2019-03-20T11:13:40Z 2019-03-20T11:13:40Z NONE

Please!

It's really painful in some cases where keepdims option is not available, tons of unneeded boilerplate required to mimic the same thing.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  keepdims=True for xarray reductions 325436508
474654983 https://github.com/pydata/xarray/issues/2824#issuecomment-474654983 https://api.github.com/repos/pydata/xarray/issues/2824 MDEyOklzc3VlQ29tbWVudDQ3NDY1NDk4Mw== aldanor 2418513 2019-03-20T02:05:55Z 2019-03-20T02:05:55Z NONE

I guess I expected it to “just work” since it’s a part of numpy core functionality. (same as you can just pass a recarray to pandas dataframe constructor and it infers the rest, without you having to create a dict of columns manually - there’s only one way to do it so it can be done automatically)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset.from_records()? 423016453
474637401 https://github.com/pydata/xarray/issues/1434#issuecomment-474637401 https://api.github.com/repos/pydata/xarray/issues/1434 MDEyOklzc3VlQ29tbWVudDQ3NDYzNzQwMQ== aldanor 2418513 2019-03-20T00:34:12Z 2019-03-20T00:34:12Z NONE

Looks like this is still a problem, just tested on 0.11.3 and it still results in object...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.concat loses coordinate dtype information with recarrays in 0.9 232350436

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 22.14ms · About: xarray-datasette