home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

92 rows where user = 6042212 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

issue >30

  • WIP: Zarr backend 14
  • Allow fsspec URLs in open_(mf)dataset 12
  • Document writing netcdf from xarray directly to S3 7
  • Dataset to zarr not working with newest s3fs Storage (s3fs > 0.5.0) 7
  • Allow fsspec/zarr/mfdataset 6
  • zarr as persistent store for xarray 5
  • Zarr consolidated 4
  • Serialization issue with distributed, h5netcdf, and fsspec (ImplicitToExplicitIndexingAdapter) 4
  • Check for path-like objects rather than Path type, use os.fspath 3
  • slow performance when storing datasets in gcsfs-backed zarr stores 2
  • to_zarr append with gcsmap does not work properly 2
  • Errors using to_zarr for an s3 store 2
  • Non-HTTPS remote URLs no longer work as input for open_zarr 2
  • requires io.IOBase subclass rather than duck file-like 2
  • ⚠️ Nightly upstream-dev CI failed ⚠️ 2
  • Obscure h5netcdf http serialization issue with python's http.server 2
  • fix zarr chunking bug 1
  • `open_zarr` hangs if 's3://' at front of root s3fs string 1
  • xarray.open_mzar: open multiple zarr files (in parallel) 1
  • Xarray open_mfdataset with engine Zarr 1
  • quick overview example not working with `to_zarr` function with gcs store 1
  • xr.DataArray.from_dask_dataframe feature 1
  • Retries for rare failures 1
  • Implement dask.sizeof for xarray.core.indexing.ImplicitToExplicitIndexingAdapter 1
  • Reading zarr gives unspecific PermissionError: Access Denied when public data has been consolidated after being written to S3 1
  • Threadlocking in DataArray calculations for zarr data depending on where it's loaded from (S3 vs local) 1
  • Checking whether there is a chunk_store passed iterates over all files 1
  • Opening fsspec s3 file twice results in invalid start byte 1
  • Reset file pointer to 0 when reading file stream 1
  • Missing Blocks when loading zarr file 1
  • …

user 1

  • martindurant · 92 ✖

author_association 1

  • CONTRIBUTOR 92
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1466255394 https://github.com/pydata/xarray/issues/7574#issuecomment-1466255394 https://api.github.com/repos/pydata/xarray/issues/7574 IC_kwDOAMm_X85XZUgi martindurant 6042212 2023-03-13T14:32:53Z 2023-03-13T14:32:53Z CONTRIBUTOR

Sorry, I really don't know what goes inside xarray's cache layers. It seems that fsspec is doing the right thing if it opens via one route, and parallel=True shouldn't require any serialisation for the in-process threaded scheduler.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_mfdataset doesn't work with fsspec and dask 1605108888
1453911083 https://github.com/pydata/xarray/issues/4122#issuecomment-1453911083 https://api.github.com/repos/pydata/xarray/issues/4122 IC_kwDOAMm_X85WqOwr martindurant 6042212 2023-03-03T18:12:01Z 2023-03-03T18:12:01Z CONTRIBUTOR

what are the limitations of the netcdf3 standard vs netcdf4

No compression, encoding or chunking except for the one "append" dimension.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Document writing netcdf from xarray directly to S3 631085856
1453902381 https://github.com/pydata/xarray/issues/4122#issuecomment-1453902381 https://api.github.com/repos/pydata/xarray/issues/4122 IC_kwDOAMm_X85WqMot martindurant 6042212 2023-03-03T18:04:29Z 2023-03-03T18:04:29Z CONTRIBUTOR

scipy only reads/writes netcdf2/3 ( https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.netcdf_file.html ), which is a very different and simpler format than netcdf4. The latter uses HDF5 as a container, and h5netcdf as the xarray engine. I guess "to_netcdf" is ambiguous.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Document writing netcdf from xarray directly to S3 631085856
1453898602 https://github.com/pydata/xarray/issues/4122#issuecomment-1453898602 https://api.github.com/repos/pydata/xarray/issues/4122 IC_kwDOAMm_X85WqLtq martindurant 6042212 2023-03-03T18:01:30Z 2023-03-03T18:01:30Z CONTRIBUTOR

I use the engine="scipy" one for reading.

This is netCDF3, in that case. If that's fine for you, no problem.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Document writing netcdf from xarray directly to S3 631085856
1453558039 https://github.com/pydata/xarray/issues/4122#issuecomment-1453558039 https://api.github.com/repos/pydata/xarray/issues/4122 IC_kwDOAMm_X85Wo4kX martindurant 6042212 2023-03-03T13:48:09Z 2023-03-03T13:48:09Z CONTRIBUTOR

Maybe it is netCDF3? xarray is supposed to be able to determine the file type with fsspec.open("s3://some_bucket/some_remote_destination.nc", mode="rb") as ff: ds = xr.open_dataset(ff) but maybe play with the engine= argument.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Document writing netcdf from xarray directly to S3 631085856
1450727551 https://github.com/pydata/xarray/issues/7522#issuecomment-1450727551 https://api.github.com/repos/pydata/xarray/issues/7522 IC_kwDOAMm_X85WeFh_ martindurant 6042212 2023-03-01T19:22:54Z 2023-03-01T19:22:54Z CONTRIBUTOR

I do generally recommend cache_type="first" for reading HDF5 files, because they tend to have most of the metadata in the header area of the file, with short pieces of metadata "elsewhere"; so the default readahead doesn't perform very well.

As to what the two writers might be doing differently, I only have guesses. I imagine xarray leaves it entirely to HDF to make whatever choices it likes. Dask does not write in parallel, since HDF does not support that, but it may order the writes more logically. It does set up the whole set of variables as a initialisation stage before writing any data - I don't know if xarray does this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Differences in `to_netcdf` for dask and numpy backed arrays 1581046647
1400583499 https://github.com/pydata/xarray/issues/4122#issuecomment-1400583499 https://api.github.com/repos/pydata/xarray/issues/4122 IC_kwDOAMm_X85TezVL martindurant 6042212 2023-01-23T15:57:24Z 2023-01-23T15:57:24Z CONTRIBUTOR

Would you mind writing out long-hand the version that worked and the version that didn't?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Document writing netcdf from xarray directly to S3 631085856
1400545067 https://github.com/pydata/xarray/issues/4122#issuecomment-1400545067 https://api.github.com/repos/pydata/xarray/issues/4122 IC_kwDOAMm_X85Tep8r martindurant 6042212 2023-01-23T15:31:16Z 2023-01-23T15:31:16Z CONTRIBUTOR

I can confirm that something like the following does work, basically automating the "write local and then push" workflow: import xarray as xr import fsspec ds = xr.open_dataset('http://geoport.usgs.esipfed.org/thredds/dodsC' '/silt/usgs/Projects/stellwagen/CF-1.6/BUZZ_BAY/2651-A.cdf') outfile = fsspec.open('simplecache::gcs://mdtemp/foo2.nc', mode='wb') with outfile as f: ds.to_netcdf(f)

Unfortunately, directly writing to the remote file without a local cached file is not supported, because HDF5 does not write in a linear way.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Document writing netcdf from xarray directly to S3 631085856
1381232564 https://github.com/pydata/xarray/issues/7430#issuecomment-1381232564 https://api.github.com/repos/pydata/xarray/issues/7430 IC_kwDOAMm_X85SU--0 martindurant 6042212 2023-01-13T02:24:24Z 2023-01-13T02:24:24Z CONTRIBUTOR

I recommend turning on logging in the HTTP file system client = Client(n_workers=1, threads_per_worker=32, memory_limit='64GB') client.run(fsspec.utils.setup_logging, logger_name="fsspec.http") fsspec.utils.setup_logging(logger_name="fsspec.http") and looking for errors

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Missing Blocks when loading zarr file 1525802030
1330736962 https://github.com/pydata/xarray/pull/7304#issuecomment-1330736962 https://api.github.com/repos/pydata/xarray/issues/7304 IC_kwDOAMm_X85PUW9C martindurant 6042212 2022-11-29T14:30:43Z 2022-11-29T14:30:43Z CONTRIBUTOR

It loos reasonable to me. I'm not sure if the warning is needed or not - we don't expect anyone to see it, or if they do, necessarily do anything about it. It's not unusual for code interacting with a file-like object to move the file pointer.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Reset file pointer to 0 when reading file stream 1458347938
1244150155 https://github.com/pydata/xarray/issues/6809#issuecomment-1244150155 https://api.github.com/repos/pydata/xarray/issues/6809 IC_kwDOAMm_X85KKDmL martindurant 6042212 2022-09-12T18:45:09Z 2022-09-12T18:45:09Z CONTRIBUTOR

I agree that is not None would make sense here. We could implement bool() for FSMap, but if we are following the dict API, indeed it should be the same as a test for empty.

@nestabur , if you pass the s3 path directly to xarray/zarr rather than making your own FSMap, you should get a FSStore storage layer (similar but different!). Does this have the same behaviour?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Checking whether there is a chunk_store passed iterates over all files 1309500528
1204310218 https://github.com/pydata/xarray/issues/6813#issuecomment-1204310218 https://api.github.com/repos/pydata/xarray/issues/6813 IC_kwDOAMm_X85HyFDK martindurant 6042212 2022-08-03T18:10:57Z 2022-08-03T18:10:57Z CONTRIBUTOR

Yes, it is reasonable to always seek(0) or to copy the file. I am not certain why/where xarray is caching the open file, though - I would have thought that a new file instance is made for each open_dataset(). I am not certain whether seeking/reading from the same file in multiple places might have unforeseen consequences, such as when doing open_dataset in multiple threads.

I am mildly against subclassing from RawIOBase, since some file-likes might choose to implement text mode right in the class (as opposed to a text wrapper layered on top). Pretty surprised that it doesn't have read()/write(), though, since all the derived classes do.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Opening fsspec s3 file twice results in invalid start byte 1310058435
1146159832 https://github.com/pydata/xarray/issues/6662#issuecomment-1146159832 https://api.github.com/repos/pydata/xarray/issues/6662 IC_kwDOAMm_X85EUQLY martindurant 6042212 2022-06-03T16:34:44Z 2022-06-03T16:34:44Z CONTRIBUTOR

Can you please explicitly check the type and __dict__ of fp?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Obscure h5netcdf http serialization issue with python's http.server 1260047355
1146079311 https://github.com/pydata/xarray/issues/6662#issuecomment-1146079311 https://api.github.com/repos/pydata/xarray/issues/6662 IC_kwDOAMm_X85ET8hP martindurant 6042212 2022-06-03T15:30:17Z 2022-06-03T15:30:17Z CONTRIBUTOR

Python's HTTP server does not normally provide content lengths without some extra work, that might be the difference.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
  Obscure h5netcdf http serialization issue with python's http.server 1260047355
1085091126 https://github.com/pydata/xarray/pull/5879#issuecomment-1085091126 https://api.github.com/repos/pydata/xarray/issues/5879 IC_kwDOAMm_X85ArS02 martindurant 6042212 2022-03-31T20:45:54Z 2022-03-31T20:45:54Z CONTRIBUTOR

OK, I get you - so the real problem is that OpenFile can look path-like, but isn't really.

OpenFile is really a file-like factory, a proxy for open file-likes that you can make (and seialise for Dask). Its main purpose is to be used in a context: python with fsspec.open(url) as f: ds = xr.open_dataset(f, engine="h5netcdf") except that the problem with xarray is that it will want to keep this thing open for subsequent operations, so you either need to put all that in the context, or use .open()/.close() as you have been.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Check for path-like objects rather than Path type, use os.fspath 1031275532
1085037801 https://github.com/pydata/xarray/pull/5879#issuecomment-1085037801 https://api.github.com/repos/pydata/xarray/issues/5879 IC_kwDOAMm_X85ArFzp martindurant 6042212 2022-03-31T19:54:26Z 2022-03-31T19:54:26Z CONTRIBUTOR

"s3://noaa-nwm-retrospective-2-1-zarr-pds/lakeout.zarr" is a directory, right? You cannot open that as a file, or maybe there is no equivalent key at all (because s3 is magic like that). No, you should not be able to do this directly - zarr requires a path which fsspec can turn into a mapper, or an instantiated mapper.

To make a bare mapper (i.e., dict-like): m = fsspec.get_mapper("s3://noaa-nwm-retrospective-2-1-zarr-pds/lakeout.zarr", ...) or you could use zarr's FSMapper meant specifically for this job.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Check for path-like objects rather than Path type, use os.fspath 1031275532
1085022939 https://github.com/pydata/xarray/pull/5879#issuecomment-1085022939 https://api.github.com/repos/pydata/xarray/issues/5879 IC_kwDOAMm_X85ArCLb martindurant 6042212 2022-03-31T19:37:49Z 2022-03-31T19:37:49Z CONTRIBUTOR

isinstance(X, os.PathLike) is very like hasattr(X, __fspath__) because of: python @classmethod def __subclasshook__(cls, subclass): if cls is PathLike: return _check_methods(subclass, '__fspath__') return NotImplemented

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Check for path-like objects rather than Path type, use os.fspath 1031275532
1020190813 https://github.com/pydata/xarray/issues/6033#issuecomment-1020190813 https://api.github.com/repos/pydata/xarray/issues/6033 IC_kwDOAMm_X848zuBd martindurant 6042212 2022-01-24T15:00:53Z 2022-01-24T15:00:53Z CONTRIBUTOR

It would be interesting to turn on s3fs logging to see the access pattern, if you are interested. python fsspec.utils.setup_logging(logger_name="s3fs") Particularly, I am interested in whether xarray is loading chunk-by chunk serially versus concurrently. It would be good to know your chunksize versus total array size.

The dask version is interesting: xr.open_zarr(lookup(f"{path_forecast}/surface"), chunks={}) # uses dask where the dask partition size will be the same as the underlying chunk size. If you find a lot of latency (small chunks), you can sometimes get an order of magnitude download performance increase by specifying the chunksize along some dimension(s) to be a multiple of the on-disk size. I wouldn't normally recommend Dask just for loading the data into memory, but feel free to experiment.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Threadlocking in DataArray calculations for zarr data depending on where it's loaded from (S3 vs local) 1064837571
970365001 https://github.com/pydata/xarray/issues/5426#issuecomment-970365001 https://api.github.com/repos/pydata/xarray/issues/5426 IC_kwDOAMm_X8451phJ martindurant 6042212 2021-11-16T15:08:03Z 2021-11-16T15:08:03Z CONTRIBUTOR

The conversation here seems to have stalled, but I feel like it was useful. Did we gather any useful actions?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Implement dask.sizeof for xarray.core.indexing.ImplicitToExplicitIndexingAdapter 908971901
961211401 https://github.com/pydata/xarray/issues/5918#issuecomment-961211401 https://api.github.com/repos/pydata/xarray/issues/5918 IC_kwDOAMm_X845SuwJ martindurant 6042212 2021-11-04T16:30:33Z 2021-11-04T16:30:33Z CONTRIBUTOR

Some thoughts: - fsspec's mapper (or even the filesystem instance) could hold default kwargs to be applied to any open() function, perhaps separate ones for reading and writing. In this case, that would mean supplying acl="public-read". - the consolidate_zarr could be made to accept extra parameters to define how the .zmetadata file is made, perhaps even accept a file-like object to write into, so the user gets complete control - the mapper translates some exceptions into KeyError, which is what zarr needs to conclude that .zmetadata is missing. You could include PermissionError in this now (missing_exceptions kwarg to get_mapper), but we can talk about what the best defaults are.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Reading zarr gives unspecific PermissionError: Access Denied when public data has been consolidated after being written to S3 1039844354
880695337 https://github.com/pydata/xarray/issues/5600#issuecomment-880695337 https://api.github.com/repos/pydata/xarray/issues/5600 MDEyOklzc3VlQ29tbWVudDg4MDY5NTMzNw== martindurant 6042212 2021-07-15T13:28:53Z 2021-07-15T13:28:53Z CONTRIBUTOR

should we change that?

Perhaps so? We are releasing pretty frequently, though, and if there is a problem here, we'd be happy to put out a bugfix.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  ⚠️ Nightly upstream-dev CI failed ⚠️ 943923579
880685334 https://github.com/pydata/xarray/issues/5600#issuecomment-880685334 https://api.github.com/repos/pydata/xarray/issues/5600 MDEyOklzc3VlQ29tbWVudDg4MDY4NTMzNA== martindurant 6042212 2021-07-15T13:15:14Z 2021-07-15T13:15:14Z CONTRIBUTOR

There was a release of fsspec, but I don't see why anything would have changed here. Can you see whether the failure is associated with the new version?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  ⚠️ Nightly upstream-dev CI failed ⚠️ 943923579
870777725 https://github.com/pydata/xarray/issues/4591#issuecomment-870777725 https://api.github.com/repos/pydata/xarray/issues/4591 MDEyOklzc3VlQ29tbWVudDg3MDc3NzcyNQ== martindurant 6042212 2021-06-29T17:20:43Z 2021-06-29T17:20:43Z CONTRIBUTOR

I only have vague thoughts.

To be sure: you can pickle the file-system, any mapper (.get_mapper()) and any open file (.open()), right?

The question here is, why msgpack is being invoked. Those items, as well as any internal xarray stuff should only be in tasks, and so pickled. Is there a high-level-graph layer encapsulating things that were previously pickled? The only things that appear in any HLG-layer should be the paths and storage options needed to open a file-system, not the file-system itself.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Serialization issue with distributed, h5netcdf, and fsspec (ImplicitToExplicitIndexingAdapter) 745801652
809493007 https://github.com/pydata/xarray/issues/5070#issuecomment-809493007 https://api.github.com/repos/pydata/xarray/issues/5070 MDEyOklzc3VlQ29tbWVudDgwOTQ5MzAwNw== martindurant 6042212 2021-03-29T15:52:28Z 2021-03-29T15:52:28Z CONTRIBUTOR

Unsure whether checking hasattr is better than just trying to read the object and catching an error

Agree, that's fine. An AttributeError in this calling function might look weird, though, so you could have both.

you could read it into BytesIO and pass the BytesIO instance

This is in general a bad idea, since we are wanting to deal with large files, and we have random access capabilities.

Ideally xarray would work with fsspec

It does, but this is an edge case of using fsspec for local files; these are normally passed as the filename.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  requires io.IOBase subclass rather than duck file-like 839823306
805900745 https://github.com/pydata/xarray/issues/5070#issuecomment-805900745 https://api.github.com/repos/pydata/xarray/issues/5070 MDEyOklzc3VlQ29tbWVudDgwNTkwMDc0NQ== martindurant 6042212 2021-03-24T15:07:16Z 2021-03-24T15:07:16Z CONTRIBUTOR

xref https://github.com/pangeo-forge/pangeo-forge/pull/87

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  requires io.IOBase subclass rather than duck file-like 839823306
797547241 https://github.com/pydata/xarray/pull/4659#issuecomment-797547241 https://api.github.com/repos/pydata/xarray/issues/4659 MDEyOklzc3VlQ29tbWVudDc5NzU0NzI0MQ== martindurant 6042212 2021-03-12T15:04:34Z 2021-03-12T15:04:34Z CONTRIBUTOR

Ping, can I please ask what the current status is here?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.DataArray.from_dask_dataframe feature 758606082
780127931 https://github.com/pydata/xarray/pull/4823#issuecomment-780127931 https://api.github.com/repos/pydata/xarray/issues/4823 MDEyOklzc3VlQ29tbWVudDc4MDEyNzkzMQ== martindurant 6042212 2021-02-16T21:26:52Z 2021-02-16T21:26:52Z CONTRIBUTOR

Thank you, @dcherian

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec URLs in open_(mf)dataset 788398518
779870336 https://github.com/pydata/xarray/pull/4823#issuecomment-779870336 https://api.github.com/repos/pydata/xarray/issues/4823 MDEyOklzc3VlQ29tbWVudDc3OTg3MDMzNg== martindurant 6042212 2021-02-16T14:26:12Z 2021-02-16T14:26:12Z CONTRIBUTOR

Can someone please explain the minimum version policy that is failing ``` Package Required Policy Status


aiobotocore 1.1.2 (2020-08-18) 0.12 (2020-02-24) > (!) (w) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec URLs in open_(mf)dataset 788398518
778353067 https://github.com/pydata/xarray/pull/4823#issuecomment-778353067 https://api.github.com/repos/pydata/xarray/issues/4823 MDEyOklzc3VlQ29tbWVudDc3ODM1MzA2Nw== martindurant 6042212 2021-02-12T18:06:49Z 2021-02-12T18:06:49Z CONTRIBUTOR

@raybellwaves , might I paraphrase to "this PR is useful, please merge!" ?

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec URLs in open_(mf)dataset 788398518
769247752 https://github.com/pydata/xarray/pull/4823#issuecomment-769247752 https://api.github.com/repos/pydata/xarray/issues/4823 MDEyOklzc3VlQ29tbWVudDc2OTI0Nzc1Mg== martindurant 6042212 2021-01-28T17:27:50Z 2021-01-28T17:27:50Z CONTRIBUTOR

I have decided, on reflection, to back away on the scope here and only implement for zarr for now, since, frankly, I am confused about what should happen for other backends, and they are not tested. Yes, some of them are happy to accept file-like objects, but others either don't do that at all, or want the URL passing through. My code would have changed how things were handled, depending on whether it passed through open_dataset or open_mfdataset. Best would be to set up a set of expectations as tests.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec URLs in open_(mf)dataset 788398518
768393609 https://github.com/pydata/xarray/pull/4823#issuecomment-768393609 https://api.github.com/repos/pydata/xarray/issues/4823 MDEyOklzc3VlQ29tbWVudDc2ODM5MzYwOQ== martindurant 6042212 2021-01-27T16:10:39Z 2021-01-27T16:10:39Z CONTRIBUTOR

Thanks, @kmuehlbauer

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec URLs in open_(mf)dataset 788398518
768385226 https://github.com/pydata/xarray/pull/4823#issuecomment-768385226 https://api.github.com/repos/pydata/xarray/issues/4823 MDEyOklzc3VlQ29tbWVudDc2ODM4NTIyNg== martindurant 6042212 2021-01-27T15:58:15Z 2021-01-27T15:58:15Z CONTRIBUTOR

The RTD failure appears to be: WARNING: failed to reach any of the inventories with the following issues: intersphinx inventory 'https://scitools.org.uk/iris/docs/latest/objects.inv' not fetchable due to <class 'requests.exceptions.HTTPError'>: 404 Client Error: Not Found for url: https://scitools.org.uk/iris/docs/latest/objects.inv which, I'm afraid, doesn't mean much to me.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec URLs in open_(mf)dataset 788398518
768362931 https://github.com/pydata/xarray/pull/4823#issuecomment-768362931 https://api.github.com/repos/pydata/xarray/issues/4823 MDEyOklzc3VlQ29tbWVudDc2ODM2MjkzMQ== martindurant 6042212 2021-01-27T15:26:57Z 2021-01-27T15:26:57Z CONTRIBUTOR

I am marking this PR as ready, but please ask me for specific test cases that might be relevant and should be included.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec URLs in open_(mf)dataset 788398518
764649377 https://github.com/pydata/xarray/pull/4823#issuecomment-764649377 https://api.github.com/repos/pydata/xarray/issues/4823 MDEyOklzc3VlQ29tbWVudDc2NDY0OTM3Nw== martindurant 6042212 2021-01-21T13:40:33Z 2021-01-21T13:40:33Z CONTRIBUTOR

(please definitely do not merge until I've added documentation)

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec URLs in open_(mf)dataset 788398518
762858956 https://github.com/pydata/xarray/pull/4823#issuecomment-762858956 https://api.github.com/repos/pydata/xarray/issues/4823 MDEyOklzc3VlQ29tbWVudDc2Mjg1ODk1Ng== martindurant 6042212 2021-01-19T14:04:05Z 2021-01-19T14:04:05Z CONTRIBUTOR

Next open question: aside from zarr, few of the other backends will know what to do with fsspec's dict-like mappers. Should we prevent them from passing through? Should we attempt to distinguish between directories and files, and make fsspec file-like objects? We could just allow the backends to fail later on incorrect input.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec URLs in open_(mf)dataset 788398518
762428604 https://github.com/pydata/xarray/pull/4461#issuecomment-762428604 https://api.github.com/repos/pydata/xarray/issues/4461 MDEyOklzc3VlQ29tbWVudDc2MjQyODYwNA== martindurant 6042212 2021-01-18T19:15:25Z 2021-01-18T19:15:25Z CONTRIBUTOR

All interested parties, please see new attempt at https://github.com/pydata/xarray/pull/4823

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec/zarr/mfdataset 709187212
762394713 https://github.com/pydata/xarray/pull/4823#issuecomment-762394713 https://api.github.com/repos/pydata/xarray/issues/4823 MDEyOklzc3VlQ29tbWVudDc2MjM5NDcxMw== martindurant 6042212 2021-01-18T17:52:47Z 2021-01-18T17:55:19Z CONTRIBUTOR

pint errors in xarray/tests/test_units.py::TestVariable appear unrelated

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec URLs in open_(mf)dataset 788398518
762393744 https://github.com/pydata/xarray/issues/4691#issuecomment-762393744 https://api.github.com/repos/pydata/xarray/issues/4691 MDEyOklzc3VlQ29tbWVudDc2MjM5Mzc0NA== martindurant 6042212 2021-01-18T17:50:32Z 2021-01-18T17:50:32Z CONTRIBUTOR

https://github.com/pydata/xarray/pull/4823 working on this. Please try and comment.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Non-HTTPS remote URLs no longer work as input for open_zarr 766826777
762367350 https://github.com/pydata/xarray/pull/4823#issuecomment-762367350 https://api.github.com/repos/pydata/xarray/issues/4823 MDEyOklzc3VlQ29tbWVudDc2MjM2NzM1MA== martindurant 6042212 2021-01-18T16:54:21Z 2021-01-18T16:54:21Z CONTRIBUTOR

Question: should HTTP URLs be passed through unprocessed as before? I think that might be required by some of the netCDF engines, but we probably don't test this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec URLs in open_(mf)dataset 788398518
762350678 https://github.com/pydata/xarray/pull/4823#issuecomment-762350678 https://api.github.com/repos/pydata/xarray/issues/4823 MDEyOklzc3VlQ29tbWVudDc2MjM1MDY3OA== martindurant 6042212 2021-01-18T16:22:53Z 2021-01-18T16:22:53Z CONTRIBUTOR

Docs to be added

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec URLs in open_(mf)dataset 788398518
761147346 https://github.com/pydata/xarray/issues/4691#issuecomment-761147346 https://api.github.com/repos/pydata/xarray/issues/4691 MDEyOklzc3VlQ29tbWVudDc2MTE0NzM0Ng== martindurant 6042212 2021-01-15T19:30:21Z 2021-01-15T19:30:21Z CONTRIBUTOR

I believe https://github.com/pydata/xarray/pull/4461 fixes this

Note that you can still use the "old" method of opening the mapper (e.g., fsspec.get_mapper) beforehand and passing that

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Non-HTTPS remote URLs no longer work as input for open_zarr 766826777
747453674 https://github.com/pydata/xarray/issues/4704#issuecomment-747453674 https://api.github.com/repos/pydata/xarray/issues/4704 MDEyOklzc3VlQ29tbWVudDc0NzQ1MzY3NA== martindurant 6042212 2020-12-17T13:56:40Z 2020-12-17T13:56:40Z CONTRIBUTOR

As far as I can tell, this has only been happening in gcsfs - so my suggestion, to try to collect the set of conditions that should be considered "retryable" but currently aren't, still holds. However, it is also worthwhile discussing where else in the stack retries might be applied, which would affect multiple storage backends.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Retries for rare failures 770006670
743287803 https://github.com/pydata/xarray/pull/4461#issuecomment-743287803 https://api.github.com/repos/pydata/xarray/issues/4461 MDEyOklzc3VlQ29tbWVudDc0MzI4NzgwMw== martindurant 6042212 2020-12-11T16:19:26Z 2020-12-11T16:19:26Z CONTRIBUTOR

Martin has gained by implementing this PR is transferrable

I'm not sure, it's been a while now...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec/zarr/mfdataset 709187212
741881966 https://github.com/pydata/xarray/pull/4461#issuecomment-741881966 https://api.github.com/repos/pydata/xarray/issues/4461 MDEyOklzc3VlQ29tbWVudDc0MTg4MTk2Ng== martindurant 6042212 2020-12-09T16:20:33Z 2020-12-09T16:20:33Z CONTRIBUTOR

ping again

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec/zarr/mfdataset 709187212
739959248 https://github.com/pydata/xarray/issues/4478#issuecomment-739959248 https://api.github.com/repos/pydata/xarray/issues/4478 MDEyOklzc3VlQ29tbWVudDczOTk1OTI0OA== martindurant 6042212 2020-12-07T14:39:57Z 2020-12-07T14:39:57Z CONTRIBUTOR

Please try with fsspec master.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset to zarr not working with newest s3fs Storage (s3fs > 0.5.0) 712782711
730396711 https://github.com/pydata/xarray/issues/4556#issuecomment-730396711 https://api.github.com/repos/pydata/xarray/issues/4556 MDEyOklzc3VlQ29tbWVudDczMDM5NjcxMQ== martindurant 6042212 2020-11-19T14:03:47Z 2020-11-19T14:03:47Z CONTRIBUTOR

Looks like a special case of a numpy scalar. I can catch this in fsspec - please wait.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  quick overview example not working with `to_zarr` function with gcs store 733201109
729863434 https://github.com/pydata/xarray/issues/4591#issuecomment-729863434 https://api.github.com/repos/pydata/xarray/issues/4591 MDEyOklzc3VlQ29tbWVudDcyOTg2MzQzNA== martindurant 6042212 2020-11-18T18:14:28Z 2020-11-18T18:14:28Z CONTRIBUTOR

The xarray.backends.h5netcdf_.H5NetCDFArrayWrapper seems to keep a reference to the open file, which for HTTP contains the open session. The linked PR fixes the serialization of those files, for the HTTP case.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Serialization issue with distributed, h5netcdf, and fsspec (ImplicitToExplicitIndexingAdapter) 745801652
729803257 https://github.com/pydata/xarray/issues/4591#issuecomment-729803257 https://api.github.com/repos/pydata/xarray/issues/4591 MDEyOklzc3VlQ29tbWVudDcyOTgwMzI1Nw== martindurant 6042212 2020-11-18T16:42:30Z 2020-11-18T16:42:30Z CONTRIBUTOR

OK, I can see a thing after all... please stand by

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Serialization issue with distributed, h5netcdf, and fsspec (ImplicitToExplicitIndexingAdapter) 745801652
729795030 https://github.com/pydata/xarray/issues/4591#issuecomment-729795030 https://api.github.com/repos/pydata/xarray/issues/4591 MDEyOklzc3VlQ29tbWVudDcyOTc5NTAzMA== martindurant 6042212 2020-11-18T16:29:18Z 2020-11-18T16:29:18Z CONTRIBUTOR

I don't think it's fsspec, the HTTPFileSystem and file objects are known to serialise.

However ```

distributed.protocol.serialize(dsc.surface.mean().data.dask['open_dataset-27832a1f850736a8d9a11a882ad06230surface-3b6f5b6a90c2cfa65379d3bfae22126f']) ({'serializer': 'error'}, ...) ``` (that's one of the keys I picked from the graph at random, your keys may differ) I can't say why this object is in the graph where perhaps it wasn't before, but it has a reference to a "CopyOnWriteArray", which sounds like a buffer owned by something else and probably the non-serializable part. Digging find a contained "<xarray.backends.h5netcdf_.H5NetCDFArrayWrapper at 0x17e669ad0>" which is not serializable - so maybe xarray can do something about this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Serialization issue with distributed, h5netcdf, and fsspec (ImplicitToExplicitIndexingAdapter) 745801652
721365827 https://github.com/pydata/xarray/pull/4461#issuecomment-721365827 https://api.github.com/repos/pydata/xarray/issues/4461 MDEyOklzc3VlQ29tbWVudDcyMTM2NTgyNw== martindurant 6042212 2020-11-03T20:46:57Z 2020-11-03T20:46:57Z CONTRIBUTOR

One completely unrelated failure (test_polyfit_warnings). Can I please get a final say here (@max-sixty @alexamici ?)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec/zarr/mfdataset 709187212
712194464 https://github.com/pydata/xarray/pull/4461#issuecomment-712194464 https://api.github.com/repos/pydata/xarray/issues/4461 MDEyOklzc3VlQ29tbWVudDcxMjE5NDQ2NA== martindurant 6042212 2020-10-19T14:22:23Z 2020-10-19T14:22:23Z CONTRIBUTOR

(failures look like something in pandas dev)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec/zarr/mfdataset 709187212
704353239 https://github.com/pydata/xarray/issues/4478#issuecomment-704353239 https://api.github.com/repos/pydata/xarray/issues/4478 MDEyOklzc3VlQ29tbWVudDcwNDM1MzIzOQ== martindurant 6042212 2020-10-06T15:30:50Z 2020-10-06T15:30:50Z CONTRIBUTOR

That's a lot of data!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset to zarr not working with newest s3fs Storage (s3fs > 0.5.0) 712782711
704285976 https://github.com/pydata/xarray/issues/4478#issuecomment-704285976 https://api.github.com/repos/pydata/xarray/issues/4478 MDEyOklzc3VlQ29tbWVudDcwNDI4NTk3Ng== martindurant 6042212 2020-10-06T13:55:34Z 2020-10-06T13:55:34Z CONTRIBUTOR

Can you confirm that this works ok with fsspec and s3fs master?

{
    "total_count": 2,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 2,
    "rocket": 0,
    "eyes": 0
}
  Dataset to zarr not working with newest s3fs Storage (s3fs > 0.5.0) 712782711
702846089 https://github.com/pydata/xarray/issues/4478#issuecomment-702846089 https://api.github.com/repos/pydata/xarray/issues/4478 MDEyOklzc3VlQ29tbWVudDcwMjg0NjA4OQ== martindurant 6042212 2020-10-02T16:59:45Z 2020-10-02T16:59:45Z CONTRIBUTOR

I have reproduced it locally (also with moto). Indeed, many threads are trying to stall the event loop at once. This will take a little finesse.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset to zarr not working with newest s3fs Storage (s3fs > 0.5.0) 712782711
702816676 https://github.com/pydata/xarray/issues/4478#issuecomment-702816676 https://api.github.com/repos/pydata/xarray/issues/4478 MDEyOklzc3VlQ29tbWVudDcwMjgxNjY3Ng== martindurant 6042212 2020-10-02T15:59:59Z 2020-10-02T15:59:59Z CONTRIBUTOR

Thanks for the digging, I'll look into it

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset to zarr not working with newest s3fs Storage (s3fs > 0.5.0) 712782711
702213899 https://github.com/pydata/xarray/issues/4478#issuecomment-702213899 https://api.github.com/repos/pydata/xarray/issues/4478 MDEyOklzc3VlQ29tbWVudDcwMjIxMzg5OQ== martindurant 6042212 2020-10-01T15:27:00Z 2020-10-01T15:27:00Z CONTRIBUTOR

File "/usr/local/lib/python3.7/asyncio/base_events.py", line 1771, in _run_once handle = self._ready.popleft()

This looks like it may be a race conditions where multiple threads are calling the event loop at once. I wonder if you could list the event loops in use and the threads (perhaps best run with base python than ipython/jupyter).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset to zarr not working with newest s3fs Storage (s3fs > 0.5.0) 712782711
702124268 https://github.com/pydata/xarray/issues/4478#issuecomment-702124268 https://api.github.com/repos/pydata/xarray/issues/4478 MDEyOklzc3VlQ29tbWVudDcwMjEyNDI2OA== martindurant 6042212 2020-10-01T13:11:32Z 2020-10-01T13:11:32Z CONTRIBUTOR

The following code, modified to the style of the s3fs test suite, works OK: ```python def test_with_xzarr(s3): da = pytest.importorskip("dask.array") xr = pytest.importorskip("xarray") name = "sample"

nana = xr.DataArray(da.zeros((1023, 1023, 3)))

s3_path = f"{test_bucket_name}/{name}"
s3store = s3.get_mapper(s3_path)

print("Storing")
nana.to_dataset().to_zarr(store=s3store, mode="w", consolidated=True, compute=True)

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset to zarr not working with newest s3fs Storage (s3fs > 0.5.0) 712782711
699155033 https://github.com/pydata/xarray/pull/4461#issuecomment-699155033 https://api.github.com/repos/pydata/xarray/issues/4461 MDEyOklzc3VlQ29tbWVudDY5OTE1NTAzMw== martindurant 6042212 2020-09-25T21:05:42Z 2020-09-25T21:05:42Z CONTRIBUTOR

Question: to eventually get tests to pass, will need changes only just now going into zarr. Those may be released some time soon, but in the meantime is it reasonable to install from master?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec/zarr/mfdataset 709187212
696766963 https://github.com/pydata/xarray/pull/4187#issuecomment-696766963 https://api.github.com/repos/pydata/xarray/issues/4187 MDEyOklzc3VlQ29tbWVudDY5Njc2Njk2Mw== martindurant 6042212 2020-09-22T14:41:41Z 2020-09-22T14:41:41Z CONTRIBUTOR

Note that zarr.open* now works with fsspec URLs (in master)

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Xarray open_mfdataset with engine Zarr 647804004
639777701 https://github.com/pydata/xarray/issues/4122#issuecomment-639777701 https://api.github.com/repos/pydata/xarray/issues/4122 MDEyOklzc3VlQ29tbWVudDYzOTc3NzcwMQ== martindurant 6042212 2020-06-05T20:17:38Z 2020-06-05T20:17:38Z CONTRIBUTOR

The write feature for simplecache isn't released yet, of course.

It would be interesting if someone could subclass file and write locally with h5netcdf to see what kind of seeks it does. Is it popping back to some file header to update array sizes? Presumably it would need a fixed-size header to do that. Parquet and other cloud formats have the metadata at the footer exactly for this reason, so you only write once you know everything and you only ever move forward in the fie.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Document writing netcdf from xarray directly to S3 631085856
620151178 https://github.com/pydata/xarray/pull/4003#issuecomment-620151178 https://api.github.com/repos/pydata/xarray/issues/4003 MDEyOklzc3VlQ29tbWVudDYyMDE1MTE3OA== martindurant 6042212 2020-04-27T18:19:54Z 2020-04-27T18:19:54Z CONTRIBUTOR

the behavior of zarr it appears will rely heavily on fsspec more in the future.

IF we can push on https://github.com/zarr-developers/zarr-python/pull/546 ; but here is also an opportunity to get the behaviour out of the zarr/fsspec interaction most convenient for this work.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray.open_mzar: open multiple zarr files (in parallel) 606683601
605222008 https://github.com/pydata/xarray/issues/3831#issuecomment-605222008 https://api.github.com/repos/pydata/xarray/issues/3831 MDEyOklzc3VlQ29tbWVudDYwNTIyMjAwOA== martindurant 6042212 2020-03-27T19:11:59Z 2020-03-27T19:11:59Z CONTRIBUTOR

Note that s3fs and gcsfs now expose the kwargs skip_instance_cache use_listings_cache, listings_expiry_time, and max_paths and pass them to fsspec. See https://filesystem-spec.readthedocs.io/en/latest/features.html#instance-caching and https://filesystem-spec.readthedocs.io/en/latest/features.html#listings-caching

(although the new releases for both already include the change that accessing a file, contents or metadata, does not require a directory listing, which is the right thing for zarr, where the full paths are known)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Errors using to_zarr for an s3 store 576337745
595379998 https://github.com/pydata/xarray/issues/3831#issuecomment-595379998 https://api.github.com/repos/pydata/xarray/issues/3831 MDEyOklzc3VlQ29tbWVudDU5NTM3OTk5OA== martindurant 6042212 2020-03-05T18:32:38Z 2020-03-05T18:32:38Z CONTRIBUTOR

https://github.com/intake/filesystem_spec/pull/243 is where my attempt to fix this kind of thing will live.

However, writing or deleting keys should invalidate the appropriate part of the cache as it currently stands, so I don't know why the problem has arisen. If it is a cache problem, then s3.invalidate_cache() can always be called.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Errors using to_zarr for an s3 store 576337745
524910496 https://github.com/pydata/xarray/issues/3251#issuecomment-524910496 https://api.github.com/repos/pydata/xarray/issues/3251 MDEyOklzc3VlQ29tbWVudDUyNDkxMDQ5Ng== martindurant 6042212 2019-08-26T15:38:16Z 2019-08-26T15:38:16Z CONTRIBUTOR

Note that get_mapper is implemented for all file systems, so there should be no need for any gcsfs-specific code.

On August 26, 2019 11:21:00 AM EDT, Justin Minsk notifications@github.com wrote:

Went back to gcsfs 0.0.4 and the current code still does not work. Going back more requires major changes to environment that conda cannot handle and would not be compatible with the current build of xarray. I think changing mutable maps to strings in this case has never worked with gcsfs and more than likely has not been seriously used until the append to zarr was added.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/pydata/xarray/issues/3251#issuecomment-524903609

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_zarr append with gcsmap does not work properly  484592018
524663267 https://github.com/pydata/xarray/issues/3251#issuecomment-524663267 https://api.github.com/repos/pydata/xarray/issues/3251 MDEyOklzc3VlQ29tbWVudDUyNDY2MzI2Nw== martindurant 6042212 2019-08-25T21:00:37Z 2019-08-25T21:00:37Z CONTRIBUTOR

I am not sure why str should ever be called on the mapping. For sure, what it returns is not the same as before (perhaps you could go back a version and check?), but I don't know what the string would have been used for. I am on leave at the moment and unlikely to be able to get time to investigate.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_zarr append with gcsmap does not work properly  484592018
460257484 https://github.com/pydata/xarray/issues/2740#issuecomment-460257484 https://api.github.com/repos/pydata/xarray/issues/2740 MDEyOklzc3VlQ29tbWVudDQ2MDI1NzQ4NA== martindurant 6042212 2019-02-04T13:54:31Z 2019-02-04T13:54:31Z CONTRIBUTOR

Do you have any idea what is taking the extra time? s3fs ought to, in theory, treat URLs with and with the s3:// the same, but this may not have been tested with tea mapping.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `open_zarr` hangs if 's3://' at front of root s3fs string 406178487
444514608 https://github.com/pydata/xarray/pull/2559#issuecomment-444514608 https://api.github.com/repos/pydata/xarray/issues/2559 MDEyOklzc3VlQ29tbWVudDQ0NDUxNDYwOA== martindurant 6042212 2018-12-05T14:58:58Z 2018-12-05T14:58:58Z CONTRIBUTOR

I like those timings.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr consolidated 382497709
443804859 https://github.com/pydata/xarray/pull/2559#issuecomment-443804859 https://api.github.com/repos/pydata/xarray/issues/2559 MDEyOklzc3VlQ29tbWVudDQ0MzgwNDg1OQ== martindurant 6042212 2018-12-03T17:55:51Z 2018-12-03T17:55:51Z CONTRIBUTOR

LGTM

Do you think there should be more explicit text of how to add consolidation to existing zarr/xarray data-sets, rather than creating them with consolidation turned on?

We may also need some text around updating consolidated data-sets, but that can maybe wait to see what kind of usage people try.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr consolidated 382497709
442581092 https://github.com/pydata/xarray/pull/2559#issuecomment-442581092 https://api.github.com/repos/pydata/xarray/issues/2559 MDEyOklzc3VlQ29tbWVudDQ0MjU4MTA5Mg== martindurant 6042212 2018-11-28T19:49:43Z 2018-11-28T19:49:43Z CONTRIBUTOR

Glad to see this happening, by the way. Once in, catalogs using intake-xarray can be updated and I don't thin the code will need to change.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr consolidated 382497709
442580432 https://github.com/pydata/xarray/pull/2559#issuecomment-442580432 https://api.github.com/repos/pydata/xarray/issues/2559 MDEyOklzc3VlQ29tbWVudDQ0MjU4MDQzMg== martindurant 6042212 2018-11-28T19:47:43Z 2018-11-28T19:47:43Z CONTRIBUTOR

Will the default for both options be False for the time being?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr consolidated 382497709
396930603 https://github.com/pydata/xarray/pull/2228#issuecomment-396930603 https://api.github.com/repos/pydata/xarray/issues/2228 MDEyOklzc3VlQ29tbWVudDM5NjkzMDYwMw== martindurant 6042212 2018-06-13T13:07:58Z 2018-06-13T13:07:58Z CONTRIBUTOR

Right now, to_zarr raises an exception in the case of irregular chunks, suggesting that the user should call .rechunk() first.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  fix zarr chunking bug 331752926
365412033 https://github.com/pydata/xarray/pull/1528#issuecomment-365412033 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NTQxMjAzMw== martindurant 6042212 2018-02-13T21:35:03Z 2018-02-13T21:35:03Z CONTRIBUTOR

Yeah, ideally when adding a variable like ds['myvar'] = xr.DataArray(data=da.zeros(..., chunks=(..)), dims=['l', 'b', 'v']) ds.to_zarr(mapping) we should be able to apply an optimization strategy in which the zarr array is created without filling in all those unnecessary zeros. This seems doable.

On the other hand, implementing ds.myvar[slice, slice, slice] = some data ds.to_zarr(mapping) (which cannot be done currently with dask-arrays at all), in such a way that only partitions with data get updated - this seems really hard.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
364817111 https://github.com/pydata/xarray/pull/1528#issuecomment-364817111 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NDgxNzExMQ== martindurant 6042212 2018-02-12T02:43:43Z 2018-02-12T03:47:48Z CONTRIBUTOR

OK, so the way to do this in pure-zarr appears to be to simply create the appropriate zarr array and set it's dimensions attribute:

ds = xr.Dataset(coords={'b': np.arange(-4, 6, 0.005), 'l': np.arange(150, 72, -0.005), 'v': np.arange(58722.24288, -164706.4225401, -8.2446e2)}, ds.to_zarr(mapping) g = zarr.open_group(mapping) arr = g.zeros(..., shape like l, b, v) arr.attrs['_ARRAY_DIMENSIONS'] = ['l', 'b', 'v']

xr..open_zarr(mapping) now shows the new array, without having to materialize any data into it, and arr can be written to piecemeal - without the convenience of the coordinate mapping, of course.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
364804697 https://github.com/pydata/xarray/pull/1528#issuecomment-364804697 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NDgwNDY5Nw== martindurant 6042212 2018-02-12T00:19:55Z 2018-02-12T00:19:55Z CONTRIBUTOR

It might be enough, in this case, to provide some helper function in zarr to create and fetch arrays that will show up as variables in xarray - this need not be specific to being used via dask. I am assuming with the work done in this PR, that there is an unambiguous way to determine if a zarr group can be interpreted as an xarray dataset, and that zarr then knows how to add things that look like variables (which generally in the zarr case don't involve writing any actual data until the parts of the array are filled in).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
364803984 https://github.com/pydata/xarray/pull/1528#issuecomment-364803984 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NDgwMzk4NA== martindurant 6042212 2018-02-12T00:12:36Z 2018-02-12T00:12:36Z CONTRIBUTOR

@jhamman , that partially solves what I mean, I can probably turn my data into dask arrays with some difficulty; but really I was hoping for something like the following: ds = xr.Dataset(coords={'b': np.arange(-4, 6, 0.005), 'l': np.arange(150, 72, -0.005), 'v': np.arange(58722.24288, -164706.4225401, -8.2446e2)}, arr = ds.create_new_zero_array(dims=['l', 'b', 'v']) arr[0:10, :, :] = 1 and expect to be able to set the values of the new variable in the same way that you can with the equivalent zarr array. I can probably get around this by setting the values with da.zeros, finding the zarr array in the dataset, and then setting its values.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
364801073 https://github.com/pydata/xarray/pull/1528#issuecomment-364801073 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NDgwMTA3Mw== martindurant 6042212 2018-02-11T23:35:34Z 2018-02-11T23:35:34Z CONTRIBUTOR

Question: how would one build a zarr-xarray dataset?

With zarr you can open an array that contains no data, and use set-slice notation to fill in the values (which is what dask's store essentially does).

If I have some pre-known coordinates and bigger-than-memory data arrays, how would I go about getting the values into the zarr structure? If this can't be done directly with the xarray interface, is there a way to call zarr's open/create/zeros such that the corresponding array will appear as a variable when the same dataset is opened with xarray?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
351106449 https://github.com/pydata/xarray/issues/1770#issuecomment-351106449 https://api.github.com/repos/pydata/xarray/issues/1770 MDEyOklzc3VlQ29tbWVudDM1MTEwNjQ0OQ== martindurant 6042212 2017-12-12T16:31:55Z 2017-12-12T16:31:55Z CONTRIBUTOR

Yes, dirs exists to prevent the need to query the server for file-lists multiple times. There is an outstanding issue to move to prefix/delimited listing as with s3fs, rather than fetch the complete listing for a bucket. If all the paths are known beforehand, as might be the case for zarr, then it may be of no use at all - but actually I'm not sure then why it would have been populated.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance when storing datasets in gcsfs-backed zarr stores 280626621
351100872 https://github.com/pydata/xarray/issues/1770#issuecomment-351100872 https://api.github.com/repos/pydata/xarray/issues/1770 MDEyOklzc3VlQ29tbWVudDM1MTEwMDg3Mg== martindurant 6042212 2017-12-12T16:15:43Z 2017-12-12T16:15:43Z CONTRIBUTOR

I am puzzled that serializing the mapping is pulling the data. GCSMap does not have get/set_state, but the only attributes are the GCSFileSystem and path. Perhaps the __getitem__ gets called? As for the GCSFileSystem, it stores the token with a renewable token, which lives indefinitely, and the refresh API is called upon deserialization. There should probably be a check in _call to ensure that the token hasn't expired.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance when storing datasets in gcsfs-backed zarr stores 280626621
345770374 https://github.com/pydata/xarray/pull/1528#issuecomment-345770374 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTc3MDM3NA== martindurant 6042212 2017-11-20T17:37:01Z 2017-11-20T17:37:01Z CONTRIBUTOR

This is, of course, by design :) I imagine there is much that could be done to optimise performance, but for fewer, larger chunks, it should be pretty good.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345104440 https://github.com/pydata/xarray/pull/1528#issuecomment-345104440 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTEwNDQ0MA== martindurant 6042212 2017-11-17T00:10:19Z 2017-11-17T00:10:19Z CONTRIBUTOR

hdfs3 also has a MutableMapping for HDFS. I did not succeed in getting one into azure-datalake-store, but it would not be hard to make. In this way, zarr can become a pretty general array cloud storage mechanism.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
333400272 https://github.com/pydata/xarray/pull/1528#issuecomment-333400272 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzMzQwMDI3Mg== martindurant 6042212 2017-10-01T19:26:22Z 2017-10-01T19:26:22Z CONTRIBUTOR

I have not done anything, I'm afraid, since posting my commit, the content of which is just an example of how you might pass parameters down to zarr, and a test-case which shows that the basic data is round-tripping properly, but actually the dataset does not come back with the same structure as it started off. We can loop back and decide where to go from here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
327901739 https://github.com/pydata/xarray/pull/1528#issuecomment-327901739 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNzkwMTczOQ== martindurant 6042212 2017-09-07T19:36:15Z 2017-09-07T19:36:15Z CONTRIBUTOR

@shoyer , is https://github.com/martindurant/xarray/commit/6c1fb6b76ebba862a1c5831210ce026160da0065 a reasonable start ?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
327833777 https://github.com/pydata/xarray/pull/1528#issuecomment-327833777 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNzgzMzc3Nw== martindurant 6042212 2017-09-07T15:23:31Z 2017-09-07T15:23:31Z CONTRIBUTOR

@rabernat , is there anything I can do to help push this along?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325728378 https://github.com/pydata/xarray/pull/1528#issuecomment-325728378 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTcyODM3OA== martindurant 6042212 2017-08-29T17:00:29Z 2017-08-29T17:00:29Z CONTRIBUTOR

A further rather big advantage in zarr that I'm not aware of in cdf/hdf (I may be wrong) is not just null values, but not having a given block be written to disc at all if it only contains null data. This probably meshes perfectly well with most user's understanding of missing data/fill value.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325727354 https://github.com/pydata/xarray/pull/1528#issuecomment-325727354 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTcyNzM1NA== martindurant 6042212 2017-08-29T16:57:10Z 2017-08-29T16:57:10Z CONTRIBUTOR

Worth pointing out here, that the zarr filter-set is extensible (I suppose hdf5 is too, but I don't think this is ever done in practice), but I don't think it makes any particular claims to performance.

I think both of the options above are reasonable, and there is no particular reason to exclude either: a zarr variable could look to xarray like floats but actually be stored as ints (i.e., arguments are passed to zarr), or it could look like ints which xarray expects to inflate to floats (i.e., stored as an attribute). I mean, if a user stores a float variable, but includes kwargs to zarr for scale/filter (or any other filter arguments), we should make no attempt to interrupt that.

The only question is, if the user wishes to apply scale/offset in xarray, which is their most likely intention? I would guess the latter, compute in xarray and use attributes, since xarray users probably don't know about zarr and its filters.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325390391 https://github.com/pydata/xarray/pull/1528#issuecomment-325390391 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTM5MDM5MQ== martindurant 6042212 2017-08-28T15:41:08Z 2017-08-28T15:41:08Z CONTRIBUTOR

@rabernat : on actually looking through your code :) Happy to see you doing exactly as I felt I was not knowledgeable to do and poke xarray's guts. If I can help in any way, please let me know, although I don't have a lot of spare hours right now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325220001 https://github.com/pydata/xarray/pull/1528#issuecomment-325220001 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTIyMDAwMQ== martindurant 6042212 2017-08-27T19:46:31Z 2017-08-27T19:46:31Z CONTRIBUTOR

Sorry that I let this slide - there was not a huge upswell of interest around what I had done, and I was not ready to dive into xarray internals. Could you comment more on the difference between your approach and mine? Is the aim to reduce the number of metadata files hanging around? zarr has made an effort with the groups interface to parallel netCDF, which is, after all, what xarray essentially expects of all its data sources.

As in this comment I have come to the realisation that although nice to/from zarr methods can be made relatively easily, they will not get traction unless they can be put within a class that mimics the existing xarray infrastructure, i.e., the user would never know, except that magically they have extra encoding/compression options, the file-path can be an S3 URL (say), and dask parallel computation suddenly works on a cluster and/or with out-of-core processing. That would raise some eyebrows!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
281990573 https://github.com/pydata/xarray/issues/1223#issuecomment-281990573 https://api.github.com/repos/pydata/xarray/issues/1223 MDEyOklzc3VlQ29tbWVudDI4MTk5MDU3Mw== martindurant 6042212 2017-02-23T13:25:36Z 2017-02-23T13:25:36Z CONTRIBUTOR

@alimanfoo , do you think this work would make more sense as part of zarr rather than as part of xarray?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  zarr as persistent store for xarray 202260275
281860859 https://github.com/pydata/xarray/issues/1223#issuecomment-281860859 https://api.github.com/repos/pydata/xarray/issues/1223 MDEyOklzc3VlQ29tbWVudDI4MTg2MDg1OQ== martindurant 6042212 2017-02-23T01:25:52Z 2017-02-23T01:25:52Z CONTRIBUTOR

True, xarray_to_zarr is unchanged from before. The dataset functions could supercede, since a single xarray is just a special case of a dataset; or we could decide that for the special case it is worth having short-cut functions. I was worried about the number of metadata files being created, since on a remote system like S3, there is a large overhead to reading many small files.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  zarr as persistent store for xarray 202260275
281813651 https://github.com/pydata/xarray/issues/1223#issuecomment-281813651 https://api.github.com/repos/pydata/xarray/issues/1223 MDEyOklzc3VlQ29tbWVudDI4MTgxMzY1MQ== martindurant 6042212 2017-02-22T21:42:49Z 2017-02-22T21:43:05Z CONTRIBUTOR

@alimanfoo , in the new dataset save function, I do exactly as you suggest, with everything getting put as a dict into the main zarr group attributes, with special attribute names "attrs" for the data-set root, "coords" for the set of coordinate objects and "variables" for the set of variables objects (all of these have their own attributes in xarray).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  zarr as persistent store for xarray 202260275
279181938 https://github.com/pydata/xarray/issues/1223#issuecomment-279181938 https://api.github.com/repos/pydata/xarray/issues/1223 MDEyOklzc3VlQ29tbWVudDI3OTE4MTkzOA== martindurant 6042212 2017-02-11T22:56:56Z 2017-02-11T22:56:56Z CONTRIBUTOR

I have developed my example a little to sidestep subclassing you suggest, which seemed tricky to implement.

Please see https://gist.github.com/martindurant/06a1e98c91f0033c4649a48a2f943390 (dataset_to/from_zarr functions)

I can use the zarr groups structure to mirror at least typical use of xarrays: variables, coordinates and sets of attributes on each. I have tested this with s3 too, stealing a little code from dask to show the idea.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  zarr as persistent store for xarray 202260275
274202189 https://github.com/pydata/xarray/issues/1223#issuecomment-274202189 https://api.github.com/repos/pydata/xarray/issues/1223 MDEyOklzc3VlQ29tbWVudDI3NDIwMjE4OQ== martindurant 6042212 2017-01-20T22:57:07Z 2017-01-20T22:57:07Z CONTRIBUTOR

3: a json-like representation such as used by the hidden .xarray item would also do.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  zarr as persistent store for xarray 202260275

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 1361.399ms · About: xarray-datasette