home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

7 rows where issue = 1030811490 and user = 48723181 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • jkingslake · 7 ✖

issue 1

  • problem appending to zarr on GCS when using json token · 7 ✖

author_association 1

  • NONE 7
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1315919098 https://github.com/pydata/xarray/issues/5878#issuecomment-1315919098 https://api.github.com/repos/pydata/xarray/issues/5878 IC_kwDOAMm_X85Ob1T6 jkingslake 48723181 2022-11-15T22:04:20Z 2022-11-15T22:04:20Z NONE

This is my latest attempt to avoid the cache issue. It is not working. But I wanted to document it here for the next time this comes up.

1. Run the following in a local jupyter notebook

``` import fsspec import xarray as xr import json import gcsfs

define a mapper to the ldeo-glaciology bucket

needs a token

with open('/Users/jkingslake/Documents/misc/ldeo-glaciology-bc97b12df06b.json') as token_file: token = json.load(token_file)

filename = 'gs://ldeo-glaciology/append_test/test56'

mapper = fsspec.get_mapper(filename, mode='w', token=token)

define two simple datasets

ds0 = xr.Dataset({'temperature': (['time'], [50, 51, 52])}, coords={'time': [1, 2, 3]}) ds1 = xr.Dataset({'temperature': (['time'], [53, 54, 55])}, coords={'time': [4, 5, 6]})

write the first ds to bucket

ds0.to_zarr(mapper)

```

2. run the following in a local terminal

gsutil setmeta -h "Cache-Control:no-store" gs://ldeo-glaciology/append_test/test56/** to turn off caching for this zarr store and all the files associated with it

3. Run the following in the local notebook

```

append the second ds to the same zarr store

ds1.to_zarr(mapper, mode='a', append_dim='time') ds = xr.open_dataset('gs://ldeo-glaciology/append_test/test56', engine='zarr', consolidated=False) len(ds.time) ``` 3

At least it sometimes does this and sometimes work later, and sometimes works immediately.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  problem appending to zarr on GCS when using json token  1030811490
1315901990 https://github.com/pydata/xarray/issues/5878#issuecomment-1315901990 https://api.github.com/repos/pydata/xarray/issues/5878 IC_kwDOAMm_X85ObxIm jkingslake 48723181 2022-11-15T21:45:22Z 2022-11-15T21:45:22Z NONE

Thanks @rabernat.

Using consolidated=False when reading seems to work, but not immediately after the append, and there is very strange behavior where the size of the dataset changes each time you read it. So maybe this is the cache issue again.

It appears from here that the default caching metadata on each object in a buckect overrides any argument you send when loading.

But following this https://stackoverflow.com/questions/52499015/set-metadata-for-all-objects-in-a-bucket-in-google-cloud-storage I can turn off caching for all objects in the bucket with gsutil setmeta -h "Cache-Control:no-store" gs://ldeo-glaciology/**

But I don't think this affects new objects.

So when writing new objects that I want to append to, maybe the approach is to write the first one, then turn off caching for that object, then continue to append.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  problem appending to zarr on GCS when using json token  1030811490
1315516624 https://github.com/pydata/xarray/issues/5878#issuecomment-1315516624 https://api.github.com/repos/pydata/xarray/issues/5878 IC_kwDOAMm_X85OaTDQ jkingslake 48723181 2022-11-15T15:57:55Z 2022-11-15T15:57:55Z NONE

Coming back to this a year later, I am still having the same issue.

Running the gsutil locally gsutil cat gs://ldeo-glaciology/append_test/test30/temperature/.zarray shows shape 6: { "chunks": [ 3 ], "compressor": { "blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1 }, "dtype": "<i8", "fill_value": null, "filters": null, "order": "C", "shape": [ 6 ], "zarr_format": 2

whereas running fsspec on leap-pangeo shows only shape 3: ``` import fsspec import xarray as xr import json import gcsfs

mapper = fsspec.get_mapper('gs://ldeo-glaciology/append_test/test30', mode='r') ds_both = xr.open_zarr(mapper) len(ds_both.temperature) ```

And trying to append using a new toy dataset written from leap-pangeo has the same issue.

Any ideas on what to try next?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  problem appending to zarr on GCS when using json token  1030811490
969427141 https://github.com/pydata/xarray/issues/5878#issuecomment-969427141 https://api.github.com/repos/pydata/xarray/issues/5878 IC_kwDOAMm_X845yEjF jkingslake 48723181 2021-11-15T23:18:46Z 2021-11-15T23:18:46Z NONE

but I now am really confused because test5 from a few days ago shows up as shape [6]:

import fsspec import xarray as xr import json import gcsfs ​ mapper = fsspec.get_mapper('gs://ldeo-glaciology/append_test/test5', mode='r') ds_both = xr.open_zarr(mapper) len(ds_both.time) ``` /tmp/ipykernel_1040/570416536.py:7: RuntimeWarning: Failed to open Zarr store with consolidated metadata, falling back to try reading non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata. ds_both = xr.open_zarr(mapper)

6 ```

@porterdf did you disable caching when you wrote the first zarr? How did you do that exactly?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  problem appending to zarr on GCS when using json token  1030811490
969423439 https://github.com/pydata/xarray/issues/5878#issuecomment-969423439 https://api.github.com/repos/pydata/xarray/issues/5878 IC_kwDOAMm_X845yDpP jkingslake 48723181 2021-11-15T23:15:38Z 2021-11-15T23:15:38Z NONE

1. In the jupyterhub (pangeo) command line with curl, i get (shape [6])

curl https://storage.googleapis.com/ldeo-glaciology/append_test/test30/temperature/.zarray { "chunks": [ 3 ], "compressor": { "blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1 }, "dtype": "<i8", "fill_value": null, "filters": null, "order": "C", "shape": [ 6 ], "zarr_format": 2

2. On my local machine using gsutil, i get (shape [6])

``` gsutil cat gs://ldeo-glaciology/append_test/test30/temperature/.zarray { "chunks": [ 3 ], "compressor": { "blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1 }, "dtype": "<i8", "fill_value": null, "filters": null, "order": "C", "shape": [ 6 ], "zarr_format": 2

```

3. When I use 'fsspec` in the jupyterhub, i get something different (shape [3])

``` import fsspec import xarray as xr import json import gcsfs

mapper = fsspec.get_mapper('gs://ldeo-glaciology/append_test/test30', mode='r') ds_both = xr.open_zarr(mapper) len(ds_both.time)

3 ```

4. Using gcsfs in the jupyterhub I get (shape [3])

``` gcs = gcsfs.GCSFileSystem(project='pangeo-integration-te-3eea') gcs.cat('ldeo-glaciology/append_test/test5/temperature/.zarray')

b'{\n "chunks": [\n 3\n ],\n "compressor": {\n "blocksize": 0,\n "clevel": 5,\n "cname": "lz4",\n "id": "blosc",\n "shuffle": 1\n },\n "dtype": "<i8",\n "fill_value": null,\n "filters": null,\n "order": "C",\n "shape": [\n 3\n ],\n "zarr_format": 2\n}'

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  problem appending to zarr on GCS when using json token  1030811490
968994399 https://github.com/pydata/xarray/issues/5878#issuecomment-968994399 https://api.github.com/repos/pydata/xarray/issues/5878 IC_kwDOAMm_X845wa5f jkingslake 48723181 2021-11-15T14:59:33Z 2021-11-15T14:59:33Z NONE

thanks for looking into this.

To test this hypothesis, you would need to disable caching on the bucket. Do you have privileges to do that?

I tried to do this last night but did not have permission myself. Perhaps @jkingslake does?

@porterdf you should have full permissions to do things like this. But in any case, I could only see how to change metadata for individual existing objects rather than the entire bucket. How do I edit the cache-control for whole bucket?

I have tried writing the first dataset, then changing its disabling caching for that object, then appending. I still do not see the full length (shape = [6]) dataset when I reload it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  problem appending to zarr on GCS when using json token  1030811490
966758709 https://github.com/pydata/xarray/issues/5878#issuecomment-966758709 https://api.github.com/repos/pydata/xarray/issues/5878 IC_kwDOAMm_X845n5E1 jkingslake 48723181 2021-11-12T02:05:04Z 2021-11-12T02:05:04Z NONE

Thanks for taking a look @rabernat.

The code below writes a new zarr and checks immediately if it's there using gcsfs. It seems to appear within a few seconds.

Is this what you meant?

``` %%time import fsspec import xarray as xr import json import gcsfs

define a mapper to the ldeo-glaciology bucket. - needs a token

with open('../secrets/ldeo-glaciology-bc97b12df06b.json') as token_file: token = json.load(token_file)

get a mapper with fsspec for a new zarr

mapper = fsspec.get_mapper('gs://ldeo-glaciology/append_test/test11', mode='w', token=token)

check what files are in there

fs = gcsfs.GCSFileSystem(project='pangeo-integration-te-3eea', mode='ab', cache_timeout = 0) print('Files in the test directory before writing:') filesBefore = fs.ls('gs://ldeo-glaciology/append_test/') print(*filesBefore,sep='\n')

define a simple datasets

ds0 = xr.Dataset({'temperature': (['time'], [50, 51, 52])}, coords={'time': [1, 2, 3]})

write the simple dataset to zarr

ds0.to_zarr(mapper)

check to see if the new file is there

print('Files in the test directory after writing:') filesAfter = fs.ls('gs://ldeo-glaciology/append_test/') print(*filesAfter,sep='\n') Output: Files in the test directory before writing: ldeo-glaciology/append_test/test1 ldeo-glaciology/append_test/test10 ldeo-glaciology/append_test/test2 ldeo-glaciology/append_test/test3 ldeo-glaciology/append_test/test4 ldeo-glaciology/append_test/test5 ldeo-glaciology/append_test/test6 ldeo-glaciology/append_test/test7 ldeo-glaciology/append_test/test8 ldeo-glaciology/append_test/test9 Files in the test directory after writing: ldeo-glaciology/append_test/test1 ldeo-glaciology/append_test/test10 ldeo-glaciology/append_test/test11 ldeo-glaciology/append_test/test2 ldeo-glaciology/append_test/test3 ldeo-glaciology/append_test/test4 ldeo-glaciology/append_test/test5 ldeo-glaciology/append_test/test6 ldeo-glaciology/append_test/test7 ldeo-glaciology/append_test/test8 ldeo-glaciology/append_test/test9 CPU times: user 130 ms, sys: 16.5 ms, total: 146 ms Wall time: 2.19 s ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  problem appending to zarr on GCS when using json token  1030811490

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 3036.431ms · About: xarray-datasette