html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/5878#issuecomment-1315919098,https://api.github.com/repos/pydata/xarray/issues/5878,1315919098,IC_kwDOAMm_X85Ob1T6,48723181,2022-11-15T22:04:20Z,2022-11-15T22:04:20Z,NONE,"This is my latest attempt to avoid the cache issue. It is not working. But I wanted to document it here for the next time this comes up. ### 1. Run the following in a local jupyter notebook ``` import fsspec import xarray as xr import json import gcsfs ## define a mapper to the ldeo-glaciology bucket ### needs a token with open('/Users/jkingslake/Documents/misc/ldeo-glaciology-bc97b12df06b.json') as token_file: token = json.load(token_file) filename = 'gs://ldeo-glaciology/append_test/test56' mapper = fsspec.get_mapper(filename, mode='w', token=token) ## define two simple datasets ds0 = xr.Dataset({'temperature': (['time'], [50, 51, 52])}, coords={'time': [1, 2, 3]}) ds1 = xr.Dataset({'temperature': (['time'], [53, 54, 55])}, coords={'time': [4, 5, 6]}) ## write the first ds to bucket ds0.to_zarr(mapper) ``` ### 2. run the following in a local terminal ` gsutil setmeta -h ""Cache-Control:no-store"" gs://ldeo-glaciology/append_test/test56/**` to turn off caching for this zarr store and all the files associated with it ### 3. Run the following in the local notebook ``` ## append the second ds to the same zarr store ds1.to_zarr(mapper, mode='a', append_dim='time') ds = xr.open_dataset('gs://ldeo-glaciology/append_test/test56', engine='zarr', consolidated=False) len(ds.time) ``` 3 At least it sometimes does this and sometimes work later, and sometimes works immediately. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1030811490 https://github.com/pydata/xarray/issues/5878#issuecomment-1315901990,https://api.github.com/repos/pydata/xarray/issues/5878,1315901990,IC_kwDOAMm_X85ObxIm,48723181,2022-11-15T21:45:22Z,2022-11-15T21:45:22Z,NONE,"Thanks @rabernat. Using `consolidated=False` when reading seems to work, but not immediately after the append, and there is very strange behavior where the size of the dataset changes each time you read it. So maybe this is the cache issue again. It appears from [here](https://cloud.google.com/storage/docs/metadata#cache-control) that the default caching metadata on each object in a buckect overrides any argument you send when loading. But following this https://stackoverflow.com/questions/52499015/set-metadata-for-all-objects-in-a-bucket-in-google-cloud-storage I can turn off caching for all objects in the bucket with ``` gsutil setmeta -h ""Cache-Control:no-store"" gs://ldeo-glaciology/**``` But I don't think this affects new objects. So when writing new objects that I want to append to, maybe the approach is to write the first one, then turn off caching for that object, then continue to append. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1030811490 https://github.com/pydata/xarray/issues/5878#issuecomment-1315553661,https://api.github.com/repos/pydata/xarray/issues/5878,1315553661,IC_kwDOAMm_X85OacF9,1197350,2022-11-15T16:22:30Z,2022-11-15T16:22:30Z,MEMBER,"Your issue is that the consolidated metadata have not been updated: ```python import gcsfs fs = gcsfs.GCSFileSystem() # the latest array metadata print(fs.cat('gs://ldeo-glaciology/append_test/test30/temperature/.zarray').decode()) # -> ""shape"": [ 6 ] # the consolidated metadata print(fs.cat(''gs://ldeo-glaciology/append_test/test30/.zmetadata'').decode()) # -> ""shape"": [ 3 ] ``` There are two ways to fix this. 1. Don't use consolidated metadatda on read. (This will be a bit slower) ```python ds = xr.open_dataset('gs://ldeo-glaciology/append_test/test30', engine='zarr', consolidated=False) ``` 1. Reconsolidate your metadata after append. https://zarr.readthedocs.io/en/stable/tutorial.html#consolidating-metadata ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1030811490 https://github.com/pydata/xarray/issues/5878#issuecomment-1315516624,https://api.github.com/repos/pydata/xarray/issues/5878,1315516624,IC_kwDOAMm_X85OaTDQ,48723181,2022-11-15T15:57:55Z,2022-11-15T15:57:55Z,NONE,"Coming back to this a year later, I am still having the same issue. Running the gsutil locally ``` gsutil cat gs://ldeo-glaciology/append_test/test30/temperature/.zarray ``` shows shape 6: ``` { ""chunks"": [ 3 ], ""compressor"": { ""blocksize"": 0, ""clevel"": 5, ""cname"": ""lz4"", ""id"": ""blosc"", ""shuffle"": 1 }, ""dtype"": "" >To test this hypothesis, you would need to disable caching on the bucket. Do you have privileges to do that? >I tried to do this last night but did not have permission myself. Perhaps @jkingslake does? @porterdf you should have full permissions to do things like this. But in any case, I could only see how to change metadata for individual existing objects rather than the entire bucket. How do I edit the cache-control for whole bucket? I have tried writing the first dataset, then changing its disabling caching for that object, then appending. I still do not see the full length (shape = [6]) dataset when I reload it. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1030811490 https://github.com/pydata/xarray/issues/5878#issuecomment-968176008,https://api.github.com/repos/pydata/xarray/issues/5878,968176008,IC_kwDOAMm_X845tTGI,7237617,2021-11-13T23:43:17Z,2021-11-13T23:44:27Z,NONE,"Update: my local notebook accessing the public bucket **does** see the appended zarr store exactly as expected, while the 2i2c-hosted notebook still is not (been well over 3600s). Also, I do as @jkingslake does above and set the `cache_timeout=0`. From [GCSFs docs](https://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFileSystem) `Set cache_timeout <= 0 for no caching,` seems like the functionality we desire, yet I continue to only see the un-appended zarr ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1030811490 https://github.com/pydata/xarray/issues/5878#issuecomment-967408017,https://api.github.com/repos/pydata/xarray/issues/5878,967408017,IC_kwDOAMm_X845qXmR,7237617,2021-11-12T19:40:46Z,2021-11-13T23:25:53Z,NONE,"> Right now, it shows the shape is `[6]`, as expected after the appending. However, if you read the file immediately after appending (within the 3600s `max-age`), you will get the cached copy. The cached copy will still be of shape `[3]`--it won't know about the append. Ignorant question: is this cache relevant to client (Jupyter) side or server (GCS) side? It has been well over 3600s and I'm still not seeing the _appended_ zarr when reading it in using Xarray. > To test this hypothesis, you would need to [disable caching](https://cloud.google.com/storage/docs/metadata) on the bucket. Do you have privileges to do that? I tried to do this last night but did not have permission myself. Perhaps @jkingslake does?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1030811490 https://github.com/pydata/xarray/issues/5878#issuecomment-967363845,https://api.github.com/repos/pydata/xarray/issues/5878,967363845,IC_kwDOAMm_X845qM0F,1197350,2021-11-12T19:18:38Z,2021-11-12T19:18:38Z,MEMBER,"Ok I think I may understand what is happening ```python ## load the zarr store ds_both = xr.open_zarr(mapper) ``` When you do this, zarr reads a file called `gs://ldeo-glaciology/append_test/test5/temperature/.zarray`. Since the data are public, I can look at it right now: ``` $ gsutil cat gs://ldeo-glaciology/append_tet/test5/temperature/.zarray { ""chunks"": [ 3 ], ""compressor"": { ""blocksize"": 0, ""clevel"": 5, ""cname"": ""lz4"", ""id"": ""blosc"", ""shuffle"": 1 }, ""dtype"": "" Can you post the full stack trace of the error you get when you try to append? In my instance, there is no error, only this returned: `` ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1030811490 https://github.com/pydata/xarray/issues/5878#issuecomment-967142419,https://api.github.com/repos/pydata/xarray/issues/5878,967142419,IC_kwDOAMm_X845pWwT,1197350,2021-11-12T14:05:36Z,2021-11-12T14:05:36Z,MEMBER,Can you post the full stack trace of the error you get when you try to append?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1030811490 https://github.com/pydata/xarray/issues/5878#issuecomment-966758709,https://api.github.com/repos/pydata/xarray/issues/5878,966758709,IC_kwDOAMm_X845n5E1,48723181,2021-11-12T02:05:04Z,2021-11-12T02:05:04Z,NONE,"Thanks for taking a look @rabernat. The code below writes a new zarr and checks immediately if it's there using gcsfs. It seems to appear within a few seconds. Is this what you meant? ``` %%time import fsspec import xarray as xr import json import gcsfs # define a mapper to the ldeo-glaciology bucket. - needs a token with open('../secrets/ldeo-glaciology-bc97b12df06b.json') as token_file: token = json.load(token_file) # get a mapper with fsspec for a new zarr mapper = fsspec.get_mapper('gs://ldeo-glaciology/append_test/test11', mode='w', token=token) # check what files are in there fs = gcsfs.GCSFileSystem(project='pangeo-integration-te-3eea', mode='ab', cache_timeout = 0) print('Files in the test directory before writing:') filesBefore = fs.ls('gs://ldeo-glaciology/append_test/') print(*filesBefore,sep='\n') # define a simple datasets ds0 = xr.Dataset({'temperature': (['time'], [50, 51, 52])}, coords={'time': [1, 2, 3]}) # write the simple dataset to zarr ds0.to_zarr(mapper) # check to see if the new file is there print('Files in the test directory after writing:') filesAfter = fs.ls('gs://ldeo-glaciology/append_test/') print(*filesAfter,sep='\n') ``` ``` Output: Files in the test directory before writing: ldeo-glaciology/append_test/test1 ldeo-glaciology/append_test/test10 ldeo-glaciology/append_test/test2 ldeo-glaciology/append_test/test3 ldeo-glaciology/append_test/test4 ldeo-glaciology/append_test/test5 ldeo-glaciology/append_test/test6 ldeo-glaciology/append_test/test7 ldeo-glaciology/append_test/test8 ldeo-glaciology/append_test/test9 Files in the test directory after writing: ldeo-glaciology/append_test/test1 ldeo-glaciology/append_test/test10 ldeo-glaciology/append_test/test11 ldeo-glaciology/append_test/test2 ldeo-glaciology/append_test/test3 ldeo-glaciology/append_test/test4 ldeo-glaciology/append_test/test5 ldeo-glaciology/append_test/test6 ldeo-glaciology/append_test/test7 ldeo-glaciology/append_test/test8 ldeo-glaciology/append_test/test9 CPU times: user 130 ms, sys: 16.5 ms, total: 146 ms Wall time: 2.19 s ``` ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1030811490 https://github.com/pydata/xarray/issues/5878#issuecomment-966665066,https://api.github.com/repos/pydata/xarray/issues/5878,966665066,IC_kwDOAMm_X845niNq,1197350,2021-11-11T22:17:32Z,2021-11-11T22:17:32Z,MEMBER,"I think that this is not an issue with xarray, zarr, or anything in python world but rather an issue with how caching works on GCS public buckets: https://cloud.google.com/storage/docs/metadata To test this, forget about xarray and zarr for a minute and just use [gcsfs](https://gcsfs.readthedocs.io/en/latest/) to list the bucket contents before and after your writes. I think you will find that the default cache lifetime of 3600 seconds means that you cannot ""see"" the changes to the bucket or the objects as quickly as needed in order to append.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1030811490