home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where issue = 1197117301 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • max-sixty 4
  • tbloch1 4
  • delgadom 2

author_association 3

  • MEMBER 4
  • NONE 4
  • CONTRIBUTOR 2

issue 1

  • Writing a a dataset to .zarr in a loop makes all the data NaNs · 10 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1099643203 https://github.com/pydata/xarray/issues/6456#issuecomment-1099643203 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85BizlD max-sixty 5635139 2022-04-14T21:31:37Z 2022-04-14T21:31:37Z MEMBER

@max-sixty could you explain which bit isn't working for you? The initial example I shared works fine in colab for me, so that might be a you problem. The second one required specifying the chunks when making the datasets (I've editted above).

Right, you changed the example after I responded

But this bug report was more about the fact that overwriting was converting data to NaNs (in two different ways depending on the code apparently).

In my case there is no longer any need to do the overwriting, but this doesn't seem like the expected behaviour of overwriting, and I'm sure there are some valid reasons to overwrite data - hence me opening the bug report.

Something surprising is indeed going on here. To focus on the surprising part;

```python print(ds3.low_dim.values)

ds3.to_zarr('zarr_bug.zarr', mode='w')

print(ds3.low_dim.values) ```

returns:

[[2. 3. 2. ... 8. 0. 9.] [6. 2. 6. ... 2. 4. 3.] [0. 8. 8. ... 6. 5. 4.] ... [1. 0. 5. ... 2. 0. 3.] [5. 5. 7. ... 9. 6. 2.] [5. 7. 8. ... 4. 8. 9.]] [[nan nan nan ... nan nan nan] [nan nan nan ... nan nan nan] [nan nan nan ... nan nan nan] ... [ 1. 0. 5. ... 2. 0. 3.] [ 5. 5. 7. ... 9. 6. 2.] [ 5. 7. 8. ... 4. 8. 9.]]

Similarly:

```python

In [50]: ds3.low_dim.count().compute() Out[50]: <xarray.DataArray 'low_dim' ()> array(1000000)

In [51]: ds3.to_zarr('zarr_bug.zarr', mode='w') Out[51]: <xarray.backends.zarr.ZarrStore at 0x16a27c6d0>

In [55]: ds3.low_dim.count().compute() Out[55]: <xarray.DataArray 'low_dim' ()> array(500000) ```

So it's changing the result in memory just from writing to the Zarr store. I'm not sure what the cause is.

We can still massively reduce the size of this example — it's currently doing pickling, got a bunch of repeated code, etc. Does it work without the pickling? What if ds3 = xr.concat([ds1, ds1.copy(deep=True)]), etc.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301
1098856530 https://github.com/pydata/xarray/issues/6456#issuecomment-1098856530 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85BfzhS tbloch1 34276374 2022-04-14T08:37:11Z 2022-04-14T08:37:11Z NONE

@delgadom thanks! This did help with my actual code, and I've now done my processing.

But this bug report was more about the fact that overwriting was converting data to NaNs (in two different ways depending on the code apparently).

In my case there is no longer any need to do the overwriting, but this doesn't seem like the expected behaviour of overwriting, and I'm sure there are some valid reasons to overwrite data - hence me opening the bug report.

If overwriting is supposed to convert data to NaNs then I guess we could close this issue, but I'm not sure that's intended?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301
1098574761 https://github.com/pydata/xarray/issues/6456#issuecomment-1098574761 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85Beuup delgadom 3698640 2022-04-13T23:34:16Z 2022-04-13T23:34:48Z CONTRIBUTOR

In the example it's saving every iteration, but in my actual code it's much less frequent

when I said "you're overwriting the file every iteration" I meant to put the emphasis on overwiting. by using mode='w' instead of mode='a' you're telling zarr to delete the file if it exists and the re-create it every time to_zarr is executed.

See the docs on xr.Dataset.to_zarr:

mode ({"w", "w-", "a", "r+", None}, optional) – Persistence mode: “w” means create (overwrite if exists); “w-” means create (fail if exists); “a” means override existing variables (create if does not exist); “r+” means modify existing array values only (raise an error if any metadata or shapes would change). The default mode is “a” if append_dim is set. Otherwise, it is “r+” if region is set and w- otherwise.

This interpretation of mode is consistent across all of python - see the docs for python builtins: open

So I think changing your writes to ds3.to_zarr('zarr_bug.zarr', mode='a') as Max suggested will get you a good part of the way there :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301
1096382964 https://github.com/pydata/xarray/issues/6456#issuecomment-1096382964 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85BWXn0 tbloch1 34276374 2022-04-12T08:47:55Z 2022-04-12T08:48:48Z NONE

@max-sixty could you explain which bit isn't working for you? The initial example I shared works fine in colab for me, so that might be a you problem. The second one required specifying the chunks when making the datasets (I've editted above).

Here's a link to the colab (which has both examples).

It's worth noting that the way in which the dataset is broken does seem to be slightly different in each of these examples - in the former example all data becomes NaN, in the latter example only the initially saved data becomes NaN.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301
1094583214 https://github.com/pydata/xarray/issues/6456#issuecomment-1094583214 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85BPgOu tbloch1 34276374 2022-04-11T06:01:44Z 2022-04-12T08:48:13Z NONE

@max-sixty - I've tried to slim it down below (no loop, and only one save). From the print statements, it's clear that before overwriting the .zarr ds3 is working correctly, but once ds3 is saved it breaks the data corresponding to the initial save (now all NaNs). I am guessing this is due to trying to read from and save over the same data, but I wouldn't have expected it to be a problem if it was loading the chunks into memory during the saving.

``` import pandas as pd import numpy as np import glob import xarray as xr from tqdm import tqdm

Creating pkl files

[pd.DataFrame(np.random.randint(0,10, (1000,500))).astype(object).to_pickle('df{}.pkl'.format(i)) for i in range(4)]

fnames = glob.glob('*.pkl')

df1 = pd.read_pickle(fnames[0]) df1.columns = np.arange(0,500).astype(object) # the real pkl files contain all objects df1.index = np.arange(0,1000).astype(object) df1 = df1.astype(np.float32)

ds = xr.DataArray(df1.values, dims=['fname', 'res_dim'], coords={'fname': df1.index.values, 'res_dim': df1.columns.values}) ds = ds.to_dataset(name='low_dim').chunk({'fname': 500, 'res_dim': 1})

ds.to_zarr('zarr_bug.zarr', mode='w') ds1 = xr.open_zarr('zarr_bug.zarr', decode_coords="all")

df2 = pd.read_pickle(fnames[1]) df2.columns = np.arange(0,500).astype(object) df2.index = np.arange(0,1000).astype(object) df2 = df2.astype(np.float32)

ds2 = xr.DataArray(df2.values, dims=['fname', 'res_dim'], coords={'fname': df2.index.values, 'res_dim': df2.columns.values}) ds2 = ds2.to_dataset(name='low_dim').chunk({'fname': 500, 'res_dim': 1})

ds3 = xr.concat([ds1, ds2], dim='fname') ds3['fname'] = ds3.fname.astype(str)

print(ds3.low_dim.values)

ds3.to_zarr('zarr_bug.zarr', mode='w')

print(ds3.low_dim.values) ```

The output:

[[7. 8. 4. ... 9. 6. 7.] [0. 4. 5. ... 9. 7. 6.] [3. 4. 3. ... 1. 6. 1.] ... [4. 0. 4. ... 5. 6. 9.] [5. 2. 5. ... 1. 7. 1.] [8. 9. 7. ... 4. 4. 1.]] [[nan nan nan ... nan nan nan] [nan nan nan ... nan nan nan] [nan nan nan ... nan nan nan] ... [ 4. 0. 4. ... 5. 6. 9.] [ 5. 2. 5. ... 1. 7. 1.] [ 8. 9. 7. ... 4. 4. 1.]]

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301
1095585081 https://github.com/pydata/xarray/issues/6456#issuecomment-1095585081 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85BTU05 max-sixty 5635139 2022-04-11T21:29:27Z 2022-04-11T21:29:27Z MEMBER

@tbloch1 it doesn't copy in to someone else's python atm — that's the "C" part of MCVE...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301
1094587632 https://github.com/pydata/xarray/issues/6456#issuecomment-1094587632 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85BPhTw tbloch1 34276374 2022-04-11T06:07:06Z 2022-04-11T10:42:51Z NONE

@delgadom - In the example it's saving every iteration, but in my actual code it's much less frequent. I figured there was probably a better way to achieve the same thing, but it still doesn't seem like the expected behaviour, which is why I thought I should raise the issue here.

The files are just sequentially names (as in my example), but the indices of the resulting dataframes are a bunch of unique strings (file-paths, not dates).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301
1094412198 https://github.com/pydata/xarray/issues/6456#issuecomment-1094412198 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85BO2em max-sixty 5635139 2022-04-10T23:46:53Z 2022-04-10T23:46:53Z MEMBER

Have you tried asking on stackoverflow with the xarray tag?

Or GH Discussions! But it would need a smaller MCVE

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301
1094411214 https://github.com/pydata/xarray/issues/6456#issuecomment-1094411214 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85BO2PO delgadom 3698640 2022-04-10T23:40:49Z 2022-04-10T23:40:49Z CONTRIBUTOR

@tbloch1 following up on Max's suggestion - it looks like you might be overwriting the file with every iteration. See the docs on ds.to_zarr - mode='w' will overwrite the file while mode='a' will append. That said, you still would need your indices to not overlap. How are you distinguishing between the files? is each one a different point in time?

To me, this doesn't seem likely to be a bug, but is more of a usage question. Have you tried asking on stackoverflow with the xarray tag?

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301
1093253883 https://github.com/pydata/xarray/issues/6456#issuecomment-1093253883 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85BKbr7 max-sixty 5635139 2022-04-08T19:05:12Z 2022-04-08T19:05:12Z MEMBER

Hi @tbloch1 — thanks for the issue

So I understand — is this loading the existing dataset, adding one a slice, and then writing the whole result? Have you considered using mode='a' if you want to write from different processes?

For the example — would it be possible to slim that down a bit further? Does it happen with with one read & write after the initial one?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.086ms · About: xarray-datasette