home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

19 rows where author_association = "CONTRIBUTOR" and user = 145117 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 6

  • float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray 5
  • Resample is ~100x slower than Pandas resample; Speed is related to resample period (unlike Pandas) 4
  • Improved CF decoding 4
  • From pandas to xarray without blowing up memory 3
  • Decode times adds micro-second noise to standard calendar 2
  • Fix logic bug - add_offset is in encoding, not attrs. 1

user 1

  • mankoff · 19 ✖

author_association 1

  • CONTRIBUTOR · 19 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1269235342 https://github.com/pydata/xarray/pull/6812#issuecomment-1269235342 https://api.github.com/repos/pydata/xarray/issues/6812 IC_kwDOAMm_X85Lpv6O mankoff 145117 2022-10-06T02:48:22Z 2022-10-06T02:48:22Z CONTRIBUTOR

A bit more detail about the existing tests that don't match the CF spec. Per the spec, scale_factor and add_offset should be of the same type. That causes tests throughout https://github.com/pydata/xarray/blob/main/xarray/tests/test_coding.py and https://github.com/pydata/xarray/blob/main/xarray/tests/test_backends.py to fail, because:

https://github.com/pydata/xarray/blob/13c52b27b777709fc3316cf4334157f50904c02b/xarray/tests/test_coding.py#L112-L113

There is 1 test in test_coding, and 9 tests in test_backends that use mixed types. That's a tractable number I can fix.

In addition, the expected dtype returned by many of the tests does not match (my interpretation of) the expected dtype per the CF spec.

I am concerned that this is a significant change and I'm not sure what the process is for making this change. I would like to have some idea, even if not a guarantee, that it would be welcomed and accepted before doing all the work. I note that a recent other large PR to try to fix cf decoding has also stalled, and I'm not sure why (see #2751)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improved CF decoding 1309966595
1266136366 https://github.com/pydata/xarray/pull/6812#issuecomment-1266136366 https://api.github.com/repos/pydata/xarray/issues/6812 IC_kwDOAMm_X85Ld7Uu mankoff 145117 2022-10-03T22:29:28Z 2022-10-03T22:29:28Z CONTRIBUTOR

Hi @dcherian - I dropped this because I went down a rabbit hole that seemed very very deep.

Xarray has written 10s (100s?) of tests that touch this decoding function that make assumptions that I believe are incorrect after a careful reading of the CF spec. I believe the path forward will take some conversation before coding, so perhaps this should be moved to an issue rather than a pull request? A big decision is if the decode option strictly follows CF guidelines. If so, then a lot of tests need to be changed (for example, to follow the simple rule of [scale_factor and add_offset] must both be of type float or both be of type double).

Enforcing this would probably break xarray backward compatibility for writing files. I assume that that may be OK and there are processes to handle this (start with 'deprecation' warnings, then eventually throw errors?). There are also likely many NetCDF files that are not standard compliant and we need to decide how to read them.

Furthermore, the CF conventions are themselves not very clear, and possibly ambiguous. I started a conversation here: https://github.com/cf-convention/cf-conventions/issues/374 on this, but that is also unresolved at the moment. The CF convention mentions int and float, but not how many bytes those are. What happens when a files is written & packed on one architecture and read & unpacked on another?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improved CF decoding 1309966595
1201464999 https://github.com/pydata/xarray/issues/2304#issuecomment-1201464999 https://api.github.com/repos/pydata/xarray/issues/2304 IC_kwDOAMm_X85HnOan mankoff 145117 2022-08-01T16:56:01Z 2022-08-01T16:56:01Z CONTRIBUTOR

Packing Qs

  • If "the variable containing the packed data must be of type byte, short or int", how do we choose what size int?
  • What to do if scale_factor and add_offset are not float or double? What if they are different types?
    • I assume issue a warning and continue?

Unpacking Qs

  • Should the unpacked data just be np.find_common_type([data, add_offset, scale_factor], []), or should we then bump the type up by 1 level (float16->32, 32->64, 64->128, etc.) to cover overflow?
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
1201461626 https://github.com/pydata/xarray/issues/2304#issuecomment-1201461626 https://api.github.com/repos/pydata/xarray/issues/2304 IC_kwDOAMm_X85HnNl6 mankoff 145117 2022-08-01T16:52:47Z 2022-08-01T16:52:47Z CONTRIBUTOR
  • From: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html

This standard is more restrictive than the NUG with respect to the use of the scale<sub>factor</sub> and add<sub>offset</sub> attributes; ambiguities and precision problems related to data type conversions are resolved by these restrictions.

If the scale<sub>factor</sub> and add<sub>offset</sub> attributes are of the same data type as the associated variable, the unpacked data is assumed to be of the same data type as the packed data.

  • What if the result of the operation leads to overflow?

However, if the scale<sub>factor</sub> and add<sub>offset</sub> attributes are of a different data type from the variable (containing the packed data) then the unpacked data should match the type of these attributes, which must both be of type float or both be of type double.

  • What if they are not of the same type?

    • Presumably, use the largest of the three types.
  • Again, this may lead to loss of precision. what if packed data is type int64 and scale<sub>factor</sub> is type float16. Seems like the result should be float64, not float16.

An additional restriction in this case is that the variable containing the packed data must be of type byte, short or int.

  • What to do if packed data is type float or double?

It is not advised to unpack an int into a float as there is a potential precision loss.

I think this means double is advised? If so, this should be stated. Should be rephrased to advise what to do (if there is one or only a few choices) rather than what not to do, or at least include that if not replacing current wording.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
1201443208 https://github.com/pydata/xarray/pull/6851#issuecomment-1201443208 https://api.github.com/repos/pydata/xarray/issues/6851 IC_kwDOAMm_X85HnJGI mankoff 145117 2022-08-01T16:35:44Z 2022-08-01T16:35:44Z CONTRIBUTOR

Thanks @mankoff Is there a test we could add?

There's a whole table of tests! https://github.com/pydata/xarray/issues/2304#issuecomment-1200627783

But now I'm building a test for the code as-is, which isn't CF-compliant. Is this worth writing?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fix logic bug - add_offset is in encoding, not attrs. 1322645651
1200627783 https://github.com/pydata/xarray/issues/2304#issuecomment-1200627783 https://api.github.com/repos/pydata/xarray/issues/2304 IC_kwDOAMm_X85HkCBH mankoff 145117 2022-08-01T02:49:28Z 2022-08-01T05:55:15Z CONTRIBUTOR

Current algorithm

python def _choose_float_dtype(dtype, has_offset): """Return a float dtype that can losslessly represent `dtype` values.""" # Keep float32 as-is. Upcast half-precision to single-precision, # because float16 is "intended for storage but not computation" if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating): return np.float32 # float32 can exactly represent all integers up to 24 bits if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer): # A scale factor is entirely safe (vanishing into the mantissa), # but a large integer offset could lead to loss of precision. # Sensitivity analysis can be tricky, so we just use a float64 # if there's any offset at all - better unoptimised than wrong! if not has_offset: return np.float32 # For all other types and circumstances, we just use float64. # (safe because eg. complex numbers are not supported in NetCDF) return np.float64

Due to calling bug, has_offset is always None, so this can be simplified to:

python def _choose_float_dtype(dtype) if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating): return np.float32 if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer): return np.float32 return np.float64

Here I call the function twice, once with has_offset False, then True.

```python import numpy as np

def _choose_float_dtype(dtype, has_offset): if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating): return np.float32 if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer): if not has_offset: return np.float32 return np.float64

generic types

for dtype in [np.byte, np.ubyte, np.short, np.ushort, np.intc, np.uintc, np.int_, np.uint, np.longlong, np.ulonglong, np.half, np.float16, np.single, np.double, np.longdouble, np.csingle, np.cdouble, np.clongdouble, np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64, np.float16, np.float32, np.float64]: print("|", dtype, "|", _choose_float_dtype(np.dtype(dtype), False), "|", _choose_float_dtype(np.dtype(dtype), True), "|") ```

| Input | Output as called | Output as written | |-----------------------------|---------------------------|--------------------------| | <class 'numpy.int8'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.uint8'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.int16'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.uint16'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.int32'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.uint32'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.int64'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.uint64'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.longlong'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.ulonglong'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.float16'> | <class 'numpy.float32'> | <class 'numpy.float32'> | | <class 'numpy.float16'> | <class 'numpy.float32'> | <class 'numpy.float32'> | | <class 'numpy.float32'> | <class 'numpy.float32'> | <class 'numpy.float32'> | | <class 'numpy.float64'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.float128'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.complex64'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.complex128'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.complex256'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.int8'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.int16'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.int32'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.int64'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.uint8'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.uint16'> | <class 'numpy.float32'> | <class 'numpy.float64'> | | <class 'numpy.uint32'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.uint64'> | <class 'numpy.float64'> | <class 'numpy.float64'> | | <class 'numpy.float16'> | <class 'numpy.float32'> | <class 'numpy.float32'> | | <class 'numpy.float32'> | <class 'numpy.float32'> | <class 'numpy.float32'> | | <class 'numpy.float64'> | <class 'numpy.float64'> | <class 'numpy.float64'> |

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
1200266255 https://github.com/pydata/xarray/issues/2304#issuecomment-1200266255 https://api.github.com/repos/pydata/xarray/issues/2304 IC_kwDOAMm_X85HipwP mankoff 145117 2022-07-30T17:58:51Z 2022-07-30T17:58:51Z CONTRIBUTOR

This issue, based on its title and initial post, is fixed by PR #6851. The code to select dtype was already correct, but the outer function that called it had a bug in the call.

Per the CF spec,

the unpacked data should match the type of these attributes, which must both be of type float or both be of type double. An additional restriction in this case is that the variable containing the packed data must be of type byte, short or int. It is not advised to unpack an int into a float as there is a potential precision loss.

I find this is ambiguous. is float above referring to float16 or float32? Is double referring to float64? If so, then they do recommend float64, as requested by the OP, because the test data is short and the scale_factor is float64 (a.k.a double?)

The broader discussion here is about CF compliance. I find the spec ambiguous and xarray non-compliant. So many tests rely on the existing behavior, that I am unsure how best to proceed to improve compliance. I worry it may be a major refactor, and possibly break things relying on the existing behavior. I'd like to discuss architecture. Should this be in a new issue, if this closes with PR #6851? Should there be a new keyword for cf_strict or something?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
1189512813 https://github.com/pydata/xarray/pull/6812#issuecomment-1189512813 https://api.github.com/repos/pydata/xarray/issues/6812 IC_kwDOAMm_X85G5oZt mankoff 145117 2022-07-19T20:19:29Z 2022-07-19T20:19:29Z CONTRIBUTOR

I'm reading more in https://github.com/pydata/xarray/blob/2a5686c6fe855502523e495e43bd381d14191c7b/xarray/coding/variables.py and I'm confused about some logic:

https://github.com/pydata/xarray/blob/2a5686c6fe855502523e495e43bd381d14191c7b/xarray/coding/variables.py#L271-L272

pop_to does a pop operation - it removes the key/value pair. So line 1 above will remove add_offset from attrs if it exists. The second line then checks for "add_offset" in attrs which should always be False.

I think this is happening based on inspecting with the debugger.

Furthermore, the fix I implemented in this Pull Request which returns np.float64 fixes my bug, but only because this bug exists. My dataset has add_offset, so the lines I changed:

python if not has_offset: return np.float64

should not run, but do run because of this issue.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improved CF decoding 1309966595
1189485451 https://github.com/pydata/xarray/pull/6812#issuecomment-1189485451 https://api.github.com/repos/pydata/xarray/issues/6812 IC_kwDOAMm_X85G5huL mankoff 145117 2022-07-19T19:46:23Z 2022-07-19T19:46:23Z CONTRIBUTOR

Note - I also have not run the "Running the performance test suite" code in https://xarray.pydata.org/en/stable/contributing.html - I assume changing from float32 to float64 would impact performance. I can run that if suggested.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improved CF decoding 1309966595
1188529343 https://github.com/pydata/xarray/issues/2304#issuecomment-1188529343 https://api.github.com/repos/pydata/xarray/issues/2304 IC_kwDOAMm_X85G14S_ mankoff 145117 2022-07-19T02:35:30Z 2022-07-19T03:20:51Z CONTRIBUTOR

I've run into this issue too, and the xarray decision to use float32 is causing problems. I recognize this is a generic floating-point representation issue, but it could be avoided with float64.

The data value is 1395. The scale is 0.0001.

python val = int(1395) scale = 0.0001 print(val*scale) # 0.1395 print( val * np.array(scale).astype(float) ) # 0.1395 print( val * np.array(scale).astype(np.float16) ) # 0.1395213... print( val * np.array(scale).astype(np.float32) ) # 0.13949999... print( val * np.array(scale).astype(np.float64) ) # 0.1395

Because we are using *1E3 * round(), the difference between 0.1395 and 0.1394999 (or 139.5 and 139.49) ends up being quite large in the downstream product.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
708594913 https://github.com/pydata/xarray/issues/2139#issuecomment-708594913 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDcwODU5NDkxMw== mankoff 145117 2020-10-14T18:52:38Z 2020-10-14T18:52:38Z CONTRIBUTOR

The issue is that if you pass in names = ['a','b','c'] to pd.read_csv and there are more columns than names, it takes all the columns without a name and creates a multi-index. That was a bug in my code that I had more columns than names, didn't want a multi-index, and didn't make use of usecols.

This multi-index came from a small 12 MB file - 5000 rows and 40 variables. When I then did df.to_xarray() it filled up my RAM. If I ran the code I provided above, it worked.

Now that I've figured all this out, I don't think that any bugs exist in xarray or pandas, just my code. As usual :). But if the fact that I can fill ram with df.to_xarray() but not with the 3 lines shown above sounds like an issue you want to explore, I'm happy to provide an MWE on a new ticket and tag you there. Let me know...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
708513119 https://github.com/pydata/xarray/issues/2139#issuecomment-708513119 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDcwODUxMzExOQ== mankoff 145117 2020-10-14T16:23:36Z 2020-10-14T16:23:36Z CONTRIBUTOR

@max-sixty Sorry for posting this here. This memory blow-up was a byproduct of another bug that it took me a few more hours to track down. This other bug is in Pandas, not xarray.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
708339519 https://github.com/pydata/xarray/issues/2139#issuecomment-708339519 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDcwODMzOTUxOQ== mankoff 145117 2020-10-14T11:25:03Z 2020-10-14T11:25:03Z CONTRIBUTOR

Late reply, but if anyone else finds this issue, I was filling memory with: ds = df.to_xarray(), but if I build the dataset more manually, I have no memory issues:

python ds = xr.Dataset({df.columns[0]: xr.DataArray(data=df[df.columns[0]], dims=['index'], coords={'index':df.index})}) for c in df.columns[1:]: ds[c] = (('index'), df[c])

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
706688398 https://github.com/pydata/xarray/issues/4498#issuecomment-706688398 https://api.github.com/repos/pydata/xarray/issues/4498 MDEyOklzc3VlQ29tbWVudDcwNjY4ODM5OA== mankoff 145117 2020-10-11T11:11:47Z 2020-10-11T11:19:56Z CONTRIBUTOR

Thanks for the clarification that this is a real issue not due to just my coding, and the suggestion to solve this elsewhere. For now I just use the fast Pandas version with this code:

python df_h = ds.to_dataframe().resample("1H").mean() # what we want (quickly), but in Pandas form vals = [xr.DataArray(data=df_h[c], dims=['time'], coords={'time':df_h.index}, attrs=ds[c].attrs) for c in df_h.columns] ds_h = xr.Dataset(dict(zip(df_h.columns,vals)), attrs=ds.attrs)

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Resample is ~100x slower than Pandas resample; Speed is related to resample period (unlike Pandas) 718436141
706688498 https://github.com/pydata/xarray/issues/4498#issuecomment-706688498 https://api.github.com/repos/pydata/xarray/issues/4498 MDEyOklzc3VlQ29tbWVudDcwNjY4ODQ5OA== mankoff 145117 2020-10-11T11:12:47Z 2020-10-11T11:12:47Z CONTRIBUTOR

The linked issues refer to groupby not resample so this could stay open or be closed as a duplicate - I leave it to you to decide. Thank you for the assistance.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Resample is ~100x slower than Pandas resample; Speed is related to resample period (unlike Pandas) 718436141
706548763 https://github.com/pydata/xarray/issues/4498#issuecomment-706548763 https://api.github.com/repos/pydata/xarray/issues/4498 MDEyOklzc3VlQ29tbWVudDcwNjU0ODc2Mw== mankoff 145117 2020-10-10T13:23:24Z 2020-10-10T13:23:24Z CONTRIBUTOR

The every 4th or 5th lag is not in the creation, it's in the resample:

````

+BEGIN_SRC jupyter-python :kernel ds :session bugreport

for i in np.arange(25): start = time.time() ds_r = ds.resample({'time':"1H"}) print('xr', str(time.time() - start))

+END_SRC

+RESULTS:

+begin_example

xr 0.04479050636291504 xr 0.047682762145996094 xr 0.8904871940612793 xr 0.05605506896972656 xr 0.0452876091003418 xr 0.0467374324798584 xr 0.8709239959716797 xr 0.05595755577087402 xr 0.046492576599121094 xr 0.04648017883300781 xr 0.045223236083984375 xr 0.8187246322631836 xr 0.05060911178588867 xr 0.04763054847717285 xr 0.8156075477600098 xr 0.055490970611572266 xr 0.047312259674072266 xr 0.04651069641113281 xr 0.8001837730407715 xr 0.05546212196350098 xr 0.04549074172973633 xr 0.04680013656616211 xr 0.04383039474487305 xr 0.7662224769592285 xr 0.04914355278015137

+end_example

````

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Resample is ~100x slower than Pandas resample; Speed is related to resample period (unlike Pandas) 718436141
706548513 https://github.com/pydata/xarray/issues/4498#issuecomment-706548513 https://api.github.com/repos/pydata/xarray/issues/4498 MDEyOklzc3VlQ29tbWVudDcwNjU0ODUxMw== mankoff 145117 2020-10-10T13:21:19Z 2020-10-10T13:21:19Z CONTRIBUTOR

"performance" is a good tag. My actual use case is a dataset with 500,000 timestamps and 15 variables (10 minute weather station for a decade).

In this case, pandas takes 0.03 seconds, and xarray takes 200 seconds. 4 orders of magnitude. Should I change the title to reflect the larger difference in performance? Here is that MWE:

```python import numpy as np import xarray as xr import pandas as pd import time

size = 500000 times = pd.date_range('2000-01-01', periods=size, freq="10Min") ds = xr.Dataset({ 'foo': xr.DataArray( data = np.random.random(size), dims = ['time'], coords = {'time': times} )}) for v in 'abcdefghijelm': ds[v] = (('time'), np.random.random(size))

start = time.time() ds_r = ds.resample({'time':"1H"}).mean() print('xr', str(time.time() - start))

start = time.time() ds_r = ds.to_dataframe().resample("1H").mean() print('pd', str(time.time() - start)) ```

Result:

xr 202.2967929840088 pd 0.03381085395812988

The strange thing here is if I drop the .mean()'s, most of the time I see what you see.

: xr 0.03333306312561035 : pd 0.020237445831298828

But every 4th or 5th time that I run this, I get this:

: xr 0.8518760204315186 : pd 0.02686452865600586

This is repeatable. I've Run this code 100s of times now, and every 4th or 5th run it takes 10x. Nothing else is going on on my computer.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Resample is ~100x slower than Pandas resample; Speed is related to resample period (unlike Pandas) 718436141
368456391 https://github.com/pydata/xarray/issues/1917#issuecomment-368456391 https://api.github.com/repos/pydata/xarray/issues/1917 MDEyOklzc3VlQ29tbWVudDM2ODQ1NjM5MQ== mankoff 145117 2018-02-26T10:28:16Z 2018-02-26T10:28:16Z CONTRIBUTOR

Appears fixed. Thank you!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decode times adds micro-second noise to standard calendar 297780998
366382745 https://github.com/pydata/xarray/issues/1917#issuecomment-366382745 https://api.github.com/repos/pydata/xarray/issues/1917 MDEyOklzc3VlQ29tbWVudDM2NjM4Mjc0NQ== mankoff 145117 2018-02-16T22:58:14Z 2018-02-16T22:58:14Z CONTRIBUTOR

foo.nc.zip

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decode times adds micro-second noise to standard calendar 297780998

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 1840.19ms · About: xarray-datasette