Summary: I have a dataset that is collected in such a way that the dimensions are not initially available. I would like to take what is essentially a big block of undifferentiated data and add dimensions to it so that it can be queried, subsetted, etc. That is the core of the following question.
Here is an xarray DataSet that I have:
<xarray.Dataset>
Dimensions: (chain: 1, draw: 2000, rows: 24000)
Coordinates:
* chain (chain) int64 0
* draw (draw) int64 0 1 2 3 4 5 6 7 ... 1993 1994 1995 1996 1997 1998 1999
* rows (rows) int64 0 1 2 3 4 5 6 ... 23994 23995 23996 23997 23998 23999
Data variables:
obs (chain, draw, rows) float64 4.304 3.985 4.612 ... 6.343 5.538 6.475
Attributes:
created_at: 2019-12-27T17:16:13.847972
inference_library: pymc3
inference_library_version: 3.8
The rows
dimension here corresponds to a number of subdimensions that I need to restore to the data. In particular, the 24,000 rows correspond to 100 samples each from 240 conditions (these 100 samples are in contiguous blocks). These conditions are combinations of gate
, input
, growth medium
, and od
.
I would like to end up with something like this:
<xarray.Dataset>
Dimensions: (chain: 1, draw: 2000, gate: 1, input: 4, growth_medium: 3, sample: 100, rows: 24000)
Coordinates:
* chain (chain) int64 0
* draw (draw) int64 0 1 2 3 4 5 6 7 ... 1993 1994 1995 1996 1997 1998 1999
* rows *MultiIndex*
* gate (gate) int64 'AND'
* input (input) int64 '00', '01', '10', '11'
* growth_medium (growth_medium) 'standard', 'rich', 'slow'
* sample (sample) int64 0 1 2 3 4 5 6 7 ... 95 96 97 98 99
Data variables:
obs (chain, draw, gate, input, growth_medium, samples) float64 4.304 3.985 4.612 ... 6.343 5.538 6.475
Attributes:
created_at: 2019-12-27T17:16:13.847972
inference_library: pymc3
inference_library_version: 3.8
I have a pandas dataframe that specifies the values of gate, input, and growth medium -- each row gives a set of values of gate, input, and growth medium, and an index that specifies where (in the rows
) the corresponding set of 100 samples appears. The intent is that this data frame is a guide for labeling the Dataset.
I looked at the xarray docs on "Reshaping and Reorganizing Data", but I don't see how to combine those operations to do what I need. I suspect somehow I need to combine these with GroupBy
, but I don't get how. Thanks!
Later: I have a solution to this problem, but it is so disgusting that I am hoping someone will explain how wrong I am, and what a more elegant approach is possible.
So, first, I extracted all the data in the original Dataset
into raw numpy form:
foo = qm.idata.posterior_predictive['obs'].squeeze('chain').values.T
foo.shape # (24000, 2000)
Then I reshaped it as needed:
bar = np.reshape(foo, (240, 100, 2000))
This gives me roughly the shape I want: there are 240 different experimental conditions, each has 100 variants, and for each of these variants, I have 2000 Monte Carlo samples in my data set.
Now, I extract the information about the 240 experimental conditions from the Pandas DataFrame
:
import pandas as pd
# qdf is the original dataframe with the experimental conditions and some
# extraneous information in other columns
new_df = qdf[['gate', 'input', 'output', 'media', 'od_lb', 'od_ub', 'temperature']]
idx = pd.MultiIndex.from_frame(new_df)
Finally, I reassembled a DataArray
from the numpy array and the pandas MultiIndex
:
xr.DataArray(bar, name='obs', dims=['regions', 'conditions', 'draws'],
coords={'regions': idx, 'conditions': range(100), 'draws': range(2000)})
The resulting DataArray
has these coordinates, as I wished:
Coordinates:
* regions (regions) MultiIndex
- gate (regions) object 'AND' 'AND' 'AND' 'AND' ... 'AND' 'AND' 'AND'
- input (regions) object '00' '10' '10' '10' ... '01' '01' '11' '11'
- output (regions) object '0' '0' '0' '0' '0' ... '0' '0' '0' '1' '1'
- media (regions) object 'standard_media' ... 'high_osm_media_five_percent'
- od_lb (regions) float64 0.0 0.001 0.001 ... 0.0001 0.0051 0.0051
- od_ub (regions) float64 0.0001 0.0051 0.0051 2.0 ... 0.0003 2.0 2.0
- temperature (regions) int64 30 30 37 30 37 30 37 ... 37 30 37 30 37 30 37
* conditions (conditions) int64 0 1 2 3 4 5 6 7 ... 92 93 94 95 96 97 98 99
* draws (draws) int64 0 1 2 3 4 5 6 ... 1994 1995 1996 1997 1998 1999
That was pretty horrible, though, and it seems wrong that I had to punch through all the nice layers of xarray
abstraction to get to this point. Especially since this does not seem like an unusual piece of a scientific workflow: getting a relatively raw data set together with a spreadsheet of metadata that needs to be combined with the data. So what am I doing wrong? What's the more elegant solution?
Given the starting Dataset, similar to:
<xarray.Dataset>
Dimensions: (draw: 2, row: 24)
Coordinates:
* draw (draw) int32 0 1
* row (row) int32 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
Data variables:
obs (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47
You can concatenate several pure xarray commands to subdivide the dimensions (get the data in the same shape but using a multiindex) or even reshape the Dataset. To subdivide the dimensions, the following code can be used:
multiindex_ds = ds.assign_coords(
dim_0=["a", "b", "c"], dim_1=[0,1], dim_2=range(4)
).stack(
stacked_dim=("dim_0", "dim_1", "dim_2")
).reset_index(
"row", drop=True
).rename(
row="stacked_dim"
)
multiindex_ds
whose output is:
<xarray.Dataset>
Dimensions: (stacked_dim: 24, draw: 2)
Coordinates:
* draw (draw) int32 0 1
* stacked_dim (stacked_dim) MultiIndex
- dim_0 (stacked_dim) object 'a' 'a' 'a' 'a' 'a' 'a' ... 'c' 'c' 'c' 'c' 'c' 'c'
- dim_1 (stacked_dim) int64 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
- dim_2 (stacked_dim) int64 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Data variables:
obs (draw, stacked_dim) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47
Moreover, the multiindex can then be unstacked, effectively reshaping the Dataset:
reshaped_ds = multiindex_ds.unstack("stacked_dim")
reshaped_ds
with output:
<xarray.Dataset>
Dimensions: (dim_0: 3, dim_1: 2, dim_2: 4, draw: 2)
Coordinates:
* draw (draw) int32 0 1
* dim_0 (dim_0) object 'a' 'b' 'c'
* dim_1 (dim_1) int64 0 1
* dim_2 (dim_2) int64 0 1 2 3
Data variables:
obs (draw, dim_0, dim_1, dim_2) int32 0 1 2 3 4 5 ... 42 43 44 45 46 47
I think that this alone does not completely cover your needs because you want to convert a dimension into two dimensions, one of which is to be a multiindex. All the building blocks are here though.
For example, you can follow this steps (including unstacking) with regions
and conditions
and then follow this steps (no unstacking now) to convert regions
to multiindex. Another option would be to use all dimensions from the start, unstack them and then stack them again leaving conditions
outside of the final multiindex.
The answer combines several quite unrelated commands, and it might be tricky to see what each of them is doing.
assign_coords
The first step is to create new dimensions and coordinates and add them to the Dataset. This is necessary because the next methods need the dimensions and coordinates to already be present in the Dataset.
Stopping right after assign_coords
yields the following Dataset:
<xarray.Dataset>
Dimensions: (dim_0: 3, dim_1: 2, dim_2: 4, draw: 2, row: 24)
Coordinates:
* draw (draw) int32 0 1
* row (row) int32 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
* dim_0 (dim_0) <U1 'a' 'b' 'c'
* dim_1 (dim_1) int32 0 1
* dim_2 (dim_2) int32 0 1 2 3
Data variables:
obs (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47
stack
The Dataset now contains 3 dimensions that add up to 24 elements, however, as the data is currently flat with respect to these 24 elements, we have to stack them into a single 24 element multiindex to make their shapes compatible.
I find the assign_coords
followed by stack
the most natural solution, however, another possibility would be to generate a multiindex similarly to how it is done above and directly call assign_coords
with the multiindex, rendering the stack unnecessary.
This step combines all 3 new dimensions into a single one:
<xarray.Dataset>
Dimensions: (stacked_dim: 24, draw: 2, row: 24)
Coordinates:
* draw (draw) int32 0 1
* row (row) int32 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
* stacked_dim (stacked_dim) MultiIndex
- dim_0 (stacked_dim) object 'a' 'a' 'a' 'a' 'a' 'a' ... 'c' 'c' 'c' 'c' 'c' 'c'
- dim_1 (stacked_dim) int64 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
- dim_2 (stacked_dim) int64 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Data variables:
obs (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47
Note that as desired now we have 2 dimensions with size 24 as desired.
reset_index
Now we have our final dimension present in the Dataset as a coordinate, and we want this new coordinate to be the one used to index the variable obs
. set_index
seems like the correct choice, however, each of our coordinates indexes itself (unlike the example in set_index
docs where x
indexes both x
and a
coordinates) which means that set_index
cannot be used in this particular case. The method to use is reset_index
to remove the coordinate row
without removing the dimension row
.
In the following output it can be seen how now row
is a dimension without coordinates:
<xarray.Dataset>
Dimensions: (stacked_dim: 24, draw: 2, row: 24)
Coordinates:
* draw (draw) int32 0 1
* stacked_dim (stacked_dim) MultiIndex
- dim_0 (stacked_dim) object 'a' 'a' 'a' 'a' 'a' 'a' ... 'c' 'c' 'c' 'c' 'c' 'c'
- dim_1 (stacked_dim) int64 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
- dim_2 (stacked_dim) int64 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Dimensions without coordinates: row
Data variables:
obs (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47
rename
The current Dataset is nearly the final one, the only issue is that the obs
variable still has the row
dimension instead of the desired one: stacked_dim
. It does not really look like this is intended usage of rename
but it can be used to get stacked_dim
to absorb row
, yielding the desired final result (called multiindex_ds
above).
Here again, set_index
seems to be the method to choose, however, if instead of rename(row="stacked_dim")
, set_index(row="stacked_dim")
is used, the multiindex is collapsed into an index made of tuples:
<xarray.Dataset>
Dimensions: (draw: 2, row: 24)
Coordinates:
* draw (draw) int32 0 1
* row (row) object ('a', 0, 0) ('a', 0, 1) ... ('c', 1, 2) ('c', 1, 3)
Data variables:
obs (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47