pythonpandaspytables

ptrepack sortby needs 'full' index


I am trying to ptrepack a HDF file that was created with pandas HDFStore pytables interface. The main index of the dataframe was time but I made some more columns data_columns so that I can filter for data on-disk via these data_columns.

Now I would like to sort the HDF file by one of those columns (because the selection is too slow for my taste, 84 GB file), using ptrepack with the sortby option like so:

()[maye@luna4 .../nominal]$ ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc --sortby=clat C9.h5 C9_sorted.h5

and I get the error message:

()[maye@luna4 .../nominal]$ Problems doing the copy from 'C9.h5:/' to 'C9_sorted.h5:/' The error was --> : Field clat must have associated a 'full' index in table /df/table (Table(390557601,)) ''. The destination file looks like: C9_sorted.h5 (File) '' Last modif.: 'Fri Jul 26 18:17:56 2013' Object Tree: / (RootGroup) '' /df (Group) '' /df/table (Table(0,), shuffle, blosc(9)) ''

Traceback (most recent call last): File "/usr/local/epd/bin/ptrepack", line 10, in sys.exit(main()) File "/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 480, in main upgradeflavors=upgradeflavors) File "/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 225, in copyChildren raise RuntimeError("Please check that the node names are not " RuntimeError: Please check that the node names are not duplicated in destination, and if so, add the --overwrite-nodes flag if desired. In particular, pay attention that rootUEP is not fooling you.

Does this mean, that I can not sort a HDF file by an index column, because they are not 'full' indexes?


Solution

  • Here is a complete example.

    Create the frame with a data_column. Reset the index to a full index. Use ptrepack to sortby it.

    In [16]: df = DataFrame(randn(10,2),columns=list('AB')).to_hdf('test.h5','df',data_columns=['B'],mode='w',table=True)
    
    In [17]: store = pd.HDFStore('test.h5')
    
    In [18]: store
    Out[18]: 
    <class 'pandas.io.pytables.HDFStore'>
    File path: test.h5
    /df            frame_table  (typ->appendable,nrows->10,ncols->2,indexers->[index],dc->[B])
    
    In [19]: store.get_storer('df').group.table
    Out[19]: 
    /df/table (Table(10,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
      "B": Float64Col(shape=(), dflt=0.0, pos=2)}
      byteorder := 'little'
      chunkshape := (2730,)
      autoIndex := True
      colindexes := {
        "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
        "B": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
    
    In [20]: store.create_table_index('df',columns=['B'],optlevel=9,kind='full')
    
    In [21]: store.get_storer('df').group.table
    Out[21]: 
    /df/table (Table(10,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
      "B": Float64Col(shape=(), dflt=0.0, pos=2)}
      byteorder := 'little'
      chunkshape := (2730,)
      autoIndex := True
      colindexes := {
        "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
        "B": Index(9, full, shuffle, zlib(1)).is_CSI=True}
    
     In [22]: store.close()
    
     In [25]: !ptdump -avd test.h5
    / (RootGroup) ''
      /._v_attrs (AttributeSet), 4 attributes:
       [CLASS := 'GROUP',
        PYTABLES_FORMAT_VERSION := '2.0',
        TITLE := '',
        VERSION := '1.0']
    /df (Group) ''
      /df._v_attrs (AttributeSet), 14 attributes:
       [CLASS := 'GROUP',
        TITLE := '',
        VERSION := '1.0',
        data_columns := ['B'],
        encoding := None,
        index_cols := [(0, 'index')],
        info := {'index': {}},
        levels := 1,
        nan_rep := b'nan',
        non_index_axes := [(1, ['A', 'B'])],
        pandas_type := b'frame_table',
        pandas_version := b'0.10.1',
        table_type := b'appendable_frame',
        values_cols := ['values_block_0', 'B']]
    /df/table (Table(10,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
      "B": Float64Col(shape=(), dflt=0.0, pos=2)}
      byteorder := 'little'
      chunkshape := (2730,)
      autoindex := True
      colindexes := {
        "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
        "B": Index(9, full, shuffle, zlib(1)).is_csi=True}
      /df/table._v_attrs (AttributeSet), 15 attributes:
       [B_dtype := b'float64',
        B_kind := ['B'],
        CLASS := 'TABLE',
        FIELD_0_FILL := 0,
        FIELD_0_NAME := 'index',
        FIELD_1_FILL := 0.0,
        FIELD_1_NAME := 'values_block_0',
        FIELD_2_FILL := 0.0,
        FIELD_2_NAME := 'B',
        NROWS := 10,
        TITLE := '',
        VERSION := '2.6',
        index_kind := b'integer',
        values_block_0_dtype := b'float64',
        values_block_0_kind := ['A']]
      Data dump:
    [0] (0, [1.10989047288066], 0.396613633081911)
    [1] (1, [0.0981650001268093], -0.9209780702446433)
    [2] (2, [-0.2429293157073629], -1.779366453624283)
    [3] (3, [0.7305529521507728], 1.243565083939927)
    [4] (4, [-0.1480724789512519], 0.5260130757651649)
    [5] (5, [1.2560020435792643], 0.5455842491255144)
    [6] (6, [1.20129355706986], 0.47930635538027244)
    [7] (7, [0.9973598999689721], 0.8602929579025727)
    [8] (8, [-0.40070941088441786], 0.7622228032635253)
    [9] (9, [0.35865804118145655], 0.29939126149826045)
    

    This is a another way to create a completely sorted index (as opposed to writing it this way)

    In [23]: !ptrepack --sortby=B test.h5 test_sorted.h5
    
    In [26]: !ptdump -avd test_sorted.h5
    / (RootGroup) ''
      /._v_attrs (AttributeSet), 4 attributes:
       [CLASS := 'GROUP',
        PYTABLES_FORMAT_VERSION := '2.1',
        TITLE := '',
        VERSION := '1.0']
    /df (Group) ''
      /df._v_attrs (AttributeSet), 14 attributes:
       [CLASS := 'GROUP',
        TITLE := '',
        VERSION := '1.0',
        data_columns := ['B'],
        encoding := None,
        index_cols := [(0, 'index')],
        info := {'index': {}},
        levels := 1,
        nan_rep := b'nan',
        non_index_axes := [(1, ['A', 'B'])],
        pandas_type := b'frame_table',
        pandas_version := b'0.10.1',
        table_type := b'appendable_frame',
        values_cols := ['values_block_0', 'B']]
    /df/table (Table(10,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
      "B": Float64Col(shape=(), dflt=0.0, pos=2)}
      byteorder := 'little'
      chunkshape := (2730,)
      /df/table._v_attrs (AttributeSet), 15 attributes:
       [B_dtype := b'float64',
        B_kind := ['B'],
        CLASS := 'TABLE',
        FIELD_0_FILL := 0,
        FIELD_0_NAME := 'index',
        FIELD_1_FILL := 0.0,
        FIELD_1_NAME := 'values_block_0',
        FIELD_2_FILL := 0.0,
        FIELD_2_NAME := 'B',
        NROWS := 10,
        TITLE := '',
        VERSION := '2.6',
        index_kind := b'integer',
        values_block_0_dtype := b'float64',
        values_block_0_kind := ['A']]
      Data dump:
    [0] (2, [-0.2429293157073629], -1.779366453624283)
    [1] (1, [0.0981650001268093], -0.9209780702446433)
    [2] (9, [0.35865804118145655], 0.29939126149826045)
    [3] (0, [1.10989047288066], 0.396613633081911)
    [4] (6, [1.20129355706986], 0.47930635538027244)
    [5] (4, [-0.1480724789512519], 0.5260130757651649)
    [6] (5, [1.2560020435792643], 0.5455842491255144)
    [7] (8, [-0.40070941088441786], 0.7622228032635253)
    [8] (7, [0.9973598999689721], 0.8602929579025727)
    [9] (3, [0.7305529521507728], 1.243565083939927)