pythonpandasdataframenumpy

ValueError: Per-column arrays must each be 1-dimensional when trying to create a pandas DataFrame from a dictionary. Why?


I'm trying to create a very simple Pandas DataFrame from a dictionary. The dictionary has 3 items, and the DataFrame as well. They are:

  1. Here is the code that succeeds and displays the preferred df

# from a dicitionary
>>>dict1 = {"x": [1, 2, 3],
...         "y": list(
...             [
...                 [2, 4, 6], 
...                 [3, 6, 9], 
...                 [4, 8, 12]
...             ]
...             ),
...         "z": 100}

>>>df1 = pd.DataFrame(dict1)
>>>df1
   x           y    z
0  1   [2, 4, 6]  100
1  2   [3, 6, 9]  100
2  3  [4, 8, 12]  100
  1. But then I assign a Numpy ndarray (shape 3, 3 )to the key y, and try to create a DataFrame from the dictionary. The line I try to create the DataFrame errors out. Below is the code I try to run, and the error I get (in separate code blocks for ease of reading.)

>>>dict2 = {"x": [1, 2, 3],
...         "y": np.array(
...             [
...                 [2, 4, 6], 
...                 [3, 6, 9], 
...                 [4, 8, 12]
...             ]
...             ),
...         "z": 100}

>>>df2 = pd.DataFrame(dict2)  # see the below block for error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
d:\studies\compsci\pyscripts\study\pandas-realpython\data-delightful\01.intro.ipynb Cell 10' in <module>
      1 # from a dicitionary
      2 dict1 = {"x": [1, 2, 3],
      3          "y": np.array(
      4              [
   (...)
      9              ),
     10          "z": 100}
---> 12 df1 = pd.DataFrame(dict1)

File ~\anaconda3\envs\dst\lib\site-packages\pandas\core\frame.py:636, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    630     mgr = self._init_mgr(
    631         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    632     )
    634 elif isinstance(data, dict):
    635     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 636     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    637 elif isinstance(data, ma.MaskedArray):
    638     import numpy.ma.mrecords as mrecords

File ~\anaconda3\envs\dst\lib\site-packages\pandas\core\internals\construction.py:502, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    494     arrays = [
    495         x
    496         if not hasattr(x, "dtype") or not isinstance(x.dtype, ExtensionDtype)
    497         else x.copy()
    498         for x in arrays
    499     ]
    500     # TODO: can we get rid of the dt64tz special case above?
--> 502 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File ~\anaconda3\envs\dst\lib\site-packages\pandas\core\internals\construction.py:120, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    117 if verify_integrity:
    118     # figure out the index, if necessary
    119     if index is None:
--> 120         index = _extract_index(arrays)
    121     else:
    122         index = ensure_index(index)

File ~\anaconda3\envs\dst\lib\site-packages\pandas\core\internals\construction.py:661, in _extract_index(data)
    659         raw_lengths.append(len(val))
    660     elif isinstance(val, np.ndarray) and val.ndim > 1:
--> 661         raise ValueError("Per-column arrays must each be 1-dimensional")
    663 if not indexes and not raw_lengths:
    664     raise ValueError("If using all scalar values, you must pass an index")

ValueError: Per-column arrays must each be 1-dimensional

Why is it ending in error like that in the second attempt, even though the dimensions of both arrays are the same? What is a workaround for this issue?


Solution

  • If you look closer at the error message and quick look at the source code here:

        elif isinstance(val, np.ndarray) and val.ndim > 1:
            raise ValueError("Per-column arrays must each be 1-dimensional")
    

    You will find that if the dictionay value is a numpy array and has more than one dimension as your example, it throws an error based on the source code. Therefore, it works very well with list because a list has no more than one dimension even if it is a list of list.

    lst = [[1,2,3],[4,5,6],[7,8,9]]
    len(lst) # print 3 elements or (3,) not (3,3) like numpy array.
    

    You can try to use np.array([1,2,3]), it will work because number of dimensions is 1 and try:

    arr = np.array([1,2,3])
    print(arr.ndim)  # output is 1
    

    If it is necessary to use numpy array inside a dictionary, you can use .tolist() to convert numpy array to a list.