pythonpandasdataframecsvpy-datatable

How to convert correctly a datatable of integers (from Python datatable library) to pandas Dataframe


I am using Python datatable (https://github.com/h2oai/datatable) to read a csv file that contain only integers values. After that I convert the datatable to pandas Dataframe. At the conversion, the columns that contain only 0/1 are considered as boolean instead of integers.

let the following csv file (small_csv_file_test.csv):

a1,a2,a3,a4,a5,a6,a7,a8,a9,a10
 1, 1, 1, 1, 1, 1, 1, 0, 1, 1
 2, 2, 2, 2, 2, 2, 2, 1, 0, 1
 3, 3, 3, 3, 3, 3, 3, 0, 0, 1
 4, 4, 4, 4, 4, 4, 4, 1, 0, 0
 5, 5, 5, 5, 5, 5, 5, 0, 0, 0
 6, 6, 6, 6, 6, 6, 6, 0, 0, 0
 7, 7, 7, 7, 7, 7, 7, 1, 1, 0
 8, 8, 8, 8, 8, 8, 8, 1, 1, 1
 9, 9, 9, 9, 9, 9, 9, 1, 1, 1
 0, 0, 0, 0, 0, 0, 0, 1, 0, 1

The source code :

import pandas as pd
import datatable as dt

test_csv_matrix = "small_csv_file_test.csv"

data = dt.fread(test_csv_matrix)
print(data.head(5))

matrix= data.to_pandas()
print(matrix.head())

Result:

   | a1  a2  a3  a4  a5  a6  a7  a8  a9  a10  
-- + --  --  --  --  --  --  --  --  --  ---  
 0 |  1   1   1   1   1   1   1   0   1    1  
 1 |  2   2   2   2   2   2   2   1   0    1  
 2 |  3   3   3   3   3   3   3   0   0    1  
 3 |  4   4   4   4   4   4   4   1   0    0  
 4 |  5   5   5   5   5   5   5   0   0    0  

[5 rows x 10 columns]

   a1  a2  a3  a4  a5  a6  a7     a8     a9    a10  
0   1   1   1   1   1   1   1  False   True   True  
1   2   2   2   2   2   2   2   True  False   True  
2   3   3   3   3   3   3   3  False  False   True  
3   4   4   4   4   4   4   4   True  False  False  
4   5   5   5   5   5   5   5  False  False  False  

Edit 1: The columns a8, a9 and a10 are not correct, I want them as integer values not boolean.

Thank you for your help.


Solution

  • You can just coerce every column to int64:

    matrix = data.to_pandas().astype('int64')