Here is some random code.
# create df
import pandas as pd
df2 = pd.DataFrame({'var1':['1_0','1_0','1_0','1_0','1_0'],
'var2':['X','y','a','a','a']})
df2.to_json('df2.json')
# import df
df2 = pd.read_json('df2.json')
df2
This is the expected output:
var1 var2
0 1_0 X
1 1_0 y
2 1_0 a
3 1_0 a
4 1_0 a
However it generates:
var1 var2
0 10 X
1 10 y
2 10 a
3 10 a
4 10 a
If I modify an entry inside ['var1']
to a string, then the code it generates when df
is imported is correct.
Here is an example to illustrate it. I replaced one of the entries with 'hello'
df2 = pd.DataFrame({'var1':['1_0','hello','1_0','1_0','1_0'],
'var2':['X','y','a','a','a']})
df2.to_json('df2.json')
# import df
df2 = pd.read_json('df2.json')
df2
Generates this
var1 var2
0 1_0 X
1 hello y
2 1_0 a
3 1_0 a
4 1_0 a
Same problem is observed if file is saved in csv format and then imported.
Has anyone encountered the same issue?
This is due to the fact that underscores are valid separators in python (often used as thousand separator: 1_000
is 1000
). You could force the dtype upon import (or use dtype=False
):
df2 = pd.read_json('df2.json', dtype='str')
If you want to keep dtype detection for the other columns:
df2 = pd.read_json('df2.json', dtype={'var1': 'str'})
Output:
var1 var2
0 1_0 X
1 1_0 y
2 1_0 a
3 1_0 a
4 1_0 a
When you have a string in the json, there is no ambiguity that the values are not numbers and the conversion is not done.