pythonpandasdataframejuliajulia-dataframe

How to convert a Python pandas into a Julia DataFrame (using PyJulia) and back to Python Pandas


I want to use PyJulia to speed up some part of the code

import numpy as np
import julia
import pandas as pd
import random
from julia import Base
from julia import Main
from julia import DataFrames

n = 100000
randomlist = []
for i in range(0,n):
    num = random.randint(1,100)
    randomlist.append(num)

data = {
    'Score': list(randomlist),
        'ScoreBin': list(np.zeros(n))
           }
df = pd.DataFrame(data, columns = ['Score', 'ScoreBin'])
Main.dfj = df

Main.eval(""" 
for i = 1:10
    #println(i)
    if dfj.Score[i] >= 10
        println(dfj.Score[i])
    end
end
"""
)

However I get the following error Message:

JuliaError: Exception 'TypeError: non-boolean (PyObject) used in boolean context' occurred while calling julia code:

Moreover the following command:

Main.eval(""" 
println(dfj.Score[1])
"""
)

gives the output (which appears not to be a Julia DataFrame):

PyObject 84

Is there a way to convert a pandas DataFrame into a Julia DataFrame?

Edit 1

Thanks to the answer of @PrzemyslawSzufel, the following code now works:

import numpy as np
import julia
import pandas as pd
import random
import copy
from julia import Base
from julia import Main
from julia import DataFrames
from julia import Pandas
#julia.install(DataFrame)
%load_ext julia.magic

n = 100000
randomlist = []
for i in range(0,n):
    num = random.randint(1,100)
    randomlist.append(num)

data = {
    'Score': list(randomlist),
        'ScoreBin': list(np.zeros(n))
           }
df = pd.DataFrame(data, columns = ['Score', 'ScoreBin'])
Main.df = df;

Main.eval("""
dfj = df |> Pandas.DataFrame|> DataFrames.DataFrame;
""")

However, although I put a ; at the end of the line, I always get a printed output from dfj which is unwanted and long (100000 rows) and takes around a second. Is there way to avoid the printed output?

Moreover, if I now modify the dataframe in Julia (which is way faster than doing that in python and the goal of the whole question) and want it to convert it back to a python pandas, I also get an error

Main.eval(""" 
for i = 1:length(dfj[:, :Score])
    if dfj[i, :Score] > 50
        dfj[i, :ScoreBin] = 1 
    end
end
"""
)

dfjpy = pd.DataFrame(Main.dfj)
dfjpy


RuntimeError: Julia exception: MethodError: no method matching iterate(::DataFrames.DataFrame)
Closest candidates are:
  iterate(!Matched::Core.SimpleVector) at essentials.jl:568
  iterate(!Matched::Core.SimpleVector, !Matched::Any) at essentials.jl:568
  iterate(!Matched::ExponentialBackOff) at error.jl:199
  ...
Stacktrace:
 [1] jlwrap_iterator(::DataFrames.DataFrame) at /Users/mymac/.julia/packages/PyCall/zqDXB/src/pyiterator.jl:144
 [2] pyjlwrap_getiter(::Ptr{PyCall.PyObject_struct}) at /Users/mymac/.julia/packages/PyCall/zqDXB/src/pyiterator.jl:125

By the way the command type(dfjpy) gives PyCall.jlwrap as output

Edit 2

In order to convert a julia Dataframe to Python Pandas, you have to first convert it to a Julia Pandas. Is is the latest working code

n = 100000
randomlist = []
for i in range(0,n):
    num = random.randint(1,100)
    randomlist.append(num)

data = {
    'Score': list(randomlist),
        'ScoreBin': list(np.zeros(n))
           }
df = pd.DataFrame(data, columns = ['Score', 'ScoreBin'])
Main.df = df;

Main.eval("""
dfj = df |> Pandas.DataFrame|> DataFrames.DataFrame;

for i = 1:length(dfj[:, :Score])
    if dfj[i, :Score] > 50
        dfj[i, :ScoreBin] = 1 
    end
end

dfjp = dfj |> Pandas.DataFrame;
"""
)

dfjpy = Main.dfjp
dfjpy

Solution

  • You need to have Pandas.jl installed. This library will process your Python pandas data frame for sanity with Julia and than you can convert it to DataFrames.jl.

    Here is the Julia code (assumes that dfj is your Python variable):

    import DataFrames
    import Pandas
    juliandf = dfj |> Pandas.DataFrame |> DataFrames.DataFrame;
    

    Note that the last line can be also written as:

    C= DataFrames.DataFrame(Pandas.DataFrame(dfj));
    

    To convert back Pandas.DataFrame(juliandf) should work.