pythondataframejuliamulti-index

Julia: equivalent of Pythons selection by multiindex level (especially columns)


I am new to Julia and Julia DataFrames. My understanding is that DataFrames do not support MultiIndexing, which generally does not pose much problems, but translating some pythonic habits to Julia poses difficulties. I wonder how one could load and subselect features by columns, as in the example below.

import numpy as np
import pandas as pd

#generating sample data
nsmpls = 10
smpls = [f'smpl{j}' for j in range(nsmpls)]

nfeats = 5
feats = [f'feat{j}' for j in range(nfeats)]

data = np.random.rand(nfeats, nsmpls)

countries = ['France'] * 2 + ['UK'] * 3 + ['US'] * 5

df = pd.DataFrame(data, index=feats, columns=pd.MultiIndex.from_tuples(zip(countries, smpls)))
df.to_csv('./data.tsv', sep='\t')

#---------------------------------------------------------------------
#loading dataset
df = pd.read_csv('./data.tsv', sep='\t', index_col=0, header=[0,1])

#extracting subset
dg = df.xs('France', level=0, axis=1)
print(dg.shape)

#iterating
for country, group in df.groupby(level=0, axis=1):
    print('#samples: {}'.format(group.shape[1]))

Solution

  • Something like this ?

    using DataFrames, CSV
    
    # Used your sample data
    df = DataFrame(CSV.File("data.tsv"))
    
    # Filter the columns by country name
    france_cols = findall(x -> occursin("France", x), names(df))
    
    # Subset the df
    dg = select(df, france_cols)
    
    # Optional : use "sampleX" as col names instead of the country name
    rename!(dg, collect(dg[1, :]))
    dg = dg[2:end, :]
    
    display(dg)
    println(size(dg))
    

    By default, DataFrames adds numbers to similar column names like this : France, France_1 etc so I selected all the columns containing "France".