pythonpandashistogramdistributionastropy

How to subsample a pandas df so that its variable distribution fits another distribution?


I am having 2 astronomical data tables, df_jpas and df_gaia. They are catalogues of galaxies containing among others the red-shifts z of the galaxies. I can plot the distribution of the redshifts of the 2 catalogs and it looks like this:

enter image description here

What I want now is to create a subsampled df_jpas, so that its distribution of z is as close as possible to the distribution of df_gaia within the z-range 0.8<z<2.3, means I want:

enter image description here

How do I do this?


Solution

  • Here is a solution.

    Let's first cut the dataframes into the desired z-range:

    left_z_edge, right_z_edge = 0.8, 2.3
    stepsize=0.02
    
    df_jpas = df_jpas[(df_jpas.z>left_z_edge)&(df_jpas.z<right_z_edge)]
    df_gaia = df_gaia[(df_gaia.z>left_z_edge)&(df_gaia.z<right_z_edge)]
    

    Next, we want to calculate the distributions (or histograms) of these dataframes:

    jpas_hist, jpas_bin_edges = np.histogram(df_jpas.z, bins = np.arange(left_z_edge,right_z_edge + stepsize, step=stepsize))
    jpas_bin_centers = (jpas_bin_edges + stepsize/2)[:-1] # instead of using the bin edges I create the bin centers and use them later
    
    gaia_hist, gaia_bin_edges = np.histogram(df_gaia.z, bins = np.arange(left_z_edge,right_z_edge + stepsize, step=stepsize))
    gaia_bin_centers = (gaia_bin_edges + stepsize/2)[:-1]
    

    After this is done comes the critical part of the code - dividing gaia_hist by jpas_hist gives us the probability of a galaxy existing in the particular z-bin and this probability is what we will use for subsampling:

    jpas_occup_prob = gaia_hist/jpas_hist
    

    Next, we create a function to be applied on the df_jpas dataframe, it creates an additional column that contains a flag if this particular galaxy should be "activated" (dropped or remained) to provide the desired distribution:

    def activate_QSO(z_val):
        idx = (np.abs(jpas_bin_centers - z_val)).argmin() # find the closest desscrite z-value to the z of the current QSO
        ocup_prob = jpas_occup_prob[idx] # assign to this entry the its probability of occupation
        activation_flag = int(np.random.random() < ocup_prob)# either activate (1) or not (0) this QSO depending on the probability from above
        return(activation_flag)
    
    df_jpas['activation_flag'] = df_jpas['z'].apply(activate_QSO)
    

    Using this flag, we can plot all galaxies containing 1 in this column which gives us the desired distribution:

    plt.hist(df_jpas[df_jpas.activation_flag==1].z, bins=100, alpha=0.5, label='jpas mock, subsampled')
    plt.hist(df_gaia.z, bins=100, alpha=0.5, label='GAIA QSO')
    plt.ylabel('N(z)')
    plt.xlabel('z')
    plt.legend()
    plt.show()
    

    enter image description here