I want to assign two control plots to each of the project plots. In the first iteration, I want to assign the nearest control plot (with the minimum value of dist_project_plot_id*) to the project plot being evaluated. If this control plot has already been assigned to a project plot, we look for the next nearest control plot.
Once all project plots have been assigned a first control plot, we assign a second control plot to each project plot, following the same criterion: find the control plot with the minimum distance as long as it has not been previously assigned to another project plot.
I have a dataframe which looks like:
`data = {
'control_plot_id': [1526258, 1507770, 1539206, 1528123, 2019722, 1504105],
'dist_project_plot_id1': [3025.22, 2670.43, 2140.41, 1697.68, 3999.77, 2783.97],
'dist_project_plot_id2': [488.07, 427.82, 1180.68, 1386.38, 4739.51, 590.44],
'dist_project_plot_id3': [2033.15, 2193.51, 2958.56, 3168.14, 5573.02, 2008.31]
}
df = pd.DataFrame(data)`
Where: control_plot_id represents the identifiers of the control plots dist_project_plot_id1 represents the distance between the control plot and project plot 1 dist_project_plot_id2 represents the distance between the control plot and project plot 2 etc.
I have tried the first search in the following code:
import pandas as pd
df = pd.DataFrame(data)
# Add new columns "PP" and "dist"
df['PP'] = ''
df['dist'] = np.nan
# Get the column names starting with 'project_plot_id'
project_columns = [col for col in df.columns if col.startswith('project_plot_id')]
# Iterate over the project_plot_id columns
for col in project_columns:
# Sort the dataframe by the current column in ascending order
df_sorted = df.sort_values(col)
# Find the k-nearest control plots for the current column
k = 1 # Set the value of k
nearest_control_plots = []
for i in range(k):
min_value = df_sorted.loc[~df_sorted['control_plot_id'].isin(nearest_control_plots)].head(1)[['control_plot_id', col]]
nearest_control_plots.append(min_value['control_plot_id'].values[0])
df.loc[df['control_plot_id'] == min_value['control_plot_id'].values[0], 'PP'] = col
df.loc[df['control_plot_id'] == min_value['control_plot_id'].values[0], 'dist'] = min_value[col].values[0]
What I am not able to program is that if a control plot has already been selected for a project plot, the code should continue searching for the next nearest control plot, which could be the third, the fourth, or even the last one. Maybe a particular library is doing what I am looking for.
The expected output should have the following columns:
Okay, based on the description this is what you are looking for:
import pandas as pd
data = {
'control_plot_id': [1526258, 1507770, 1539206, 1528123, 2019722, 1504105],
'dist_project_plot_id1': [3025.22, 2670.43, 2140.41, 1697.68, 3999.77, 2783.97],
'dist_project_plot_id2': [488.07, 427.82, 1180.68, 1386.38, 4739.51, 590.44],
'dist_project_plot_id3': [2033.15, 2193.51, 2958.56, 3168.14, 5573.02, 2008.31]
}
df = pd.DataFrame(data)
cols = [c for c in df.columns if c.startswith("dist")]
project_data = []
assignments = []
for c in cols:
for i in range(1, 3):
row = {}
search_df = df.loc[~df.control_plot_id.isin(assignments)]
control_plot = search_df.loc[search_df[c].idxmin()]
row["project_plot"] = c
row["control_plot"] = control_plot.control_plot_id
row["dist"] = control_plot[c]
project_data.append(row)
assignments.append(control_plot.control_plot_id)
out_data = pd.DataFrame(project_data)
print(out_data)
This returns:
project_plot control_plot dist
0 dist_project_plot_id1 1528123.0 1697.68
1 dist_project_plot_id1 1539206.0 2140.41
2 dist_project_plot_id2 1507770.0 427.82
3 dist_project_plot_id2 1526258.0 488.07
4 dist_project_plot_id3 1504105.0 2008.31
5 dist_project_plot_id3 2019722.0 5573.02
This does what you ask for:
I want to assign the nearest control plot (with the minimum value of dist_project_plot_id*) to the project plot being evaluated. If this control plot has already been assigned to a project plot, we look for the next nearest control plot. Once all project plots have been assigned a first control plot, we assign a second control plot to each project plot, following the same criterion: find the control plot with the minimum distance as long as it has not been previously assigned to another project plot.
However your example output is different from what you explained you wanted, not sure why...
NOTE after edit: If instead you wanted one row per project then, as with my answer prior to the edit when you clarified your expected output:
project_data = {}
assignments = []
for c in cols:
project_data[c] = {}
for i in range(1, 3):
search_df = df.loc[~df.control_plot_id.isin(assignments)]
control_plot = search_df.loc[search_df[c].idxmin()]
project_data[c][f"control_plot{i}"] = control_plot.control_plot_id
project_data[c][f"dist{i}"] = control_plot[c]
assignments.append(control_plot.control_plot_id)
out_data = pd.DataFrame.from_dict(project_data, orient='index').reset_index()
out_data.rename(columns={'index': 'project'}, inplace=True)
which would return:
project control_plot1 dist1 control_plot2 dist2
0 dist_project_plot_id1 1528123.0 1697.68 1539206.0 2140.41
1 dist_project_plot_id2 1507770.0 427.82 1526258.0 488.07
2 dist_project_plot_id3 1504105.0 2008.31 2019722.0 5573.02