I am exporting a schedule from P6, and am in the process of cleaning the data in Excel in prep for another application. I created a new column titled 'Parent', and am using that as the basis for my grouping.
The application I am going to use for this dataframe is Milestones Professional, which requires parent rows in the dataframe in order to indent properly. I have a column titled 'Parent', and any group which has a sub-parent is identified like the following: "Parent name > Sub-Parent name> etc."
import pandas as pd
#example of current dateframe
data = [['Milestone 1', 'Major Milestones', '01/01/2022', '01/02/2022'], ['Milestone 2', 'Major Milestones 2', '01/03/2022', '01/04/2022'], ['Milestone 3', 'Major Milestones>April MS', '01/05/2022', '01/06/2022'], ['Milestone 4', 'Major Milestones>April MS', '01/07/2022', '01/07/2022'], ['Milestone 5', 'Late Milestones', '01/08/2022', '01/09/2022'], ['Milestone 7', 'Late Milestones', '01/09/2022', '01/10/2022'], ['Milestones 8', 'Late Milestones>June MS', '01/13/2022', '01/14/2022'], ['Milestone 9', 'Late Milestones>June MS', '01/15/2022', '01/16/2022'], ['Milestone 10', 'Late Milestones>July MS', '01/19/2022', '01/20/2022']]
df = pd.DataFrame(data, columns=['Activity', 'Parent', 'Start', 'Finish'])
df
Activity Parent Start Finish
0 Milestone 1 Major Milestones 01/01/2022 01/02/2022
1 Milestone 2 Major Milestones 2 01/03/2022 01/04/2022
2 Milestone 3 Major Milestones>April MS 01/05/2022 01/06/2022
3 Milestone 4 Major Milestones>April MS 01/07/2022 01/07/2022
4 Milestone 5 Late Milestones 01/08/2022 01/09/2022
5 Milestone 7 Late Milestones 01/09/2022 01/10/2022
6 Milestones 8 Late Milestones>June MS 01/13/2022 01/14/2022
7 Milestone 9 Late Milestones 9 01/15/2022 01/16/2022
8 Milestone 10 Late Milestones>July MS 01/19/2022 01/20/2022
#ideal dataframe
data_1 = [['Major Milestones', '', '01/01/2022', '01/07/2022'], ['Milestone 1', 'Major Milestones', '01/01/2022', '01/02/2022'], ['Milestone 2', 'Major Milestones', '01/03/2022', '01/04/2022'], ['April MS', '', '01/05/2022', '01/07/2022'], ['Milestone 3', 'Major Milestones>April MS', '01/05/2022', '01/06/2022'], ['Milestone 4', 'Major Milestones>April MS', '01/07/2022', '01/07/2022'], ['Late Milestones', '', '01/08/2022', '01/20/2022'], ['Milestone 5', 'Late Milestones', '01/08/2022', '01/09/2022'], ['June MS', '', '01/11/2022', '01/14/2022'], ['Milestone 6', 'Late Milestones>June MS', '01/10/2022', '01/11/2022'], ['Milestone 7', 'Late Milestones>June MS', '01/09/2022', '01/10/2022'], ['Milestones 8', 'Late Milestones>June MS', '01/13/2022', '01/14/2022'], ['Milestone 9', 'Late Milestones>June MS', '01/15/2022', '01/16/2022'], ['July MS', '', '01/17/2022', '01/20/2022'], ['Milestone 10', 'Late Milestones>July MS', '01/19/2022', '01/20/2022']]
df_1 = pd.DataFrame(data_1, columns=['Activity', 'Parent', 'Start', 'Finish'])
df_1
Activity Parent Start Finish
0 Major Milestones 01/01/2022 01/07/2022
1 Milestone 1 Major Milestones 01/01/2022 01/02/2022
2 Milestone 2 Major Milestones 01/03/2022 01/04/2022
3 April MS 01/05/2022 01/07/2022
4 Milestone 3 Major Milestones>April MS 01/05/2022 01/06/2022
5 Milestone 4 Major Milestones>April MS 01/07/2022 01/07/2022
6 Late Milestones 01/08/2022 01/20/2022
7 Milestone 5 Late Milestones 01/08/2022 01/09/2022
8 June MS 01/11/2022 01/14/2022
9 Milestone 6 Late Milestones>June MS 01/10/2022 01/11/2022
10 Milestone 7 Late Milestones>June MS 01/09/2022 01/10/2022
11 Milestones 8 Late Milestones>June MS 01/13/2022 01/14/2022
12 Milestone 9 Late Milestones>June MS 01/15/2022 01/16/2022
13 July MS 01/17/2022 01/20/2022
14 Milestone 10 Late Milestones>July MS 01/19/2022 01/20/2022
I am trying to insert new rows identifying the major parent above the different groups, plant the start and finish date for that group, and also put rows identifying the sub-parents and the start and finish dates for those groups (sorry I know that's probably worded terribly). I tried creating a condition that searches in the parent column and creates a new row with the name of the parent for each group above but with no success.
My apologies for the formatting of the previous code snippet, it didn't come over well from VS.
If you want the original propose format, you could use extract
, slicing and concat
:
s = df['Parent'].str.extract('>(.*)$', expand=False).fillna(df['Parent'])
out = pd.concat([s[s.ne(s.shift())].to_frame(name='Activity'), df]).sort_index()
Output:
Activity Parent Start Finish
0 Major Milestones NaN NaN NaN
0 Milestone 1 Major Milestones 01/01/2022 01/02/2022
1 Milestone 2 Major Milestones 01/03/2022 01/04/2022
2 April MS NaN NaN NaN
2 Milestone 3 Major Milestones>April MS 01/05/2022 01/06/2022
3 Milestone 4 Major Milestones>April MS 01/07/2022 01/07/2022
4 Late Milestones NaN NaN NaN
4 Milestone 5 Late Milestones 01/08/2022 01/09/2022
5 Milestone 7 Late Milestones 01/09/2022 01/10/2022
6 June MS NaN NaN NaN
6 Milestones 8 Late Milestones>June MS 01/13/2022 01/14/2022
7 Milestone 9 Late Milestones>June MS 01/15/2022 01/16/2022
8 July MS NaN NaN NaN
8 Milestone 10 Late Milestones>July MS 01/19/2022 01/20/2022