rggplot2labelgeom-textgeom-col

How to use ggplot with labels, fills, colors, and symbols to visualize different dimensions in a stacked bar chart plot?


I have been able to create the following visualization in python but would like to re-create it in r. The data can be found further down in the r code.

The python code I wrote to generate the below graph from the data is:

import matplotlib.pyplot as plt
import pandas as pd

# Load the data
portfolio_data = pd.read_excel("Data.xlsx")

# Define colors for each Therapeutic Area (TA)
ta_colors = {
    'Malaria': 'lightblue',
    'HIV': 'lightgreen',
    # Additional colors can be added for other TAs if present in the dataset
}

# Define the width of the bars to adjust the diamond symbol position
bar_width = 0.8

plt.figure(figsize=(12, 8))

# For each phase, plot the projects, label them, color them by TA, add symbol for external funding, and draw border for NME type
for idx, phase in enumerate(portfolio_data['Phase'].unique()):
    phase_data = portfolio_data[portfolio_data['Phase'] == phase]
    
    bottom_offset = 0
    for index, row in phase_data.iterrows():
        edge_color = 'black' if row['Type'] == 'NME' else None  # Add border if project type is NME
        plt.bar(idx, 1, bottom=bottom_offset, color=ta_colors[row['TA']], edgecolor=edge_color, linewidth=1.2)
        plt.text(idx, bottom_offset + 0.5, row['Project'], ha='center', va='center', fontsize=10)
        
        # Add diamond symbol next to projects with external funding, positioned on the right border of the bar
        if row['Funding'] == 'External':
            plt.text(idx + bar_width/2, bottom_offset + 0.5, u'\u25C6', ha='right', va='center', fontsize=10, color='red')
        
        bottom_offset += 1

# Adjust x-ticks to match phase names
plt.xticks(range(len(portfolio_data['Phase'].unique())), portfolio_data['Phase'].unique())

# Create legends for the TAs and external funding separately
legend_handles_ta = [plt.Rectangle((0, 0), 1, 1, color=ta_colors[ta], label = ta) for ta in ta_colors.keys() ]
legend_external_funding = [plt.Line2D([0], [0], marker='D', color='red', markersize=10, label='External Funding', linestyle='None')]
legend_nme = [plt.Rectangle((0, 0), 1, 1, edgecolor='black', facecolor='none', linewidth=1.2, label='NME Type')]

# Add legends to the plot
legend1 = plt.legend(handles=legend_handles_ta, title="Therapeutic Area (TA)", loc='upper left')
plt.gca().add_artist(legend1)
legend2 = plt.legend(handles=legend_external_funding, loc='upper right')
plt.gca().add_artist(legend2)
plt.legend(handles=legend_nme, loc='upper center')

plt.title('Number of Projects by Phase, Colored by TA, with Symbol on Bar Border for External Funding and Border for NME Type')
plt.xlabel('Phase')
plt.ylabel('Number of Projects')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Here is what the result looks like: enter image description here

In my attempts to replicate the output in r, I have tried the following code:

library(ggplot2)
library(dplyr)

portfolio_data <- read.table(text = "Project    Phase   Funding TA  Type
Project1    I   Internal    Malaria NME
Project2    I   Internal    Malaria NME
Project3    I   Internal    Malaria NME
Project4    I   External    HIV NME
Project5    I   Internal    HIV NME
Project10   II  Internal    Malaria NME
Project11   II  Internal    Malaria NME
Project12   II  Internal    Malaria NME
Project17   II  External    Malaria LCM
Project18   II  External    HIV LCM
Project19   II  Internal    HIV LCM
Project20   III External    Malaria NME
Project21   III Internal    Malaria NME
Project22   III External    Malaria LCM
Project23   III Internal    HIV LCM
Project24   III External    HIV NME
Project25   III Internal    Malaria LCM
Project26   III External    HIV LCM
Project27   III Internal    HIV NME
", header=TRUE)

portfolio_data <- portfolio_data %>%
  mutate(dummy = 1)


ta_colors <- c(
  Malaria = "lightblue",
  HIV = "lightgreen"
)

type_colors <- c(
  NME = "black",
  LCM = "white"
)

# Create the plot
plot <- ggplot(portfolio_data, aes(x = Phase, y = dummy, fill = TA, label = Project)) +
  
  geom_col() +

  #add project name as labels
  geom_text(aes(label = Project)
            , position = position_stack(vjust = .5)) +
  
  #add borders by Type
  geom_col(aes(color = Type)
           , fill = NA
           , size = 1) +
  
  #add colors for TA and Type
  scale_fill_manual(values = ta_colors) +
  scale_color_manual(values = type_colors) +
  
  #diamonds for projects with external funding
  geom_text(aes(label = if_else(Funding == "External", "\u25C6", NA))
            , vjust = 0.5, hjust = -6.8, color = "red", size = 5
            , position = position_stack(vjust = .5)) +
  
  # Theme and labels
  labs(title = "Number of Projects by Phase, Colored by TA, with Symbol on Bar Border for External Funding and Border for NME Type",
       x = "Phase", 
       y = "Number of Projects") +
  theme_minimal()

print(plot)

I got the following result: enter image description here

The problem is that the borders are not correct. For example, Project 24 is an NME project. It seems that the second geom_col() call re-orders the projects so that the link between the Project and Type is no longer maintained. Is there a way around this? I wanted to use the built in functionality to draw borders but maybe I should consider adding a separate layer with boxes around the labels? I also tried geom_bar() but no success. Perhaps there are even better ways. Any help appreciated.


Solution

  • The main issue is the grouping. When using position_stack the order of the stack is determined by the group aes. If not explicitly set, ggplot2 will infer or set the group based on the categorical variables mapped on other aesthetics, e.g. in your case the grouping is set according to fill, color and label. Moreover, each layer has its own (default) grouping, e.g. in case of your second geom_col you drop the grouping by fill as you set fill=NA. As a consequence you get a different grouping for this layer.

    Hence, especially in case of complex plots like yours, which involve multiple geoms and aesthetics, the default grouping will not always give you the desired result. Instead you have to set it explicitly. In your case the the stack should be ordered by and only by Project, i.e. add group = Project to aes().

    Besides that I did some additional Tweaks. First, I reversed the order of the stacks using position_stack(..., reverse = TRUE). Second, I have set the outline color to "transparent" for the "LCM" type. Third, I switched to geom_point to add the diamonds which allows to use the shape aes and accordingly to get a third (shape) legend as in your python plot. Finally, I tweaked the legends via theme() and guides().

    library(ggplot2)
    
    type_colors <- c(
      NME = "black",
      LCM = "transparent"
    )
    
    ps <- position_stack(vjust = .5, reverse = TRUE)
    
    ggplot(
      portfolio_data,
      aes(x = Phase, y = dummy, group = Project)
    ) +
      geom_col(aes(fill = TA), position = ps) +
      geom_col(aes(color = Type),
        fill = NA,
        linewidth = 1, position = ps
      ) +
      geom_text(aes(label = Project), position = ps) +
      geom_point(
        aes(
          x = as.numeric(factor(Phase)) + .35,
          shape = Funding == "External"
        ),
        color = "red", size = 5,
        position = ps
      ) +
      scale_shape_manual(
        values = c(18, NA),
        labels = "External",
        breaks = "TRUE"
      ) +
      scale_fill_manual(
        values = ta_colors
      ) +
      scale_color_manual(
        values = type_colors,
        breaks = "NME"
      ) +
      # Theme and labels
      labs(
        title = "Number of Projects by Phase, Colored by TA, with Symbol on Bar Border for External Funding and Border for NME Type",
        x = "Phase",
        y = "Number of Projects",
        shape = "Funding"
      ) +
      theme_minimal() +
      theme(
        legend.position = "top",
        legend.direction = "vertical"
      ) +
      guides(
        color = guide_legend(title.position = "top", order = 2),
        fill = guide_legend(title.position = "top", order = 1),
        shape = guide_legend(title.position = "top", order = 3)
      )
    

    enter image description here