all.
I am giving a label to each sentence in an article. I am trying to generate a stacked area plot to show at a specific location, the percentage of a certain label.
The location is calculated as (sentence_index/total_number_of_sentence)
The percentage is calculated as at location X, (total number of sentences with label A/total number of sentences)
Here is an example of my data,a complete subsec of loc (0.24,0.28). I have tested that at each location, the sum of all percentage is 1.
> area_df[area_df$loc>0.24,]
label percentage loc
186 B1 0.195 0.25
187 C1 0.111 0.25
188 E1 0.006 0.25
189 G1 0.075 0.25
190 H1 0.008 0.25
191 M1 0.125 0.25
192 M2 0.064 0.25
193 M3 0.084 0.25
194 O1 0.070 0.25
195 O2 0.053 0.25
196 R1 0.209 0.25
197 B1 0.500 0.26
198 M2 0.250 0.26
199 M3 0.250 0.26
200 B1 0.166 0.27
201 C1 0.177 0.27
202 E1 0.015 0.27
203 G1 0.100 0.27
204 H1 0.011 0.27
205 M1 0.114 0.27
206 M2 0.048 0.27
207 M3 0.059 0.27
208 O1 0.074 0.27
209 O2 0.026 0.27
210 R1 0.210 0.27
211 B1 0.125 0.28
212 C1 0.250 0.28
213 G1 0.125 0.28
214 H1 0.125 0.28
215 M1 0.125 0.28
216 O1 0.125 0.28
217 O2 0.125 0.28
I want to create a stacked area plot to represent the overall percentage. I am expecting a solid fill graph with ranging from [0,1]. However, in my geom_area plot, there are some locations with sum(y) greater than 1. when I try set ylim(0,1), there are strange blank(white) lines showing in the area plot.
I am not sure what causes this problem
Here is my code without and with ylim:
# all data stored in area_df
normal_loc_uniq <- sort(unique(normal_loc))
area_df <- data.frame(matrix(ncol = 3,nrow=0))
colnames(area_df) <- c("loc","label","percentage")
# for each location, calculate the percentage
for (one_loc in normal_loc_uniq){
subset <- data[data$normal_loc == one_loc,]
subset_count <- as.data.frame(round(prop.table(table(subset$normal_label, useNA = "no")),5))
names(subset_count) <- c("label","percentage")
subset_count$loc <- as.numeric(one_loc)
subset_count$percentage <- round(subset_count$percentage,3)
# test if there are locations with percentage not equal to 1
if (0.98>sum(subset_count$percentage)| sum(subset_count$percentage) >1.02){
print("error. total percentage is not 1")
}
area_df <- rbind(area_df,subset_count)
}
library(ggplot2)
colors <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf", "#aaffc3")
ggplot(area_df, aes(x = loc, y = percentage, fill = label)) +
geom_area(na.rm=TRUE,position="stack") +
scale_fill_manual(values=colors) +
labs(x = "Relative Location", y = "Percentage", fill = "Label") +
theme_bw()
edit 1: added a complete subset of data
TL;DR - replace your geom_area(...)
part with
geom_area(na.rm=TRUE,position="fill")
For more information...
What you are looking for are called Position Adjustments which adjust how plotting layers are handled when they may overlap. You can define position=
as an argument for any geom_*()
function, and each one has a different default behavior. This link explains some of the options with examples, but I'll summarize here:
geom_point()
and geom_line()
(and a few more).stack. Each successive y value at the same x position is added to the previous one(s). The final appearance is that the geoms are drawn like they are "stacked" on top of one another in the y direction. Default for geoms such as geom_col()
, geom_bar()
, and geom_area()
(and a few more).
fill. Works like position="stack"
, but actual scale and position in the y axis is recalculated so each represents a proportion of the total. In other words, the new values for y will equal 1. I don't believe any geoms default to this behavior, but this is what OP is looking to do.
To apply position adjustments, you address the position=
argument in a geom_*()
function. To use the default behavior of adjusting, just use position="dodge"
or position="fill"
, etc. To fine-tune the adjustment, you can refer to each position adjustment's function instead, such as position=position_dodge(...)
or position=position_stack(...)
.
Here's an example area plot. geom_area()
defaults to use position="stack"
, so geom_area(position="stack")
does the same thing as writing just geom_area()
.
library(ggplot2)
set.seed(8675309)
df <- data.frame(
x=1:20,
y=c(runif(20, min=0, max=100), runif(20, min=10, max=50),
runif(20, min=5, max=20)),
category=rep(LETTERS[1:3], each=20)
)
p <- ggplot(df, aes(x=x, y=y, fill=category))
p + geom_area()
Using position="fill"
we get this:
p + geom_area(position="fill")
Therefore, OP should change geom_area(...)
in their code to be:
geom_area(na.rm=TRUE,position="fill")
... oh, and if you want to add actual percentages in the labels of y axis instead of 0, 0.25, 0.50,... then I would recommend adding:
scale_y_continuous(labels=scales::percent_format())