I'm doing a common R ggplot2 graph with boxplot: boxplots supplemented individual samples as points shown by geom_jitter(), to show the individual sample positions and numbers in each group. Normally I have not noticed a problem, but with some recent data, I've noticed substantial inaccuracy and variation in the y position of the jitter. However, the boxplot stays stable with respect to the Y, and so does geom_point() when used to show the same points as jitter is plotting. Error is likely not noticeable when you have many data points, but if required to do something with 5-10 samples in a group, it can produce an obvious error that makes a plot that may mislead you, if you were not aware of the issue.
I first thought this may have always happened and I didn't notice, so I made some random numbers and made a ggplot with geom_jitter(), but at first the problem disappeared. Some example data and plots are given to show the normal and problematic cases.
Data generation and plotting that worked as expected:
df <- data.frame("X" = rep("X", 5), "Y" = rnorm(5, 100, 30))
check the plot:
library(ggplot2)
ggplot(df, aes(X, Y)) + geom_boxplot() + geom_jitter(col = "red") + geom_point(col = "blue")
The red and blue dots are almost exactly aligned, and you can just watch the plot come in RStudio preview if you repeat the code 5 times and not notice variation in the jitter point y-position (only horizontally along the X-axis, as expected). In a problematic case like below, you quickly see the y-axis point variation, especially because it sometimes shifts the range of the y-axis.
With more variation in random numbers, I found a difference visible between the red and blue points, which varied each time of plotting the same data:
df <- data.frame("X" = rep("X", 5), "Y" = rnorm(5, 100, 400))
The actual numbers to get this problem were:
X Y
1 X 610.78026
2 X -38.58905
3 X -196.00943
4 X 94.37797
5 X 415.58417
In my result, the lowest point, -196, sometimes was about -170, sometimes about -250. The range of the y-axis shifts each time. It's similar to the problem I had happen with real data.
I found with other testing of data having more variance, or a larger range between points, did not explain occurrence variability of the jitter y-position. In some cases with more variance, geom_jitter() again produced near perfect y-positions. So I wondered if it may have something to do with trouble mapping with certain plot areas used by ggplot2. I thought to test that by forcing ggplot to keep the same ylimit using ylim(-206, 621)
but it failed to stop the area with the above problematic case. It gives a mysterious, yet consistent error of: "Warning message: Removed 1 rows containing missing values (geom_point)."
(In the corresponding plot, it lost the red jitter point for the 610.7 value, despite enough pixel space in the plot preview window for about 10 more points between the blue point and the top of the graph. In another attempt, 2 jitter points get lost, because the bottom sometimes goes past the lower limit).
A roundabout solution would be to make random points for the X group, all keeping the same Y and group identity, but it's not efficient. When non-numerical groups are used on X, I found it will have a numerical position of 1 for any labels being added. Adding the following to the last dataframe gives the proper appearance + geom_point(aes(x= rnorm(5, 1, .2), y = Y), col = "yellow")
- but that would become quite cumbersome if there are many groups if there is not some way to automatically get the correct X position for groups of boxplots.
To solve the problem, any input on what the cause of it is would be a great help.
It sounds like you do not want the default geom_jitter
behavior, which adds a uniformly distributed amount of noise separately to the x and y value before plotting, by default "40% of the resolution of the data: this means the jitter values will occupy 80% of the implied bins."
For a continuous variable like yours "resolution" is "the smallest non-zero distance between adjacent values."
Try this:
geom_jitter(col = "red", height = 0) +
That will tell ggplot you want no noise applied to the y values before plotting.
Another approach would be to add noise yourself before the plotting step, giving you the ability to control its distribution and range specifically.
e.g. instead of having the jittering fill a uniform rectangle: ...
library(dplyr)
tibble(x = rep(1:2, each = 1000),
y = rep(3:4, each = 1000)) -> point_data
ggplot(point_data, aes(x,y)) + geom_jitter()
We could add whatever noise function we want. Here, for no particular reason, I make donuts around the real data, and compare that to the default jitter:
point_data %>%
mutate(angle = runif(2000, 0, 2*pi),
dist = rnorm(2000, 0.3, 0.05),
x2 = x + dist*cos(angle),
y2 = y + dist*sin(angle)) %>%
ggplot() +
geom_jitter(aes(x,y), color = "red", alpha = 0.2) +
geom_point(aes(x2,y2))