I came across this paper which presented this plot below. Could someone share what is this kind of plot called? And how can I plot a similar chart with python, specifically matplotlib? I would need to present predictions from a log regression as well, hence the question.
Thanks!
This is almost certainly an error bar with grouped partitions. Since you asked how one can plot such a graph, let's understand some basics first.
An error bar is a graph used to model and/or illustrate variability and uncertainty as an heuristic of data analysis. It allows you to visualize the precision of data points, and it can be used to model standard deviation, standard error, confidence intervals, or range (Cumming, Fidler & Vaux, 2007). This is done through the use of markers drawn over the original graph and its data points, juxtaposed with cap-tipped lines (or caps) extending from the center of the plotted data point.
Caps add a touch of visual aesthetics to your plot (subjective opinion), helping you quickly conceptualize the boundaries in relation to your data points. The sample you've provided however (if it is indeed an error bar), does not utilize caps. This can actually enhance visibility of the error bar endpoints which might be useful in plots with many overlapping elements.
A relatively short error bar signifies a condensed/concentrated value distribution, meaning that the data implied average is more likely. Contrastingly, a relatively long error bar is the obvious antithesis - it suggests sparse/wide distribution and that the average value is less unlikely.
Anatomy of an Error Bar (source)
Furthermore, Error bars can be symmetrical (the same length above and below the data point) or asymmetrical (varying lengths).
Error bars can be applied to scatterplots, dot plots, bar charts, or line graphs, to provide an additional layer of details that expands on the information presented by the initial data (The Data Visualization Catalogue article).
The error value in such a graph is the amounts by which your data points deviates from the expected value, and can be specified as fixed value or as a percentage of the data point (the latter, I believe is what your source image has presented).
In regards to your source material which I've also studied briefly, my interpretation (of the particular cropped portion in your question) is the authors seem to be testing a probabilistic model analyzing the likelihood of vaccinated patients to be hospitalized upon experiencing severe Omicron variant infections, against actual outcomes recorded in reality. So the error bars is used there as a measure of the distribution of accuracy of the model's results. In the ML sense, I believe it's a way of asking:
How well does the model fit the training data?
The official documentation has a rich collection of useful error bar examples, but let's buildup from basics all the way to plotting an identical one to your sample.
You can draw a simple error bar using matplotlib.pyplot.errorbar()
as follows:
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3]
y = [10, 20, 30]
yerr = [2, 3, 1] # Error values for y
# Create a plot with error bars
plt.errorbar(x, y, yerr=yerr, fmt="o", capsize=5, label="Data with error bars")
# Label axes
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
# Show legend
plt.legend()
# Show plot
plt.show()
Output:
The key takeaway is the fact that we have our error values readily available and easily pass it to the yerr
parameter of errorbar()
, everything else is pretty much trivial.
To achieve a capless error bar plot, you can always set the capsize
parameter of errorbar()
to 0
. Similarly, you can also plot your error bar data points with markers by specifying an additional keyword argument marker
to errorbar()
with any of the following values: 'o', 's', '^', 'D', 'P'
. We'll be doing this in a more sophisticated way when we attempt to plot an identical graph to your sample.
The code for generating the plot in your sample would look something like this. It is quite straightforward and self-explanatory, but since I'm making up (almost a close guesstimation to the actual data points but ultimately bogus) data points, I've strategically spaced out the intervals for the data points.
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots()
# Categories/x-axis data and their positions
categories = ["Primary + booster < 1yr", "Primary + booster >= 1yr", "At most primary"]
positions = np.arange(len(categories))
# Set the y-axis label and ticks
ax.set_ylabel("Probability hospital (%)")
ax.set_yticks(np.arange(0, 101, 25))
# Set the x-axis with categorical labels
ax.set_xticks(positions)
ax.set_xticklabels(categories)
# Define the colors and markers
colors = ['black', 'blue', 'green', 'yellow', 'red']
markers = ['o', 's', '^', 'D', 'P']
# Define the offsets for spacing the error bars
offsets = np.linspace(-0.1, 0.1, len(colors))
# Guesstimations of vertical lines with different ranges for each category
y_ranges = [
[(45, 50), (50, 55), (52, 58), (48, 53), (47, 56)],
[(55, 60), (60, 65), (62, 68), (58, 63), (57, 66)],
[(75, 80), (80, 85), (82, 88), (78, 83), (77, 86)]
]
# Plot the error bars for each category
for i, ranges in enumerate(y_ranges):
for j, (color, marker, offset, (ymin, ymax)) in enumerate(zip(colors, markers, offsets, ranges)):
x_position = positions[i] + offset # Adjust x position with offset
y = (ymin + ymax) / 2 # Center point for marker
yerr = (ymax - ymin) / 2 # Error value for the error bar # capsize=5
ax.errorbar(x_position, y, yerr=yerr, fmt=marker, color=color, capsize=0, label=f'{categories[i]} - Line {j+1}')
# Add internal text-label in the top-left corner
ax.text(0.05, 0.95, "At least 60 yrs", transform=ax.transAxes,
fontsize=12, verticalalignment='top', bbox=dict(facecolor='white', edgecolor='none'))
# Bborder to the plot
for spine in ax.spines.values():
spine.set_edgecolor('black')
# Show the plot
plt.show()
Output:
Provided you have your data points, you can still manually plot something identical using ordinary line graph functionalities of Matplotlib.
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots()
# Categories/x-axis data and their positions
categories = ["Primary + booster < 1yr", "Primary + booster >= 1yr", "At most primary"]
positions = np.arange(len(categories))
# Set the y-axis label and ticks
ax.set_ylabel("Probability hospital (%)")
ax.set_yticks(np.arange(0, 101, 25))
# Set the x-axis with categorical labels
ax.set_xticks(positions)
ax.set_xticklabels(categories)
# Colors and markers
colors = ['black', 'blue', 'green', 'yellow', 'red']
markers = ['o', 's', '^', 'D', 'P']
# Offsets for spacing the vertical lines
offsets = np.linspace(-0.1, 0.1, len(colors))
# Guesstimations of vertical lines with different ranges for each category
range_limits = [(45, 60), (55, 70), (75, 90)]
y_ranges = [
[(45, 50), (50, 55), (52, 58), (48, 53), (47, 56)],
[(55, 60), (60, 65), (62, 68), (58, 63), (57, 66)],
[(75, 80), (80, 85), (82, 88), (78, 83), (77, 86)]
]
for i, (limits, ranges) in enumerate(zip(range_limits, y_ranges)):
for j, (color, marker, offset, (ymin, ymax)) in enumerate(zip(colors, markers, offsets, ranges)):
x_position = positions[i] + offset # Adjust x position with offset
y = (ymin + ymax) / 2 # Center point for marker
ax.vlines(x_position, ymin, ymax, color=color, label=f'{categories[i]} - Line {j+1}')
ax.scatter(x_position, y, color=color, marker=marker)
# internal text-label in the top-left corner
ax.text(0.05, 0.95, "At least 60 yrs", transform=ax.transAxes,
fontsize=12, verticalalignment='top', bbox=dict(facecolor='white', edgecolor='none'))
# Border to the plot
for spine in ax.spines.values():
spine.set_edgecolor('black')
# Voila!
plt.show()
Output: