I am using the library prince
in order to perform Correspondence Analysis
from prince import CA
My contingency table dummy_contingency
looks like this:
{'v1': {'0': 4.479591836734694,
'1': 75.08163265306122,
'2': 1.1020408163265305,
'3': 5.285714285714286,
'4': 14.244897959183673,
'5': 0.0,
'6': 94.06122448979592,
'7': 0.5102040816326531,
'8': 87.62244897959184,
'9': 16.102040816326532},
'v2': {'0': 6.142857142857143,
'1': 24.653061224489797,
'2': 0.3979591836734694,
'3': 2.63265306122449,
'4': 18.714285714285715,
'5': 0.0,
'6': 60.92857142857143,
'7': 1.030612244897959,
'8': 71.73469387755102,
'9': 14.76530612244898},
'v3': {'0': 3.642857142857143,
'1': 21.551020408163264,
'2': 0.8061224489795918,
'3': 2.979591836734694,
'4': 14.5,
'5': 0.030612244897959183,
'6': 39.60204081632653,
'7': 0.7551020408163265,
'8': 71.89795918367346,
'9': 11.571428571428571},
'v4': {'0': 6.1020408163265305,
'1': 25.632653061224488,
'2': 0.6938775510204082,
'3': 3.9285714285714284,
'4': 21.581632653061224,
'5': 0.22448979591836735,
'6': 10.704081632653061,
'7': 0.8469387755102041,
'8': 71.21428571428571,
'9': 12.489795918367347}}
Chi Square Test reveals dependence:
Chi-square statistic: 69.6630377155341
p-value: 1.2528156966101567e-05
Now I fit the data:
dummy_contingency = pd.DataFrame(dummy_contingency)
ca_dummy = CA(n_components=2) # Number of components for correspondence analysis
ca_dummy.fit(dummy_contingency)
And the plot:
fig = ca_dummy.plot(
X=dummy_contingency)
fig
How do I get the labelling done for this plot? The examples posted by others (Using mca package in Python) uses the function plot_coordinates()
which has the option of putting the labels as well. But it looks like this function is no longer available with prince
package and need to use the plot()
function which does not have the option to put labels. Appreciate any help on this.
Edit: Example of an output with labels:
The text for each of the points in the plot like "strawberries", "banana", "yogurt", etc. are the labels that I am looking for, which in this will be the index values 0,1,2,3,4,5,6,7,8,9 for the blue points and the column names "v1", "v2", "v3", "v4" for the orange points.
ca
.correspondence-analysis
plot, .column_coordinates
and .row_coordinates
must be extracted from the ca
model. These are the points on the plot, not those from df
.import pandas as pd
import prince
import altair as alt
# convert the dictionary of data to a dataframe
df = pd.DataFrame(dummy_contingency)
# create the model
ca = prince.CA()
# fit the model
ca = ca.fit(df)
# extract the column coordinate dataframe, and change the column names
cc = ca.column_coordinates(df).reset_index()
cc.columns = ['name', 'x', 'y']
# extract the row coordinates dataframe, and change the column names
rc = ca.row_coordinates(df).reset_index()
rc.columns = ['name', 'x', 'y']
# combine the dataframes
crc_df = pd.concat([cc, rc], ignore_index=True)
# plot and annotate
points = ca.plot(df)
annot = alt.Chart(crc_df).mark_text(
align='left',
baseline='middle',
fontSize = 20,
dx = 7
).encode(
x='x',
y='y',
text='name'
)
points + annot
annot
.cc
and rc
into a single dataframe.points = ca.plot(df)
annot1 = alt.Chart(cc).mark_text(
align='left',
baseline='middle',
fontSize = 20,
dx = 7
).encode(
x='x',
y='y',
text='name'
)
annot2 = alt.Chart(rc).mark_text(
align='left',
baseline='middle',
fontSize = 20,
dx = 7
).encode(
x='x',
y='y',
text='name'
)
points + annot1 + annot2