I have some problems with alghorytmic hierarchical clustering by Minkowski method. That's my code
install.packages("eurostat")
install.packages("ggplot2")
install.packages("dplyr")
library(eurostat)
library(ggplot2)
library(dplyr)
unique(data$geo)
data <- get_eurostat("prc_hicp_manr")
data <- filter(data, time >= "2000-03-01" & time <= "2022-09-30" & coicop == "CP00")
country_name <- eurostat::eu_countries
data <- merge(data, country_name, by.x = "geo", by.y = "code")
unique(data$name)
data <- filter(data, name != "" & name != "United Kingdom")
unique(data$name)
sum(is.na(data$name)) # Mamy 0 NAs, a zatem w naszych danych sa tylko kraje EU-27
ggplot(data) +
aes(x = time, y = values, colour = name) +
geom_line() +
scale_color_hue(direction = 1) +
labs(title = "Przebiegi HICP dla krajow UE") +
theme_bw() +
theme(plot.title = element_text(size = 20L, hjust = 0.5))
data_by_country <- data %>%
group_by(name) %>%
summarize(HICP = mean(values)) %>%
ungroup() %>%
as.matrix()
any(is.na(data_by_country))
data_by_country <- na.omit(data_by_country)
distance_matrix <- as.dist(dist(data_by_country, p = 1.5, method = "minkowski"))
Everything is fine until data_by_country line. By running this I got nice table with two columns, about names of countries and their HICP. There is no any NAs in this table. That's the first 6 rows from this table.
name HICP
1 Austria 2.102952
2 Belgium 2.232472
3 Bulgaria 4.036531
4 Croatia 2.410332
5 Cyprus 1.833579
6 Czechia 2.621033
After that, I want to create distance matrix based on that table. Unfortunately, I got this error:
Warning message:
In dist(data_by_country, p = 1.5, method = "minkowski") :
NAs introduced by coercion
But there is literally no NAs in my table. Additionally, when I use function:
distance_matrix <- as.dist(dist(data_by_country, p = 1.5, method = "minkowski"))
I got a matrix with values, but there are no names in columns and rows. There are only numbers.
Have you got any idea what's wrong with that code? I am new to R, so I think it could be easy but i have no idea what to do now.
IMPORTANT: I can use only dplyr, ggplot2 and eurostat packages in that exercise
That's the plot I want to get. Is it a right way to do that?enter image description here
Add row names to your data frame, pick only the data column and make sure the data is not just a vector (drop=F
).
rownames(data_by_country) <- data_by_country$name
plot(hclust(dist(data_by_country[, "HICP", drop=F])))