rknnldakruskal-wallispairwise.wilcox.test

How to find meaningful boundaries between two continuous variables in R


To find the relationship between two columns of the iris dataset, I am performing kruskal.test and p.value shows a meaningful relationship between these two columns.

data(iris)
kruskal.test(iris$Petal.Length, iris$Sepal.Width)

Here are the results:

    Kruskal-Wallis rank sum test

data:  iris$Petal.Length and iris$Sepal.Width
Kruskal-Wallis chi-squared = 41.827, df = 22, p-value = 0.00656

The Scatter plot also shows some sort of relationship. plot(iris$Petal.Length, iris$Petal.Width)

enter image description here

To find the meaningful boundaries of these two variables, I ran pairwise.wilcox.test test, but for this test to work, one of the variables needs to be categorical. If I pass both continuous variables to it, then the results are not as expected.

pairwise.wilcox.test(x = iris$Petal.Length, g = iris$Petal.Width, p.adjust.method = "BH")

As an output, I need a clear cut point where these two variables have some sort of relationship and where this relationship ends (As shown through the red line in the attached image above)

I am not sure if there is any statistical test or another programming technique to find these boundaries.

e.g. manually I can do something like this to mark boundaries -

setDT(iris)[, relationship := ifelse(Petal.Length > 3 & Sepal.Width < 3.5, 1, 0)]

But, is there a programming technique or library in R to find such boundaries?

It is important to note that my actual data is skewed.

Thanks, Saurabh


Solution

  • There is not sth like the best split. It could be the best under certain conditions/criteria you will specify.

    I think you expected second plot although I added the first one too where you have one line. There is used a Linear Discriminant Analysis. However this is supervised learning as we have Species column. So you might be interested in unsupervised methods like K-Nearest Neighborhoods and boundaries for them - then check this one https://stats.stackexchange.com/questions/21572/how-to-plot-decision-boundary-of-a-k-nearest-neighbor-classifier-from-elements-o.

    data(iris)
    library(MASS)
    
    plot(iris$Petal.Length, iris$Petal.Width, col = iris$Species)
    
    # construct the model
    mdl <- lda(Species ~ Petal.Length + Petal.Width, data = iris)
    
    # draw discrimination line
    np <- 300
    nd.x <- seq(from = min(iris$Petal.Length), to = max( iris$Petal.Length), length.out = np)
    nd.y <- seq(from = min(iris$Petal.Width), to = max( iris$Petal.Width), length.out = np)
    nd <- expand.grid(Petal.Length = nd.x, Petal.Width = nd.y)
    
    prd <- as.numeric(predict(mdl, newdata = nd)$class)
    
    plot(iris[, c("Petal.Length", "Petal.Width")], col = iris$Species)
    points(mdl$means, pch = "+", cex = 3, col = c("black", "red"))
    contour(x = nd.x, y = nd.y, z = matrix(prd, nrow = np, ncol = np), 
            levels = c(1, 2), add = TRUE, drawlabels = FALSE)
    
    #create LD sequences from min - max values 
    p = predict(mdl, newdata= nd)
    p.x = seq(from = min(p$x[,1]), to = max(p$x[,1]), length.out = np) #LD1 scores
    p.y = seq(from = min(p$x[,2]), to = max(p$x[,2]), length.out = np) #LD2 scores
    
    
    contour(x = p.x, y = p.y, z = matrix(prd, nrow = np, ncol = np), 
            levels = c(1, 2, 3), add = TRUE, drawlabels = FALSE)
    

    enter image description here enter image description here

    Linked to: How to plot classification borders on an Linear Discrimination Analysis plot in R