To clarify this matters, I used some datasets to interpret a variants of 2 Dimensional data
the dataset can be accessed in: https://drive.google.com/file/d/14-VivVlGSlaJo6BXlYMqn-1leorSU6ET/view?usp=sharing
and also a helper function:
scatterplot_check <- function(data, dependent_col, x_column, y_column, legend_pos="topright"){
x11()
data_subsets <- data[,c(which(colnames(data) %in% c(dependent_col, x_column, y_column)))]
if(class(data_subsets[[dependent_col]]) == "factor"){
factor_key <- levels(data_subsets[[dependent_col]])
data_subsets[[dependent_col]] <- as.numeric(data_subsets[[dependent_col]])
factor_num <- sort(unique(data_subsets[[dependent_col]]))
plot(data_subsets[[x_column]],data_subsets[[y_column]],
col = data_subsets[[dependent_col]], pch=18,
xlab=x_column, ylab=y_column)
legend(legend_pos, legend=factor_key, col = factor_num, pch=18)
}
else if(class(data_subsets[[dependent_col]]) == "character"){
data_subsets[[dependent_col]] <- as.factor(data_subsets[[dependent_col]])
factor_key <- levels(data_subsets[[dependent_col]])
data_subsets[[dependent_col]] <- as.numeric(data_subsets[[dependent_col]])
factor_num <- sort(unique(data_subsets[[dependent_col]]))
plot(data_subsets[[x_column]],data_subsets[[y_column]],
col = data_subsets[[dependent_col]], pch=18,
xlab=x_column, ylab=y_column)
legend(legend_pos, legend=factor_key, col = factor_num, pch=18)
}
else if(class(data_subsets[[dependent_col]]) == "integer"){
if(min(data_subsets[[dependent_col]]) == 0){
data_subsets[[dependent_col]] <- data_subsets[[dependent_col]] + 1
plot(data_subsets[[x_column]],data_subsets[[y_column]],
col = data_subsets[[dependent_col]], pch=18,
xlab=x_column, ylab=y_column)
legend(legend_pos, legend=sort(unique(data_subsets[[dependent_col]]-1)),
col = sort(unique(data_subsets[[dependent_col]])), pch=18)
}else{
plot(data_subsets[[x_column]],data_subsets[[y_column]],
col = data_subsets[[dependent_col]], pch=18,
xlab=x_column, ylab=y_column)
legend(legend_pos, legend=sort(unique(data_subsets[[dependent_col]])),
col = sort(unique(data_subsets[[dependent_col]])), pch=18)
}
}
}
Suppose, I read all the data into the environment with:
dataset1 <- read.csv("dataset1.csv")
dataset2 <- read.csv("dataset2.csv")
dataset3 <- read.csv("dataset3.csv")
And here is some variants of scatterplot:
scatterplot_check(dataset1, "y","x.1","x.2")
(This is likely to capable to be classified as SVM Models)
scatterplot_check(dataset2, "Purchased","Age","EstimatedSalary")
This is Also likely to capable to be classified as SVM Models
scatterplot_check(dataset3, "grades","english","math")
This is Not likely to capable to be classified as SVM Models
scatterplot_check(dataset3, "grades","read","math", legend_pos="topleft")
This is Not likely to capable to be classified as SVM Models
Is there any best approach to compute the likeliness of 2D Scatterplot to be modeled with SVM Model?
I am Spending some thoughts on making this, While I think it may have a future weaknesses, I think this should be my custom approach to calculate overlapping scatterplot between groups, The Steps are:
Here is the result when I implemented it to those 4 cases:
d1_compare <- dataset_class_comparison(dataset1, "y", "x.1", "x.2")
============================================================================
Class = -1
SeqX(-10,10,1)
SeqY(-10,10,1)
x.1_-2 to -1 (pct) x.1_-1 to 0 (pct) x.1_0 to 1 (pct) x.1_1 to 2 (pct)
0.16 0.38 0.30 0.10
x.2_-2 to -1 (pct) x.2_-1 to 0 (pct) x.2_0 to 1 (pct) x.2_1 to 2 (pct)
0.14 0.28 0.46 0.08
============================================================================
============================================================================
Class = 1
SeqX(-10,10,1)
SeqY(-10,10,1)
x.1_-1 to 0 (pct) x.1_1 to 2 (pct) x.1_2 to 3 (pct) x.1_3 to 4 (pct)
0.08 0.42 0.36 0.08
x.2_-1 to 0 (pct) x.2_0 to 1 (pct) x.2_1 to 2 (pct) x.2_2 to 3 (pct) x.2_3 to 4 (pct)
0.06 0.26 0.38 0.20 0.06
============================================================================
Conclusion: Since each class within a 5% threshold not having similiar distribution from x.1 or x.2
SVM Likely can be modeled
d2_compare <- dataset_class_comparison(dataset2, "Purchased", "Age", "EstimatedSalary")
============================================================================
Class = 0
SeqX(10,100,10)
SeqY(10000,1e+06,10000)
Age_10 to 20 (pct) Age_20 to 30 (pct) Age_30 to 40 (pct) Age_40 to 50 (pct)
0.066 0.325 0.413 0.178
EstimatedSalary_10000 to 20000 (pct) EstimatedSalary_20000 to 30000 (pct) EstimatedSalary_30000 to 40000 (pct)
0.063 0.077 0.059
EstimatedSalary_40000 to 50000 (pct) EstimatedSalary_50000 to 60000 (pct) EstimatedSalary_60000 to 70000 (pct)
0.098 0.182 0.112
EstimatedSalary_70000 to 80000 (pct) EstimatedSalary_80000 to 90000 (pct)
0.210 0.150
============================================================================
============================================================================
Class = 1
SeqX(10,100,10)
SeqY(10000,1e+06,10000)
Age_30 to 40 (pct) Age_40 to 50 (pct) Age_50 to 60 (pct)
0.222 0.392 0.304
EstimatedSalary_20000 to 30000 (pct) EstimatedSalary_30000 to 40000 (pct) EstimatedSalary_40000 to 50000 (pct)
0.123 0.105 0.056
EstimatedSalary_70000 to 80000 (pct) EstimatedSalary_80000 to 90000 (pct) EstimatedSalary_90000 to 1e+05 (pct)
0.080 0.080 0.074
EstimatedSalary_1e+05 to 110000 (pct) EstimatedSalary_110000 to 120000 (pct) EstimatedSalary_120000 to 130000 (pct)
0.093 0.062 0.062
EstimatedSalary_130000 to 140000 (pct) EstimatedSalary_140000 to 150000 (pct)
0.093 0.099
============================================================================
Conclusion: Since each class within a 5% threshold not having similiar distribution from Age or EstimatedSalary
SVM Likely can be modeled
d3_compare <- dataset_class_comparison(dataset3, "grades", "english", "math")
============================================================================
Class = KK-08
SeqX(0,100,10)
SeqY(100,1000,100)
english_0 to 10 (pct) english_10 to 20 (pct) english_20 to 30 (pct) english_30 to 40 (pct) english_40 to 50 (pct)
0.571 0.162 0.061 0.084 0.056
math_600 to 700 (pct)
0.989
============================================================================
============================================================================
Class = KK-06
SeqX(0,100,10)
SeqY(100,1000,100)
english_0 to 10 (pct) english_10 to 20 (pct) english_20 to 30 (pct) english_30 to 40 (pct) english_40 to 50 (pct)
0.377 0.262 0.098 0.131 0.066
math_600 to 700 (pct)
0.984
============================================================================
Conclusion: Since each class within a 5% threshold having similiar distribution either from english and math
SVM Unlikely can be modeled
d4_compare <- dataset_class_comparison(dataset3, "grades", "math", "read")
============================================================================
Class = KK-08
SeqX(100,1000,100)
SeqY(100,1000,100)
math_600 to 700 (pct)
0.989
read_600 to 700 (pct)
0.992
============================================================================
============================================================================
Class = KK-06
SeqX(100,1000,100)
SeqY(100,1000,100)
math_600 to 700 (pct)
0.984
read_600 to 700 (pct)
1
============================================================================
Conclusion: Since each class within a 5% threshold having similiar distribution either from math and read
SVM Unlikely can be modeled
dataset_class_comparison
is a customized function with over 300 lines, that can be found in https://drive.google.com/file/d/1RmIhbNnKZWS2jFIsS9p4LWjhcbikpOga/view?usp=sharing