# Feature Selection using Voice Gender Data and the #caret package

Feature Selection Methods

## Feature Selection , Dimensionality reduction and Random Forests

This post is based on an article by Shirin Glander on feature selection.

Feature Selection is a process of selecting a subset of relevant features for use in a classification problem. This allows for :

1. Simplification of Models
2. Shorter training times
3. Avoiding overfitting, so that models can be generalised to other incoming data

Dimensionality reduction is a way of reducing the number of random variables under consideration.In this project, we will be using PCA to map the data to a 2-Dimensional frame along the eigen vectors that obtain the maximum variance. For classification, we will be using the random forests algorithm.This method classifies data with the help of a multitude of decision trees at training time and outputting the class that is the mode of the classes. We will be using the voice_gender dataset that can be found at Kaggle.This data set consists of different numerical voice parameters that define men and women. To evaluate our model, we will be using the AUC (Area Under the ROC Curve ). Ideally speaking, we would want an AUC value close to 1. This value falls between 0 and 1 and it is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.

## Reading relevant libraries and data

We will load the necessary libraries and tools for data analysis.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.2
## Loading tidyverse: ggplot2
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: package 'tibble' was built under R version 3.3.2
## Warning: package 'tidyr' was built under R version 3.3.2
## Warning: package 'readr' was built under R version 3.3.2
## Warning: package 'purrr' was built under R version 3.3.2
## Warning: package 'dplyr' was built under R version 3.3.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.3.2
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.3.2
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
##     lowess
library(caret)
## Warning: package 'caret' was built under R version 3.3.2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
##     lift
library(doParallel)
## Warning: package 'doParallel' was built under R version 3.3.2
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.3.2
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
##     accumulate, when
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 3.3.2
## Loading required package: parallel
library(ellipse)
## Warning: package 'ellipse' was built under R version 3.3.3
library(GGally)
## Warning: package 'GGally' was built under R version 3.3.3
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
##     nasa
voice <- read.csv("voice.csv",header=T,stringsAsFactors = F)
str(voice)
## 'data.frame':    3168 obs. of  21 variables:
##  $meanfreq: num 0.0598 0.066 0.0773 0.1512 0.1351 ... ##$ sd      : num  0.0642 0.0673 0.0838 0.0721 0.0791 ...
##  $median : num 0.032 0.0402 0.0367 0.158 0.1247 ... ##$ Q25     : num  0.0151 0.0194 0.0087 0.0966 0.0787 ...
##  $Q75 : num 0.0902 0.0927 0.1319 0.208 0.206 ... ##$ IQR     : num  0.0751 0.0733 0.1232 0.1114 0.1273 ...
##  $skew : num 12.86 22.42 30.76 1.23 1.1 ... ##$ kurt    : num  274.4 634.61 1024.93 4.18 4.33 ...
##  $sp.ent : num 0.893 0.892 0.846 0.963 0.972 ... ##$ sfm     : num  0.492 0.514 0.479 0.727 0.784 ...
##  $mode : num 0 0 0 0.0839 0.1043 ... ##$ centroid: num  0.0598 0.066 0.0773 0.1512 0.1351 ...
##  $meanfun : num 0.0843 0.1079 0.0987 0.089 0.1064 ... ##$ minfun  : num  0.0157 0.0158 0.0157 0.0178 0.0169 ...
##  $maxfun : num 0.276 0.25 0.271 0.25 0.267 ... ##$ meandom : num  0.00781 0.00901 0.00799 0.2015 0.71281 ...
##  $mindom : num 0.00781 0.00781 0.00781 0.00781 0.00781 ... ##$ maxdom  : num  0.00781 0.05469 0.01562 0.5625 5.48438 ...
##  $dfrange : num 0 0.04688 0.00781 0.55469 5.47656 ... ##$ modindx : num  0 0.0526 0.0465 0.2471 0.2083 ...
##  $label : chr "male" "male" "male" "male" ... voice$label <- as.factor(voice$label) plotTheme <- function(base_size = 12) { theme( text = element_text( color = "black"), plot.title = element_text(size = 10,colour = "black",hjust=0.5), plot.subtitle = element_text(face="italic"), plot.caption = element_text(hjust=0), axis.ticks = element_blank(), panel.background = element_blank(), panel.grid.major = element_line("grey80", size = 0.1), panel.grid.minor = element_blank(), strip.background = element_rect(fill = "grey80", color = "white"), strip.text = element_text(size=12), axis.title = element_text(size=8), axis.text = element_text(size=8), axis.title.x = element_text(hjust=1), axis.title.y = element_text(hjust=1), plot.background = element_blank(), legend.background = element_blank(), legend.title = element_text(colour = "black", face = "bold"), legend.text = element_text(colour = "black", face = "bold")) } auc_values <- data.frame(feature_selection = c("No Feature Elimination","Correlation","Recursive Feature Elimination"),values=c(0,0,0)) The data is made of 3168 observations (rows) and 21 variables (columns). The last variable denotes the class to which each row belongs. We convert it from a character variable to a factor variable for building the model. ## How many instances of each class (male/female) are there? voice %>% group_by(label) %>% summarise(n=n()) %>% ggplot(aes(x=label,y=n))+ geom_bar(stat="identity")+plotTheme()+labs(title="Number of Each Instance") ## How does each numerical variable vary across the labels? voice %>% na.omit() %>% gather(type,value,1:20) %>% ggplot(aes(x=value,fill=label))+geom_density(alpha=0.3)+plotTheme()+facet_wrap(~type,scales="free")+theme(axis.text.x = element_text(angle = 90,vjust=1))+labs(title="Density Plots of Data across Variables") The above plot shows that for all features, there is a bit of overlap between the values for the data that is labelled male and female. ## Principal Components Analysis Principal components analysis is a statistical procedure that uses an orthogonal tranformation to convert data to a set of linearly uncorrelated variables. #library(pcaGoPromoter) pca_viz <- function(dataframe,label,heading) { data_new <- dataframe[-label] head(data_new) pca_temp <- prcomp((data_new),scale=T,center=T) pcaOutput_df <- as.data.frame(pca_temp$x)
pcaOutput_df$label <- dataframe[,label] centroids <- pcaOutput_df %>% group_by(label) %>% summarise(PC1=mean(PC1),PC2=mean(PC2)) # 95% confidence region conf.rgn_male <- data.frame(label="male",ellipse(cov(pcaOutput_df[pcaOutput_df$label == "male", c("PC1","PC2")]),
centre = as.matrix(centroids[centroids$label == "male", c("PC1","PC2")]), level = 0.95)) conf.rgn_female <- data.frame(label="female",ellipse(cov(pcaOutput_df[pcaOutput_df$label == "female", c("PC1","PC2")]),
centre = as.matrix(centroids[centroids$label == "female", c("PC1","PC2")]), level = 0.95)) conf.rgn <- bind_rows(conf.rgn_female,conf.rgn_male) %>% mutate(label = as.factor(label)) var <- (pca_temp$sdev)^2/sum((pca_temp$sdev)^2) pcaOutput_df %>% ggplot(aes(x=PC1,y=PC2,colour=label))+ geom_point(alpha=0.3)+ geom_polygon(data=conf.rgn,aes(fill=label),alpha=0.2)+ labs(title=heading, x=paste("PC1",round(var[1]*100,2),"% Variance"), y = paste("PC2",round(var[2]*100,2),"% Variance"))+plotTheme()+theme(legend.position = "bottom") } pca_viz(voice,21,"PCA Visualization") ## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character We see an overlap between the 95% confidence regions of the male and female clusters. ## Feature Importance registerDoParallel() control <- trainControl(method="repeatedcv",number=10,repeats = 10) model <- train(label ~ ., data = voice, method = "rf", preProcess = c("scale", "center"), trControl = control) ## Loading required package: randomForest ## Warning: package 'randomForest' was built under R version 3.3.3 ## randomForest 4.6-12 ## Type rfNews() to see new features/changes/bug fixes. ## ## Attaching package: 'randomForest' ## The following object is masked from 'package:dplyr': ## ## combine ## The following object is masked from 'package:ggplot2': ## ## margin importance<- varImp(model,scale=T) imp_df1 <- importance$importance
imp_df1$group <- rownames(imp_df1) imp_df1 %>% ggplot(aes(x=reorder(group,Overall),y=Overall),size=2)+geom_bar(stat = "identity")+theme(axis.text.x = element_text(vjust=1,angle=90))+ labs(x="Variable",y="Overall Importance",title="Scaled Feature Importance")+ plotTheme() The feature importance gets us acquainted with the dataset by assigning scores to features based on usefulness in building the decision trees. According to the plot above, meanfun,IQR,Q25 are the three most important attributes in the dataset. This is used as a way to understand our dataset.By looking at the density plots above ,we can see that under these numerical attributes, there is a clear demarcation between male and female labels. In other words, there is minimum overlap. ### Random forests set.seed(100) train <- sample(dim(voice)[1],dim(voice)[1]*0.9) voice_train <- voice[train,] voice_test <- voice[-train,] set.seed(100) model_rf <- train(label~., data=voice_train, method="rf", preProcess=c("scale","center"), trControl = trainControl(method = "repeatedcv",number=5,repeats = 10,verboseIter = F) ) ### ROC Curves voice_test_pred <- predict(model_rf, voice_test[,-21]) predvec <- ifelse(voice_test_pred=="female", 1, 0) realvec <- ifelse(voice_test$label=="female", 1, 0)
pred <- prediction(predvec,realvec)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, main = "ROC curve for Random Forest Classifier",col = "blue", lwd = 3)
abline(a = 0, b = 1, lwd = 2, lty = 2)

perf.auc <- performance(pred, measure = "auc")
paste("ROC Value",unlist(perf.auc@y.values))
## [1] "ROC Value 0.968434041875647"
auc_values[1,]$values <- unlist(perf.auc@y.values) ## Removing Correlated Variables Highly correlated features provide redundant information. By removing them, we can avoid any kind of bias associated with the prediction step. In this section we find pair wise correlations using the cor function. The correlation plots are visualized using the geom_tile() function. library(reshape2) ## Warning: package 'reshape2' was built under R version 3.3.2 ## ## Attaching package: 'reshape2' ## The following object is masked from 'package:tidyr': ## ## smiths cor_voice <- round(cor(voice[,-21]),2) cor_voice_df <- melt((cor_voice)) cor_voice_df %>% ggplot(aes(x=Var1,y=Var2,fill=value))+geom_tile()+plotTheme()+theme(axis.text.x = element_text(vjust=1,angle=90),legend.position = "bottom")+scale_fill_continuous(low="#eb0096",high="#a3742c")+labs(title="Correlation Plots") We will define a high correlation as greater than equal to 0.7. highlyCor <- colnames(voice)[findCorrelation(cor_voice, cutoff = 0.7, verbose = TRUE)] ## Compare row 1 and column 12 with corr 1 ## Means: 0.568 vs 0.382 so flagging column 1 ## Compare row 12 and column 4 with corr 0.91 ## Means: 0.544 vs 0.363 so flagging column 12 ## Compare row 4 and column 2 with corr 0.85 ## Means: 0.502 vs 0.344 so flagging column 4 ## Compare row 2 and column 10 with corr 0.84 ## Means: 0.451 vs 0.326 so flagging column 2 ## Compare row 3 and column 5 with corr 0.73 ## Means: 0.419 vs 0.319 so flagging column 3 ## Compare row 10 and column 9 with corr 0.87 ## Means: 0.383 vs 0.299 so flagging column 10 ## Compare row 18 and column 19 with corr 1 ## Means: 0.407 vs 0.28 so flagging column 18 ## Compare row 19 and column 16 with corr 0.81 ## Means: 0.353 vs 0.263 so flagging column 19 ## Compare row 7 and column 8 with corr 0.98 ## Means: 0.283 vs 0.252 so flagging column 7 ## All correlations <= 0.7 print(highlyCor) ## [1] "meanfreq" "centroid" "Q25" "sd" "median" "sfm" ## [7] "maxdom" "dfrange" "skew" The columns that need to be removed to avoid a predictive bias are as above. Removing the columns voice_correlation <- voice[,which(!colnames(voice) %in% highlyCor)] 9 variables that were highly correlated were removed. ### PCA Visualization with Correlated Terms Removed pca_viz(voice_correlation,which(colnames(voice_correlation)=="label"),"PCA with Correlated Terms Removed") ## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character With correlated variables in the data set, the first component contributed around 45% of the total variance in the data set. Upon removal of correlated variables, the contribution to the variance reduces to 31%. Correlated variables tend to overemphasize the contribution of the principal component to the total variance. For a detailed explanantion please read this article ### Training The random forest Now that we have removed the correlated terms, we will apply the random forest algorithm #train <- sample(dim(voice)[1],dim(voice)[1]*0.9) voice_train <- voice_correlation[train,] voice_test <- voice_correlation[-train,] set.seed(100) model_rf <- train(label~., data=voice_train, method="rf", preProcess=c("scale","center"), trControl = trainControl(method = "repeatedcv",number=5,repeats = 10,verboseIter = F) ) voice_test_pred <- predict(model_rf, voice_test[,-which(colnames(voice_test)=="label")]) ### ROC Curves predvec <- ifelse(voice_test_pred=="female", 1, 0) realvec <- ifelse(voice_test$label=="female", 1, 0)
pred <- prediction(predvec,realvec)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, main = "ROC curve for Random Forest Classifier With Correlated Terms Removed",col = "blue", lwd = 3)
abline(a = 0, b = 1, lwd = 2, lty = 2)

perf.auc <- performance(pred, measure = "auc")
paste("ROC Value",unlist(perf.auc@y.values))
## [1] "ROC Value 0.977887907013773"
auc_values[2,]$values <- unlist(perf.auc@y.values) ## Recursive Feature Elimination set.seed(100) control <- rfeControl(functions=rfFuncs,method = "cv",number=10) results_1 <- rfe(x=voice[,-21],y=voice$label,sizes=(1:9),rfeControl = control)

predictors(results_1)
## [1] "meanfun" "IQR"     "Q25"     "sd"      "sfm"

### PCA Visualization

voice_rfe <- voice[,c(which(colnames(voice) %in% predictors(results_1)),21)]
pca_viz(voice_rfe,which(colnames(voice_rfe)=="label"),"PCA with rfe")
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character

### Random Forest

voice_train <- voice_rfe[train,]
voice_test <- voice_rfe[-train,]

set.seed(100)

model_rf <- train(label~.,
data=voice_train,
method="rf",
preProcess=c("scale","center"),
trControl = trainControl(method = "repeatedcv",number=5,repeats = 10,verboseIter = F)
)

voice_test_pred <- predict(model_rf, voice_test[,-which(colnames(voice_test)=="label")])

### ROC Curves and AUC

predvec <- ifelse(voice_test_pred=="female", 1, 0)
realvec <- ifelse(voice_test$label=="female", 1, 0) pred <- prediction(predvec,realvec) perf <- performance(pred, measure = "tpr", x.measure = "fpr") plot(perf, main = "ROC curve for Random Forest Classifier",col = "blue", lwd = 3) abline(a = 0, b = 1, lwd = 2, lty = 2) perf.auc <- performance(pred, measure = "auc") paste("ROC Value",unlist(perf.auc@y.values)) ## [1] "ROC Value 0.977907809887748" auc_values[3,]$values <- unlist(perf.auc@y.values)

## Comparing AUC Values and Feature Selection Methods

auc_values %>%
ggplot(aes(x=feature_selection,y=values,group=1))+geom_line()+geom_point()+plotTheme()+theme(axis.text.x = element_text(vjust=-1,angle=90))+labs(y="AUC Values")

## Conclusion

We see that the AUC values are near about the same. Removing features by recursive elimination and by removing correlated variables improve the performance of the classifier.Information on feature selection using the caret package can be found here