Feature Selection using Voice Gender Data and the #caret package

Feature Selection Methods

Feature Selection , Dimensionality reduction and Random Forests

This post is based on an article by Shirin Glander on feature selection.

Feature Selection is a process of selecting a subset of relevant features for use in a classification problem. This allows for :

  1. Simplification of Models
  2. Shorter training times
  3. Avoiding overfitting, so that models can be generalised to other incoming data

Dimensionality reduction is a way of reducing the number of random variables under consideration.In this project, we will be using PCA to map the data to a 2-Dimensional frame along the eigen vectors that obtain the maximum variance. For classification, we will be using the random forests algorithm.This method classifies data with the help of a multitude of decision trees at training time and outputting the class that is the mode of the classes. We will be using the voice_gender dataset that can be found at Kaggle.This data set consists of different numerical voice parameters that define men and women. To evaluate our model, we will be using the AUC (Area Under the ROC Curve ). Ideally speaking, we would want an AUC value close to 1. This value falls between 0 and 1 and it is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.

Reading relevant libraries and data

We will load the necessary libraries and tools for data analysis.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.2
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: package 'tibble' was built under R version 3.3.2
## Warning: package 'tidyr' was built under R version 3.3.2
## Warning: package 'readr' was built under R version 3.3.2
## Warning: package 'purrr' was built under R version 3.3.2
## Warning: package 'dplyr' was built under R version 3.3.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.3.2
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.3.2
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(caret)
## Warning: package 'caret' was built under R version 3.3.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(doParallel)
## Warning: package 'doParallel' was built under R version 3.3.2
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.3.2
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 3.3.2
## Loading required package: parallel
library(ellipse)
## Warning: package 'ellipse' was built under R version 3.3.3
library(GGally)
## Warning: package 'GGally' was built under R version 3.3.3
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
voice <- read.csv("voice.csv",header=T,stringsAsFactors = F)
str(voice)
## 'data.frame':    3168 obs. of  21 variables:
##  $ meanfreq: num  0.0598 0.066 0.0773 0.1512 0.1351 ...
##  $ sd      : num  0.0642 0.0673 0.0838 0.0721 0.0791 ...
##  $ median  : num  0.032 0.0402 0.0367 0.158 0.1247 ...
##  $ Q25     : num  0.0151 0.0194 0.0087 0.0966 0.0787 ...
##  $ Q75     : num  0.0902 0.0927 0.1319 0.208 0.206 ...
##  $ IQR     : num  0.0751 0.0733 0.1232 0.1114 0.1273 ...
##  $ skew    : num  12.86 22.42 30.76 1.23 1.1 ...
##  $ kurt    : num  274.4 634.61 1024.93 4.18 4.33 ...
##  $ sp.ent  : num  0.893 0.892 0.846 0.963 0.972 ...
##  $ sfm     : num  0.492 0.514 0.479 0.727 0.784 ...
##  $ mode    : num  0 0 0 0.0839 0.1043 ...
##  $ centroid: num  0.0598 0.066 0.0773 0.1512 0.1351 ...
##  $ meanfun : num  0.0843 0.1079 0.0987 0.089 0.1064 ...
##  $ minfun  : num  0.0157 0.0158 0.0157 0.0178 0.0169 ...
##  $ maxfun  : num  0.276 0.25 0.271 0.25 0.267 ...
##  $ meandom : num  0.00781 0.00901 0.00799 0.2015 0.71281 ...
##  $ mindom  : num  0.00781 0.00781 0.00781 0.00781 0.00781 ...
##  $ maxdom  : num  0.00781 0.05469 0.01562 0.5625 5.48438 ...
##  $ dfrange : num  0 0.04688 0.00781 0.55469 5.47656 ...
##  $ modindx : num  0 0.0526 0.0465 0.2471 0.2083 ...
##  $ label   : chr  "male" "male" "male" "male" ...
voice$label <- as.factor(voice$label)
plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 10,colour = "black",hjust=0.5),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=8),
    axis.text = element_text(size=8),
    axis.title.x = element_text(hjust=1),
    axis.title.y = element_text(hjust=1),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "bold"),
    legend.text = element_text(colour = "black", face = "bold"))
}


auc_values <- data.frame(feature_selection = c("No Feature Elimination","Correlation","Recursive Feature Elimination"),values=c(0,0,0))

The data is made of 3168 observations (rows) and 21 variables (columns). The last variable denotes the class to which each row belongs. We convert it from a character variable to a factor variable for building the model.

How many instances of each class (male/female) are there?

voice %>% group_by(label) %>%
  summarise(n=n()) %>%
  ggplot(aes(x=label,y=n))+
  geom_bar(stat="identity")+plotTheme()+labs(title="Number of Each Instance")

 

unnamed-chunk-2-1

How does each numerical variable vary across the labels?

voice %>% na.omit() %>%
  gather(type,value,1:20) %>%
  ggplot(aes(x=value,fill=label))+geom_density(alpha=0.3)+plotTheme()+facet_wrap(~type,scales="free")+theme(axis.text.x = element_text(angle = 90,vjust=1))+labs(title="Density Plots of Data across Variables")

 

unnamed-chunk-3-1

The above plot shows that for all features, there is a bit of overlap between the values for the data that is labelled male and female.

Principal Components Analysis

Principal components analysis is a statistical procedure that uses an orthogonal tranformation to convert data to a set of linearly uncorrelated variables.

#library(pcaGoPromoter)

pca_viz <- function(dataframe,label,heading)
{
  data_new <- dataframe[-label]
  head(data_new)
  pca_temp <- prcomp((data_new),scale=T,center=T)
  
  pcaOutput_df <- as.data.frame(pca_temp$x)
  pcaOutput_df$label <- dataframe[,label]
  centroids <- pcaOutput_df %>% group_by(label) %>%
    summarise(PC1=mean(PC1),PC2=mean(PC2))
  # 95% confidence region 
  conf.rgn_male <- data.frame(label="male",ellipse(cov(pcaOutput_df[pcaOutput_df$label == "male", c("PC1","PC2")]),
                         centre = as.matrix(centroids[centroids$label == "male", c("PC1","PC2")]),
                         level = 0.95))
  conf.rgn_female <- data.frame(label="female",ellipse(cov(pcaOutput_df[pcaOutput_df$label == "female", c("PC1","PC2")]),
                         centre = as.matrix(centroids[centroids$label == "female", c("PC1","PC2")]),
                         level = 0.95))
  conf.rgn <- bind_rows(conf.rgn_female,conf.rgn_male) %>% mutate(label = as.factor(label))
  var <- (pca_temp$sdev)^2/sum((pca_temp$sdev)^2)

  pcaOutput_df %>%
    ggplot(aes(x=PC1,y=PC2,colour=label))+
    geom_point(alpha=0.3)+
    geom_polygon(data=conf.rgn,aes(fill=label),alpha=0.2)+
    labs(title=heading,
         x=paste("PC1",round(var[1]*100,2),"% Variance"),
         y = paste("PC2",round(var[2]*100,2),"% Variance"))+plotTheme()+theme(legend.position = "bottom")

  
  
}




pca_viz(voice,21,"PCA Visualization")
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character

 

unnamed-chunk-4-1

We see an overlap between the 95% confidence regions of the male and female clusters.

Feature Importance

registerDoParallel()
control <- trainControl(method="repeatedcv",number=10,repeats = 10)
model <- train(label ~ ., data = voice, method = "rf", preProcess = c("scale", "center"), trControl = control)
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.3.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
importance<- varImp(model,scale=T)
imp_df1 <- importance$importance
imp_df1$group <- rownames(imp_df1)

imp_df1 %>%
  ggplot(aes(x=reorder(group,Overall),y=Overall),size=2)+geom_bar(stat = "identity")+theme(axis.text.x = element_text(vjust=1,angle=90))+
  labs(x="Variable",y="Overall Importance",title="Scaled Feature Importance")+
  plotTheme()

 

unnamed-chunk-5-1

The feature importance gets us acquainted with the dataset by assigning scores to features based on usefulness in building the decision trees. According to the plot above, meanfun,IQR,Q25 are the three most important attributes in the dataset. This is used as a way to understand our dataset.By looking at the density plots above ,we can see that under these numerical attributes, there is a clear demarcation between male and female labels. In other words, there is minimum overlap.

Random forests

set.seed(100)
train <- sample(dim(voice)[1],dim(voice)[1]*0.9)
voice_train <- voice[train,]
voice_test <- voice[-train,]
set.seed(100)
model_rf <- train(label~.,
                  data=voice_train,
                  method="rf",
            preProcess=c("scale","center"),
            trControl = trainControl(method = "repeatedcv",number=5,repeats = 10,verboseIter = F)
                  )

ROC Curves

voice_test_pred <- predict(model_rf, voice_test[,-21])
predvec <- ifelse(voice_test_pred=="female", 1, 0)
realvec <- ifelse(voice_test$label=="female", 1, 0)
pred <- prediction(predvec,realvec)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, main = "ROC curve for Random Forest Classifier",col = "blue", lwd = 3)
abline(a = 0, b = 1, lwd = 2, lty = 2)

unnamed-chunk-7-1

perf.auc <- performance(pred, measure = "auc")
paste("ROC Value",unlist(perf.auc@y.values))
## [1] "ROC Value 0.968434041875647"
auc_values[1,]$values <- unlist(perf.auc@y.values)

Removing Correlated Variables

Highly correlated features provide redundant information. By removing them, we can avoid any kind of bias associated with the prediction step. In this section we find pair wise correlations using the cor function. The correlation plots are visualized using the geom_tile() function.

library(reshape2)
## Warning: package 'reshape2' was built under R version 3.3.2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
cor_voice <- round(cor(voice[,-21]),2)
cor_voice_df <- melt((cor_voice))
cor_voice_df %>%
  ggplot(aes(x=Var1,y=Var2,fill=value))+geom_tile()+plotTheme()+theme(axis.text.x = element_text(vjust=1,angle=90),legend.position = "bottom")+scale_fill_continuous(low="#eb0096",high="#a3742c")+labs(title="Correlation Plots")

unnamed-chunk-8-1

We will define a high correlation as greater than equal to 0.7.

highlyCor <- colnames(voice)[findCorrelation(cor_voice, cutoff = 0.7, verbose = TRUE)]
## Compare row 1  and column  12 with corr  1 
##   Means:  0.568 vs 0.382 so flagging column 1 
## Compare row 12  and column  4 with corr  0.91 
##   Means:  0.544 vs 0.363 so flagging column 12 
## Compare row 4  and column  2 with corr  0.85 
##   Means:  0.502 vs 0.344 so flagging column 4 
## Compare row 2  and column  10 with corr  0.84 
##   Means:  0.451 vs 0.326 so flagging column 2 
## Compare row 3  and column  5 with corr  0.73 
##   Means:  0.419 vs 0.319 so flagging column 3 
## Compare row 10  and column  9 with corr  0.87 
##   Means:  0.383 vs 0.299 so flagging column 10 
## Compare row 18  and column  19 with corr  1 
##   Means:  0.407 vs 0.28 so flagging column 18 
## Compare row 19  and column  16 with corr  0.81 
##   Means:  0.353 vs 0.263 so flagging column 19 
## Compare row 7  and column  8 with corr  0.98 
##   Means:  0.283 vs 0.252 so flagging column 7 
## All correlations <= 0.7
print(highlyCor)
## [1] "meanfreq" "centroid" "Q25"      "sd"       "median"   "sfm"     
## [7] "maxdom"   "dfrange"  "skew"

The columns that need to be removed to avoid a predictive bias are as above.

Removing the columns

voice_correlation <- voice[,which(!colnames(voice) %in% highlyCor)]

9 variables that were highly correlated were removed.

PCA Visualization with Correlated Terms Removed

pca_viz(voice_correlation,which(colnames(voice_correlation)=="label"),"PCA with Correlated Terms Removed")
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character

unnamed-chunk-11-1

With correlated variables in the data set, the first component contributed around 45% of the total variance in the data set. Upon removal of correlated variables, the contribution to the variance reduces to 31%. Correlated variables tend to overemphasize the contribution of the principal component to the total variance. For a detailed explanantion please read this article

Training The random forest

Now that we have removed the correlated terms, we will apply the random forest algorithm

#train <- sample(dim(voice)[1],dim(voice)[1]*0.9)


voice_train <- voice_correlation[train,]
voice_test <- voice_correlation[-train,]

set.seed(100)

model_rf <- train(label~.,
                  data=voice_train,
                  method="rf",
            preProcess=c("scale","center"),
            trControl = trainControl(method = "repeatedcv",number=5,repeats = 10,verboseIter = F)
                  )

voice_test_pred <- predict(model_rf, voice_test[,-which(colnames(voice_test)=="label")])

ROC Curves

predvec <- ifelse(voice_test_pred=="female", 1, 0)
realvec <- ifelse(voice_test$label=="female", 1, 0)
pred <- prediction(predvec,realvec)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, main = "ROC curve for Random Forest Classifier With Correlated Terms Removed",col = "blue", lwd = 3)
abline(a = 0, b = 1, lwd = 2, lty = 2)

unnamed-chunk-13-1

perf.auc <- performance(pred, measure = "auc")
paste("ROC Value",unlist(perf.auc@y.values))
## [1] "ROC Value 0.977887907013773"
auc_values[2,]$values <- unlist(perf.auc@y.values)

Recursive Feature Elimination

set.seed(100)
control <- rfeControl(functions=rfFuncs,method = "cv",number=10)
results_1 <- rfe(x=voice[,-21],y=voice$label,sizes=(1:9),rfeControl = control)

predictors(results_1)
## [1] "meanfun" "IQR"     "Q25"     "sd"      "sfm"

PCA Visualization

voice_rfe <- voice[,c(which(colnames(voice) %in% predictors(results_1)),21)]
pca_viz(voice_rfe,which(colnames(voice_rfe)=="label"),"PCA with rfe")
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character

unnamed-chunk-15-1

Random Forest

voice_train <- voice_rfe[train,]
voice_test <- voice_rfe[-train,]

set.seed(100)

model_rf <- train(label~.,
                  data=voice_train,
                  method="rf",
            preProcess=c("scale","center"),
            trControl = trainControl(method = "repeatedcv",number=5,repeats = 10,verboseIter = F)
                  )

voice_test_pred <- predict(model_rf, voice_test[,-which(colnames(voice_test)=="label")])

ROC Curves and AUC

predvec <- ifelse(voice_test_pred=="female", 1, 0)
realvec <- ifelse(voice_test$label=="female", 1, 0)
pred <- prediction(predvec,realvec)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, main = "ROC curve for Random Forest Classifier",col = "blue", lwd = 3)
abline(a = 0, b = 1, lwd = 2, lty = 2)

unnamed-chunk-17-1

perf.auc <- performance(pred, measure = "auc")
paste("ROC Value",unlist(perf.auc@y.values))
## [1] "ROC Value 0.977907809887748"
auc_values[3,]$values <- unlist(perf.auc@y.values)

Comparing AUC Values and Feature Selection Methods

auc_values %>%
  ggplot(aes(x=feature_selection,y=values,group=1))+geom_line()+geom_point()+plotTheme()+theme(axis.text.x = element_text(vjust=-1,angle=90))+labs(y="AUC Values")

unnamed-chunk-18-1

Conclusion

We see that the AUC values are near about the same. Removing features by recursive elimination and by removing correlated variables improve the performance of the classifier.Information on feature selection using the caret package can be found here

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s