Skip to content

Cosine Similarity with spam message Feature Data Leakage #1

@blazysecon

Description

@blazysecon

Hi. Great tutorial. Just a quick note on Session 11: when creating cosine similarities with spam message feature on training data you should exclude the observation itself from the spam messages list:

# cosine similarities with spam messages and vice versa!
spam.indexes <- which(train$Label == "spam")
train.svd$SpamSimilarity <- rep(0.0, nrow(train.svd))
for(i in 1:nrow(train.svd)) {
    spam.indexesCV <- setdiff(spam.indexes,i)
    train.svd$SpamSimilarity[i] <- mean(train.similarities[i, spam.indexesCV])
}

This solves the data leakage problem leading to over-fitting. The RF results on test data with updated feature are much better:

 # Drill-in on results
 confusionMatrix(preds, test.svd$Label)
Confusion Matrix and Statistics

          Reference
Prediction  ham spam
      ham  1445   32
      spam    2  192
                                      
               Accuracy : 0.98          
                 95% CI : (0.972, 0.986)
    No Information Rate : 0.866         
    P-Value [Acc > NIR] : < 2e-16       
                                        
                  Kappa : 0.907         
 Mcnemar's Test P-Value : 0.000000658   
                                        
            Sensitivity : 0.999         
            Specificity : 0.857         
         Pos Pred Value : 0.978         
         Neg Pred Value : 0.990         
             Prevalence : 0.866         
         Detection Rate : 0.865         
   Detection Prevalence : 0.884         
      Balanced Accuracy : 0.928         
                                        
       'Positive' Class : ham           
                            

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions