Cosine Similarity with spam message Feature Data Leakage

Hi. Great tutorial. Just a quick note on Session 11: when creating cosine similarities with spam message feature on training data you should exclude the observation itself from the spam messages list:
``` {r}
# cosine similarities with spam messages and vice versa!
spam.indexes <- which(train$Label == "spam")
train.svd$SpamSimilarity <- rep(0.0, nrow(train.svd))
for(i in 1:nrow(train.svd)) {
    spam.indexesCV <- setdiff(spam.indexes,i)
    train.svd$SpamSimilarity[i] <- mean(train.similarities[i, spam.indexesCV])
}
```

This solves the data leakage problem leading to over-fitting. The RF results on test data with updated feature are much better:
```
 # Drill-in on results
 confusionMatrix(preds, test.svd$Label)
Confusion Matrix and Statistics

          Reference
Prediction  ham spam
      ham  1445   32
      spam    2  192
                                      
               Accuracy : 0.98          
                 95% CI : (0.972, 0.986)
    No Information Rate : 0.866         
    P-Value [Acc > NIR] : < 2e-16       
                                        
                  Kappa : 0.907         
 Mcnemar's Test P-Value : 0.000000658   
                                        
            Sensitivity : 0.999         
            Specificity : 0.857         
         Pos Pred Value : 0.978         
         Neg Pred Value : 0.990         
             Prevalence : 0.866         
         Detection Rate : 0.865         
   Detection Prevalence : 0.884         
      Balanced Accuracy : 0.928         
                                        
       'Positive' Class : ham           
                            
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cosine Similarity with spam message Feature Data Leakage #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cosine Similarity with spam message Feature Data Leakage #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions