-
Notifications
You must be signed in to change notification settings - Fork 242
Open
Description
Hi. Great tutorial. Just a quick note on Session 11: when creating cosine similarities with spam message feature on training data you should exclude the observation itself from the spam messages list:
# cosine similarities with spam messages and vice versa!
spam.indexes <- which(train$Label == "spam")
train.svd$SpamSimilarity <- rep(0.0, nrow(train.svd))
for(i in 1:nrow(train.svd)) {
spam.indexesCV <- setdiff(spam.indexes,i)
train.svd$SpamSimilarity[i] <- mean(train.similarities[i, spam.indexesCV])
}
This solves the data leakage problem leading to over-fitting. The RF results on test data with updated feature are much better:
# Drill-in on results
confusionMatrix(preds, test.svd$Label)
Confusion Matrix and Statistics
Reference
Prediction ham spam
ham 1445 32
spam 2 192
Accuracy : 0.98
95% CI : (0.972, 0.986)
No Information Rate : 0.866
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.907
Mcnemar's Test P-Value : 0.000000658
Sensitivity : 0.999
Specificity : 0.857
Pos Pred Value : 0.978
Neg Pred Value : 0.990
Prevalence : 0.866
Detection Rate : 0.865
Detection Prevalence : 0.884
Balanced Accuracy : 0.928
'Positive' Class : ham
Metadata
Metadata
Assignees
Labels
No labels