-
Notifications
You must be signed in to change notification settings - Fork 20
Description
I'm developing a package for project-specific data processing. One step is checking whether a number of names are really distinct, or if similar names refer to the same person. For this I first generate from a database a data.table of pairs that are similar based on string similarity, and compare this to a data.table of pairs for that I have manually checked whether they refer to the same person. If all similar sounding names have been covered in my manually compiled list, the test passes.
I do this via a negative join with data.table:
dt_redux <- dt_pairs_from_db[!dt_manually_checked_pairs, on = .(name1, name2)]
expect_true(nrow(dt_redux)==0)
This test did pass when calling test_all or build_install_test, but failed in R CMD check.
After some searching I tracked it down to the name order in dt_pairs_from_db. Here the pairs are generated from a string similarity function, which creates two entries for each couple (name1, name2 and name2, name1). To avoid having to check each couple twice, I only cover the cases where name1 > name2. However for one couple, "İnan Kıraç" and "Suna Kıraç", the alphabetical order differs between the normal R environment and the testing environment: In the normal R environment, expect_true("İnan Kıraç" > "Suna Kıraç") fails, but in the testing environment (in my test_package.R file), expect_true("İnan Kıraç" > "Suna Kıraç") passes.
This difference in alphabetical order lead to a dt_pairs_from_db being generated that didn't match the order of pairs to check in my dt_manually_checked_pairs, which caused the test to fail.
I've now fixed it by just adding this particular couple in both comparisons to my dt_manually_checked_pairs, but I'm curious what caused this; any ideas?