-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Hi,
Amazing project! I actually found about it after I made one based on the exact same principle based on Swedish data for my own needs. I just published the code here.
I'm both frustrated and happy I found your project (as well as name-dataset) because I couldn't find anything when I first looked and felt like I had to write my own code. But now that I've done it, I'm bummed someone implemented it better and with more data. Oh well... 😊
Anyway, I'm reaching out since I saw that you seem to be using newborn data for Sweden. I've been using a different dataset which I think works better. SCB has a list of all the names born by at least two people living in Sweden (first, middle and last names). They can be found on this page (the files called Namnsök 2021 and 2022).
I did the math and this amounts to 98% of the population (e.g. 2% of the population have a unique name and are hence not in this list). So it's way more exhaustive than the lists of newborns, even if you go back a few decades. In total, there are 97386 unique first names to compare with the 1518 in your newborn dataset.
Would you be interested in a PR to use this dataset instead?