-
Notifications
You must be signed in to change notification settings - Fork 0
Description
As I have figured out together with @audrism a couple of weeks ago, we should document two important characteristics of the WoC data in a prominent way. As I don't know where to put this, I just open this issue now to explain what I would like to document. I am looking forward to receiving your suggestions on where exactly this should be documented:
(1) Important notice/disclaimer regarding the commits stored in the WoC database:
World of Code stores commits of a project that might never have ended up to appear in the source code of a project. For example, commits that are part of a pull request on GitHub that has never been merged are still part of the repository although they never have ended up in the project's codebase. This also holds for individual commits of squash-merged pull requests or interactively rebased commits: git stores all versions of these commits even if they don't end up in the codebase only in their final version. As World of Code uses git clone --mirror, it extracts all of such commits that don't end up in the project's codebase, which could for instance lead to counting the same blob multiple times or tracking unmerged blobs belonging to low-quality garbage pull requests of unrelated developers. Users of World of Code should be aware of that when accessing and analyzing the commits of a project.
(At the moment it is not possible to automatically identify and filter such commits of garbage pull requests, but this will possibly discussed at the next hackathon.)
(2) Important notice/disclaimer regarding the data of forks available in the WoC database:
As many forks of GitHub projects appear to be just vehicles for pull requests instead of actual forks, World of Code has stopped discovering forks and updating their data around 2021 for convenience reasons, as a huge number of commits would have been stored multiple times in numerous of these forks. Therefore, the WoC database might be inconsistent with respect to such forks: Forks that have been existing before 2021 are part of World of Code, but no updates to these repositories (i.e., new commits) have been tracked for them after this point in time. In contrast, forks that have been created after WoC has stopped discovering forks won't show up in the WoC data at all.
(Unsure about the exact date, but while investigating a few exemplary case studies, @audrism and I figured out that it must have happened definitely before March 2022, probably during 2021).
Any suggestions from the entire WoC team on where to put these two pieces of information in the documentation?
It should be prominently placed, such that new uses become certainly aware of it. But it shouldn't be the very first note of the documentation, not to scare potential users.