-
Notifications
You must be signed in to change notification settings - Fork 15
Consolidate and formalize model report for feature analysis #401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Consolidate and formalize model report for feature analysis #401
Conversation
| ) | ||
| # Sample for readability | ||
| if (nrow(unmatched_wide) > 10000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If everything is included but data hasen't been finalized i.e. school data not being uploaded, the report will just break.
Do we want to include this, since we should be running it after things like that have been resolved? And, I don't think we will look at more than 10,000 observations.
We also randomize so that we don't get lower numbered pins.
|
@jeancochrane |
|
new file is in the same location called model_features.qmd representing the new folder/file name |
…analysis' of github.com:ccao-data/model-res-avm into 393-consolidate-and-formalize-model-report-for-feature-analysis
jeancochrane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just about ready to go! A few small suggestions below.
One higher-level suggestion that we should tackle before we merge: I realized while messing around with the other reports that we typically use the renv environment in the root of the repo for managing the R dependencies that our reports rely on. I checked and it seems like that root renv environment already has all the dependencies that these new reports need. As such, I think we can delete the following renv and R artifacts from this branch, and just use the root renv environment instead:
reports/model_features/renv/*reports/model_features/.Rprofilereports/model_features/model_features.Rprojreports/model_features/renv.lock
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
…analysis' of github.com:ccao-data/model-res-avm into 393-consolidate-and-formalize-model-report-for-feature-analysis
|
I rendered the doc, and nothing seems to cause issues, but I'll make sure it finishes fine as well. |
jeancochrane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just kicked off one final render to check to make sure the output still looks good after our most recent changes, but in the meantime, I identified a couple minor issues!
| library(arrow) | ||
| library(data.table) | ||
| library(dplyr) | ||
| library(DT) | ||
| library(ggplot2) | ||
| library(glue) | ||
| library(kableExtra) | ||
| library(knitr) | ||
| library(leaflet) | ||
| library(noctua) | ||
| library(stringr) | ||
| library(tidyr) | ||
|
|
||
| source("_utils.R") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion, optional] Sorry for yet again recommending a change to the structure of _baseline_query_data.R and _utils.R, but I noticed while testing document rendering that the library() calls in this script are interfering with the caching implemented in the Quarto docs that source this script. In short, Quarto can't property cache library() calls, and since the results of _baseline_query_data.R are cached in all the Quarto docs that source it, those docs will fail during render if _baseline_query_data.R is cached, because package functions are not available in the document namespace. (Happy to talk this through in more detail if it's not clear.)
There are two ways we could deal with this issue:
- Stop caching
_baseline_query_data.Rin downstream data consumers, because the script already implements a form of caching using theif (!exists(<variable>))conditional branches- This would be a very easy change to make, but it's suboptimal in that
_baseline_query_data.Rdoesn't implement true caching -- it will always need to run all the queries the first time it is called in the context of a Quarto doc, even if it hasn't changed
- This would be a very easy change to make, but it's suboptimal in that
- Keep caching
_baseline_query_data.Rin downstream data consumers, but move all thelibrary()calls from_baseline_query_data.Rto_utils.R, and switch things up so that wesource("_utils.R")in the context of the Quarto docs, not in the context of_baseline_query_data.R- This is a slightly more complicated change, but it would preserve true caching in the docs
I would recommend approach 2, but ultimately I'm fine with either one! I tested approach 2 and it seems to work well so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we also remove #| cache-file-2: !expr rlang::hash_file("_utils.R") from each file as well?
@jeancochrane
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @Damonamajor!
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
This merges two existing EI issues to create a consolidated QC testing framework to run once each year upon final upload of data. It uses 4 reports, 3 derived from this and 1 from this.
The feature missingness creates a series of thresholds to test if values change a significant amount. For example, all acs values should not match, but the percentages should not change by more than 10%.
I also refactored the data_changes script from the previous EI issue so it runs in about half the time.
Because the file takes a long time to run, and it has limited impact once data has been finalized. It should be run manually, not integrated in the model pipeline.
In total the run time appears to be about 2 hours for the actual script, and for some reason, the rendering takes approximately 1 hour.
output is too large to upload, so it is stored here:
O:\CCAODATA\tmp\model_features.html