Skip to content

Conversation

@wagnerlmichael
Copy link
Member

@wagnerlmichael wagnerlmichael commented Feb 6, 2025

This PR showcases a potential refactor which reduces lines of code, and may improve readability. A potential cost to this refactor is modularity or computational cost (runtime). I've confirmed that we flag sales in the exact same way comparing outputs from this branch to the master.

Previously there had been 4 functions primarily responsible for the standard deviation calculations.

  • pricing_info
  • price_column
  • which_price
  • get_thresh

The standard deviation information had been held in a nested dictionary structure operated on with the get_thresh helper function. In this PR we switch to vectorized operations and remove the nested dictionary strategy , which included get_thresh.

Edit:
After testing on the same subset of data it seems like the proposed change decreases runtime for the overarching pricing_info function

Spec Runtime
Main 315.28s
PR Branch 255.78s

@wagnerlmichael wagnerlmichael marked this pull request as ready for review February 21, 2025 21:21
Copy link
Member

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this refactor! Can you share the query you ran to check that these results are identical to the existing results? Once I've verified that, we'll be good to go.

Comment on lines +229 to +244
# Vectorized per-row lower and upper thresholds (mean ± std * multiplier)
for col in [f"sv_price_deviation_{group_str}", f"sv_cgdr_deviation_{group_str}"]:
df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[
0
] * df.groupby(list(groups))[col].transform("std")
df[f"{col}_upper"] = df.groupby(list(groups))[col].transform("mean") + permut[
1
] * df.groupby(list(groups))[col].transform("std")
if not condos:
df["sv_which_price"] = df.apply(which_price, args=(holds, groups), axis=1)

col = f"sv_price_per_sqft_deviation_{group_str}"
df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[
0
] * df.groupby(list(groups))[col].transform("std")
df[f"{col}_upper"] = df.groupby(list(groups))[col].transform("mean") + permut[
1
] * df.groupby(list(groups))[col].transform("std")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion, non-blocking] Nice generalization! We could go one step further and fold the if not condos branch into the for loop that precedes it:

thresh_cols = [
    f"sv_price_deviation_{group_str}",
    f"sv_cgdr_deviation_{group_str}"
] + ([] if condos else ["f"sv_price_per_sqft_deviation_{group_str}"])

for col in thresh_cols:
    df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[
        0
    ] * df.groupby(list(groups))[col].transform("std")
    ...

sqft_val = row[f"sv_price_per_sqft_deviation_{group_str}"]
sqft_lower = row[f"sv_price_per_sqft_deviation_{group_str}_lower"]
sqft_upper = row[f"sv_price_per_sqft_deviation_{group_str}_upper"]
sqft_out = sqft_val > sqft_upper or sqft_val < sqft_lower
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nitpick, non-blocking] Any reason not to use between_two_numbers() here, the way we do for the rest of the threshold checks?

@wagnerlmichael
Copy link
Member Author

Reminder: There were inconsistencies between the flag outputs in main and this branch, so if we proceed with this work we'll need to figure that out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants