Regression Enhancement Guide

Learn how to improve your valuation models with data enhancements

What Are Enhancements?

Enhancements are data transformations that can improve your regression model's accuracy and interpretability. They address common data issues like skewness, outliers, non-linear relationships, and scale differences.

Log Transform PriceOften Improves R²

What It Does

Applies the natural logarithm (ln) to all sale prices. This compresses high values more than low values, reducing right-skewness and stabilizing variance.

When to Use

  • Prices are heavily right-skewed (few very expensive properties pull the distribution)
  • Price variance increases with price level (expensive properties vary more)
  • You want to model percentage changes rather than dollar changes

Real-World Example

Scenario: You're valuing homes in a neighborhood where most properties sell for $200k-$300k, but a few luxury homes sell for $800k-$1.2M.

Without log transform: The model tries to fit both ranges with a straight line, resulting in poor predictions for typical homes (R² = 0.42).

With log transform: The model captures the percentage relationship between features and price, improving fit across all price ranges (R² = 0.78).

Important Notes

Predictions are automatically back-transformed to dollars (you don't see log values)
Coefficients represent % change (e.g., +100 sq ft → +2.5% price)
Don't combine with Standardize Features (creates tiny, hard-to-interpret coefficients)
Remove OutliersUsually Improves R²

What It Does

Identifies and removes extreme data points using the IQR (Interquartile Range) method. Points beyond 1.5×IQR from Q1/Q3 quartiles are excluded from the analysis.

When to Use

  • You have obvious data errors (e.g., $50k property with 5,000 sq ft)
  • A few extreme properties are distorting the model
  • You want a model representative of typical properties, not extremes

Real-World Example

Scenario: You're analyzing 25 comparable sales. One property sold for $150k (foreclosure, needs major repairs), while others range from $240k-$310k.

Without outlier removal: The $150k sale pulls the regression line down, undervaluing all typical properties (R² = 0.51, predictions $15k-$20k too low).

With outlier removal: The distressed sale is excluded, and the model accurately represents the typical market (R² = 0.82, predictions within $5k).

Important Notes

Usually improves R² by removing noise and distortions
May reduce sample size significantly (10-20% of data removed)
Use cautiously—"outliers" may be legitimate high-value properties, not errors
Polynomial TermsMay Improve R²

What It Does

Adds squared terms (e.g., GLA²) to the regression model. This allows the model to capture curved relationships instead of forcing everything into straight lines.

When to Use

  • Price doesn't increase linearly with size (diminishing returns)
  • Scatter plots show curved patterns between features and price
  • You have enough data (>20 properties) to support additional terms

Real-World Example

Scenario: You're valuing properties ranging from 1,000 to 4,000 sq ft. Small homes ($150/sq ft) and large homes ($110/sq ft) have different price-per-sq-ft values.

Without polynomial terms: Linear model assumes constant $130/sq ft, overvaluing small homes and undervaluing large ones (R² = 0.65).

With polynomial terms: Model captures the curved relationship, accurately pricing homes across all sizes (R² = 0.84).

Important Notes

Can significantly improve R² when relationships are truly non-linear
May cause overfitting with small datasets (<20 properties)
Makes coefficient interpretation more complex (not just "per sq ft")
Standardize FeaturesDoes NOT Change R²

What It Does

Converts all features to the same scale (z-scores) by subtracting the mean and dividing by standard deviation. This makes coefficients directly comparable.

When to Use

  • You want to compare feature importance (which matters more: age or size?)
  • Features have very different scales (age: 1-50 vs GLA: 1000-3000)
  • You need to interpret coefficients as "standard deviation units"

Real-World Example

Scenario: You want to know which matters more for price: property age or gross living area.

Without standardization: Coefficients are +$85/sq ft and -$1,200/year. Hard to compare because units are different.

With standardization: Coefficients are +$18,500/std and -$8,200/std. Now you can see GLA has 2.3× more impact than age on price.

Important Notes

Improves coefficient interpretability and comparability
Does NOT improve R² or model fit—purely for interpretation
Don't combine with Log Transform (creates tiny coefficients)
Combining Enhancements

Good Combinations

  • Log Transform + Remove Outliers: Addresses both skewness and extreme values. Often produces the best R² improvement.
  • Remove Outliers + Polynomial Terms: Clean data allows polynomial terms to capture true non-linearity without being distorted by outliers.
  • Polynomial Terms + Standardize Features: Helps compare importance of linear vs quadratic terms.

Avoid These Combinations

  • Log Transform + Standardize Features: Creates tiny coefficients (e.g., 0.00012) that are hard to interpret. Choose one or the other.
  • All Four Together: Over-transformation can obscure relationships and make results difficult to explain to clients.
Quick Reference Table
EnhancementPrimary BenefitTypical R² ImpactBest For
Log TransformReduces skewness+5% to +20%Right-skewed prices
Remove OutliersRemoves noise+3% to +15%Data with errors/extremes
Polynomial TermsCaptures curves+2% to +10%Non-linear relationships
Standardize FeaturesImproves interpretability0% (no change)Comparing feature importance