Regression Enhancement Guide

Learn how to improve your valuation models with data enhancements

What Are Enhancements?

Enhancements are data transformations that can improve your regression model's accuracy and interpretability. They address common data issues like skewness, outliers, non-linear relationships, and scale differences.

Key Principle: Not all enhancements are appropriate for every dataset. Use the AI-powered recommendations to identify which enhancements will help your specific analysis.

Log Transform PriceOften Improves R²

What It Does

Applies the natural logarithm (ln) to all sale prices. This compresses high values more than low values, reducing right-skewness and stabilizing variance.

When to Use

Prices are heavily right-skewed (few very expensive properties pull the distribution)
Price variance increases with price level (expensive properties vary more)
You want to model percentage changes rather than dollar changes

Real-World Example

Scenario: You're valuing homes in a neighborhood where most properties sell for $200k-$300k, but a few luxury homes sell for $800k-$1.2M.

Without log transform: The model tries to fit both ranges with a straight line, resulting in poor predictions for typical homes (R² = 0.42).

With log transform: The model captures the percentage relationship between features and price, improving fit across all price ranges (R² = 0.78).

Important Notes

Predictions are automatically back-transformed to dollars (you don't see log values)

Coefficients represent % change (e.g., +100 sq ft → +2.5% price)

Don't combine with Standardize Features (creates tiny, hard-to-interpret coefficients)

Remove OutliersUsually Improves R²

What It Does

Identifies and removes extreme data points using the IQR (Interquartile Range) method. Points beyond 1.5×IQR from Q1/Q3 quartiles are excluded from the analysis.

When to Use

You have obvious data errors (e.g., $50k property with 5,000 sq ft)
A few extreme properties are distorting the model
You want a model representative of typical properties, not extremes

Real-World Example

Scenario: You're analyzing 25 comparable sales. One property sold for $150k (foreclosure, needs major repairs), while others range from $240k-$310k.

Without outlier removal: The $150k sale pulls the regression line down, undervaluing all typical properties (R² = 0.51, predictions $15k-$20k too low).

With outlier removal: The distressed sale is excluded, and the model accurately represents the typical market (R² = 0.82, predictions within $5k).

Important Notes

Usually improves R² by removing noise and distortions

May reduce sample size significantly (10-20% of data removed)

Use cautiously—"outliers" may be legitimate high-value properties, not errors

Polynomial TermsMay Improve R²

What It Does

Adds squared terms (e.g., GLA²) to the regression model. This allows the model to capture curved relationships instead of forcing everything into straight lines.

When to Use

Price doesn't increase linearly with size (diminishing returns)
Scatter plots show curved patterns between features and price
You have enough data (>20 properties) to support additional terms

Real-World Example

Scenario: You're valuing properties ranging from 1,000 to 4,000 sq ft. Small homes ($150/sq ft) and large homes ($110/sq ft) have different price-per-sq-ft values.

Without polynomial terms: Linear model assumes constant $130/sq ft, overvaluing small homes and undervaluing large ones (R² = 0.65).

With polynomial terms: Model captures the curved relationship, accurately pricing homes across all sizes (R² = 0.84).

Important Notes

Can significantly improve R² when relationships are truly non-linear

May cause overfitting with small datasets (<20 properties)

Makes coefficient interpretation more complex (not just "per sq ft")

Standardize FeaturesDoes NOT Change R²

What It Does

Converts all features to the same scale (z-scores) by subtracting the mean and dividing by standard deviation. This makes coefficients directly comparable.

When to Use

You want to compare feature importance (which matters more: age or size?)
Features have very different scales (age: 1-50 vs GLA: 1000-3000)
You need to interpret coefficients as "standard deviation units"

Real-World Example

Scenario: You want to know which matters more for price: property age or gross living area.

Without standardization: Coefficients are +$85/sq ft and -$1,200/year. Hard to compare because units are different.

With standardization: Coefficients are +$18,500/std and -$8,200/std. Now you can see GLA has 2.3× more impact than age on price.

Important Notes

Improves coefficient interpretability and comparability

Does NOT improve R² or model fit—purely for interpretation

Don't combine with Log Transform (creates tiny coefficients)

Combining Enhancements

Good Combinations

Log Transform + Remove Outliers: Addresses both skewness and extreme values. Often produces the best R² improvement.
Remove Outliers + Polynomial Terms: Clean data allows polynomial terms to capture true non-linearity without being distorted by outliers.
Polynomial Terms + Standardize Features: Helps compare importance of linear vs quadratic terms.

Avoid These Combinations

Log Transform + Standardize Features: Creates tiny coefficients (e.g., 0.00012) that are hard to interpret. Choose one or the other.
All Four Together: Over-transformation can obscure relationships and make results difficult to explain to clients.

Best Practice: Start with AI recommendations, then experiment with one enhancement at a time. Compare R² values to see what actually helps your specific dataset.

Quick Reference Table

Enhancement	Primary Benefit	Typical R² Impact	Best For
Log Transform	Reduces skewness	+5% to +20%	Right-skewed prices
Remove Outliers	Removes noise	+3% to +15%	Data with errors/extremes
Polynomial Terms	Captures curves	+2% to +10%	Non-linear relationships
Standardize Features	Improves interpretability	0% (no change)	Comparing feature importance