rdatasetregressionlinear-regressionfactoring

Linear model regressing on every level of a numeric field


I am currently trying to run a linear model on a large data set, but am running into issues with some specific variables.

    pv_model <- lm(SalePrice ~ MSSubClass + LotConfig + GarageArea + LotFrontage, data = train)
    summary(pv_model)

Here is code for my regression. SalePrice, MSSubClass, GarageArea, and LotFrontage are all numeric fields, while LotConfig is a factored variable.

Here is the output of my pv_model:

                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       98154.64   17235.51   5.695 1.75e-08 ***
MSSubClass           50.05      58.38   0.857 0.391539    
LotConfigCulDSac  69949.50   12740.62   5.490 5.42e-08 ***
LotConfigFR2      19998.34   14592.31   1.370 0.170932    
LotConfigFR3      21390.99   34126.44   0.627 0.530962    
LotConfigInside   21666.04    5597.33   3.871 0.000118 ***
GarageArea          175.67      10.96  16.035  < 2e-16 ***
LotFrontage101    42571.20   42664.89   0.998 0.318682    
LotFrontage102    26051.49   35876.54   0.726 0.467968    
LotFrontage103    36528.81   35967.56   1.016 0.310131    
LotFrontage104   218129.42   58129.56   3.752 0.000188 ***
LotFrontage105    61737.12   27618.21   2.235 0.025673 *  
LotFrontage106    40806.22   58159.42   0.702 0.483120    
LotFrontage107    36744.69   29494.94   1.246 0.213211    
LotFrontage108    71537.30   42565.91   1.681 0.093234 .  
LotFrontage109   -29193.02   42528.98  -0.686 0.492647    
LotFrontage110    73589.28   27706.92   2.656 0.008068 ** 

As you can see, the first variables operate correctly. Both the factored and numeric fields respond appropriately. That is, until it gets to LotFrontage. For whatever reason, the model runs the regression on every single level of LotFrontage.

For reference, LotFrontage describes the square footage of the subject's front yard. I have properly cleaned the data and replaced NA values. I really am at a loss for why this particular column is acting so unusually.

Any help is greatly appreciated.


Solution

  • If I download the data from the kaggle link or use a github link and do:

    train = read.csv("train.csv")
    
    class(x$LotFrontage)
    [1] "integer"
    
    pv_model <- lm(SalePrice ~ MSSubClass + LotConfig + GarageArea + LotFrontage, 
    data = train)
        summary(pv_model)
    
    Call:
    lm(formula = SalePrice ~ MSSubClass + LotConfig + GarageArea + 
        LotFrontage, data = train)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -380310  -33812   -4418   24345  487970 
    
    Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
    (Intercept)      11915.866   9455.677   1.260  0.20785    
    MSSubClass         105.699     45.345   2.331  0.01992 *  
    LotConfigCulDSac 81789.113  10547.120   7.755 1.89e-14 ***
    LotConfigFR2     17736.355  11787.227   1.505  0.13266    
    LotConfigFR3     17649.409  31418.281   0.562  0.57439    
    LotConfigInside  13073.201   5002.092   2.614  0.00907 ** 
    GarageArea         208.708      8.725  23.920  < 2e-16 ***
    LotFrontage        722.380     88.294   8.182 7.12e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    

    Suggest that you read in the csv again like above.