pythondata-miningdata-analysismissing-datadata-handling

How to deal with missing values in Real-Estate data?


I am working with properties dataset and I am trying to deal with missing values in Land Square Feet Column. There are almost 160000 records in the dataset. Out of which 70000 records have missing LandSquareFeet. I also have a feature which tells about the type of building. When the building types are CONDO/Walkup I have many missing values in LandSquareFeet. There are 47k records of condo type which has 44k records with missing values in LandsquareFeet. Similarly for most of the properties of Elevator/Walkup apartments. Other categories of buildings have a very small amount of records missing with LandSquareFeet. I am confused about how to deal with missing Landsquarefeet feature. If I remove the records with missing LandSquareFeet, I will lose almost half of my dataset. I don't know if it is wise to remove the feature for all the records. I did a Little's MCAR test to find if it is MCAR but I got a p-value of 0.000 so it is not MCAR. Is it MAR? Any leads on how to deal with this will be helpful.


Solution

  • First of all, it might be a good idea if you study the missingness in your data, as tools and methods of missingness resolving often are categorized with respect to these characteristics

    MCAR missingness can be resolved by imputation techniques easily, you can search for MICE algorithm, or MissForest, as a special case of MICE.

    MNAR and MAR missingness mechanisms are called non-ignorable mechanism. there are techniques such as IP-weighting to deal with this types. Recently papers also have been published where they deal with missingness as a causal inference problem.

    Bad News is in some special cases, MAR missingness is theoretically impossible to curate. Good news is that there are several really really complex cases studied and investigated, and hopefully your case is not more complex than those, thus you could apply the existing methods.

    I tried not to solve your problem, but to give you essential keywords, by which you might be able to find your material. If you are willing to spend a good deal of time on that, you can read a great book on the subject:

    My final thought: I have a gut feeling that you can be able to solve your problem by IP-Weighting of all possible methods and approaches. look it up.