In an ideal world, we would like to work with datasets that are clean, complete and accurate. However, real-world data rarely meets our expectation. We often encounter datasets with noise, inconsistencies, outliers and missingness, which requires careful handling to get effective results. Especially, missing data is an unavoidable challenge, and how we address it has a significant impact on the output of our predictive models or analysis.
Why?
The reason is hidden in the definition. Missing data are the unobserved values that would be meaningful for analysis if observed.
In the literature, we can find several methods to address missing data, but according to the nature of the missingness, choosing the right technique is highly critical. Simple methods such as dropping rows with missing values can cause biases or the loss of important insights. Imputing wrong values can also result in distortions that influence the final results. Thus, it is essential to understand the nature of missingness in the data before deciding on the correction action.
The nature of missingness can simply be classified into three: