Missing data: classification

By Pierre Baudin

Data Manager

Every person who works with data will tell you about it, they all had to deal with missing data. This experience can be a headache and sometimes a nightmare when it comes to mining data riddled with missing values

As we know, data quality is one of the main keys to successfully carrying out a data project (see: the concept of clean data).

However, before embarking on strategies for dealing with missing data, it is important to be able to identify and understand the reasons behind these information gaps.

To read also : Do you speak data ?

Understand and identify missing data

There are several categories of missing data which are based on the reasons and mechanisms leading to the missing data. In the following paragraphs, I detail the three main types.

#1 Missing Completely At Random (MCAR)

Missing data are classified as MCAR (Missing Completely At Random) if the events that lead to the absence of a particular information are independent from both the observable variables and the unobservable parameters. That is, this missing data is produced entirely at random. This implies that the causes of missing data are not related to the data itself.

💡 An example of MCAR is a scale running out of batteries. Some data will be missing just because of bad luck.

In the context of a company collecting information on its website, MCAR classified data appears when the site is no longer functional for any reason (breakdown, temporary stoppage of services, maintenance, etc.).

When the data is MCAR, the analysis performed on that data is unbiased. None of the variables are affected more than another. The statistical advantage of MCAR data is that the analysis remains unbiased despite an obvious loss of information.

🔎 However, data is rarely MCAR.

#2 Missing at Random (MAR)

Modern statistical methods generally start with the Missing At Random (MAR) assumption to justify missing data.

Random missing data is a more general and realistic assumption than MCAR. MAR occurs when the absence is not random but it can be fully accounted for by variables for which there is complete information.

💡For example, when placed on a soft surface, our scale may produce more missing values than when placed on a hard surface. These data are therefore not MCARs because we know that different surfaces give different results. However if we know the surface type and if we can assume that the data is MCAR on that surface type then the data is considered MAR.

In our business and website information collection context, an example of MAR data may be a difference in the behavior of data flows between desktop browsing (via a computer) and mobile browsing (via a smartphone). In this case, it is possible to know the differences of collection between these two types of navigation. By isolating the data from one of the site access mode we can then consider the data as MCAR.

Illustration du concept de balance

#3 Missing Not At Random (MNAR)

If the characteristics of the missing data do not match those of MCAR or MAR, they fall under the category of Missing Not At Random (MNAR).

MNAR means that the probability of missing data varies for reasons unknown to us

💡 For example, the mechanism of our scale may wear down over time, producing more missing data over time, but we may not notice it. If the heaviest objects are measured later in our study, then we get a distribution of measurements that will be distorted. MNAR also includes the possibility that our scale will produce more missing values for heavier objects, a phenomenon that could be difficult to identify and manage.

An example of non-random missing data for our company and its website may be changes in site behavior and data collection components as frameworks and systems used are updated. In this case, it can be very complex to identify the mechanisms leading to the generation of missing data.

MNAR data cases are problematic. The only way to get an unbiased estimate of the parameters in such a case is to model the missing data. The model can then be incorporated into a more complex model to estimate missing values.

The concept of missing data is important to understand in order to successfully manage the management and usage of data. If the missing values are not handled correctly by the user, it can lead to inaccurate conclusions about the data. In a future article, I will share with you the strategies available to deal with missing data.