The ethical approach to data visualization has many faces. One of them is dealing with missing data and the way of communicating them to the audience. In the real world, we face situations that our databases are incomplete. This is a common case of many reasons. Some are technical errors that can occur during ETL processes, others appear when data is collected manually, especially as a result of surveys, as people often fail to answer all questions.
Statistical procedures often eliminate entire records when only one variable is missing. This leads to a dramatic shortage of statistical samples. However, many times, even though our data is leaky like Swiss cheese, we have to present them and what is even worse, draw conclusions, because having 100% of data is in many cases ineffective and unrealistic in terms of costs and time.
To stay honest with our audience and to present the observations or phenomenon to them in the most transparent way, we have only two options: to present gaps in the data or imputed data in place of missing data. There are several imputation methods widely used in statistics and statistic data modelling. The most common ones are:
- Case deletion – omitting cases with incomplete data and not take them to analysis.
- Zero-filling – imputation of value 0 for all missing data.
- Linear interpolation – replacing missing data with estimated values.
- Marginal means – the mean value of variable is used instead of missing one.
More explanations of the specific methods you can find here.
Nevertheless, what method we are going to use, we need to communicate to the audience about which data comes from observations and which ones are imputed. This communication should be given in voice and visual form to strengthen the message leave no room for presumptions.
Dilemma – show gaps or imputed data?
Many strategic decisions are data-driven and missing data impacts the overall understanding, interpretation and reasoning of a phenomenon if not properly addressed.
Recently I found interesting research by Hayeong Song and Danielle Albers Szafir that shed some light on how we visually communicate missing data, which has a significant influence on data quality perception and on confidence in drawing conclusions. Research emphasizes that visualizations that highlight missing data but do not break visual continuity are perceived by responders as those with higher data quality. The general conclusion is that imputation methods are better graphical choices than simply removal of information as they do not decrease perceived data quality as much that have consequences in the decision-making process. However, the very important aspect is to highlight imputed data by different shapes or colours. Another interesting graphical decision is to present imputed data as error bars. It gives our audience additional information about the likely range of values.
The research results in Figure 5 (b) clearly show that linear interpolation has the greatest positive impact on the perceived quality and accuracy of the data, and the visualization with data absent (Figure 4 (a)) is the lowest.
The research was carried out for two commonly known visualization: a line chart and a bar chart. Both graphical choices gave similar outcomes.
I have several books which are like a shining star that guides me through the darkness. One of them is “The Little Prince” Antoine de Saint-Exupery and quote from that book: “You become responsible, forever, for what you have tamed”. I believe we should have exactly the same approach to our analyses and their graphical representations as data analysts or data storytellers.