Monthly Archives: Nov 2021

Mind the Gap! – How visualizing missing data influences people’s trust in data quality and affects decision-making processes.

The ethical approach to data visualization has many faces. One of them is dealing with missing data and the way of communicating them to the audience. In the real world, we face situations that our databases are incomplete.  This is a common case of many reasons. Some are technical errors that can occur during ETL processes, others appear when data is collected manually, especially as a result of surveys, as people often fail to answer all questions.

Statistical procedures often eliminate entire records when only one variable is missing. This leads to a dramatic shortage of statistical samples. However, many times, even though our data is leaky like Swiss cheese, we have to present them and what is even worse, draw conclusions, because having 100% of data is in many cases ineffective and unrealistic in terms of costs and time.

Statistical approach

To stay honest with our audience and to present the observations or phenomenon to them in the most transparent way, we have only two options: to present gaps in the data or imputed data in place of missing data. There are several imputation methods widely used in statistics and statistic data modelling. The most common ones are:

  • Case deletion – omitting cases with incomplete data and not take them to analysis.
  • Zero-filling – imputation of value 0 for all missing data.
  • Linear interpolation – replacing missing data with estimated values.
  • Marginal means – the mean value of variable is used instead of missing one.

More explanations of the specific methods you can find here.

Nevertheless, what method we are going to use, we need to communicate to the audience about which data comes from observations and which ones are imputed. This communication should be given in voice and visual form to strengthen the message leave no room for presumptions.

Dilemma – show gaps or imputed data?

Many strategic decisions are data-driven and missing data impacts the overall understanding, interpretation and reasoning of a phenomenon if not properly addressed.

Recently I found interesting research by Hayeong Song and Danielle Albers Szafir that shed some light on how we visually communicate missing data, which has a significant influence on data quality perception and on confidence in drawing conclusions. Research emphasizes that visualizations that highlight missing data but do not break visual continuity are perceived by responders as those with higher data quality. The general conclusion is that imputation methods are better graphical choices than simply removal of information as they do not decrease perceived data quality as much that have consequences in the decision-making process. However, the very important aspect is to highlight imputed data by different shapes or colours. Another interesting graphical decision is to present imputed data as error bars. It gives our audience additional information about the likely range of values.

source

The research results in Figure 5 (b) clearly show that linear interpolation has the greatest positive impact on the perceived quality and accuracy of the data, and the visualization with data absent (Figure 4 (a)) is the lowest.

source

The research was carried out for two commonly known visualization: a line chart and a bar chart. Both graphical choices gave similar outcomes.

source
source

Conclusion

I have several books which are like a shining star that guides me through the darkness. One of them is “The Little Prince” Antoine de Saint-Exupery and quote from that book: “You become responsible, forever, for what you have tamed”. I believe we should have exactly the same approach to our analyses and their graphical representations as data analysts or data storytellers.

Advertisement

My personal attitude towards data – ethics in data storytelling.

On September 26, 1983, in the middle of the Cold war, Russian lieutenant Stanislav Petrov was on duty at the command centre of the nuclear early-warning system. The system reported that six missiles were fired from the US toward the ZSSR. Petrov based on provided information had to decide whether the alarm was true or false and to obey or not obey orders. After countless minutes that seemed to be an eternity, Petrov judged that it was a false alarm and saved the world against third war – the nuclear for sure. Later, the investigation revealed that the system malfunctioned.

But what kind of the world could we live in now if Petrov had not considered other options of the system’s response? Having that historical event in mind, can we trust any information without a doubt?

As data analysts or data storytellers, we are like a nuclear early-warning system. We provide people with the information they need to make critical decisions and shape the future. It is a very responsible role.

Why is so hard not to lie with data?

Does it sound controversial?  I believe so. Does it sound realistic? For sure. Why do I think so? Are you confident that you know all aspects of a subject that you want to present to others? Have you considered all possible options and looked at them from all involved stakeholders’ perspectives? Are you sure that the data set time period is long enough, and data quality is high? There are more questions than answers. So, tell me which version of the truth you are holding in your visualizations?

I do not accuse anybody to mislead people on purpose. Most of the time when we prepare data analysis and data visualizations to communicate information, we have pure intentions. The case is that we hold some biases and believes, and our brain uses previous experiences, and constantly makes unconscious assumptions. All that influences our thoughts and perception.

Harmful data visualization

Let’s do the mental exercise and think together about how harmful data visualization can be. Currently, I’m reading an exciting book by one of the most recognizable authors of the information visualization domain Alberto Cairo “How charts lie”. In one of the chapters, there is a story about nationalist Dylann Roof, who killed several Afro-Americans by being influenced by some charts that presented a number of crimes vs ethnic roots. That shocked me and opened my eyes to the potential consequences of distributing misleading visual representations of data.

That warning is more for data journalists and other people who juggle with data publicly. Often to get more votes or support or to influence some kind of the audience line of thinking. However, even in the business environment, we must be cautious not to make the same mistakes, because results can be catastrophic and have a real impact on people. Nevertheless, all of us should remember that when we share any data on social media or on other web pages.  

The potential negative impact of wrongly done analysis and poor data visualizations:

  • Hundreds of people can lose their job,
  • Profitable business sector can be shut down,
  • Launch of a new product can miss the target,
  • Thousands or billions of people can be at threat because of the release of the new drug.

This vulnerability is real because people who make decisions make history. There is always a human factor in any success or failure.

Do you feel like an influencer?

Some time ago I had a lot of fun preparing and sharing data visualization. But currently, I’m not so eager to do that. I didn’t have enough confidence in the data that are available, and I don’t have enough time to dive into and understand the specific subject, make analyses and investigations.

In upcoming posts, I’ll focus on ethics from a data visualizations point of view. The first one is data range.

Data range

Insights could differ very much in case of changing data scope. Anyone who has some shares on the stock market knows that depending on the selected time range he or she can observe positive or negative trends. The same cognitive dissonance we can have presenting data within our organization. Maybe in the last two years, we achieved tremendous revenue growth, but looking at revenue from a longer perspective, it can turn out that we even got closer to the results from the financial crisis (pick your favourite one as an example, they come and go periodically).

Figure 1 depicts what kind of understanding and feeling the investor can have to look at the same data but from different ranges. The left chart can indicate that results are declining, but when we look at the right one, we can see that in the longer perspective trend is positive.

Figure 1

Of course, our narration can be built around the latest two years of growth, but we shouldn’t hide information from the bigger picture. The approach in such a case should be to display the bigger picture first – a longer period of data is displayed and then zoom in on the last two years to present factors of recent revenue growth.

Another example, which is notoriously used to present voting results, is presenting people support for particular parties but having only people who voted as the full population. When I listen to the news in the mass media, often people refer to the election results without considering the voter turnout. That narration skews reality. Let’s see the below example. Figure 2 shows the result of the latest presidential elections in Poland. What will most people remember from the chart? That Duda won and had more than 50% of public support.

Figure 2

But this is not true! The real public support for Duda was 34.49% if we consider the voter turnout. The voter turnout in this election was 68.18%. It means that 31.82% of Poles didn’t go on the election. I would love to see in the mass media charts which present the entire election results, including those who didn’t vote. Then we would have the complete picture of people’s political preferences. However, I still see truncated data scope.

Figure 3

By manipulating data range as a timeline or included/excluded categories, we tell different stories about data and evoke different understandings and feelings in our audience about the subject. Let’s remember that to not lose in translation the most objective view possible.