#Preliminaries:
knitr::opts_chunk$set( message=FALSE, warning=FALSE) #echo = FALSE,

rm(list=ls())
library(readxl)
library(tidyverse)
library(viridis)

Introduction

In our rapidly evolving digital age, Mark Twain’s famous quip, “I only believe in statistics that I doctored myself”, takes on new significance. As the amount and accessibility of data continue to expand, the challenge of transforming raw data into actionable insights becomes increasingly vital. This transformational journey is equally relevant to business leaders, policymakers, and individuals navigating in an information-saturated world. The good news is that we can now make more evidence-based decisions, thanks to the decreasing barriers of data availability (e.g. this blog post or ICPSR), advancements in software and hardware, and the democratization of data analysis tools supported by the open source movement. However, the challenge remains: How do we separate valuable insights from the noise? The answer lies, in part, in fostering data literacy and transparency.

Data Exploration

Let’s take the first step together by demonstrating how you can turn data into valuable insights using just a few lines of R code, to uncover intriguing patterns in global happiness. We’ll use data from the World Happiness Report 2023, which compiles self-reported happiness data from various countries. We first download and read the data

if (file.exists("DataForTable2.1WHR2023.xls")) {
  
  data_in <- readxl::read_excel("DataForTable2.1WHR2023.xls")
  
} else {
    
  download.file("https://happiness-report.s3.amazonaws.com/2023/DataForTable2.1WHR2023.xls", "DataForTable2.1WHR2023.xls", mode = 'wb') #adjust 'mode' if not running on windows machine
  
  data_in <- read_excel("DataForTable2.1WHR2023.xls")
  
}

#show data: 
head(data_in)

and with the aid of the popular R-tidyverse framework, we’ll manipulate the data to explore regional and developmental trends in average perceived happiness across the globe. Our analysis focuses on data from 2010 onward and we order the 163 countries by their “happiness index”, averaged over time:

data_in %>%
  filter(year >= 2010) %>%
  group_by(`Country name`) %>%
  mutate(happiness_mean=mean(`Life Ladder`), n=n()) %>% #mean happiness by country
  ungroup() %>%
  mutate(rank=rank(happiness_mean))  %>% #create happiness rank across countries
  mutate(year=as.factor(year), `Country name`=fct_reorder(factor(`Country name`), rank)) %>% #order countries by average happiness
  ggplot(aes(y=`Country name`, x=year,  fill=`Life Ladder`)) + # graph
  scale_fill_viridis(option="inferno") + #colour
  geom_tile(colour="white") +            #background colors
  theme_minimal(base_size = 7) +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  labs(title="Hierarchy and Development of Happiness", fill="Happiness Score",
       y="Country", x="Year")  

The resulting graphic, inspired by Healy (2018), is a dynamic map showing the world’s happiness over time. Nordic countries consistently rank high, while many African and Oriental countries trend toward the bottom. We also see that from 2020 there are more white spaces across the world. Is this phenomenon linked to the COVID-19 pandemic? Examining differences, changes, extreme values, and missing values can be a rich source of insights when exploring a data set.

Data Science in Action

The data journey doesn’t stop at mere observation: Knowledge of statistical communication, programming, and modeling provides you with a range of skills from data wrangling and visualization to predicting likely outcomes for new instances (eg. Kuhn and Johnson (2013)), assessing causal relationships (eg. Pearl and Mackenzie (2018)), and creating new data products.

In the spirit of this discussion, this website gives some examples of reproducible data analytics using various real-world data, which may be interesting to you, whether you are new to data science or a seasoned professional. Feel free to get in touch.

References

Franconeri, Steven L, Lace M Padilla, Priti Shah, Jeffrey M Zacks, and Jessica Hullman. 2021. “The Science of Visual Data Communication: What Works.” Psychological Science in the Public Interest 22 (3): 110–61.
Garnier, Simon, Ross, Noam, Rudis, Robert, Camargo, et al. 2021. viridis - Colorblind-Friendly Color Maps for r. https://doi.org/10.5281/zenodo.4679424.
Healy, Kieran. 2018. “Visualizing the Baby Boom.” Socius 4: 2378023118777324.
Kahneman, Daniel. 2011. Thinking, Fast and Slow. Macmillan.
Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.
Mousavi, Shabnam, and Gerd Gigerenzer. 2014. “Risk, Uncertainty, and Heuristics.” Journal of Business Research 67 (8): 1671–78.
Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. Basic books.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Reinhart, Alex. 2015. Statistics Done Wrong: The Woefully Complete Guide. No starch press.
Spiegelhalter, David. 2019. The Art of Statistics: Learning from Data. Penguin UK.
Watzlawick, Paul. 2018. Wie Wirklich Ist Die Wirklichkeit?: Wahn, Täuschung, Verstehen. Piper ebooks.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, and Jennifer Bryan. 2019. Readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.