I’ve been trying to get back into the habit of coding with R every day, with the goal of compiling a better project portfolio (and reviving my GitHub). This means looking for projects to work on and figuring out where to get the data.

One hurdle: Tidying and transforming data has always been a bit of a pain for me. There’s no escaping the fact that these stages take up most of the time in any data analysis project, but I prefer to do my wrangling with Python, which isn’t today’s language of interest. 😅 You could say that the prerequisite for this R exercise was to find a convenient data set that required minimal cleaning and — if lucky! — omitted the need for scraping entirely.

Enter Information is Beautiful, one of my favourite data visualisation sites and a great source of tidy datasets to play with. I was poking around the site this weekend and happened upon the 2018 World Data Visualisation challenge, which includes global governance data from the World Government Summit. The available variables are rich enough to feed quite a number of interesting questions, but those would entail projects of a larger scope.

Today, since this is more of a practice exercise for skills that need dusting off, let’s go with a smaller goal: Plotting out World Happiness Index data on a map and seeing what that tells us.

1. Load R packages

library(tidyverse)
library(ggplot2)
library(stringr)
library(maps)

The tidyverse package comes standard for any kind of data manipulation, and ggplot2 is its natural complement when visualisation is involved. However, ggplot2 itself doesn’t come with data that will allow us to plot a world map.

That’s where the maps package comes in. This package compiles map data from sources such as the Natural Earth project. It allows us to create data frames of latitudes and longitudes, which ggplot2 can then use for mapping.

2. Import and clean the World Happiness Index data

hapindex <- read.csv("./wdvpdata.csv", header=TRUE)

hapindex <- hapindex %>% 
  select(region = indicator,
         "hapscore" = `world.happiness.report.score`,
         "countrycode" = `ISO.Country.code`) %>% 
  slice(-(1:4))

glimpse(hapindex)

Now that we have the libraries loaded up, it’s time to dig into our data. After importing the given data set into R, let’s go step by step:

  • We’ll only need the columns with the country name, ISO code, and the country’s corresponding World Happiness Report score. Use select to extract these variables.
  • When we import the dataset, we get a few extra rows of explanatory text / source attribution up top. We won’t need these, so use slice to remove them.
  • Double-check the data types of each variable by using glimpse.

This is what we get:

Rows: 195
Columns: 3
$ region      <chr> "Afghanistan",~
$ hapscore    <chr> "2.66", "4.64"~
$ countrycode <chr> "AFG", "ALB", ~

Looks like all our variables were imported as the character data type. This isn’t quite what we want for the scores, since this keeps R from parsing each score as a proper value. We’re leaving this be for now, though; first, we need to check our country codes.

One thing to remember here is that the maps package already gives us a set of longitudes and latitudes for world regions. Let’s extract this world map data and store it in a variable called world:

## Use ggplot2's map_data() function to create a data frame from the maps package's world data

world <- map_data("world")
glimpse(world)

Here’s what the data looks like:

Rows: 99,338
Columns: 6
$ long      <dbl> -69.89912, -69.8~
$ lat       <dbl> 12.45200, 12.423~
$ group     <dbl> 1, 1, 1, 1, 1, 1~
$ order     <int> 1, 2, 3, 4, 5, 6~
$ region    <chr> "Aruba", "Aruba"~
$ subregion <chr> NA, NA, NA, NA, ~

Our next question is: how do we link this map dataset to the other dataset we have, which carries each country’s happiness index score?

Note that both datasets have a common variable called region. This is important because region can serve as the common key that will allow us to merge both datasets while retaining the links between countries and their corresponding information.

First, though, we need to check: are the entries in our map dataset’s region column completely aligned with the region entries in our happiness index dataset? If not, we’ll have to reconcile the region variables first so that they have the exact same entries. Otherwise, information will get dropped when we merge the full sets.

## check if world$region and hapindex$region need reconciliation
difference <- setdiff(world$region, hapindex$region)

## reconcile differences
hapindex <- hapindex %>% 
  mutate(region = recode(str_trim(region),
                         "United States" = "USA",
                         "United Kingdom" = "UK",
                         "Korea (Rep.)" = "South Korea",
                         "Congo (Dem. Rep.)" = "Democratic Republic of the Congo",
                         "Congo (Rep.)" = "Republic of Congo",
                         "Korea (Dem. People’s Rep.)" = "North Korea"))

hapindex$hapscore <- as.numeric(hapindex$hapscore)

It turns out that our happiness index dataset uses slightly different names for some countries. We use mutate and stringr‘s str_trim function to rename these entries and bring them in line with the names used in our map dataset.

Finally, remember how our happiness index score was stored as a character data type? Time for us to convert it to numerics. This will be useful later on, when we want ggplot2 to recognise these scores as values and adjust the look of our plot accordingly.

3. Build the world map

We’re talking so much about our plot, but we don’t actually have a world map yet. Time to bring back the world variable that we created earlier.

We’ll build the actual shapes of the map using ggplot2‘s geom_polygon function. One useful point to remember here: Whenever you’re binding an aspect of the plot to a variable in your dataset (instead of a fixed value), use aes().

worldmap <- ggplot() +
  geom_polygon(data=world, aes(x=long,
                               y=lat,
                               group=group)) +
  coord_fixed(1.3)

worldmap

This gives us a basic world map:

It doesn’t tell us anything yet, but still pretty cool. 🙂

4. Merge map and Happiness Index datasets

One last step before we get to the fun part. It’s time for us to link each country’s map data with its happiness index score. We need to do this so that ggplot2 can map the correct score for each country on our world map.

# Combine map data with happiness index data

worldjoin <- inner_join(world, hapindex, by = "region")
worldjoin$hapscore <- as.numeric(worldjoin$hapscore)
glimpse(worldjoin)

We used inner_join to retain only the matching rows present in both datasets. Think of it as keeping the intersection of our map dataset and our happiness index dataset, based on the values in our selected column, region.

At this point, I noticed that the happiness index score somehow reverted back to a character data type. Another round of converting to numerics, then.

5. Plot the Happiness Index map

Now it’s time to visualise the different happiness index scores all over the world!

Why do we even need to visualise at all, though? Visual representation brings out the contrasts among various countries’ scores and gives us a friendlier entry point for exploring the implications of the data that we have. It sounds like a tall order, but ggplot2 simplifies the process a ton:

## compile all map theme configurations
cleanup <- theme(
  axis.text = element_blank(),
  axis.line = element_blank(),
  axis.ticks = element_blank(),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  axis.title = element_blank(),
  panel.background = element_rect(fill = "white"),
  plot.title = element_text(hjust = 0.5)
)

## plot our merged data

worldHappiness <- worldjoin %>% 
  ggplot(mapping = aes(
    x=long,
    y=lat,
    group=group)) +
  scale_fill_distiller(palette = "RdYlBu", direction = -1) +
  coord_fixed(1.3) +
  geom_polygon(aes(fill=hapscore)) +
  ggtitle("World Happiness Report Scores 2017") +
  cleanup

I’m not really looking to have visible x and y axes on our map, nor gridlines, backgrounds, etc. To make life easier, I’ve compiled all of these “non-elements” into a variable called cleanup

Then, we get to the map itself. To associate happiness index scores with the colours used for each region, we link geom_polygon‘s fill aspect to our happiness index score column, hapscore.

This gives us the following map:

6. What does this tell us?

The highest happiness index scores are bright orange — and unsurprisingly, this colour tracks with developed countries such as the US, Australia, and members of the EU. One thing to note about the Happiness Index score is that it comes from self-reports: inhabitants of a country are asked to rate their quality of life from 1-10, the higher, the better. But the score doesn’t delve much into why someone might rate a country a certain way. In future plots, then, it would be interesting to delve into that question by juxtaposing the Happiness Index score with other metrics such as the GINI index (which measures income inequality). This would help us probe for any notable correlations between inhabitants’ happiness and factors such as life expectancy or income distribution, which are measured by the other indices.