I’ve been trying to get back into the habit of coding with R every day, with the goal of compiling a better project portfolio (and reviving my GitHub). This means looking for projects to work on and figuring out where to get the data.
One hurdle: Tidying and transforming data has always been a bit of a pain for me. There’s no escaping the fact that these stages take up most of the time in any data analysis project, but I prefer to do my wrangling with Python, which isn’t today’s language of interest. 😅 You could say that the prerequisite for this R exercise was to find a convenient data set that required minimal cleaning and — if lucky! — omitted the need for scraping entirely.
Enter Information is Beautiful, one of my favourite data visualisation sites and a great source of tidy datasets to play with. I was poking around the site this weekend and happened upon the 2018 World Data Visualisation challenge, which includes global governance data from the World Government Summit. The available variables are rich enough to feed quite a number of interesting questions, but those would entail projects of a larger scope.
Today, since this is more of a practice exercise for skills that need dusting off, let’s go with a smaller goal: Plotting out World Happiness Index data on a map and seeing what that tells us.
1. Load R packages
library(tidyverse)
library(ggplot2)
library(stringr)
library(maps)
The tidyverse
package comes standard for any kind of data manipulation, and ggplot2
is its natural complement when visualisation is involved. However, ggplot2
itself doesn’t come with data that will allow us to plot a world map.
That’s where the maps
package comes in. This package compiles map data from sources such as the Natural Earth project. It allows us to create data frames of latitudes and longitudes, which ggplot2
can then use for mapping.
2. Import and clean the World Happiness Index data
hapindex <- read.csv("./wdvpdata.csv", header=TRUE)
hapindex <- hapindex %>%
select(region = indicator,
"hapscore" = `world.happiness.report.score`,
"countrycode" = `ISO.Country.code`) %>%
slice(-(1:4))
glimpse(hapindex)
Now that we have the libraries loaded up, it’s time to dig into our data. After importing the given data set into R, let’s go step by step:
- We’ll only need the columns with the country name, ISO code, and the country’s corresponding World Happiness Report score. Use
select
to extract these variables. - When we import the dataset, we get a few extra rows of explanatory text / source attribution up top. We won’t need these, so use
slice
to remove them. - Double-check the data types of each variable by using
glimpse
.
This is what we get:
Rows: 195
Columns: 3
$ region <chr> "Afghanistan",~
$ hapscore <chr> "2.66", "4.64"~
$ countrycode <chr> "AFG", "ALB", ~
Looks like all our variables were imported as the character
data type. This isn’t quite what we want for the scores, since this keeps R from parsing each score as a proper value. We’re leaving this be for now, though; first, we need to check our country codes.
One thing to remember here is that the maps
package already gives us a set of longitudes and latitudes for world regions. Let’s extract this world map data and store it in a variable called world
:
## Use ggplot2's map_data() function to create a data frame from the maps package's world data
world <- map_data("world")
glimpse(world)
Here’s what the data looks like:
Rows: 99,338
Columns: 6
$ long <dbl> -69.89912, -69.8~
$ lat <dbl> 12.45200, 12.423~
$ group <dbl> 1, 1, 1, 1, 1, 1~
$ order <int> 1, 2, 3, 4, 5, 6~
$ region <chr> "Aruba", "Aruba"~
$ subregion <chr> NA, NA, NA, NA, ~
Our next question is: how do we link this map dataset to the other dataset we have, which carries each country’s happiness index score?
Note that both datasets have a common variable called region
. This is important because region
can serve as the common key that will allow us to merge both datasets while retaining the links between countries and their corresponding information.
First, though, we need to check: are the entries in our map dataset’s region
column completely aligned with the region
entries in our happiness index dataset? If not, we’ll have to reconcile the region
variables first so that they have the exact same entries. Otherwise, information will get dropped when we merge the full sets.
## check if world$region and hapindex$region need reconciliation
difference <- setdiff(world$region, hapindex$region)
## reconcile differences
hapindex <- hapindex %>%
mutate(region = recode(str_trim(region),
"United States" = "USA",
"United Kingdom" = "UK",
"Korea (Rep.)" = "South Korea",
"Congo (Dem. Rep.)" = "Democratic Republic of the Congo",
"Congo (Rep.)" = "Republic of Congo",
"Korea (Dem. People’s Rep.)" = "North Korea"))
hapindex$hapscore <- as.numeric(hapindex$hapscore)
It turns out that our happiness index dataset uses slightly different names for some countries. We use mutate
and stringr
‘s str_trim
function to rename these entries and bring them in line with the names used in our map dataset.
Finally, remember how our happiness index score was stored as a character
data type? Time for us to convert it to numerics
. This will be useful later on, when we want ggplot2
to recognise these scores as values and adjust the look of our plot accordingly.
3. Build the world map
We’re talking so much about our plot, but we don’t actually have a world map yet. Time to bring back the world
variable that we created earlier.
We’ll build the actual shapes of the map using ggplot2
‘s geom_polygon
function. One useful point to remember here: Whenever you’re binding an aspect of the plot to a variable in your dataset (instead of a fixed value), use aes()
.
worldmap <- ggplot() +
geom_polygon(data=world, aes(x=long,
y=lat,
group=group)) +
coord_fixed(1.3)
worldmap
This gives us a basic world map:

It doesn’t tell us anything yet, but still pretty cool. 🙂
4. Merge map and Happiness Index datasets
One last step before we get to the fun part. It’s time for us to link each country’s map data with its happiness index score. We need to do this so that ggplot2 can map the correct score for each country on our world map.
# Combine map data with happiness index data
worldjoin <- inner_join(world, hapindex, by = "region")
worldjoin$hapscore <- as.numeric(worldjoin$hapscore)
glimpse(worldjoin)
We used inner_join
to retain only the matching rows present in both datasets. Think of it as keeping the intersection of our map dataset and our happiness index dataset, based on the values in our selected column, region
.
At this point, I noticed that the happiness index score somehow reverted back to a character
data type. Another round of converting to numerics
, then.
5. Plot the Happiness Index map
Now it’s time to visualise the different happiness index scores all over the world!
Why do we even need to visualise at all, though? Visual representation brings out the contrasts among various countries’ scores and gives us a friendlier entry point for exploring the implications of the data that we have. It sounds like a tall order, but ggplot2
simplifies the process a ton:
## compile all map theme configurations
cleanup <- theme(
axis.text = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.title = element_blank(),
panel.background = element_rect(fill = "white"),
plot.title = element_text(hjust = 0.5)
)
## plot our merged data
worldHappiness <- worldjoin %>%
ggplot(mapping = aes(
x=long,
y=lat,
group=group)) +
scale_fill_distiller(palette = "RdYlBu", direction = -1) +
coord_fixed(1.3) +
geom_polygon(aes(fill=hapscore)) +
ggtitle("World Happiness Report Scores 2017") +
cleanup
I’m not really looking to have visible x and y axes on our map, nor gridlines, backgrounds, etc. To make life easier, I’ve compiled all of these “non-elements” into a variable called cleanup
Then, we get to the map itself. To associate happiness index scores with the colours used for each region, we link geom_polygon
‘s fill
aspect to our happiness index score column, hapscore
.
This gives us the following map:

6. What does this tell us?
The highest happiness index scores are bright orange — and unsurprisingly, this colour tracks with developed countries such as the US, Australia, and members of the EU. One thing to note about the Happiness Index score is that it comes from self-reports: inhabitants of a country are asked to rate their quality of life from 1-10, the higher, the better. But the score doesn’t delve much into why someone might rate a country a certain way. In future plots, then, it would be interesting to delve into that question by juxtaposing the Happiness Index score with other metrics such as the GINI index (which measures income inequality). This would help us probe for any notable correlations between inhabitants’ happiness and factors such as life expectancy or income distribution, which are measured by the other indices.