I’ve been trying to get back into the habit of coding with R every day, with the goal of compiling a better project portfolio (and reviving my GitHub). This means looking for projects to work on and figuring out where to get the data.
One hurdle: Tidying and transforming data has always been a bit of a pain for me. There’s no escaping the fact that these stages take up most of the time in any data analysis project, but I prefer to do my wrangling with Python, which isn’t today’s language of interest. 😅 You could say that the prerequisite for this R exercise was to find a convenient data set that required minimal cleaning and — if lucky! — omitted the need for scraping entirely.
Enter Information is Beautiful, one of my favourite data visualisation sites and a great source of tidy datasets to play with. I was poking around the site this weekend and happened upon the 2018 World Data Visualisation challenge, which includes global governance data from the World Government Summit. The available variables are rich enough to feed quite a number of interesting questions, but those would entail projects of a larger scope.
Today, since this is more of a practice exercise for skills that need dusting off, let’s go with a smaller goal: Plotting out World Happiness Index data on a map and seeing what that tells us.
1. Load R packages
library(tidyverse) library(ggplot2) library(stringr) library(maps)
tidyverse package comes standard for any kind of data manipulation, and
ggplot2 is its natural complement when visualisation is involved. However,
ggplot2 itself doesn’t come with data that will allow us to plot a world map.
That’s where the
maps package comes in. This package compiles map data from sources such as the Natural Earth project. It allows us to create data frames of latitudes and longitudes, which
ggplot2 can then use for mapping.
2. Import and clean the World Happiness Index data
hapindex <- read.csv("./wdvpdata.csv", header=TRUE) hapindex <- hapindex %>% select(region = indicator, "hapscore" = `world.happiness.report.score`, "countrycode" = `ISO.Country.code`) %>% slice(-(1:4)) glimpse(hapindex)
Now that we have the libraries loaded up, it’s time to dig into our data. After importing the given data set into R, let’s go step by step:
- We’ll only need the columns with the country name, ISO code, and the country’s corresponding World Happiness Report score. Use
selectto extract these variables.
- When we import the dataset, we get a few extra rows of explanatory text / source attribution up top. We won’t need these, so use
sliceto remove them.
- Double-check the data types of each variable by using
This is what we get:
Rows: 195 Columns: 3 $ region <chr> "Afghanistan",~ $ hapscore <chr> "2.66", "4.64"~ $ countrycode <chr> "AFG", "ALB", ~
Looks like all our variables were imported as the
character data type. This isn’t quite what we want for the scores, since this keeps R from parsing each score as a proper value. We’re leaving this be for now, though; first, we need to check our country codes.
One thing to remember here is that the
maps package already gives us a set of longitudes and latitudes for world regions. Let’s extract this world map data and store it in a variable called
## Use ggplot2's map_data() function to create a data frame from the maps package's world data world <- map_data("world") glimpse(world)
Here’s what the data looks like:
Rows: 99,338 Columns: 6 $ long <dbl> -69.89912, -69.8~ $ lat <dbl> 12.45200, 12.423~ $ group <dbl> 1, 1, 1, 1, 1, 1~ $ order <int> 1, 2, 3, 4, 5, 6~ $ region <chr> "Aruba", "Aruba"~ $ subregion <chr> NA, NA, NA, NA, ~
Our next question is: how do we link this map dataset to the other dataset we have, which carries each country’s happiness index score?
Note that both datasets have a common variable called
region. This is important because
region can serve as the common key that will allow us to merge both datasets while retaining the links between countries and their corresponding information.
First, though, we need to check: are the entries in our map dataset’s
region column completely aligned with the
region entries in our happiness index dataset? If not, we’ll have to reconcile the
region variables first so that they have the exact same entries. Otherwise, information will get dropped when we merge the full sets.
## check if world$region and hapindex$region need reconciliation difference <- setdiff(world$region, hapindex$region) ## reconcile differences hapindex <- hapindex %>% mutate(region = recode(str_trim(region), "United States" = "USA", "United Kingdom" = "UK", "Korea (Rep.)" = "South Korea", "Congo (Dem. Rep.)" = "Democratic Republic of the Congo", "Congo (Rep.)" = "Republic of Congo", "Korea (Dem. Peopleâ€™s Rep.)" = "North Korea")) hapindex$hapscore <- as.numeric(hapindex$hapscore)
It turns out that our happiness index dataset uses slightly different names for some countries. We use
str_trim function to rename these entries and bring them in line with the names used in our map dataset.
Finally, remember how our happiness index score was stored as a
character data type? Time for us to convert it to
numerics. This will be useful later on, when we want
ggplot2 to recognise these scores as values and adjust the look of our plot accordingly.
3. Build the world map
We’re talking so much about our plot, but we don’t actually have a world map yet. Time to bring back the
world variable that we created earlier.
We’ll build the actual shapes of the map using
geom_polygon function. One useful point to remember here: Whenever you’re binding an aspect of the plot to a variable in your dataset (instead of a fixed value), use
worldmap <- ggplot() + geom_polygon(data=world, aes(x=long, y=lat, group=group)) + coord_fixed(1.3) worldmap
This gives us a basic world map:
It doesn’t tell us anything yet, but still pretty cool. 🙂
4. Merge map and Happiness Index datasets
One last step before we get to the fun part. It’s time for us to link each country’s map data with its happiness index score. We need to do this so that ggplot2 can map the correct score for each country on our world map.
# Combine map data with happiness index data worldjoin <- inner_join(world, hapindex, by = "region") worldjoin$hapscore <- as.numeric(worldjoin$hapscore) glimpse(worldjoin)
inner_join to retain only the matching rows present in both datasets. Think of it as keeping the intersection of our map dataset and our happiness index dataset, based on the values in our selected column,
At this point, I noticed that the happiness index score somehow reverted back to a
character data type. Another round of converting to
5. Plot the Happiness Index map
Now it’s time to visualise the different happiness index scores all over the world!
Why do we even need to visualise at all, though? Visual representation brings out the contrasts among various countries’ scores and gives us a friendlier entry point for exploring the implications of the data that we have. It sounds like a tall order, but
ggplot2 simplifies the process a ton:
## compile all map theme configurations cleanup <- theme( axis.text = element_blank(), axis.line = element_blank(), axis.ticks = element_blank(), panel.border = element_blank(), panel.grid = element_blank(), axis.title = element_blank(), panel.background = element_rect(fill = "white"), plot.title = element_text(hjust = 0.5) ) ## plot our merged data worldHappiness <- worldjoin %>% ggplot(mapping = aes( x=long, y=lat, group=group)) + scale_fill_distiller(palette = "RdYlBu", direction = -1) + coord_fixed(1.3) + geom_polygon(aes(fill=hapscore)) + ggtitle("World Happiness Report Scores 2017") + cleanup
I’m not really looking to have visible x and y axes on our map, nor gridlines, backgrounds, etc. To make life easier, I’ve compiled all of these “non-elements” into a variable called cleanup
Then, we get to the map itself. To associate happiness index scores with the colours used for each region, we link
fill aspect to our happiness index score column,
This gives us the following map:
6. What does this tell us?
The highest happiness index scores are bright orange — and unsurprisingly, this colour tracks with developed countries such as the US, Australia, and members of the EU. One thing to note about the Happiness Index score is that it comes from self-reports: inhabitants of a country are asked to rate their quality of life from 1-10, the higher, the better. But the score doesn’t delve much into why someone might rate a country a certain way. In future plots, then, it would be interesting to delve into that question by juxtaposing the Happiness Index score with other metrics such as the GINI index (which measures income inequality). This would help us probe for any notable correlations between inhabitants’ happiness and factors such as life expectancy or income distribution, which are measured by the other indices.