popular

library(populaR)
library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

This vignette shows how to:

  • Set up your API key
  • Query data from the WPP API
  • Filter and summarise population indicators.

Setting up your API key

To query the WPP API, you require an API key. You can get this by emailing population at un dot org (replace the words with the symbols) with the subject “Data Portal Token Request”.

It might take a few days for the team to get back to you with your API key.

Once you have your key, you can set your API key using:

  • popular::set_wpp_api_key(YOUR_KEY_HERE): This will set your key for the session
  • Via the .Renviron file. The easiest way to access this is with usethis::edit_r_environ().

The .Renviron file is a key-value pair, so just add a line that says

WPP_API_KEY=YOUR_KEY_HERE

Don’t include any spaces before or after the = sign.

Restart your R session. You can test if the key is working by typing get_wpp_api_key(). If this function doesn’t return an error, then you are good to go.

Querying data from the WPP API

The WPP API operates based on indicators and locations. These are typically numeric identifiers. Indicators can be thought of as datasets. There is a wide range of datasets reported by the API, and you can see the list using

get_base_levels("Indicators")
#> # A tibble: 86 × 33
#>       id name         shortName description displayName dimAge dimSex dimVariant
#>    <int> <chr>        <chr>     <chr>       <chr>       <lgl>  <lgl>  <lgl>     
#>  1     1 Contracepti… CPAnyP    Percentage… Any         FALSE  FALSE  TRUE      
#>  2     2 Contracepti… CPModP    Percentage… CP Modern   FALSE  FALSE  TRUE      
#>  3     3 Contracepti… CPTrad    Percentage… CP Traditi… FALSE  FALSE  TRUE      
#>  4     4 Unmet need … UNMP      Percentage… Unmet need  FALSE  FALSE  TRUE      
#>  5     5 Unmet need … UNMModP   Percentage… Unmet need… FALSE  FALSE  TRUE      
#>  6     6 Total deman… DEMTot    Percentage… Total dema… FALSE  FALSE  TRUE      
#>  7     7 Demand for … DEMAny    Percentage… Demand sat… FALSE  FALSE  TRUE      
#>  8     8 Demand for … DEMMod    Percentage… Demand sat… FALSE  FALSE  TRUE      
#>  9     9 Contracepti… CPAnyN    Number of … Contracept… FALSE  FALSE  TRUE      
#> 10    10 Contracepti… CPModN    Number of … Users of m… FALSE  FALSE  TRUE      
#> # ℹ 76 more rows
#> # ℹ 25 more variables: dimCategory <lgl>, defaultAgeId <int>,
#> #   defaultSexId <int>, defaultVariantId <int>, defaultCategoryId <int>,
#> #   variableType <chr>, valueType <chr>, unitScaling <dbl>, precision <int>,
#> #   isThousandSeparatorSpace <lgl>, formatString <chr>, unitShortLabel <chr>,
#> #   unitLongLabel <chr>, nClassesDefault <int>, downloadFileName <chr>,
#> #   sourceId <int>, sourceName <chr>, sourceYear <int>, …

If you are working on age-related data, then you probably want this indicator:

age_sex_id <- get_id("Population by 5-year age groups and sex", type = "Indicators", search = FALSE, .progress = FALSE)

age_sex_id
#> # A tibble: 1 × 4
#>      id name                                    shortName       description     
#>   <int> <chr>                                   <chr>           <chr>           
#> 1    46 Population by 5-year age groups and sex PopByAge5AndSex Annual populati…

By default, the get_id function will perform a fuzzy search for the string you pass in, so if you’re not sure of the indicator, have a guess and see what comes back.

The other piece of information you require is a location. This could be a country, or wider region. For example,

australia <- get_id("Australia", type = "locations")

australia
#> # A tibble: 6 × 6
#>      id name                                      iso3  iso2  longitude latitude
#>   <int> <chr>                                     <chr> <chr>     <dbl>    <dbl>
#> 1    36 Australia                                 AUS   AU         134.    -25.3
#> 2   927 Australia/New Zealand                     ANZ   ZL          NA      NA  
#> 3  1834 Australia/New Zealand                     ANZ   ZL          NA      NA  
#> 4  1835 Oceania (excluding Australia and New Zea… OCA   OZ          NA      NA  
#> 5  1837 Eastern and South-Eastern Asia, and Ocea… SP1   S1          NA      NA  
#> 6  5502 Europe, Northern America, Australia and … SDG   SD          NA      NA

If I know the country I want exactly, I can disable search using

australia <- get_id("Australia", type = "locations", search = FALSE)

australia
#> # A tibble: 1 × 6
#>      id name      iso3  iso2  longitude latitude
#>   <int> <chr>     <chr> <chr>     <dbl>    <dbl>
#> 1    36 Australia AUS   AU         134.    -25.3

Finally, I can collect the indicator data from the API using get_indicator_data:

aus_data <- get_indicator_data(indicator_id = age_sex_id$id, location_id = australia$id, start_year = 2020, end_year = 2024)

aus_data
#> # A tibble: 252 × 7
#>    locationId location   year sexId      ageStart ageEnd population
#>         <dbl> <chr>     <dbl> <fct>         <dbl>  <dbl>      <dbl>
#>  1         36 Australia  2020 Male              0      4    793764 
#>  2         36 Australia  2020 Female            0      4    749506 
#>  3         36 Australia  2020 Both sexes        0      4   1543270 
#>  4         36 Australia  2020 Male              5      9    836376.
#>  5         36 Australia  2020 Female            5      9    791073 
#>  6         36 Australia  2020 Both sexes        5      9   1627450.
#>  7         36 Australia  2020 Male             10     14    826988 
#>  8         36 Australia  2020 Female           10     14    780265 
#>  9         36 Australia  2020 Both sexes       10     14   1607253 
#> 10         36 Australia  2020 Male             15     19    770618.
#> # ℹ 242 more rows

Working with data from the WPP API

Let’s start by creating a simple time series plot of the population, stratified by sex, over time.

First, notice that the WPP API returns three levels for sexId:

aus_data |>
  pull(sexId) |>
  unique()
#> [1] Male       Female     Both sexes
#> Levels: Both sexes Female Male

We will need to filter out the sum.

aus_data |>
  filter(sexId != "Both sexes") |>
  group_by(year, sexId) |>
  summarise(population = sum(population)) |>
  ggplot(aes(x = year, y = population, colour = sexId)) +
  geom_line() +
  geom_point() +
  labs(x = "Year", y = "Population", colour = "Sex")
#> `summarise()` has regrouped the output.
#> ℹ Summaries were computed grouped by year and sexId.
#> ℹ Output is grouped by year.
#> ℹ Use `summarise(.groups = "drop_last")` to silence this message.
#> ℹ Use `summarise(.by = c(year, sexId))` for per-operation grouping
#>   (`?dplyr::dplyr_by`) instead.

We can see that there is consistently a higher proportion of females than males in Australia!