This tutorial takes you through a simple approach to geocoding addresses using the ggmap package. This package automates the look-up of an address within the Google Geocoding API Application Programming Interface (API) and extracts the latitude and longitude and stores this within a new dataframe.

To be able to use the ggmap library, you will first need to sign up to a Developer account with the Google Cloud Platform and obtain an API key. An API key is a unique token that gives you access to the API.

Theoretically, you will need to pay to use the Google Geocoding API - however, you will have up to $200 in free usage for Maps, Routes and Places APIs each month (and in addition, Google is offering $300 worth of ‘free credit’ for developers when you sign-up to an account). This is a lot of credit on offer - around 2,000 addresses would cost approximately $10 - so for our tutorial, and hopefully your work, using the API shouldn’t cost you anything.

Once you’ve signed up to the API and have your key, geocoding your data will be a simple case of ingesting a csv that contains your addressses and running approximately 10 lines of code, depending on any cleaning or processing you’d need to complete.

This tutorial will walk you through the process.

Here we’ll use a dataset taken from the data.gov.uk website that contains a list of schools in England in 2014. You can find the dataset here: https://www.gov.uk/government/publications/schools-in-england - the data itself is out-of-date, but we’re just using this for geocoding and will not use it for any analysis per se. The dataset was downloaded in a Microsoft Excel spreadsheet format - to enable easy ingestion in RStudio, it was opened and then saved as a csv (just the sheet required for this tutorial). You can find the resulting csv linked here).

There are several other open options for geocoding, which means you would not have to create or use a Google account. We’ll be creating a tutorial on one of these options in the near future, but there will still be (different) limitations with these options!

Getting Started - Acquiring an API key

First, let’s sign up to the Google Geocoding API and get our API key (also known as an Access Token). You’ll need to sign up for a Google Developer account, which is relatively straight forward for those with a Google account already.

If you do not have a Google account, you may be able to use a company/institute email to sign up for an account - but it may be easier to sign up for a dummy account with Google and then delete this once you’ve completed your geocoding (e.g. if you’re not comfortable with having a Google account).

Once you have a Google account, you can follow Google’s instructions in terms of how to acquire an API key here: https://developers.google.com/maps/documentation/geocoding/get-api-key. You may be asked for a credit card to put on file for your account - as long as you stay within your credits, you will not be charged. You can monitor your usage on your Platform page, including the metrics and overview pages.

The Platform page you’ll be navigated to can be quite overwhelming at first to look at - but make sure to follow the steps! You’ll need to create a project for which you want the API key - give it a title that is specific to your project, e.g. ‘Dissertation-geocoding-schools’.

Once you’ve got a project, follow the steps outlined in the link above to create your API key - you’ll be able to find your key in the Credentials section of the platform. The link above also outlines the steps to restrict your key. As your key is linked only to your account, if you share it (accidentally or not), any usage will be applied and billed to your account. Restricting your key helps prevent this from happening - you can at least restrict your key to use within only the Geocoding API. Remember to keep your key private to prevent it from unauthorised use! You can also delete the key once you’ve geocoded your data (and are happy with the results).

Once you’ve got your key, you’re ready to start coding!

Setting up your script

Open up R-Studio and start a new script.

The first step with any script is to load the packages you’ll need to use - here, we’ll be using ggmap for geocoding as well as the sf package to create spatial data (points) from our extracted lat-lon data, and finally the mapview package. This package will let us load our points data onto a zoomable map, which will allow us to check the success and accuracy of our geocoding. We also load the readr package to help read in a text file used as for our API key (explained below).

Make sure you have the packages installed - if you don’t , use the install.packages command within the console to get these packages installed before running your script. And remember to place your libraries in "" within the parenthesis.

# Load our libraries
library("readr")
library("ggmap")
## Loading required package: ggplot2
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library("sf")
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library("mapview")

Next, set your working directory.

# Set working directory
# Replace the path below with your file path
setwd("~/R-GIS-Tutorials/")

Then state your API key. Here, we have stored our API key in a text document, which is loaded into this R script and stored as the API_key variable. This prevents us hard-coding our API key into a script that we will share with others and avoid the potential of our API key being used by others.

You can either create a text file and store your API key for use there (recommended) or paste your API key either directly into the function, or as a string stored in the API_key variable.

# Load and register API key for ggmap library

API_key <- read_file("data/API_key.txt")
  
register_google(key = API_key)

Preparing your dataset

Now we’ve got the “logistics” part of our script completed, we’ll load our dataset into RStudio and prepare it for geocoding.

First, we read the csv into Rstudio:

# Load our raw CSV dataset - list of schools in England
# State that the first row is a header, and to not interpret our strings as factors
schools_data <- read.csv("data/school_addresses.csv", header=TRUE, stringsAsFactors = FALSE)

# We print the first five lines of our csv to our console to check that the data has loaded correctly
head(schools_data)
##      URN Local.authority..code. Local.authority..name. Establishment.number
## 1 100000                    201         City of London                 3614
## 2 100001                    201         City of London                 6005
## 3 100002                    201         City of London                 6006
## 4 100003                    201         City of London                 6007
## 5 100005                    202                 Camden                 1048
## 6 100006                    202                 Camden                 1100
##                          Establishment.name                 Street     Locality
## 1 Sir John Cass's Foundation Primary School     St James's Passage Duke's Place
## 2           City of London School for Girls      St Giles' Terrace     Barbican
## 3                St Paul's Cathedral School           2 New Change             
## 4                     City of London School  Queen Victoria Street             
## 5                       Thomas Coram Centre 49 Mecklenburgh Square             
## 6                      CCfL Key Stage 4 PRU         Agincourt Road             
##   Address3   Town County Postcode    Type.of.establishment
## 1          London        EC3A 5DE   Voluntary Aided School
## 2          London        EC2Y 8BB Other Independent School
## 3          London        EC4M 9AD Other Independent School
## 4          London        EC4V 3AL Other Independent School
## 5          London        WC1N 2NY        LA Nursery School
## 6          London         NW3 2NY      Pupil Referral Unit
##   Statutory.highest.age Statutory.lowest.age        Boarders
## 1                    11                    3     No Boarders
## 2                    18                    7     No Boarders
## 3                    13                    4 Boarding School
## 4                    18                   10     No Boarders
## 5                     5                    3     No Boarders
## 6                    16                   14     No Boarders
##                   Sixth.form    UKPRN Phase.of.education Gender
## 1 Does not have a sixth form       NA            Primary  Mixed
## 2           Has a sixth form 10013279     Not applicable  Girls
## 3 Does not have a sixth form 10018890     Not applicable  Mixed
## 4           Has a sixth form 10008165     Not applicable   Boys
## 5             Not applicable       NA            Nursery  Mixed
## 6             Not applicable 10016665     Not applicable  Mixed
##   Religious.character   Religious.ethos Admissions.policy
## 1   Church of England    Does not apply    Not applicable
## 2                None Church of England     Not collected
## 3   Church of England         Christian     Not collected
## 4                None              None     Not collected
## 5      Does not apply    Does not apply    Not applicable
## 6      Does not apply    Does not apply    Not applicable
##                              Website.address Telephone.number
## 1                 www.sirjohncassprimary.org       2072831147
## 2                     http://www.clsg.org.uk       2078475500
## 3 http://www.stpauls.co.uk/school/school.htm       2072485156
## 4                       http://www.clsb.org/       2074890291
## 5      http://www.thomascoram.camden.sch.uk/       2075200385
## 6                  http://ccfl.camden.sch.uk       2079748906
##           Headteacher Establishment.status Reason.establishment.opened
## 1       Mr Tim Wilson                 Open              Not applicable
## 2      Mrs Ena Harrop                 Open              Not applicable
## 3 Mr Neil Chippington                 Open              Not applicable
## 4   Ms Sarah Fletcher                 Open              Not applicable
## 5   Ms Perina Holness                 Open              Not applicable
## 6 Ms Elizabeth Rattue                 Open              Not applicable
##   Opening.date Parliamentary.Constituency..code.
## 1                                      E14000639
## 2   01/01/1920                         E14000639
## 3   01/01/1939                         E14000639
## 4   01/01/1919                         E14000639
## 5                                      E14000750
## 6   01/09/1999                         E14000750
##   Parliamentary.Constituency..name. Region
## 1  Cities of London and Westminster London
## 2  Cities of London and Westminster London
## 3  Cities of London and Westminster London
## 4  Cities of London and Westminster London
## 5            Holborn and St Pancras London
## 6            Holborn and St Pancras London

We can even find out a little more information about the structure of our school data:

# Get the structure of our data frame
str(schools_data)
## 'data.frame':    24302 obs. of  31 variables:
##  $ URN                              : int  100000 100001 100002 100003 100005 100006 100007 100008 100009 100010 ...
##  $ Local.authority..code.           : int  201 201 201 201 202 202 202 202 202 202 ...
##  $ Local.authority..name.           : chr  "City of London" "City of London" "City of London" "City of London" ...
##  $ Establishment.number             : int  3614 6005 6006 6007 1048 1100 1101 2019 2036 2065 ...
##  $ Establishment.name               : chr  "Sir John Cass's Foundation Primary School" "City of London School for Girls" "St Paul's Cathedral School" "City of London School" ...
##  $ Street                           : chr  "St James's Passage" "St Giles' Terrace" "2 New Change" "Queen Victoria Street" ...
##  $ Locality                         : chr  "Duke's Place" "Barbican" "" "" ...
##  $ Address3                         : chr  "" "" "" "" ...
##  $ Town                             : chr  "London" "London" "London" "London" ...
##  $ County                           : chr  "" "" "" "" ...
##  $ Postcode                         : chr  "EC3A 5DE" "EC2Y 8BB" "EC4M 9AD" "EC4V 3AL" ...
##  $ Type.of.establishment            : chr  "Voluntary Aided School" "Other Independent School" "Other Independent School" "Other Independent School" ...
##  $ Statutory.highest.age            : int  11 18 13 18 5 16 11 11 11 11 ...
##  $ Statutory.lowest.age             : int  3 7 4 10 3 14 5 3 3 2 ...
##  $ Boarders                         : chr  "No Boarders" "No Boarders" "Boarding School" "No Boarders" ...
##  $ Sixth.form                       : chr  "Does not have a sixth form" "Has a sixth form" "Does not have a sixth form" "Has a sixth form" ...
##  $ UKPRN                            : int  NA 10013279 10018890 10008165 NA 10016665 NA NA NA NA ...
##  $ Phase.of.education               : chr  "Primary" "Not applicable" "Not applicable" "Not applicable" ...
##  $ Gender                           : chr  "Mixed" "Girls" "Mixed" "Boys" ...
##  $ Religious.character              : chr  "Church of England" "None" "Church of England" "None" ...
##  $ Religious.ethos                  : chr  "Does not apply" "Church of England" "Christian" "None" ...
##  $ Admissions.policy                : chr  "Not applicable" "Not collected" "Not collected" "Not collected" ...
##  $ Website.address                  : chr  "www.sirjohncassprimary.org" "http://www.clsg.org.uk" "http://www.stpauls.co.uk/school/school.htm" "http://www.clsb.org/" ...
##  $ Telephone.number                 : num  2.07e+09 2.08e+09 2.07e+09 2.07e+09 2.08e+09 ...
##  $ Headteacher                      : chr  "Mr Tim Wilson" "Mrs Ena Harrop" "Mr Neil Chippington" "Ms Sarah Fletcher" ...
##  $ Establishment.status             : chr  "Open" "Open" "Open" "Open" ...
##  $ Reason.establishment.opened      : chr  "Not applicable" "Not applicable" "Not applicable" "Not applicable" ...
##  $ Opening.date                     : chr  "" "01/01/1920" "01/01/1939" "01/01/1919" ...
##  $ Parliamentary.Constituency..code.: chr  "E14000639" "E14000639" "E14000639" "E14000639" ...
##  $ Parliamentary.Constituency..name.: chr  "Cities of London and Westminster" "Cities of London and Westminster" "Cities of London and Westminster" "Cities of London and Westminster" ...
##  $ Region                           : chr  "London" "London" "London" "London" ...

From this, we’ve got a list of the columns (and examples of their content) and we can see that we have 24,302 schools. This would be quite a lot to geocode - if each school takes 3 seconds to geocode, this could take us up to 20 hours!

As this is a tutorial, we’ll pretend that for our theoretical analysis, we want to focus on the schools within London - hopefully this would mean that we won’t have quite as many schools to process! To obtain only the schools for London, we’ll create a subset of our schools data frame - and we’ll also check the length of our data frame to see how many schools we will end up geocoding.

# Subset to only schools in 'London', i.e. where Town is equal to London
london_schools_data <- subset(schools_data, Town=='London')

# Get the number of observations (note, this is run on a column, not the data frame)
length(london_schools_data$URN)
## [1] 1915

Great, we’re under 2,000 observations - which will be much quicker to process and keep us within our usage limits!

Now we want to get our data ready for geocoding. The way in which the ggmap package works is very simple - it is an automated way to access the Geocoding API we now have access to via the API key.

Essentially the package will take each address you provide it with, enter this address into a Google Maps search, and then scrape the results that the search would return. If you ran this search manually, i.e. directly on the Google Maps website, you would see this as a pop-up where you’d be able to find lots of information about the address you’ve provided - including the latitude and longitude. The ggmap looks at the json file behind this pop-up and extracts the values for the latitude and longitude of your address and then stores this in a data frame for you.

Whilst you could do this yourself by manually populating a csv as you searched for each address, the ggmap package will be substantially faster and less subject to copy and pasting errors! It is however not without fault, and may end up geocoding the wrong location - as a result, you should always double-check your data after geocoding, which we’ll do by mapping our points and checking their locations.

To improve the accuracy of the geocoding, we’ll provide ggmap with as much information as possible. As you can see from the data frame structure above, many of our address components (e.g. school name, street address, postcode) are currently in separate columns. We will therefore create a new column within our data frame that constructs a complete address with these components. This column, gg_address, will then be used for geocoding.

# Create new column, gg_address, that joins the establishment name, street and postcode together
london_schools_data$gg_address <- paste(london_schools_data$Establishment.name, london_schools_data$Street, london_schools_data$Postcode, sep= ", ")

Now we have our column for geocoding, we can now run our geocoding process!

Geocoding our dataset

Using the ggmap package, we can either choose to geocode each address one by one using the geocode function, or use the geocode_mutate function to batch process our data. This latter code will geocode every address provided and then store the results in a dataframe. It means you do not need to write out a complicated for loop to use the geocode function that you might see in other tutorials.

First, we’ll have a quick look at the geocode function to see what the output would be for one address - and check that our API key set up is working. You can also navigate to the Metrics page of the Geocoding API within your Google Cloud Platofrm to see the request register.

# Geocode the first line in our london_schools_data set:
geocode(london_schools_data$gg_address[1])
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Sir+John+Cass's+Foundation+Primary+School,+St+James's+Passage,+EC3A+5DE&key=xxx
## # A tibble: 1 x 2
##       lon   lat
##     <dbl> <dbl>
## 1 -0.0775  51.5

Great, we can see that a longitude and latitude is provided by our code. To check this quickly, you can open https://www.google.com/maps and then enter the latitude followed by the longitude into the search box. We can check the location against the first entry of our schools dataset from earlier - and it looks like we’ve managed to geocode Sir John Cass’s Foundation Primary School to it’s correct location!

Now we know our code works and it’s likely to (fingers crossed!) geocode to the right location, we’ll run this on our overall dataset to produce a new data frame called gg_address_geoc.

We’ll also go make a cup of tea as this might take some time - approximately 15 minutes per 2,000 requests!

# Geocode our london_schools_data data frame, using the gg_address column and the mutate_geocode function
# Store results in a new data frame, gg_address_geoc
gg_address_geoc <- mutate_geocode(london_schools_data, gg_address)

# Tell us the processing is complete
print("Geocoding complete!")
## [1] "Geocoding complete!"

We can check our final output by using the head command again:

#Check the first couple of lines of our new data frame, gg_address_geoc
head(gg_address_geoc)
##   X.8 X.7 X.6 X.5 X.4 X.3 X.2 X.1 X    URN Local.authority..code.
## 1   1   1   1   1   1   1   1   1 1 100000                    201
## 2   2   2   2   2   2   2   2   2 2 100001                    201
## 3   3   3   3   3   3   3   3   3 3 100002                    201
## 4   4   4   4   4   4   4   4   4 4 100003                    201
## 5   5   5   5   5   5   5   5   5 5 100005                    202
## 6   6   6   6   6   6   6   6   6 6 100006                    202
##   Local.authority..name. Establishment.number
## 1         City of London                 3614
## 2         City of London                 6005
## 3         City of London                 6006
## 4         City of London                 6007
## 5                 Camden                 1048
## 6                 Camden                 1100
##                          Establishment.name                 Street     Locality
## 1 Sir John Cass's Foundation Primary School     St James's Passage Duke's Place
## 2           City of London School for Girls      St Giles' Terrace     Barbican
## 3                St Paul's Cathedral School           2 New Change             
## 4                     City of London School  Queen Victoria Street             
## 5                       Thomas Coram Centre 49 Mecklenburgh Square             
## 6                      CCfL Key Stage 4 PRU         Agincourt Road             
##   Address3   Town County Postcode    Type.of.establishment
## 1          London        EC3A 5DE   Voluntary Aided School
## 2          London        EC2Y 8BB Other Independent School
## 3          London        EC4M 9AD Other Independent School
## 4          London        EC4V 3AL Other Independent School
## 5          London        WC1N 2NY        LA Nursery School
## 6          London         NW3 2NY      Pupil Referral Unit
##   Statutory.highest.age Statutory.lowest.age        Boarders
## 1                    11                    3     No Boarders
## 2                    18                    7     No Boarders
## 3                    13                    4 Boarding School
## 4                    18                   10     No Boarders
## 5                     5                    3     No Boarders
## 6                    16                   14     No Boarders
##                   Sixth.form    UKPRN Phase.of.education Gender
## 1 Does not have a sixth form       NA            Primary  Mixed
## 2           Has a sixth form 10013279     Not applicable  Girls
## 3 Does not have a sixth form 10018890     Not applicable  Mixed
## 4           Has a sixth form 10008165     Not applicable   Boys
## 5             Not applicable       NA            Nursery  Mixed
## 6             Not applicable 10016665     Not applicable  Mixed
##   Religious.character   Religious.ethos Admissions.policy
## 1   Church of England    Does not apply    Not applicable
## 2                None Church of England     Not collected
## 3   Church of England         Christian     Not collected
## 4                None              None     Not collected
## 5      Does not apply    Does not apply    Not applicable
## 6      Does not apply    Does not apply    Not applicable
##                              Website.address Telephone.number
## 1                 www.sirjohncassprimary.org       2072831147
## 2                     http://www.clsg.org.uk       2078475500
## 3 http://www.stpauls.co.uk/school/school.htm       2072485156
## 4                       http://www.clsb.org/       2074890291
## 5      http://www.thomascoram.camden.sch.uk/       2075200385
## 6                  http://ccfl.camden.sch.uk       2079748906
##           Headteacher Establishment.status Reason.establishment.opened
## 1       Mr Tim Wilson                 Open              Not applicable
## 2      Mrs Ena Harrop                 Open              Not applicable
## 3 Mr Neil Chippington                 Open              Not applicable
## 4   Ms Sarah Fletcher                 Open              Not applicable
## 5   Ms Perina Holness                 Open              Not applicable
## 6 Ms Elizabeth Rattue                 Open              Not applicable
##   Opening.date Parliamentary.Constituency..code.
## 1                                      E14000639
## 2   01/01/1920                         E14000639
## 3   01/01/1939                         E14000639
## 4   01/01/1919                         E14000639
## 5                                      E14000750
## 6   01/09/1999                         E14000750
##   Parliamentary.Constituency..name. Region
## 1  Cities of London and Westminster London
## 2  Cities of London and Westminster London
## 3  Cities of London and Westminster London
## 4  Cities of London and Westminster London
## 5            Holborn and St Pancras London
## 6            Holborn and St Pancras London
##                                                                gg_address
## 1 Sir John Cass's Foundation Primary School, St James's Passage, EC3A 5DE
## 2            City of London School for Girls, St Giles' Terrace, EC2Y 8BB
## 3                      St Paul's Cathedral School, 2 New Change, EC4M 9AD
## 4                  City of London School, Queen Victoria Street, EC4V 3AL
## 5                   Thomas Coram Centre, 49 Mecklenburgh Square, WC1N 2NY
## 6                           CCfL Key Stage 4 PRU, Agincourt Road, NW3 2NY
##          lon      lat
## 1 -0.0775440 51.51348
## 2 -0.0943486 51.51917
## 3 -0.0968205 51.51386
## 4 -0.0986223 51.51182
## 5 -0.1206182 51.52543
## 6 -0.1601399 51.55386

Checking our geocoded dataset

It looks like we’ve got the structure we expected - but the next question is, did the geocoding work on all of our addresses? To find out, we’ll query whether any observations had NA or NULL values in their latitude column. We can use the which and is.na functions to tell us which observations have an na value:

# Identify rows/observations with NA values in the latitude column
which(is.na(gg_address_geoc$lat))
## [1]  126  648  748  848 1385 1431 1684 1911

So we have 8 entries that were not geocoded - not bad considering we have 1915 schools. With this small number, it’s up to you as the analyst to determine how you would try to fill in these data gaps. One approach is to manually geocode them yourself.

To do this, we can export gg_address_geoc to a csv, which we can then edit ourselves with the correct longitudes and latitudes as you find them manually. You can also do this directly in RStudio, using selection and replacement - but we’ll get onto that another time. For now, we can export the data frame to a csv for use within manual cleaning:

# Export gg_address_geoc dataframe to csv within the data folder
# We will set the row.names function to TRUE so it is easy to identify the 8 observations with NA values (although this would be easy to search in a spreadsheet editor, such as Excel!)
write.csv(gg_address_geoc, "data/london_schools_geocoded.csv", row.names=TRUE)

To keep this tutorial relatively short, we won’t go into the details of editing the exported spreadsheet. In this case, with this few missing entries, it would not take long to do a manual search of Google Maps to try to locate the missing schools.

In addition to the missing values, it would also be a good idea to check the accuracy of the geocoding. To do this, we’ll map the schools and check that they are all located in London.

We’ll use the mapview package to display our schools as it provides an interactive map to navigate and check the distribution and metadata of our schools.

To map our data, we’ll need to turn our data frame into a spatial dataset. Here we will use the sf package, although the sp package also works with mapview. To be able to map all of our points, for now, we’ll also remove those eight that were not geocded by creating a subset of our dataset. The code will not work if there are missing values in these columns!

# Create a subset of our geocoded schools data frame to remove those schools without lat/lon data
london_schools_gc <- subset(gg_address_geoc, lat!="NA")

# Create an sf points spatial object, using the lon and lat columns and stating WGS84 as the crs
school.points <- st_as_sf(london_schools_gc, coords = c("lon", "lat"), crs = 4326)

# Launch the mapview plot to check the spatial accuracy of our geocode:
mapview(school.points)

Oh, yikes! It looks like we’ve ended up with a few schools not exactly where we would want them! But the great thing about the mapview plot is that we can click on each of these schools to find out what might have gone wrong - and make a note of them to clean manually in our spreadsheet. It looks like in total we have 6 schools in the USA and one north of Cambridge to relocate.

In total that makes 7 schools to relocate and 8 to add addresses to geocode manually - considering we have 1915 observations, it’s a much smaller amount to clean/geocode manually then when we started this tutorial! Of course, with these issues, we will need to consider the accuracy of the other 1900 schools, but zooming in on the dataset, it seems to roughly follow the expected shape of London. There are other approaches we could use to check the validity of this final dataset, but we’ll save this for another time.

Next steps

The next steps from here therefore would be to manually clean and geocode the 15 entries within the exported CSV - and then start a new script where you load this final csv and convert it to a new spatial points object for your analysis!