Download the script for this session at the link below (Click Link –> Right Click on Page –> Save Page As):

Data wrangling involves cleaning, (re)structuring and transforming the data. This allows us to analyse and visualise data more easily.

Part 1: Data cleaning

In this section we will look at turning messy data into tidy data. Tidy data allows us easily (re)structure it, should we need to. Cleaning data involves removing unnecessary spaces, periods and duplicate observations in the data. Clean data make it easy for us to select columns and rows in our data set.

Install packages and load libraries

We will start by installing and loading the packages that we will need to clean our data and for the rest of the data wrangling session. *You might already have these packages. If you do, you can skip this step. We will be using the janitor package as well as the dplyr, tidyr and readxl packages from the tidyverse collection of packages.

#### Install packages ----
install.packages('tidyverse') # dplyr, tidyr, readxl  
install.packages('janitor') # cleaning names

Next we need to load our packages (libraries) onto our work spaces. Note that we have to load ‘readxl’ separately because it is not a core tidyverse package like dplyr and tidyr.

#### Load libraries ----
library(readxl)
library(tidyverse)
library(janitor)
library(dplyr)

Import data

Our data that need cleaning will be loaded next. Data can be imported from .txt, .csv or Excel (to name a few common file types) into R. For this workshop we will be importing an excel file. I added another line to show one can import a .csv file too.

#### Loading the data ----
photographer_names <- read_excel("data/raw/photographer_names.xlsx")
#photographer_names <- read.csv("path/filename.csv)

Inspect data

Before we decide how to clean the data, we need to have a clear view of the data and know how they are structured. We want to find out information about the rows(observations) and columns(variables). This will give us an idea of how we can clean, manipulate/transform and (re)structure the data before visualisation and analysis.

#### Inspecting the data
View(photographer_names)
names(photographer_names) # gives you the names of the columns 
## [1] "Site Number"           "Site Location"         "Original Photographer"
## [4] "Date...4"              "Repeat Photographer"   "Date...6"             
## [7] "Latitude"              "Longitude"             "Altitude (m)"
str(photographer_names) # gives you the structure of the data
## tibble [87 × 9] (S3: tbl_df/tbl/data.frame)
##  $ Site Number          : num [1:87] 1100 1101 1103 1104 1105 ...
##  $ Site Location        : chr [1:87] "Welbedacht" "Welbedacht" "Welbedacht" "Welbedacht" ...
##  $ Original Photographer: chr [1:87] "Lambert" "Howes-Howell" "Taylor" "Taylor" ...
##  $ Date...4             : chr [1:87] "Unknown" "11324" "1986/7" "1986/7" ...
##  $ Repeat Photographer  : chr [1:87] "Jack" "Jack" "Jack" "Jack" ...
##  $ Date...6             : POSIXct[1:87], format: "2013-04-01" "2013-04-01" ...
##  $ Latitude             : num [1:87] -32.4 -32.4 -32.4 -32.4 -32.4 ...
##  $ Longitude            : num [1:87] 19.2 19.2 19.2 19.2 19.2 ...
##  $ Altitude (m)         : num [1:87] 1324 1380 1340 1330 1330 ...
head(photographer_names) # gives you the first 6 rows of the data
## # A tibble: 6 × 9
##   `Site Number` `Site Location` `Original Photograph…` Date...4 `Repeat Photog…`
##           <dbl> <chr>           <chr>                  <chr>    <chr>           
## 1          1100 Welbedacht      Lambert                Unknown  Jack            
## 2          1101 Welbedacht      Howes-Howell           11324    Jack            
## 3          1103 Welbedacht      Taylor                 1986/7   Jack            
## 4          1104 Welbedacht      Taylor                 1986/7   Jack            
## 5          1105 Welbedacht      Taylor                 1986/7   Jack            
## 6          1107 Welbedacht      Taylor                 1986/7   Jack            
## # … with 4 more variables: Date...6 <dttm>, Latitude <dbl>, Longitude <dbl>,
## #   `Altitude (m)` <dbl>
tail(photographer_names) # gives you the last 6 rows of the data
## # A tibble: 6 × 9
##   `Site Number` `Site Location` `Original Photograph…` Date...4 `Repeat Photog…`
##           <dbl> <chr>           <chr>                  <chr>    <chr>           
## 1          1064 Middelberg      Forsyth                29068    Puttick         
## 2          1065 Middelberg      Forsyth                29068    Puttick         
## 3          1066 Middelberg      Forsyth                29068    Puttick         
## 4          1067 Middelberg      Forsyth                29068    Puttick         
## 5          1068 Middelberg      Howes-Howell           17533    Puttick         
## 6          1070 Middelberg      Hubbard                12479    Puttick         
## # … with 4 more variables: Date...6 <dttm>, Latitude <dbl>, Longitude <dbl>,
## #   `Altitude (m)` <dbl>

Prepare data

Data preparation is a step that you might not always need if your data have no gaps, extra letter, symbols or other attibutes you might not need. It also allows you to see duplicate information. Preparing the data allows you to clean up the messy parts of your data that may make it hard for you to transform your data.We will start by cleaning the names of the columns, renaming columns that have ambiguous names. Then we will move to changing the column class. Data can be in different classes/formats such as numeric, character or date format.

#### Clean and rename data (columns)
photographers <- clean_names(photographer_names) # the new names are unique
photographers <- rename(photographers, initial_date = date_4)
photographers <- rename(photographers, repeat_date = date_6)
names(photographers) # view the new names of the columns
## [1] "site_number"           "site_location"         "original_photographer"
## [4] "initial_date"          "repeat_photographer"   "repeat_date"          
## [7] "latitude"              "longitude"             "altitude_m"
#### changing column format/class
class(photographers$initial_date) # view the class of the column
## [1] "character"
class(photographers$repeat_date)
## [1] "POSIXct" "POSIXt"
photographers %>% mutate(initial_date = case_when(
  initial_date == "1986/7" ~ 
    as.character(as.Date('86/01/01', format = "%y/%m/%d")),
  initial_date != "1986/7" ~ 
    as.character(as.Date(as.numeric(initial_date), origin="1899-12-30"))
)) -> photographers

View(photographers)

#### remove or add rows
rows_delete (photographers, tibble(site_number = 1100)) # you can match the delete function to any specific row
## # A tibble: 86 × 9
##    site_number site_location original_photographer initial_date repeat_photogra…
##          <dbl> <chr>         <chr>                 <chr>        <chr>           
##  1        1101 Welbedacht    Howes-Howell          1931-01-01   Jack            
##  2        1103 Welbedacht    Taylor                1986-01-01   Jack            
##  3        1104 Welbedacht    Taylor                1986-01-01   Jack            
##  4        1105 Welbedacht    Taylor                1986-01-01   Jack            
##  5        1107 Welbedacht    Taylor                1986-01-01   Jack            
##  6        1108 Welbedacht    Taylor                1986-01-01   Jack            
##  7        1111 Welbedacht    Howes-Howell          1934-11-01   Jack            
##  8        1113 Welbedacht    Manders               1982-09-01   Jack            
##  9        1114 Welbedacht    Howes-Howell          <NA>         Jack            
## 10        1115 Welbedacht    Howes-Howell          1944-10-01   Jack            
## # … with 76 more rows, and 4 more variables: repeat_date <dttm>,
## #   latitude <dbl>, longitude <dbl>, altitude_m <dbl>
rows_insert (photographers, tibble (site_number = 9)) # new row is added.
## # A tibble: 88 × 9
##    site_number site_location original_photographer initial_date repeat_photogra…
##          <dbl> <chr>         <chr>                 <chr>        <chr>           
##  1        1100 Welbedacht    Lambert               <NA>         Jack            
##  2        1101 Welbedacht    Howes-Howell          1931-01-01   Jack            
##  3        1103 Welbedacht    Taylor                1986-01-01   Jack            
##  4        1104 Welbedacht    Taylor                1986-01-01   Jack            
##  5        1105 Welbedacht    Taylor                1986-01-01   Jack            
##  6        1107 Welbedacht    Taylor                1986-01-01   Jack            
##  7        1108 Welbedacht    Taylor                1986-01-01   Jack            
##  8        1111 Welbedacht    Howes-Howell          1934-11-01   Jack            
##  9        1113 Welbedacht    Manders               1982-09-01   Jack            
## 10        1114 Welbedacht    Howes-Howell          <NA>         Jack            
## # … with 78 more rows, and 4 more variables: repeat_date <dttm>,
## #   latitude <dbl>, longitude <dbl>, altitude_m <dbl>
#### remove or add columns
select(photographers,-initial_date) # removes column
## # A tibble: 87 × 8
##    site_number site_location original_photographer repeat_photographer
##          <dbl> <chr>         <chr>                 <chr>              
##  1        1100 Welbedacht    Lambert               Jack               
##  2        1101 Welbedacht    Howes-Howell          Jack               
##  3        1103 Welbedacht    Taylor                Jack               
##  4        1104 Welbedacht    Taylor                Jack               
##  5        1105 Welbedacht    Taylor                Jack               
##  6        1107 Welbedacht    Taylor                Jack               
##  7        1108 Welbedacht    Taylor                Jack               
##  8        1111 Welbedacht    Howes-Howell          Jack               
##  9        1113 Welbedacht    Manders               Jack               
## 10        1114 Welbedacht    Howes-Howell          Jack               
## # … with 77 more rows, and 4 more variables: repeat_date <dttm>,
## #   latitude <dbl>, longitude <dbl>, altitude_m <dbl>
photographers <- mutate (photographers, location = " ") #adds a new column
View(photographers)

Part 2: Data transformation

Data transformation involves changing the structure of data, adding new columns, making new data frames from the existing data or pulling a column from the existing data frame. We performed a few data transformations above and here are a few more.

pull(photographers, latitude) # we pulled the latitude column from the dataframe
##  [1] -32.40899 -32.40719 -32.43596 -32.43659 -32.43647 -32.40108 -32.40224
##  [8] -32.40020 -32.40693 -32.40862 -32.40862 -32.41253 -32.41253 -32.21079
## [15] -32.22125 -32.22133 -32.22163 -32.21944 -32.22029 -32.31575 -32.32013
## [22] -32.32323 -32.32635 -32.32814 -32.33909 -32.34566 -32.34819 -32.34806
## [29] -32.35179 -32.34477 -32.33626 -32.33396 -32.33234 -32.33243 -32.34393
## [36] -32.34039 -32.34191 -32.34203 -32.35600 -32.35546 -32.35418 -32.35741
## [43] -32.35763 -32.35807 -32.35098 -32.36158 -32.36173 -32.35576 -32.35582
## [50] -32.35677 -32.35630 -32.36166 -32.37061 -32.36857 -32.36961 -32.36892
## [57] -32.36937 -32.36919 -32.37169 -32.36862 -32.36839 -32.37067 -32.37078
## [64] -32.36963 -32.37199 -32.37231 -32.37511 -32.37489 -32.37434 -32.37340
## [71] -32.37318 -32.37313 -32.37331 -32.37402 -32.37518 -32.37396 -32.37388
## [78] -32.37385 -32.37342 -32.36967 -32.38276 -32.38267 -32.38255 -32.38125
## [85] -32.38279 -32.35810 -32.35474
select(photographers, site_location, altitude_m) # we selected the site_location and altitude_m columns from the dataframe
## # A tibble: 87 × 2
##    site_location altitude_m
##    <chr>              <dbl>
##  1 Welbedacht          1324
##  2 Welbedacht          1380
##  3 Welbedacht          1340
##  4 Welbedacht          1330
##  5 Welbedacht          1330
##  6 Welbedacht          1426
##  7 Welbedacht          1409
##  8 Welbedacht          1420
##  9 Welbedacht          1378
## 10 Welbedacht          1386
## # … with 77 more rows
New_Data <-select(photographers, site_location, altitude_m) # we selected the site_location and altitude_m columns from the photographers data frame and made a new data frame called New_Data
View(New_Data)