Download the script for this session at the link below (Click Link –> Right Click on Page –> Save Page As):
Data wrangling involves cleaning, (re)structuring and transforming the data. This allows us to analyse and visualise data more easily.
In this section we will look at turning messy data into tidy data. Tidy data allows us easily (re)structure it, should we need to. Cleaning data involves removing unnecessary spaces, periods and duplicate observations in the data. Clean data make it easy for us to select columns and rows in our data set.
We will start by installing and loading the packages that we will need to clean our data and for the rest of the data wrangling session. *You might already have these packages. If you do, you can skip this step. We will be using the janitor package as well as the dplyr, tidyr and readxl packages from the tidyverse collection of packages.
#### Install packages ----
install.packages('tidyverse') # dplyr, tidyr, readxl
install.packages('janitor') # cleaning names
Next we need to load our packages (libraries) onto our work spaces. Note that we have to load ‘readxl’ separately because it is not a core tidyverse package like dplyr and tidyr.
#### Load libraries ----
library(readxl)
library(tidyverse)
library(janitor)
library(dplyr)
Our data that need cleaning will be loaded next. Data can be imported from .txt, .csv or Excel (to name a few common file types) into R. For this workshop we will be importing an excel file. I added another line to show one can import a .csv file too.
#### Loading the data ----
<- read_excel("data/raw/photographer_names.xlsx")
photographer_names #photographer_names <- read.csv("path/filename.csv)
Before we decide how to clean the data, we need to have a clear view of the data and know how they are structured. We want to find out information about the rows(observations) and columns(variables). This will give us an idea of how we can clean, manipulate/transform and (re)structure the data before visualisation and analysis.
#### Inspecting the data
View(photographer_names)
names(photographer_names) # gives you the names of the columns
## [1] "Site Number" "Site Location" "Original Photographer"
## [4] "Date...4" "Repeat Photographer" "Date...6"
## [7] "Latitude" "Longitude" "Altitude (m)"
str(photographer_names) # gives you the structure of the data
## tibble [87 × 9] (S3: tbl_df/tbl/data.frame)
## $ Site Number : num [1:87] 1100 1101 1103 1104 1105 ...
## $ Site Location : chr [1:87] "Welbedacht" "Welbedacht" "Welbedacht" "Welbedacht" ...
## $ Original Photographer: chr [1:87] "Lambert" "Howes-Howell" "Taylor" "Taylor" ...
## $ Date...4 : chr [1:87] "Unknown" "11324" "1986/7" "1986/7" ...
## $ Repeat Photographer : chr [1:87] "Jack" "Jack" "Jack" "Jack" ...
## $ Date...6 : POSIXct[1:87], format: "2013-04-01" "2013-04-01" ...
## $ Latitude : num [1:87] -32.4 -32.4 -32.4 -32.4 -32.4 ...
## $ Longitude : num [1:87] 19.2 19.2 19.2 19.2 19.2 ...
## $ Altitude (m) : num [1:87] 1324 1380 1340 1330 1330 ...
head(photographer_names) # gives you the first 6 rows of the data
## # A tibble: 6 × 9
## `Site Number` `Site Location` `Original Photograph…` Date...4 `Repeat Photog…`
## <dbl> <chr> <chr> <chr> <chr>
## 1 1100 Welbedacht Lambert Unknown Jack
## 2 1101 Welbedacht Howes-Howell 11324 Jack
## 3 1103 Welbedacht Taylor 1986/7 Jack
## 4 1104 Welbedacht Taylor 1986/7 Jack
## 5 1105 Welbedacht Taylor 1986/7 Jack
## 6 1107 Welbedacht Taylor 1986/7 Jack
## # … with 4 more variables: Date...6 <dttm>, Latitude <dbl>, Longitude <dbl>,
## # `Altitude (m)` <dbl>
tail(photographer_names) # gives you the last 6 rows of the data
## # A tibble: 6 × 9
## `Site Number` `Site Location` `Original Photograph…` Date...4 `Repeat Photog…`
## <dbl> <chr> <chr> <chr> <chr>
## 1 1064 Middelberg Forsyth 29068 Puttick
## 2 1065 Middelberg Forsyth 29068 Puttick
## 3 1066 Middelberg Forsyth 29068 Puttick
## 4 1067 Middelberg Forsyth 29068 Puttick
## 5 1068 Middelberg Howes-Howell 17533 Puttick
## 6 1070 Middelberg Hubbard 12479 Puttick
## # … with 4 more variables: Date...6 <dttm>, Latitude <dbl>, Longitude <dbl>,
## # `Altitude (m)` <dbl>
Data preparation is a step that you might not always need if your data have no gaps, extra letter, symbols or other attibutes you might not need. It also allows you to see duplicate information. Preparing the data allows you to clean up the messy parts of your data that may make it hard for you to transform your data.We will start by cleaning the names of the columns, renaming columns that have ambiguous names. Then we will move to changing the column class. Data can be in different classes/formats such as numeric, character or date format.
#### Clean and rename data (columns)
<- clean_names(photographer_names) # the new names are unique
photographers <- rename(photographers, initial_date = date_4)
photographers <- rename(photographers, repeat_date = date_6)
photographers names(photographers) # view the new names of the columns
## [1] "site_number" "site_location" "original_photographer"
## [4] "initial_date" "repeat_photographer" "repeat_date"
## [7] "latitude" "longitude" "altitude_m"
#### changing column format/class
class(photographers$initial_date) # view the class of the column
## [1] "character"
class(photographers$repeat_date)
## [1] "POSIXct" "POSIXt"
%>% mutate(initial_date = case_when(
photographers == "1986/7" ~
initial_date as.character(as.Date('86/01/01', format = "%y/%m/%d")),
!= "1986/7" ~
initial_date as.character(as.Date(as.numeric(initial_date), origin="1899-12-30"))
-> photographers
))
View(photographers)
#### remove or add rows
rows_delete (photographers, tibble(site_number = 1100)) # you can match the delete function to any specific row
## # A tibble: 86 × 9
## site_number site_location original_photographer initial_date repeat_photogra…
## <dbl> <chr> <chr> <chr> <chr>
## 1 1101 Welbedacht Howes-Howell 1931-01-01 Jack
## 2 1103 Welbedacht Taylor 1986-01-01 Jack
## 3 1104 Welbedacht Taylor 1986-01-01 Jack
## 4 1105 Welbedacht Taylor 1986-01-01 Jack
## 5 1107 Welbedacht Taylor 1986-01-01 Jack
## 6 1108 Welbedacht Taylor 1986-01-01 Jack
## 7 1111 Welbedacht Howes-Howell 1934-11-01 Jack
## 8 1113 Welbedacht Manders 1982-09-01 Jack
## 9 1114 Welbedacht Howes-Howell <NA> Jack
## 10 1115 Welbedacht Howes-Howell 1944-10-01 Jack
## # … with 76 more rows, and 4 more variables: repeat_date <dttm>,
## # latitude <dbl>, longitude <dbl>, altitude_m <dbl>
rows_insert (photographers, tibble (site_number = 9)) # new row is added.
## # A tibble: 88 × 9
## site_number site_location original_photographer initial_date repeat_photogra…
## <dbl> <chr> <chr> <chr> <chr>
## 1 1100 Welbedacht Lambert <NA> Jack
## 2 1101 Welbedacht Howes-Howell 1931-01-01 Jack
## 3 1103 Welbedacht Taylor 1986-01-01 Jack
## 4 1104 Welbedacht Taylor 1986-01-01 Jack
## 5 1105 Welbedacht Taylor 1986-01-01 Jack
## 6 1107 Welbedacht Taylor 1986-01-01 Jack
## 7 1108 Welbedacht Taylor 1986-01-01 Jack
## 8 1111 Welbedacht Howes-Howell 1934-11-01 Jack
## 9 1113 Welbedacht Manders 1982-09-01 Jack
## 10 1114 Welbedacht Howes-Howell <NA> Jack
## # … with 78 more rows, and 4 more variables: repeat_date <dttm>,
## # latitude <dbl>, longitude <dbl>, altitude_m <dbl>
#### remove or add columns
select(photographers,-initial_date) # removes column
## # A tibble: 87 × 8
## site_number site_location original_photographer repeat_photographer
## <dbl> <chr> <chr> <chr>
## 1 1100 Welbedacht Lambert Jack
## 2 1101 Welbedacht Howes-Howell Jack
## 3 1103 Welbedacht Taylor Jack
## 4 1104 Welbedacht Taylor Jack
## 5 1105 Welbedacht Taylor Jack
## 6 1107 Welbedacht Taylor Jack
## 7 1108 Welbedacht Taylor Jack
## 8 1111 Welbedacht Howes-Howell Jack
## 9 1113 Welbedacht Manders Jack
## 10 1114 Welbedacht Howes-Howell Jack
## # … with 77 more rows, and 4 more variables: repeat_date <dttm>,
## # latitude <dbl>, longitude <dbl>, altitude_m <dbl>
<- mutate (photographers, location = " ") #adds a new column
photographers View(photographers)
Data transformation involves changing the structure of data, adding new columns, making new data frames from the existing data or pulling a column from the existing data frame. We performed a few data transformations above and here are a few more.
pull(photographers, latitude) # we pulled the latitude column from the dataframe
## [1] -32.40899 -32.40719 -32.43596 -32.43659 -32.43647 -32.40108 -32.40224
## [8] -32.40020 -32.40693 -32.40862 -32.40862 -32.41253 -32.41253 -32.21079
## [15] -32.22125 -32.22133 -32.22163 -32.21944 -32.22029 -32.31575 -32.32013
## [22] -32.32323 -32.32635 -32.32814 -32.33909 -32.34566 -32.34819 -32.34806
## [29] -32.35179 -32.34477 -32.33626 -32.33396 -32.33234 -32.33243 -32.34393
## [36] -32.34039 -32.34191 -32.34203 -32.35600 -32.35546 -32.35418 -32.35741
## [43] -32.35763 -32.35807 -32.35098 -32.36158 -32.36173 -32.35576 -32.35582
## [50] -32.35677 -32.35630 -32.36166 -32.37061 -32.36857 -32.36961 -32.36892
## [57] -32.36937 -32.36919 -32.37169 -32.36862 -32.36839 -32.37067 -32.37078
## [64] -32.36963 -32.37199 -32.37231 -32.37511 -32.37489 -32.37434 -32.37340
## [71] -32.37318 -32.37313 -32.37331 -32.37402 -32.37518 -32.37396 -32.37388
## [78] -32.37385 -32.37342 -32.36967 -32.38276 -32.38267 -32.38255 -32.38125
## [85] -32.38279 -32.35810 -32.35474
select(photographers, site_location, altitude_m) # we selected the site_location and altitude_m columns from the dataframe
## # A tibble: 87 × 2
## site_location altitude_m
## <chr> <dbl>
## 1 Welbedacht 1324
## 2 Welbedacht 1380
## 3 Welbedacht 1340
## 4 Welbedacht 1330
## 5 Welbedacht 1330
## 6 Welbedacht 1426
## 7 Welbedacht 1409
## 8 Welbedacht 1420
## 9 Welbedacht 1378
## 10 Welbedacht 1386
## # … with 77 more rows
<-select(photographers, site_location, altitude_m) # we selected the site_location and altitude_m columns from the photographers data frame and made a new data frame called New_Data
New_Data View(New_Data)