You can use this template script file to work through the lesson.
Sometimes, it is useful to work with multiple datasets to evaluate a question. Maybe, you completed an experiment two different times and want to put the two together. Or maybe you want to add in other information about the conditions or environment. It can be helpful to join the datasets into one.
You can do this using dplyr “join” functions. There are a few different join functions based on what rows to include from the two data sets. For example, you can keep every row from both data sets (full_join), only rows that are in both (inner_join), or all rows from one (left_join or right_join)
When doing a join, you must give at least one column to know how to match the two data sets. Here, we will work on an example using some sample data. Let’s imagine that we are interested in the relationship between growth rates and photosynthesis in plants. First, we measure the growth of 10 plants over the course of a couple months, and record this data in a spreadsheet. Then, we go back and measure the photosynthesis, as the amount of carbon dioxide that the plants take in, a couple weeks later. Unfortunately, some of the plants either died and we were unable to measure the photosythesis for all the plants. We record this data in a different spreadsheet.
First, we will load in our two data files as growth and
co2.
growth <- read.csv('https://maryglover.github.io/bio331/open_data/growth.csv')
photosynthesis <- read.csv('https://maryglover.github.io/bio331/open_data/photosynthesis.csv')
We can look at these two data files.
growth
## Plant_ID Growth_in
## 1 1 6.29
## 2 2 9.92
## 3 3 6.46
## 4 4 8.96
## 5 5 11.75
## 6 6 8.02
## 7 7 8.46
## 8 8 9.84
## 9 9 5.04
## 10 10 10.16
photosynthesis
## Plant_ID CO2
## 1 8 25.94
## 2 10 18.72
## 3 1 25.73
## 4 5 27.73
## 5 6 23.67
## 6 9 27.35
## 7 3 16.13
To look at the growth and the photosynthesis rates together, we need
to join these two files. Here, we will start with the
growth data set and then do an inner_join to just keep the
rows were the plants are present in both data sets. In this
case, we want the match the data by the Plant_ID column. You
can see that the join function does not just add a new
column, but add in the CO2 value, based on the Plant_ID value. So the
Plant_IDs do not have to be in the same order in the two two data
frames.
library(dplyr)
growth |>
inner_join(photosynthesis, by = 'Plant_ID')
## Plant_ID Growth_in CO2
## 1 1 6.29 25.73
## 2 3 6.46 16.13
## 3 5 11.75 27.73
## 4 6 8.02 23.67
## 5 8 9.84 25.94
## 6 9 5.04 27.35
## 7 10 10.16 18.72
Notice: because this is the first time we have used a dplyr function, we first loaded in the package.
Often, the join function can determine what to match the data frames
by, as long as the columns have exactly the same name. So you
will get the same answer if you leave out the by=
argument.
growth |>
inner_join(photosynthesis)
## Joining with `by = join_by(Plant_ID)`
## Plant_ID Growth_in CO2
## 1 1 6.29 25.73
## 2 3 6.46 16.13
## 3 5 11.75 27.73
## 4 6 8.02 23.67
## 5 8 9.84 25.94
## 6 9 5.04 27.35
## 7 10 10.16 18.72
Now see what happens when you do a full_join
instead.
growth |>
full_join(photosynthesis)
## Joining with `by = join_by(Plant_ID)`
## Plant_ID Growth_in CO2
## 1 1 6.29 25.73
## 2 2 9.92 NA
## 3 3 6.46 16.13
## 4 4 8.96 NA
## 5 5 11.75 27.73
## 6 6 8.02 23.67
## 7 7 8.46 NA
## 8 8 9.84 25.94
## 9 9 5.04 27.35
## 10 10 10.16 18.72
You can see here, that all the Plant_IDs are present in the joined
data set. When there is no match in the photosynthesis data frame for a
plant, the CO2 is indicated as “NA”.
Activity:
right_join and a
left_join. What do you notice about the difference?Right and left joins will keep all the rows in either the right or
the left join, respectively. Think back to what the pipe,
|> is doing.
This code
growth |>
right_join(photosynthesis)
is the same as:
right_join(growth, photosynthesis)
In this case, the photosynthesis data is on the right of the
two, so all the Plant_IDs present in the photosynthesis data will remain
in the joined data, but not the rows only present in the growth data. A
left_join would keep all the Plant_ID values in the growth
data frame, even if they are not in the photosynthesis data.
Remember, when we use dplyr functions, the changes we
make are not saved unless we assign the output to a new data frame.
Let’s say we wanted to graph the relationship between the growth and
photosynthesis. We would need to reassign the joined data to a new data
frame and then graph it.
In this example, we will combine what we learned in our last lesson on summary tables with joins. We will use data collected by this class in 2024 for the macroinvertebrate insects they found during a class period in a nearby stream.
First, we will read in the data from our class website. I will call
it invert
invert <- read.csv('https://maryglover.github.io/bio331/open_data/macroinvertebrate_sampling.csv')
Let’s again start by just looking at the data
head(invert)
## Group Taxa N
## 1 Group 1 Caddisfly larvae 10
## 2 Group 1 Gilled Snails 5
## 3 Group 1 Damselfly 5
## 4 Group 1 Sowbugs 2
## 5 Group 1 Aquatic worms 5
## 6 Group 1 Midge larvae 8
In this data, we have the different taxa that two different student groups collected and how many of each they collected.
Let’s look at the species richness or how many
different taxa were collected. Here, we want to tally
the individual taxa in each group. We can do this again with
the n()function.
invert |>
group_by(Group)|>
summarize(richness = n())
## # A tibble: 2 × 2
## Group richness
## <chr> <int>
## 1 Group 1 7
## 2 Group 2 7
Let’s say we want to know how many total insects each group collected to see if one group collected more than the other. Go ahead and try this summary on your own. You should get the following answer.
## # A tibble: 2 × 2
## Group total
## <chr> <int>
## 1 Group 1 46
## 2 Group 2 37
Now, let’s see how many total insects we collected from each taxa. This gives us the abundance of each different invertebrate. Notice: We are only changing the grouping that we are using. First, we wanted to see how many insects each group collected. Now we want to see how many insects in each taxa. So you only need to change one variable from the previous code.
invert |>
group_by(Taxa)|>
summarize(total = sum(N))
## # A tibble: 10 × 2
## Taxa total
## <chr> <int>
## 1 Aquatic worms 18
## 2 Blackfly larvae 11
## 3 Caddisfly larvae 10
## 4 Crayfish 8
## 5 Damselfly 7
## 6 Gilled Snails 5
## 7 Midge larvae 10
## 8 Riffle beetles 8
## 9 Snails 3
## 10 Sowbugs 3
One thing you can do with this data is make a bar graph to show the different insects that were collected in the stream.
First, we want to save the data table we made. Remember, as with
other dplyr functions, the functions do not change
the underlying data.
invert_summary <- invert |>
group_by(Taxa)|>
summarize(total = sum(N))
Then, we can use ggplot to plot the data with the layer
geom_bar.
library(ggplot2)
ggplot(invert_summary, aes(x = Taxa, y = total))+
geom_bar()
You will notice that this code does not work.
geom_bar requires an additional argument that indicates how
to determine the bar height. We just want the bar height to be the value
in the total column. To do this we must set
stat = "identity"
ggplot(invert_summary, aes(x = Taxa, y = total))+
geom_bar(stat = 'identity')
You will notice that this is hard to read with the axis labels. We can
fix this in two ways:
To rotate the axis labels, you must adjust the angle of the
axis.text.x in a theme() layer like so:
ggplot(invert_summary, aes(x = Taxa, y = total))+
geom_bar(stat = 'identity')+
theme(axis.text.x = element_text(angle = 90))
Alternatively, we can just switch the x and the y to plot in the other direction.
ggplot(invert_summary, aes(y = Taxa, x = total))+
geom_bar(stat = 'identity')
Macroinvertebrate sampling is often used in streams to measure water quality by looking at how tolerant the taxa are to pollution. If there are many taxa that are intolerant, or cannot live in streams with lots of pollution, that indicates that the stream is relatively healthy.
Here, we will join in a data file that has the pollution class for different aquatic macroinvertebrates.
pollute_class <- read.csv('https://maryglover.github.io/bio331/open_data/pollution_class.csv')
head(pollute_class)
## Taxa Pollution
## 1 Caddisfly larvae Sensitive
## 2 Hellgrammites Sensitive
## 3 Riffle beetles Sensitive
## 4 Mayfly nymphs Sensitive
## 5 Gilled Snails Sensitive
## 6 Stonefly nymphs Sensitive
Activity:
invert data frame.Another useful tool in R is learning how to work with dates. Dates can be a complicated data format to work with in R. They are often characterized as “characters”, however they do have a quantitative nature and can thus be considered as continuous. We want to be able to put things in order of the data to see how things change over time.
Let’s look back at the Raleigh climate.
climate <- read.csv('data/raleigh_prism_climate.csv')
head(climate)
## year month tmean tmin tmax precip
## 1 2023 7 27.2688 22.1419 32.3958 114.6551
## 2 2023 8 26.3314 20.8735 31.7895 94.6812
## 3 2023 9 22.3231 16.8723 27.7740 125.5386
## 4 2023 10 17.2026 11.0189 23.3865 32.2431
## 5 2023 11 10.1585 3.0728 17.2443 58.3917
## 6 2023 12 8.3325 1.5722 15.0930 185.7540
If we look at the top of this data frame, we can see that we have
separate columns for year and month. This can be helpful when looking at
the differences in months or years, but with this format, we cannot make
a graph to see how the climate variables change over the entire time
frame. We want to have a column that has the full date, with month and
year. We can do this by utilizing the lubridate package.
lubridate works with dplyr functions and
includes many functions to work with dates.
Here, we will use the make_date function, to take the
year and month columns and turn them in to one date. We will also
incorporate the mutate function. We will use
mutate to make a new column using the function
make_date. The make_date function expects a
column, for year, month, and day. We do not have a day column, so will
just provide year and month. It must be in this order. Since we are
using a new package, we will also have to load it in.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
climate |>
mutate(date = make_date(year, month))|>
head()
## year month tmean tmin tmax precip date
## 1 2023 7 27.2688 22.1419 32.3958 114.6551 2023-07-01
## 2 2023 8 26.3314 20.8735 31.7895 94.6812 2023-08-01
## 3 2023 9 22.3231 16.8723 27.7740 125.5386 2023-09-01
## 4 2023 10 17.2026 11.0189 23.3865 32.2431 2023-10-01
## 5 2023 11 10.1585 3.0728 17.2443 58.3917 2023-11-01
## 6 2023 12 8.3325 1.5722 15.0930 185.7540 2023-12-01
Now, you can see a new column date, which we created.
You will also notice that there is a day. Because we did not provide a
column for the day, the default is the first. This is fine for our
analysis, but is important to consider if using in the future. If we
save this data frame with the date column, we can use it to make a line
graph of the tmean over the entire data set.
climate_date<- climate |>
mutate(date = make_date(year, month))
ggplot(climate_date, aes(x = date, y = tmean)) +
geom_line()
Let’s look back at our water quality data. You should have our updated water quality file in your workspace. Go ahead and load that in.
wq <- read.csv('data/raleigh_water_analysis.csv')
We can see in this data frame thehat here is a column with the date in the format yyyy-mm-dd.
head(wq)
## Site Date Time Calcium_mg_L Hardness_total_mg_L Magnesium_mg_L
## 1 BB2 2008-09-30 9:52 NA NA NA
## 2 BBS3 2008-09-30 8:30 NA NA NA
## 3 BDB1 2008-09-30 10:16 NA NA NA
## 4 CC4 2008-09-30 9:40 NA NA NA
## 5 CC5 2008-09-30 9:05 NA NA NA
## 6 HC7 2008-09-30 11:06 NA NA NA
## Salinity_ppt Phosphorus_total_mg_L NH3_mg_L Copper_mg_L E_coli_MPN_100mL
## 1 NA 0.00 0.00 0.000 236
## 2 NA 0.05 0.02 0.005 326
## 3 NA 0.00 0.00 0.000 236
## 4 NA 0.00 0.00 0.000 81
## 5 NA 0.00 0.00 0.000 308
## 6 NA 0.00 0.00 0.000 435
## Conductivity_uS do_percent_sat Temperature_C do_mg_L pH Turbidity_NTU
## 1 106.8 88.5 19.9 8.10 6.50 4.9
## 2 132.5 48.4 20.0 4.38 6.72 22.0
## 3 137.2 77.5 20.3 7.02 6.44 3.0
## 4 145.5 83.3 21.2 7.43 7.20 21.4
## 5 109.5 84.1 21.4 7.42 7.00 15.1
## 6 114.9 82.3 19.9 7.46 6.59 6.6
## TSS_mg_L Nitrogen_total_mg_L NO2_NO3_mg_L TKN_mg_L Zinc_mg_L Salinity_uS
## 1 2.2 0.74 0.41 0.33 0.029 NA
## 2 10.8 0.68 0.03 0.65 0.031 NA
## 3 2.4 1.01 0.62 0.39 0.036 NA
## 4 10.3 1.23 0.40 0.83 0.040 NA
## 5 9.0 0.34 0.34 0.00 0.040 NA
## 6 6.8 0.52 0.25 0.27 0.030 NA
## E_coli_CFU_100mL
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
As is, this is just a string of characters – R does not know that
this is a date and should be treated as such. We will need to format it
as a date in the data frame. The package lubridate does
this easily. Lubridate has a variety of functions to make a variable a
date based on the order of the date objects in the string, for example
dmy(), myd(), ymd().
Our date is in the year, month, date format so we can use the
function ymd() to turn the Date column into a date class. We will again
use mutate as we are changing a column. In this case,
instead of making a new column, we will resave the Date column. We will
also save the wq data frame to keep this format change
wq<- wq |>
mutate(Date = ymd(Date))
head(wq)
## Site Date Time Calcium_mg_L Hardness_total_mg_L Magnesium_mg_L
## 1 BB2 2008-09-30 9:52 NA NA NA
## 2 BBS3 2008-09-30 8:30 NA NA NA
## 3 BDB1 2008-09-30 10:16 NA NA NA
## 4 CC4 2008-09-30 9:40 NA NA NA
## 5 CC5 2008-09-30 9:05 NA NA NA
## 6 HC7 2008-09-30 11:06 NA NA NA
## Salinity_ppt Phosphorus_total_mg_L NH3_mg_L Copper_mg_L E_coli_MPN_100mL
## 1 NA 0.00 0.00 0.000 236
## 2 NA 0.05 0.02 0.005 326
## 3 NA 0.00 0.00 0.000 236
## 4 NA 0.00 0.00 0.000 81
## 5 NA 0.00 0.00 0.000 308
## 6 NA 0.00 0.00 0.000 435
## Conductivity_uS do_percent_sat Temperature_C do_mg_L pH Turbidity_NTU
## 1 106.8 88.5 19.9 8.10 6.50 4.9
## 2 132.5 48.4 20.0 4.38 6.72 22.0
## 3 137.2 77.5 20.3 7.02 6.44 3.0
## 4 145.5 83.3 21.2 7.43 7.20 21.4
## 5 109.5 84.1 21.4 7.42 7.00 15.1
## 6 114.9 82.3 19.9 7.46 6.59 6.6
## TSS_mg_L Nitrogen_total_mg_L NO2_NO3_mg_L TKN_mg_L Zinc_mg_L Salinity_uS
## 1 2.2 0.74 0.41 0.33 0.029 NA
## 2 10.8 0.68 0.03 0.65 0.031 NA
## 3 2.4 1.01 0.62 0.39 0.036 NA
## 4 10.3 1.23 0.40 0.83 0.040 NA
## 5 9.0 0.34 0.34 0.00 0.040 NA
## 6 6.8 0.52 0.25 0.27 0.030 NA
## E_coli_CFU_100mL
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
You can see the data itself didn’t change, but under Date, the type of data has changed.
Activity:
Now that we have the date in the correct format, we can plot it like we did the climate.
Changing the date format is very useful when looking at the data over
time. We can also use other lubridate functions to extract out the date
components like month(), year(), wday(). This will get only the specific
date variable (day, month, year) from the date. For example, let’s say
we want to look at the difference in different years. We don’t want the
entire date, we want to make a group for the year. We can extract the
year using the function year and use mutate to
make a new column. get a column for the year.
wq |>
mutate(year = year(Date))|>
head()
## Site Date Time Calcium_mg_L Hardness_total_mg_L Magnesium_mg_L
## 1 BB2 2008-09-30 9:52 NA NA NA
## 2 BBS3 2008-09-30 8:30 NA NA NA
## 3 BDB1 2008-09-30 10:16 NA NA NA
## 4 CC4 2008-09-30 9:40 NA NA NA
## 5 CC5 2008-09-30 9:05 NA NA NA
## 6 HC7 2008-09-30 11:06 NA NA NA
## Salinity_ppt Phosphorus_total_mg_L NH3_mg_L Copper_mg_L E_coli_MPN_100mL
## 1 NA 0.00 0.00 0.000 236
## 2 NA 0.05 0.02 0.005 326
## 3 NA 0.00 0.00 0.000 236
## 4 NA 0.00 0.00 0.000 81
## 5 NA 0.00 0.00 0.000 308
## 6 NA 0.00 0.00 0.000 435
## Conductivity_uS do_percent_sat Temperature_C do_mg_L pH Turbidity_NTU
## 1 106.8 88.5 19.9 8.10 6.50 4.9
## 2 132.5 48.4 20.0 4.38 6.72 22.0
## 3 137.2 77.5 20.3 7.02 6.44 3.0
## 4 145.5 83.3 21.2 7.43 7.20 21.4
## 5 109.5 84.1 21.4 7.42 7.00 15.1
## 6 114.9 82.3 19.9 7.46 6.59 6.6
## TSS_mg_L Nitrogen_total_mg_L NO2_NO3_mg_L TKN_mg_L Zinc_mg_L Salinity_uS
## 1 2.2 0.74 0.41 0.33 0.029 NA
## 2 10.8 0.68 0.03 0.65 0.031 NA
## 3 2.4 1.01 0.62 0.39 0.036 NA
## 4 10.3 1.23 0.40 0.83 0.040 NA
## 5 9.0 0.34 0.34 0.00 0.040 NA
## 6 6.8 0.52 0.25 0.27 0.030 NA
## E_coli_CFU_100mL year
## 1 NA 2008
## 2 NA 2008
## 3 NA 2008
## 4 NA 2008
## 5 NA 2008
## 6 NA 2008
Now that we know how to get a column for the year, we can use what we have learned about summary tables, to summarize based on the year. Let’s look to see turbidity and total nitrogen for the different years.
wq |>
mutate(year = year(Date)) |>
group_by(year)|>
summarize(turbidity = mean(Turbidity_NTU, na.rm =T), nitrogen = mean(Nitrogen_total_mg_L, na.rm = T))
## # A tibble: 16 × 3
## year turbidity nitrogen
## <dbl> <dbl> <dbl>
## 1 2008 13.7 0.866
## 2 2009 55.7 1.25
## 3 2010 6.99 0.966
## 4 2011 8.94 0.882
## 5 2012 27.2 0.838
## 6 2013 13.3 0.792
## 7 2014 10.4 0.738
## 8 2015 9.40 0.975
## 9 2016 17.9 1.24
## 10 2017 8.48 1.08
## 11 2018 42.2 1.79
## 12 2019 9.33 0.907
## 13 2020 13.9 0.870
## 14 2021 8.20 0.786
## 15 2022 6.61 1.13
## 16 2023 7.56 0.787
In the lessons with water quality data, we have often summarized based on the different stream. The data uses a steam code. However, it would be useful to have the full stream name for our evaluation.
Here, we have a file that matches the code to the stream name and gps coordinates.
stream_codes <- read.csv('https://maryglover.github.io/bio331/open_data/stream_codes.csv')
head(stream_codes)
## Site Stream lat long
## 1 BDB1 Beaverdam Branch 35.82430 -78.65056
## 2 BB2 Big Branch 35.82264 -78.62973
## 3 BBS3 Big Branch South 35.74339 -78.56879
## 4 CC4 Crabtree Creek 35.84380 -78.72660
## 5 CC5 Crabtree Creek 35.79123 -78.58823
## 6 HSC6 Hare Snipe Creek 35.84504 -78.68847
Use what you learned to join with the stream codes to your full water quality data set.
You should get something that looks like this:
## Joining with `by = join_by(Site)`
## Site Stream lat long Date Time Calcium_mg_L
## 1 BDB1 Beaverdam Branch 35.8243 -78.65056 2008-09-30 10:16 NA
## 2 BDB1 Beaverdam Branch 35.8243 -78.65056 2008-12-16 10:15 NA
## 3 BDB1 Beaverdam Branch 35.8243 -78.65056 2009-03-17 10:50 NA
## 4 BDB1 Beaverdam Branch 35.8243 -78.65056 2009-06-16 12:35 NA
## 5 BDB1 Beaverdam Branch 35.8243 -78.65056 2009-09-15 9:40 NA
## 6 BDB1 Beaverdam Branch 35.8243 -78.65056 2009-12-15 9:40 NA
## Hardness_total_mg_L Magnesium_mg_L Salinity_ppt Phosphorus_total_mg_L
## 1 NA NA NA 0.00
## 2 NA NA NA 0.00
## 3 NA NA NA 0.00
## 4 NA NA NA 0.06
## 5 NA NA NA 0.00
## 6 NA NA NA 0.00
## NH3_mg_L Copper_mg_L E_coli_MPN_100mL Conductivity_uS do_percent_sat
## 1 0.00 0 236 137.2 77.5
## 2 0.00 0 228 154.5 90.2
## 3 0.00 0 1050 126.2 97.5
## 4 0.05 0 1990 81.5 94.4
## 5 0.10 0 162 155.0 76.0
## 6 0.00 0 119 151.8 96.4
## Temperature_C do_mg_L pH Turbidity_NTU TSS_mg_L Nitrogen_total_mg_L
## 1 20.3 7.02 6.44 3.00 2.4 1.01
## 2 12.1 9.80 6.62 2.40 1.7 1.15
## 3 10.4 10.90 5.95 17.90 5.6 1.76
## 4 21.3 8.33 6.71 74.60 57.2 1.59
## 5 20.8 6.80 5.91 1.88 1.2 1.55
## 6 10.4 10.77 6.61 2.90 0.0 1.42
## NO2_NO3_mg_L TKN_mg_L Zinc_mg_L Salinity_uS E_coli_CFU_100mL
## 1 0.62 0.39 0.036 NA NA
## 2 1.15 0.00 0.013 NA NA
## 3 1.05 0.71 0.016 NA NA
## 4 0.53 1.06 0.032 NA NA
## 5 0.50 1.05 0.012 NA NA
## 6 1.42 0.00 0.012 NA NA
The North Carolina Department of Environmental Quality also collects data on streams in Raleigh. In this homework assignment, you will use what you learned about working with dates to look at a separate dataset on water in Raleigh.
nc_monitor <- read.csv('https://maryglover.github.io/bio331/open_data/NC_monitoring.csv')
Date columns to the date format using what
you learned from the lubridate function ymd()na.rm =T.The water quality data set includes many water quality metrics. However, you may want to add in other data to use with the water quality data in your analysis. For example, one question the City of Raleigh has is how the precipitation might affect the values collected. To answer this question, you have to “join” two data sets.
stream_precip <- read.csv('https://maryglover.github.io/bio331/open_data/precip_stream_sites.csv')
ymd() function.Turbidity_NTU and precip using your joined
data set.do_percent_sat and the mean of the
Temperature_C columns for each month.make_date()ymd()inner_joinfull_joingeom_bar()