You can use this template script file to work through the lesson.

Joins

Sometimes, it is useful to work with multiple datasets to evaluate a question. Maybe, you completed an experiment two different times and want to put the two together. Or maybe you want to add in other information about the conditions or environment. It can be helpful to join the datasets into one.

You can do this using dplyr “join” functions. There are a few different join functions based on what rows to include from the two data sets. For example, you can keep every row from both data sets (full_join), only rows that are in both (inner_join), or all rows from one (left_join or right_join)

image/svg+xml left_join(x, y) right_join(x, y) inner_join(x, y) full_join(x, y) anti_join(x, y) anti_join(y, x)

When doing a join, you must give at least one column to know how to match the two data sets. Here, we will work on an example using some sample data. Let’s imagine that we are interested in the relationship between growth rates and photosynthesis in plants. First, we measure the growth of 10 plants over the course of a couple months, and record this data in a spreadsheet. Then, we go back and measure the photosynthesis, as the amount of carbon dioxide that the plants take in, a couple weeks later. Unfortunately, some of the plants either died and we were unable to measure the photosythesis for all the plants. We record this data in a different spreadsheet.

First, we will load in our two data files as growth and co2.

growth <- read.csv('https://maryglover.github.io/bio331/open_data/growth.csv')

photosynthesis <- read.csv('https://maryglover.github.io/bio331/open_data/photosynthesis.csv')

We can look at these two data files.

growth
##    Plant_ID Growth_in
## 1         1      6.29
## 2         2      9.92
## 3         3      6.46
## 4         4      8.96
## 5         5     11.75
## 6         6      8.02
## 7         7      8.46
## 8         8      9.84
## 9         9      5.04
## 10       10     10.16
photosynthesis
##   Plant_ID   CO2
## 1        8 25.94
## 2       10 18.72
## 3        1 25.73
## 4        5 27.73
## 5        6 23.67
## 6        9 27.35
## 7        3 16.13

To look at the growth and the photosynthesis rates together, we need to join these two files. Here, we will start with the growth data set and then do an inner_join to just keep the rows were the plants are present in both data sets. In this case, we want the match the data by the Plant_ID column. You can see that the join function does not just add a new column, but add in the CO2 value, based on the Plant_ID value. So the Plant_IDs do not have to be in the same order in the two two data frames.

library(dplyr)
growth |>
  inner_join(photosynthesis, by = 'Plant_ID')
##   Plant_ID Growth_in   CO2
## 1        1      6.29 25.73
## 2        3      6.46 16.13
## 3        5     11.75 27.73
## 4        6      8.02 23.67
## 5        8      9.84 25.94
## 6        9      5.04 27.35
## 7       10     10.16 18.72

Notice: because this is the first time we have used a dplyr function, we first loaded in the package.

Often, the join function can determine what to match the data frames by, as long as the columns have exactly the same name. So you will get the same answer if you leave out the by= argument.

growth |>
  inner_join(photosynthesis)
## Joining with `by = join_by(Plant_ID)`
##   Plant_ID Growth_in   CO2
## 1        1      6.29 25.73
## 2        3      6.46 16.13
## 3        5     11.75 27.73
## 4        6      8.02 23.67
## 5        8      9.84 25.94
## 6        9      5.04 27.35
## 7       10     10.16 18.72

Now see what happens when you do a full_join instead.

growth |>
  full_join(photosynthesis)
## Joining with `by = join_by(Plant_ID)`
##    Plant_ID Growth_in   CO2
## 1         1      6.29 25.73
## 2         2      9.92    NA
## 3         3      6.46 16.13
## 4         4      8.96    NA
## 5         5     11.75 27.73
## 6         6      8.02 23.67
## 7         7      8.46    NA
## 8         8      9.84 25.94
## 9         9      5.04 27.35
## 10       10     10.16 18.72

You can see here, that all the Plant_IDs are present in the joined data set. When there is no match in the photosynthesis data frame for a plant, the CO2 is indicated as “NA”.

Activity:

  • On your own, try a right_join and a left_join. What do you notice about the difference?

Right and left joins will keep all the rows in either the right or the left join, respectively. Think back to what the pipe, |> is doing.

This code

growth |>
  right_join(photosynthesis)

is the same as:

right_join(growth, photosynthesis)

In this case, the photosynthesis data is on the right of the two, so all the Plant_IDs present in the photosynthesis data will remain in the joined data, but not the rows only present in the growth data. A left_join would keep all the Plant_ID values in the growth data frame, even if they are not in the photosynthesis data.

Remember, when we use dplyr functions, the changes we make are not saved unless we assign the output to a new data frame. Let’s say we wanted to graph the relationship between the growth and photosynthesis. We would need to reassign the joined data to a new data frame and then graph it.

Aquatic macroinvertebrate example

In this example, we will combine what we learned in our last lesson on summary tables with joins. We will use data collected by this class in 2024 for the macroinvertebrate insects they found during a class period in a nearby stream.

First, we will read in the data from our class website. I will call it invert

invert <- read.csv('https://maryglover.github.io/bio331/open_data/macroinvertebrate_sampling.csv')

Let’s again start by just looking at the data

head(invert)
##     Group             Taxa  N
## 1 Group 1 Caddisfly larvae 10
## 2 Group 1    Gilled Snails  5
## 3 Group 1        Damselfly  5
## 4 Group 1          Sowbugs  2
## 5 Group 1    Aquatic worms  5
## 6 Group 1     Midge larvae  8

Summarizing the data

In this data, we have the different taxa that two different student groups collected and how many of each they collected.

Let’s look at the species richness or how many different taxa were collected. Here, we want to tally the individual taxa in each group. We can do this again with the n()function.

invert |>
  group_by(Group)|>
  summarize(richness = n())
## # A tibble: 2 × 2
##   Group   richness
##   <chr>      <int>
## 1 Group 1        7
## 2 Group 2        7

Let’s say we want to know how many total insects each group collected to see if one group collected more than the other. Go ahead and try this summary on your own. You should get the following answer.

## # A tibble: 2 × 2
##   Group   total
##   <chr>   <int>
## 1 Group 1    46
## 2 Group 2    37

Now, let’s see how many total insects we collected from each taxa. This gives us the abundance of each different invertebrate. Notice: We are only changing the grouping that we are using. First, we wanted to see how many insects each group collected. Now we want to see how many insects in each taxa. So you only need to change one variable from the previous code.

invert |>
  group_by(Taxa)|>
  summarize(total = sum(N))
## # A tibble: 10 × 2
##    Taxa             total
##    <chr>            <int>
##  1 Aquatic worms       18
##  2 Blackfly larvae     11
##  3 Caddisfly larvae    10
##  4 Crayfish             8
##  5 Damselfly            7
##  6 Gilled Snails        5
##  7 Midge larvae        10
##  8 Riffle beetles       8
##  9 Snails               3
## 10 Sowbugs              3

Graphing the data

One thing you can do with this data is make a bar graph to show the different insects that were collected in the stream.

First, we want to save the data table we made. Remember, as with other dplyr functions, the functions do not change the underlying data.

invert_summary <- invert |>
  group_by(Taxa)|>
  summarize(total = sum(N))

Then, we can use ggplot to plot the data with the layer geom_bar.

library(ggplot2)
ggplot(invert_summary, aes(x = Taxa, y = total))+
  geom_bar()

You will notice that this code does not work. geom_bar requires an additional argument that indicates how to determine the bar height. We just want the bar height to be the value in the total column. To do this we must set stat = "identity"

ggplot(invert_summary, aes(x = Taxa, y = total))+
  geom_bar(stat = 'identity')

You will notice that this is hard to read with the axis labels. We can fix this in two ways:

  • rotate the axis labels
  • plot horizontally

To rotate the axis labels, you must adjust the angle of the axis.text.x in a theme() layer like so:

ggplot(invert_summary, aes(x = Taxa, y = total))+
  geom_bar(stat = 'identity')+
  theme(axis.text.x = element_text(angle = 90))

Alternatively, we can just switch the x and the y to plot in the other direction.

ggplot(invert_summary, aes(y = Taxa, x = total))+
  geom_bar(stat = 'identity')

Joining the data

Macroinvertebrate sampling is often used in streams to measure water quality by looking at how tolerant the taxa are to pollution. If there are many taxa that are intolerant, or cannot live in streams with lots of pollution, that indicates that the stream is relatively healthy.

Here, we will join in a data file that has the pollution class for different aquatic macroinvertebrates.

pollute_class <- read.csv('https://maryglover.github.io/bio331/open_data/pollution_class.csv')

head(pollute_class)
##               Taxa Pollution
## 1 Caddisfly larvae Sensitive
## 2    Hellgrammites Sensitive
## 3   Riffle beetles Sensitive
## 4    Mayfly nymphs Sensitive
## 5    Gilled Snails Sensitive
## 6  Stonefly nymphs Sensitive

Activity:

  • Join the two data frames together. Make sure you keep all the Taxa collected in the invert data frame.
  • Make a bar graph to show the number of insects collected in each Pollution class.

Dealing with Dates

Another useful tool in R is learning how to work with dates. Dates can be a complicated data format to work with in R. They are often characterized as “characters”, however they do have a quantitative nature and can thus be considered as continuous. We want to be able to put things in order of the data to see how things change over time.

Let’s look back at the Raleigh climate.

climate <- read.csv('data/raleigh_prism_climate.csv')
head(climate)
##   year month   tmean    tmin    tmax   precip
## 1 2023     7 27.2688 22.1419 32.3958 114.6551
## 2 2023     8 26.3314 20.8735 31.7895  94.6812
## 3 2023     9 22.3231 16.8723 27.7740 125.5386
## 4 2023    10 17.2026 11.0189 23.3865  32.2431
## 5 2023    11 10.1585  3.0728 17.2443  58.3917
## 6 2023    12  8.3325  1.5722 15.0930 185.7540

If we look at the top of this data frame, we can see that we have separate columns for year and month. This can be helpful when looking at the differences in months or years, but with this format, we cannot make a graph to see how the climate variables change over the entire time frame. We want to have a column that has the full date, with month and year. We can do this by utilizing the lubridate package. lubridate works with dplyr functions and includes many functions to work with dates.

Here, we will use the make_date function, to take the year and month columns and turn them in to one date. We will also incorporate the mutate function. We will use mutate to make a new column using the function make_date. The make_date function expects a column, for year, month, and day. We do not have a day column, so will just provide year and month. It must be in this order. Since we are using a new package, we will also have to load it in.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
climate |>
  mutate(date = make_date(year, month))|>
  head()
##   year month   tmean    tmin    tmax   precip       date
## 1 2023     7 27.2688 22.1419 32.3958 114.6551 2023-07-01
## 2 2023     8 26.3314 20.8735 31.7895  94.6812 2023-08-01
## 3 2023     9 22.3231 16.8723 27.7740 125.5386 2023-09-01
## 4 2023    10 17.2026 11.0189 23.3865  32.2431 2023-10-01
## 5 2023    11 10.1585  3.0728 17.2443  58.3917 2023-11-01
## 6 2023    12  8.3325  1.5722 15.0930 185.7540 2023-12-01

Now, you can see a new column date, which we created. You will also notice that there is a day. Because we did not provide a column for the day, the default is the first. This is fine for our analysis, but is important to consider if using in the future. If we save this data frame with the date column, we can use it to make a line graph of the tmean over the entire data set.

climate_date<- climate |>
  mutate(date = make_date(year, month))

ggplot(climate_date, aes(x = date, y = tmean)) +
  geom_line()

Setting date format

Let’s look back at our water quality data. You should have our updated water quality file in your workspace. Go ahead and load that in.

wq <- read.csv('data/raleigh_water_analysis.csv')

We can see in this data frame thehat here is a column with the date in the format yyyy-mm-dd.

head(wq)
##   Site       Date  Time Calcium_mg_L Hardness_total_mg_L Magnesium_mg_L
## 1  BB2 2008-09-30  9:52           NA                  NA             NA
## 2 BBS3 2008-09-30  8:30           NA                  NA             NA
## 3 BDB1 2008-09-30 10:16           NA                  NA             NA
## 4  CC4 2008-09-30  9:40           NA                  NA             NA
## 5  CC5 2008-09-30  9:05           NA                  NA             NA
## 6  HC7 2008-09-30 11:06           NA                  NA             NA
##   Salinity_ppt Phosphorus_total_mg_L NH3_mg_L Copper_mg_L E_coli_MPN_100mL
## 1           NA                  0.00     0.00       0.000              236
## 2           NA                  0.05     0.02       0.005              326
## 3           NA                  0.00     0.00       0.000              236
## 4           NA                  0.00     0.00       0.000               81
## 5           NA                  0.00     0.00       0.000              308
## 6           NA                  0.00     0.00       0.000              435
##   Conductivity_uS do_percent_sat Temperature_C do_mg_L   pH Turbidity_NTU
## 1           106.8           88.5          19.9    8.10 6.50           4.9
## 2           132.5           48.4          20.0    4.38 6.72          22.0
## 3           137.2           77.5          20.3    7.02 6.44           3.0
## 4           145.5           83.3          21.2    7.43 7.20          21.4
## 5           109.5           84.1          21.4    7.42 7.00          15.1
## 6           114.9           82.3          19.9    7.46 6.59           6.6
##   TSS_mg_L Nitrogen_total_mg_L NO2_NO3_mg_L TKN_mg_L Zinc_mg_L Salinity_uS
## 1      2.2                0.74         0.41     0.33     0.029          NA
## 2     10.8                0.68         0.03     0.65     0.031          NA
## 3      2.4                1.01         0.62     0.39     0.036          NA
## 4     10.3                1.23         0.40     0.83     0.040          NA
## 5      9.0                0.34         0.34     0.00     0.040          NA
## 6      6.8                0.52         0.25     0.27     0.030          NA
##   E_coli_CFU_100mL
## 1               NA
## 2               NA
## 3               NA
## 4               NA
## 5               NA
## 6               NA

As is, this is just a string of characters – R does not know that this is a date and should be treated as such. We will need to format it as a date in the data frame. The package lubridate does this easily. Lubridate has a variety of functions to make a variable a date based on the order of the date objects in the string, for example dmy(), myd(), ymd().

Our date is in the year, month, date format so we can use the function ymd() to turn the Date column into a date class. We will again use mutate as we are changing a column. In this case, instead of making a new column, we will resave the Date column. We will also save the wq data frame to keep this format change

wq<- wq |>
  mutate(Date = ymd(Date)) 

head(wq)
##   Site       Date  Time Calcium_mg_L Hardness_total_mg_L Magnesium_mg_L
## 1  BB2 2008-09-30  9:52           NA                  NA             NA
## 2 BBS3 2008-09-30  8:30           NA                  NA             NA
## 3 BDB1 2008-09-30 10:16           NA                  NA             NA
## 4  CC4 2008-09-30  9:40           NA                  NA             NA
## 5  CC5 2008-09-30  9:05           NA                  NA             NA
## 6  HC7 2008-09-30 11:06           NA                  NA             NA
##   Salinity_ppt Phosphorus_total_mg_L NH3_mg_L Copper_mg_L E_coli_MPN_100mL
## 1           NA                  0.00     0.00       0.000              236
## 2           NA                  0.05     0.02       0.005              326
## 3           NA                  0.00     0.00       0.000              236
## 4           NA                  0.00     0.00       0.000               81
## 5           NA                  0.00     0.00       0.000              308
## 6           NA                  0.00     0.00       0.000              435
##   Conductivity_uS do_percent_sat Temperature_C do_mg_L   pH Turbidity_NTU
## 1           106.8           88.5          19.9    8.10 6.50           4.9
## 2           132.5           48.4          20.0    4.38 6.72          22.0
## 3           137.2           77.5          20.3    7.02 6.44           3.0
## 4           145.5           83.3          21.2    7.43 7.20          21.4
## 5           109.5           84.1          21.4    7.42 7.00          15.1
## 6           114.9           82.3          19.9    7.46 6.59           6.6
##   TSS_mg_L Nitrogen_total_mg_L NO2_NO3_mg_L TKN_mg_L Zinc_mg_L Salinity_uS
## 1      2.2                0.74         0.41     0.33     0.029          NA
## 2     10.8                0.68         0.03     0.65     0.031          NA
## 3      2.4                1.01         0.62     0.39     0.036          NA
## 4     10.3                1.23         0.40     0.83     0.040          NA
## 5      9.0                0.34         0.34     0.00     0.040          NA
## 6      6.8                0.52         0.25     0.27     0.030          NA
##   E_coli_CFU_100mL
## 1               NA
## 2               NA
## 3               NA
## 4               NA
## 5               NA
## 6               NA

You can see the data itself didn’t change, but under Date, the type of data has changed.

Activity:

Now that we have the date in the correct format, we can plot it like we did the climate.

  1. Pick one stream and filter the data with that stream.
  2. Make a line graph with Date on the x-axis. Choose one of the other stream variables on the y axis

Extracting date variables

Changing the date format is very useful when looking at the data over time. We can also use other lubridate functions to extract out the date components like month(), year(), wday(). This will get only the specific date variable (day, month, year) from the date. For example, let’s say we want to look at the difference in different years. We don’t want the entire date, we want to make a group for the year. We can extract the year using the function year and use mutate to make a new column. get a column for the year.

wq |>
  mutate(year = year(Date))|>
  head()
##   Site       Date  Time Calcium_mg_L Hardness_total_mg_L Magnesium_mg_L
## 1  BB2 2008-09-30  9:52           NA                  NA             NA
## 2 BBS3 2008-09-30  8:30           NA                  NA             NA
## 3 BDB1 2008-09-30 10:16           NA                  NA             NA
## 4  CC4 2008-09-30  9:40           NA                  NA             NA
## 5  CC5 2008-09-30  9:05           NA                  NA             NA
## 6  HC7 2008-09-30 11:06           NA                  NA             NA
##   Salinity_ppt Phosphorus_total_mg_L NH3_mg_L Copper_mg_L E_coli_MPN_100mL
## 1           NA                  0.00     0.00       0.000              236
## 2           NA                  0.05     0.02       0.005              326
## 3           NA                  0.00     0.00       0.000              236
## 4           NA                  0.00     0.00       0.000               81
## 5           NA                  0.00     0.00       0.000              308
## 6           NA                  0.00     0.00       0.000              435
##   Conductivity_uS do_percent_sat Temperature_C do_mg_L   pH Turbidity_NTU
## 1           106.8           88.5          19.9    8.10 6.50           4.9
## 2           132.5           48.4          20.0    4.38 6.72          22.0
## 3           137.2           77.5          20.3    7.02 6.44           3.0
## 4           145.5           83.3          21.2    7.43 7.20          21.4
## 5           109.5           84.1          21.4    7.42 7.00          15.1
## 6           114.9           82.3          19.9    7.46 6.59           6.6
##   TSS_mg_L Nitrogen_total_mg_L NO2_NO3_mg_L TKN_mg_L Zinc_mg_L Salinity_uS
## 1      2.2                0.74         0.41     0.33     0.029          NA
## 2     10.8                0.68         0.03     0.65     0.031          NA
## 3      2.4                1.01         0.62     0.39     0.036          NA
## 4     10.3                1.23         0.40     0.83     0.040          NA
## 5      9.0                0.34         0.34     0.00     0.040          NA
## 6      6.8                0.52         0.25     0.27     0.030          NA
##   E_coli_CFU_100mL year
## 1               NA 2008
## 2               NA 2008
## 3               NA 2008
## 4               NA 2008
## 5               NA 2008
## 6               NA 2008

Now that we know how to get a column for the year, we can use what we have learned about summary tables, to summarize based on the year. Let’s look to see turbidity and total nitrogen for the different years.

wq |>
  mutate(year = year(Date)) |>
  group_by(year)|>
  summarize(turbidity = mean(Turbidity_NTU, na.rm =T), nitrogen = mean(Nitrogen_total_mg_L, na.rm = T))
## # A tibble: 16 × 3
##     year turbidity nitrogen
##    <dbl>     <dbl>    <dbl>
##  1  2008     13.7     0.866
##  2  2009     55.7     1.25 
##  3  2010      6.99    0.966
##  4  2011      8.94    0.882
##  5  2012     27.2     0.838
##  6  2013     13.3     0.792
##  7  2014     10.4     0.738
##  8  2015      9.40    0.975
##  9  2016     17.9     1.24 
## 10  2017      8.48    1.08 
## 11  2018     42.2     1.79 
## 12  2019      9.33    0.907
## 13  2020     13.9     0.870
## 14  2021      8.20    0.786
## 15  2022      6.61    1.13 
## 16  2023      7.56    0.787

Join water quality data

In the lessons with water quality data, we have often summarized based on the different stream. The data uses a steam code. However, it would be useful to have the full stream name for our evaluation.

Here, we have a file that matches the code to the stream name and gps coordinates.

stream_codes <- read.csv('https://maryglover.github.io/bio331/open_data/stream_codes.csv')

head(stream_codes)
##   Site           Stream      lat      long
## 1 BDB1 Beaverdam Branch 35.82430 -78.65056
## 2  BB2       Big Branch 35.82264 -78.62973
## 3 BBS3 Big Branch South 35.74339 -78.56879
## 4  CC4   Crabtree Creek 35.84380 -78.72660
## 5  CC5   Crabtree Creek 35.79123 -78.58823
## 6 HSC6 Hare Snipe Creek 35.84504 -78.68847

Use what you learned to join with the stream codes to your full water quality data set.

You should get something that looks like this:

## Joining with `by = join_by(Site)`
##   Site           Stream     lat      long       Date  Time Calcium_mg_L
## 1 BDB1 Beaverdam Branch 35.8243 -78.65056 2008-09-30 10:16           NA
## 2 BDB1 Beaverdam Branch 35.8243 -78.65056 2008-12-16 10:15           NA
## 3 BDB1 Beaverdam Branch 35.8243 -78.65056 2009-03-17 10:50           NA
## 4 BDB1 Beaverdam Branch 35.8243 -78.65056 2009-06-16 12:35           NA
## 5 BDB1 Beaverdam Branch 35.8243 -78.65056 2009-09-15  9:40           NA
## 6 BDB1 Beaverdam Branch 35.8243 -78.65056 2009-12-15  9:40           NA
##   Hardness_total_mg_L Magnesium_mg_L Salinity_ppt Phosphorus_total_mg_L
## 1                  NA             NA           NA                  0.00
## 2                  NA             NA           NA                  0.00
## 3                  NA             NA           NA                  0.00
## 4                  NA             NA           NA                  0.06
## 5                  NA             NA           NA                  0.00
## 6                  NA             NA           NA                  0.00
##   NH3_mg_L Copper_mg_L E_coli_MPN_100mL Conductivity_uS do_percent_sat
## 1     0.00           0              236           137.2           77.5
## 2     0.00           0              228           154.5           90.2
## 3     0.00           0             1050           126.2           97.5
## 4     0.05           0             1990            81.5           94.4
## 5     0.10           0              162           155.0           76.0
## 6     0.00           0              119           151.8           96.4
##   Temperature_C do_mg_L   pH Turbidity_NTU TSS_mg_L Nitrogen_total_mg_L
## 1          20.3    7.02 6.44          3.00      2.4                1.01
## 2          12.1    9.80 6.62          2.40      1.7                1.15
## 3          10.4   10.90 5.95         17.90      5.6                1.76
## 4          21.3    8.33 6.71         74.60     57.2                1.59
## 5          20.8    6.80 5.91          1.88      1.2                1.55
## 6          10.4   10.77 6.61          2.90      0.0                1.42
##   NO2_NO3_mg_L TKN_mg_L Zinc_mg_L Salinity_uS E_coli_CFU_100mL
## 1         0.62     0.39     0.036          NA               NA
## 2         1.15     0.00     0.013          NA               NA
## 3         1.05     0.71     0.016          NA               NA
## 4         0.53     1.06     0.032          NA               NA
## 5         0.50     1.05     0.012          NA               NA
## 6         1.42     0.00     0.012          NA               NA

Exercises

Part 1: NC Monitoring Data

The North Carolina Department of Environmental Quality also collects data on streams in Raleigh. In this homework assignment, you will use what you learned about working with dates to look at a separate dataset on water in Raleigh.

  1. Import the data from our class website using the following code:
nc_monitor <- read.csv('https://maryglover.github.io/bio331/open_data/NC_monitoring.csv')
  1. Convert the Date columns to the date format using what you learned from the lubridate function ymd()
  2. Filter the date to only include the Location “Pigeon_House” and assign the data frame a new name.
  3. Using the data from the Pigeon_House location that you just made, make a line graph to show how the Fecal.Coliform changes over time.
  4. Add a theme to your plot and change the color of the line
  5. Now, make a summary table using the Pigeon House data to get the mean Fecal.Coliform for each year:
    1. Extract the year data and make a new column
    2. Make a summary table. group by the year column
    3. Summarize to get the mean of the Fecal.Coliform column. You will need to specify that na.rm =T.

Part 2: Water quality and precipitation

The water quality data set includes many water quality metrics. However, you may want to add in other data to use with the water quality data in your analysis. For example, one question the City of Raleigh has is how the precipitation might affect the values collected. To answer this question, you have to “join” two data sets.

  1. Import this data set that includes precipitation for each of the dates and sites sampled.
stream_precip <- read.csv('https://maryglover.github.io/bio331/open_data/precip_stream_sites.csv')
  1. Change the date column in the stream_precip data so that it is in the data format. You should have already done this with your water quality data. If you have not, make the date column in your water quality data in the date format as well. You will use the ymd() function.
  2. Join the water quality data with this precipitation data. Make sure you assign a new name to your joined data.
  3. Make a scatterplot that shows the relationship between Turbidity_NTU and precip using your joined data set.
  4. Do you notice anything about the relationship?

Part 3: Season summary

  1. Use the joined data that you used for the Part 2 on water quality.
  2. Add a column to the data that includes only the month that the sample was created. See the section on Extracting date variables in the lesson
  3. Make a summary table that calculates the mean of the do_percent_sat and the mean of the Temperature_C columns for each month.
  4. Is the result what you expected?

List of new functions

  • lubridate functions
    • make_date()
    • ymd()
  • join functions
    • inner_join
    • full_join
  • geom_bar()