Water quality data

In this course, we will exploring historic water quality data from the City of Raleigh. The data is collected from 2008 - 2023 and includes 19 different water quality variables, including water temperature, E. coli bacteria, and dissolved oxygen for 18 different stream sites.

Let’s now go back to our City of Raleigh water quality data and use plots to explore the data a bit before we move into more complicated data summaries and analysis.

First, we need to read in the data into R. This data is stored on our class website, not in your workspace. You can run this code to “read” the data into R.

wq_clean <- read.csv('https://maryglover.github.io/bio331/water_quality_manip/raleigh_wq_clean.csv')

Look at the columns to see the different variables that were measured in the streams.

colnames(wq_clean)
##  [1] "Site"                  "Date"                  "Time"                 
##  [4] "Calcium_mg_L"          "Hardness_total_mg_L"   "Magnesium_mg_L"       
##  [7] "Salinity_ppt"          "Phosphorus_total_mg_L" "NH3_mg_L"             
## [10] "Copper_mg_L"           "E_coli_MPN_100mL"      "Conductivity_uS"      
## [13] "do_percent_sat"        "Temperature_C"         "do_mg_L"              
## [16] "pH"                    "Turbidity_NTU"         "TSS_mg_L"             
## [19] "Nitrogen_total_mg_L"   "NO2_NO3_mg_L"          "TKN_mg_L"             
## [22] "Zinc_mg_L"             "Salinity_uS"           "E_coli_CFU_100mL"

Activity: First, pick one of the different parameters and make a histogram. Record the code in your script and make a note about what you see.

Now, let’s use the data to explore the relationship of temperature and dissolved oxygen (in percent saturation). What do you predict?

Make a scatterplot with Temperature_C on the x asis and do_percent_sat on the y axis.

library(ggplot2)
ggplot(wq_clean, aes(x = Temperature_C, do_percent_sat)) +
  geom_point()

Now, let’s see what we notice about the different sites. We can start with exploring the amount of phosphorus at each site using a boxplot.

ggplot(data = wq_clean, aes(x = Site, y = Phosphorus_total_mg_L )) +
  geom_boxplot()
## Warning: Removed 234 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Now, look at the amount of E. coli in MPN units. What do you notice?

ggplot(data = wq_clean, aes(x = Site, y = E_coli_MPN_100mL )) +
  geom_boxplot()

Here, you can see that there is an outlier! Use dplyr functions to sort the E. coli data to look more closely.

This is an example of a time where good scientific judgement is needed to figure out what to do with this data point. It is not appropriate to remove a data point just because it is very different from another. However, there are times when an outlier is due to an error in data collection or recording. It is important to understand the parameter to determine what is going on. This is more complicated when we did not collect the data. In this instance, I was able to talk to the data collector and they indicated that this value for E. coli is not a reasonable value and is likely due to a lab issue. In this case, we can remove this data point.

Run the following code to do that.

library(dplyr)
wq_clean <- wq_clean |>
 mutate(E_coli_MPN_100mL = gsub('155310', NA, E_coli_MPN_100mL)) |>
  mutate(E_coli_MPN_100mL = as.numeric(E_coli_MPN_100mL))

Now, let’s look at the boxplot again.

ggplot(data = wq_clean, aes(x = Site, y = E_coli_MPN_100mL)) +
  geom_boxplot()
## Warning: Removed 295 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

You can use this table to match the Site codes with the stream name.

V1 V2
BDB1—Beaverdam Branch 35.824303, -78.650559
BB2—Big Branch 35.822637, -78.629735
BBS3—Big Branch South 35.743394, -78.568786
CC4—Crabtree Creek 35.843795, -78.726600
CC5—Crabtree Creek 35.791234, -78.588230
HSC6—Hare Snipe Creek 35.845044, -78.688466
HC7—House Creek 35.834072, -78.677525
LBC8—Little Brier Creek 35.889018, -78.796077
MC9—Marsh Creek 35.799051, -78.590593
MC10—Mine Creek 35.841903, -78.662271
PC11—Perry Creek 35.879771, -78.547979
PHB12—Pigeon House Branch 35.805995, -78.615268
RC13—Richland Creek 35.834244, -78.720481
RC14—Richland Creek-WF 35.945004, -78.552641
RB15—Rocky Branch 35.759904, -78.641103
SC16—Sycamore Creek 35.847404, -78.726468
TC17—Turkey Creek 35.848474, -78.722960
WC18—Walnut Creek 35.749248, -78.535679

Because we have made some changes in the water quality data, it is a good idea to save the data set. We can do this with the write.csv() function. We should not rewrite the original data file, but should rename it. Here, we provide the arguments for the data frame we want to save, and the name of the file.

write.csv(wq_clean, 'data/raleigh_water_analysis.csv', row.names = FALSE)

Figure caption assignment

  1. Look at the data using functions like head() and colnames(). Pick one thing that you are interested in exploring in the data. For example, you can:
    • Look at how a single variable differs at the different stream Sites
    • Look at the relationship between two of the variables measured in the streams.
    • Look at how a stream metric changes over time in one stream Site.
  2. Make a plot in ggplot to explore your questions. Use what you have learned in ggplot to make the figure look nice!
  3. Write a 3 part figure caption for the figure you created. It should include
    1. A descriptive title
    2. A statment of the methods
    3. A statement of the results.
  4. Submit the code for you graph and your figure caption in the assignment on Moodle