In this course, we will exploring historic water quality data from the City of Raleigh. The data is collected from 2008 - 2023 and includes 19 different water quality variables, including water temperature, E. coli bacteria, and dissolved oxygen for 18 different stream sites.
Let’s now go back to our City of Raleigh water quality data and use plots to explore the data a bit before we move into more complicated data summaries and analysis.
First, we need to read in the data into R. This data is stored on our class website, not in your workspace. You can run this code to “read” the data into R.
wq_clean <- read.csv('https://maryglover.github.io/bio331/water_quality_manip/raleigh_wq_clean.csv')
Look at the columns to see the different variables that were measured in the streams.
colnames(wq_clean)
## [1] "Site" "Date" "Time"
## [4] "Calcium_mg_L" "Hardness_total_mg_L" "Magnesium_mg_L"
## [7] "Salinity_ppt" "Phosphorus_total_mg_L" "NH3_mg_L"
## [10] "Copper_mg_L" "E_coli_MPN_100mL" "Conductivity_uS"
## [13] "do_percent_sat" "Temperature_C" "do_mg_L"
## [16] "pH" "Turbidity_NTU" "TSS_mg_L"
## [19] "Nitrogen_total_mg_L" "NO2_NO3_mg_L" "TKN_mg_L"
## [22] "Zinc_mg_L" "Salinity_uS" "E_coli_CFU_100mL"
Activity: First, pick one of the different parameters and make a histogram. Record the code in your script and make a note about what you see.
Now, let’s use the data to explore the relationship of temperature and dissolved oxygen (in percent saturation). What do you predict?
Make a scatterplot with Temperature_C on the x asis and do_percent_sat on the y axis.
library(ggplot2)
ggplot(wq_clean, aes(x = Temperature_C, do_percent_sat)) +
geom_point()
Now, let’s see what we notice about the different sites. We can start with exploring the amount of phosphorus at each site using a boxplot.
ggplot(data = wq_clean, aes(x = Site, y = Phosphorus_total_mg_L )) +
geom_boxplot()
## Warning: Removed 234 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Now, look at the amount of E. coli in MPN units. What do you notice?
ggplot(data = wq_clean, aes(x = Site, y = E_coli_MPN_100mL )) +
geom_boxplot()
Here, you can see that there is an outlier! Use dplyr
functions to sort the E. coli data to look more closely.
This is an example of a time where good scientific judgement is needed to figure out what to do with this data point. It is not appropriate to remove a data point just because it is very different from another. However, there are times when an outlier is due to an error in data collection or recording. It is important to understand the parameter to determine what is going on. This is more complicated when we did not collect the data. In this instance, I was able to talk to the data collector and they indicated that this value for E. coli is not a reasonable value and is likely due to a lab issue. In this case, we can remove this data point.
Run the following code to do that.
library(dplyr)
wq_clean <- wq_clean |>
mutate(E_coli_MPN_100mL = gsub('155310', NA, E_coli_MPN_100mL)) |>
mutate(E_coli_MPN_100mL = as.numeric(E_coli_MPN_100mL))
Now, let’s look at the boxplot again.
ggplot(data = wq_clean, aes(x = Site, y = E_coli_MPN_100mL)) +
geom_boxplot()
## Warning: Removed 295 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
You can use this table to match the Site codes with the stream name.
| V1 | V2 |
|---|---|
| BDB1—Beaverdam Branch | 35.824303, -78.650559 |
| BB2—Big Branch | 35.822637, -78.629735 |
| BBS3—Big Branch South | 35.743394, -78.568786 |
| CC4—Crabtree Creek | 35.843795, -78.726600 |
| CC5—Crabtree Creek | 35.791234, -78.588230 |
| HSC6—Hare Snipe Creek | 35.845044, -78.688466 |
| HC7—House Creek | 35.834072, -78.677525 |
| LBC8—Little Brier Creek | 35.889018, -78.796077 |
| MC9—Marsh Creek | 35.799051, -78.590593 |
| MC10—Mine Creek | 35.841903, -78.662271 |
| PC11—Perry Creek | 35.879771, -78.547979 |
| PHB12—Pigeon House Branch | 35.805995, -78.615268 |
| RC13—Richland Creek | 35.834244, -78.720481 |
| RC14—Richland Creek-WF | 35.945004, -78.552641 |
| RB15—Rocky Branch | 35.759904, -78.641103 |
| SC16—Sycamore Creek | 35.847404, -78.726468 |
| TC17—Turkey Creek | 35.848474, -78.722960 |
| WC18—Walnut Creek | 35.749248, -78.535679 |
Because we have made some changes in the water quality data, it is a
good idea to save the data set. We can do this with the
write.csv() function. We should not rewrite the
original data file, but should rename it. Here, we provide the arguments
for the data frame we want to save, and the name of the file.
write.csv(wq_clean, 'data/raleigh_water_analysis.csv', row.names = FALSE)
head() and
colnames(). Pick one thing that you are
interested in exploring in the data. For example, you can:
ggplot to explore your questions. Use
what you have learned in ggplot to make the figure look
nice!