You can use this template script file to work through the lesson.

Creating summary tables

We have begun in class summarizing different variables in data frames. But often we are interested in summaries based on different groups. For example, in our climate data analysis, we looked to see what the average precipitation value was in Raleigh, but maybe we want to see the average for every year. We can do this utilizing both the group_by() and summarize() functions.

  • group_by() puts the data frame in “groups” based on a column or columns. This is done “behind the scenes” and doesn’t change the data output. But, when your run other functions, like mutate() or summarize() the groups are honored.

  • summarize() allows you to make new columns based on a summary functions – like the mean, median, or count. summarize() uses a similar format to mutate, new_column = function.

First, let’s load in our climate data and our dplyr package.

climate <- read.csv('data/raleigh_prism_climate.csv')
library(dplyr)

Now, I will summarize to get the mean of precipitation (precip), grouping for the different years:

climate |>
  group_by(year) |>
  summarize(average_precip = mean(precip)) 
## # A tibble: 43 × 2
##     year average_precip
##    <int>          <dbl>
##  1  1981           70.5
##  2  1982          103. 
##  3  1983          102. 
##  4  1984          105. 
##  5  1985           81.8
##  6  1986           76.0
##  7  1987           91.5
##  8  1988           80.9
##  9  1989          126. 
## 10  1990           90.7
## # ℹ 33 more rows

This first code group_by groups the data by the year, so that when you do the summarize the average will be taken for each year group.

You can also make multiple summaries at once by adding more calculations into the summarize() function. Here, we can add in the standard deviation (sd()). You can add as many summary calculations as you would like. Let’s also add in a filter for only years in the 90’s to simply the table.

climate |>
  filter(year %in% 1990:1999) |>
  group_by(year) |>
  summarize(average_precip = mean(precip), sd_precip = sd(precip))
## # A tibble: 10 × 3
##     year average_precip sd_precip
##    <int>          <dbl>     <dbl>
##  1  1990           90.7      49.2
##  2  1991           85.0      50.6
##  3  1992           98.4      43.7
##  4  1993           89.8      37.1
##  5  1994           88.6      39.5
##  6  1995          121.       69.7
##  7  1996          117.       64.5
##  8  1997           84.3      28.1
##  9  1998          109.       48.0
## 10  1999          111.      130.

Activity:

  • What year had the highest average temperature tmean?
  • Which year had the highest total precipitation in Raleigh?

More examples

Raleigh cycling crashes

The City of Raleigh keeps track of incidents that involve bicycles and vehicles in Raleigh. We will use this data set to practice making summaries.

First, let’s load in the data, which I have stored in our class website. Here, I am assigning it the name cycling.

cycling <- read.csv('https://maryglover.github.io/bio331/open_data/cycling_crashes.csv')

Let’s look at the data, by viewing the first few rows and the column names.

head(cycling)
##   day_of_week day    month year            road       road_feature weather
## 1    Saturday   7 February 2015    CHAPANOKE RD No Special Feature   Clear
## 2     Tuesday  10    March 2015  SANDY FORKS RD No Special Feature   Clear
## 3    Thursday  19    March 2015    NEW BERN AVE No Special Feature   Clear
## 4      Monday  23    March 2015  WAKE FOREST RD No Special Feature   Clear
## 5    Saturday  28    March 2015 KING CHARLES RD No Special Feature   Clear
## 6    Saturday  28    March 2015   W. MORGAN ST. No Special Feature   Clear
##      traffic_control
## 1 Stop and go signal
## 2 No control present
## 3 Stop and go signal
## 4 Stop and go signal
## 5 Stop and go signal
## 6 No control present
colnames(cycling)
## [1] "day_of_week"     "day"             "month"           "year"           
## [5] "road"            "road_feature"    "weather"         "traffic_control"

We can see that this data set includes the date of the incident (day_of_week, day, month, and year), where it occurred (road), and different characteristics (road_feature, weather, traffic control).

Let’s say we want to see how many crashes occurred each year. To get the number of crashes, we can want to know the number of rows in each group. This data is in the “tidy” format, so each crash is one row. In dplyr we can simplify this by using the function n(), which gives the number of items in the specified group.

cycling |>
  group_by(year)|>
  summarise(number = n())
## # A tibble: 10 × 2
##     year number
##    <int>  <int>
##  1  2015     70
##  2  2016     73
##  3  2017     67
##  4  2018     67
##  5  2019     55
##  6  2020     39
##  7  2021     64
##  8  2022     47
##  9  2023     47
## 10  2024      1

Now, let’s say we wanted to look at the number of crashes based on the weather, to see if the weather causes more crashes. Instead of grouping by the year, we simply change the group to weather.

cycling |>
  group_by(weather)|>
  summarise(number = n())
## # A tibble: 4 × 2
##   weather          number
##   <chr>             <int>
## 1 Clear               461
## 2 Cloudy               43
## 3 Fog, smog, smoke      2
## 4 Rain                 24

Activity:

  • Summarize the data to determine the number of crashes for each road and then sort to see what road had the most crashes.

Names data from birth records

In our next example, we are using a dataset with names of babies born from years 1910 to 2024. Use the following code to read in the dataset and assigning it the name birth

birth <- read.csv('https://maryglover.github.io/bio331/open_data/nc_birth.csv')

Look just at the first few rows

head(birth)
##   sex year      name number decade
## 1   F 1910      Mary    837   1910
## 2   F 1910     Annie    401   1910
## 3   F 1910      Ruth    235   1910
## 4   F 1910     Ethel    199   1910
## 5   F 1910 Elizabeth    191   1910
## 6   F 1910  Margaret    171   1910

You can see that it has the number of babies born in a year with different names. It also includes a column indicating the decade.

Let’s first summarize to what is the most common name for boys and girls for the entire data set. In this instance, we will group by both the name and the sex. You can do this by adding both of the columns in to the group_by function. Because there are so many names, let’s also arrange the data to sort by the most common.

birth |>
  group_by(sex, name) |>
  summarize(total = sum(number)) |>
  arrange(-total)
## # A tibble: 10,645 × 3
## # Groups:   sex [2]
##    sex   name         total
##    <chr> <chr>        <int>
##  1 M     James       211907
##  2 M     William     161425
##  3 F     Mary        134082
##  4 M     John        125765
##  5 M     Robert      116316
##  6 M     Michael      96018
##  7 M     David        92204
##  8 M     Charles      82400
##  9 M     Christopher  63345
## 10 M     Thomas       60457
## # ℹ 10,635 more rows

We can also use the decade column to summarize data. Let’s say we want to know the most common name for girls for each decade. First, we will filter to just get data for girls. Then, we can group for both decade and name. Lastly, we will summarize for the total babies born with each name.

birth |>
  filter(sex == 'F')|>
  group_by(decade, name) |>
  summarize(total = sum(number))
## # A tibble: 19,936 × 3
## # Groups:   decade [12]
##    decade name     total
##     <int> <chr>    <int>
##  1   1910 Abbie       12
##  2   1910 Ada        804
##  3   1910 Addie      837
##  4   1910 Adelaide   107
##  5   1910 Adele       52
##  6   1910 Adeline     21
##  7   1910 Adell      117
##  8   1910 Adelle      17
##  9   1910 Aggie       10
## 10   1910 Agnes      742
## # ℹ 19,926 more rows

We now have the number of times a name was given in every decade. We can use the useful function slice_max which we give us just the row that has the highest value in a specific column. We only want the top max value, so can set n=1 in slice_max(). We want where the total value is the highest.

birth |>
  filter(sex == 'F')|>
  group_by(decade, name) |>
  summarize(total = sum(number)) |> 
  slice_max(total, n = 1)
## `summarise()` has grouped output by 'decade'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 3
## # Groups:   decade [12]
##    decade name     total
##     <int> <chr>    <int>
##  1   1910 Mary     16575
##  2   1920 Mary     28711
##  3   1930 Mary     24040
##  4   1940 Mary     23554
##  5   1950 Mary     16906
##  6   1960 Lisa     13392
##  7   1970 Jennifer 12523
##  8   1980 Jessica  12615
##  9   1990 Ashley    9128
## 10   2000 Madison   6899
## 11   2010 Emma      5990
## 12   2020 Olivia    2533

Activity:

  • Filter the data for the 2000’s decade and determine the top 5 names for boys and girls.

Invertebrate data

For our last example, we will use data collected by this class in 2024 for the macroinvertebrate insects they found during a class period in a nearby stream. First, we will read in the data from our class website. I will call it invert

invert <- read.csv('https://maryglover.github.io/bio331/open_data/macroinvertebrate_sampling.csv')

Let’s again start by just looking at the data

head(invert)
##     Group             Taxa  N
## 1 Group 1 Caddisfly larvae 10
## 2 Group 1    Gilled Snails  5
## 3 Group 1        Damselfly  5
## 4 Group 1          Sowbugs  2
## 5 Group 1    Aquatic worms  5
## 6 Group 1     Midge larvae  8

In this data, we have the different taxa that two different student groups collected and how many of each they collected.

Let’s say we want to know how many total insects each group collected to see if one group collected more than the other. Go ahead and try this summary on your own. You should get the following answer.

## # A tibble: 2 × 2
##   Group   total
##   <chr>   <int>
## 1 Group 1    46
## 2 Group 2    37

Now, let’s see how many total insects we collected from each taxa. This gives us the abundance of each differnt invertebrate.

invert |>
  group_by(Taxa)|>
  summarize(total = sum(N))
## # A tibble: 10 × 2
##    Taxa             total
##    <chr>            <int>
##  1 Aquatic worms       18
##  2 Blackfly larvae     11
##  3 Caddisfly larvae    10
##  4 Crayfish             8
##  5 Damselfly            7
##  6 Gilled Snails        5
##  7 Midge larvae        10
##  8 Riffle beetles       8
##  9 Snails               3
## 10 Sowbugs              3

One thing you can do with this data is make a bar graph to show the different insects that were collected in the stream.

First, we want to save the data table we made. Remember, as with other dplyr functions, the functions do not change the underlying data.

invert_summary <- invert |>
  group_by(Taxa)|>
  summarize(total = sum(N))

Then, we can use ggplot to plot the data with the layer geom_bar.

library(ggplot2)
ggplot(invert_summary, aes(x = Taxa, y = total))+
  geom_bar()

You will notice that this code does not work. geom_bar requires an additional argument that indicates how to determine the bar height. We just want the bar height to be the value in the total column. To do this we must set stat = "identity"

ggplot(invert_summary, aes(x = Taxa, y = total))+
  geom_bar(stat = 'identity')

You will notice that this is hard to read with the axis labels. We can fix this in two ways:

  • rotate the axis labels
  • plot horizontally

To rotate the axis labels, you must adjust the angle of the axis.text.x in a theme() layer like so:

ggplot(invert_summary, aes(x = Taxa, y = total))+
  geom_bar(stat = 'identity')+
  theme(axis.text.x = element_text(angle = 90))

Alternatively, we can just switch the x and the y to plot in the other direction.

ggplot(invert_summary, aes(y = Taxa, x = total))+
  geom_bar(stat = 'identity')

Lastly, let’s look at the species richness or how many different taxa were collected. Here, we want to tally the individual taxa in each group. We can do this again with the n() function.

invert |>
  group_by(Group)|>
  summarize(richness = n())
## # A tibble: 2 × 2
##   Group   richness
##   <chr>      <int>
## 1 Group 1        7
## 2 Group 2        7

Water quality data

Let’s now go back to our City of Raleigh water quality data and use plots to explore the data a bit before we do some summaries.

First, we need to read in the data into R. Last class, you saved the water quality data, removing the E. coli outlier. You will need to read in this data from where you saved it.

wq <- read.csv('data/raleigh_water_analysis.csv')

In the water quality data, let’s look at how the streams differ in Turbidity. We will calculate the turbidity average for each of the different streams, using the column Site.

wq |>
  group_by(Site) |>
  summarize(mean_turb = mean(Turbidity_NTU))

What do you notice?

In this data, and in many other data sets, there are NA values, meaning not applicable or not available. This is used when we do not have information about a value. We can’t put zero when we don’t know, because we cannot say that the value is zero. Because there are NAs in our data set, we can’t say for sure that the maximum turbidity in the streams is not in a stream that wasn’t measured. So the answer for the maximum value is NA. To ignore the NAs to get the maximum turbidity based on the values that we measured, we must add na.rm = TRUE to the function.

wq |>
  group_by(Site) |>
  summarize(mean_turb = mean(Turbidity_NTU, na.rm = T))
## # A tibble: 18 × 2
##    Site  mean_turb
##    <chr>     <dbl>
##  1 BB2        6.51
##  2 BBS3      27.6 
##  3 BDB1       7.47
##  4 CC4       17.8 
##  5 CC5       21.0 
##  6 HC7       13.3 
##  7 HSC6       9.63
##  8 LBC8      55.3 
##  9 MC10      11.8 
## 10 MC9       13.7 
## 11 PC11      14.1 
## 12 PHB12      7.45
## 13 RB15       9.94
## 14 RC13      20.6 
## 15 RC14      20.1 
## 16 SC16      13.9 
## 17 TC17      16.3 
## 18 WC18      19.7

Now, what do you notice about the data?

You can use this table to match the Site codes with the stream name.

V1 V2
BDB1—Beaverdam Branch 35.824303, -78.650559
BB2—Big Branch 35.822637, -78.629735
BBS3—Big Branch South 35.743394, -78.568786
CC4—Crabtree Creek 35.843795, -78.726600
CC5—Crabtree Creek 35.791234, -78.588230
HSC6—Hare Snipe Creek 35.845044, -78.688466
HC7—House Creek 35.834072, -78.677525
LBC8—Little Brier Creek 35.889018, -78.796077
MC9—Marsh Creek 35.799051, -78.590593
MC10—Mine Creek 35.841903, -78.662271
PC11—Perry Creek 35.879771, -78.547979
PHB12—Pigeon House Branch 35.805995, -78.615268
RC13—Richland Creek 35.834244, -78.720481
RC14—Richland Creek-WF 35.945004, -78.552641
RB15—Rocky Branch 35.759904, -78.641103
SC16—Sycamore Creek 35.847404, -78.726468
TC17—Turkey Creek 35.848474, -78.722960
WC18—Walnut Creek 35.749248, -78.535679

Homework exercises

Complete the homework exercises in the same R script that you used for class. Submit the R script with both class exercises and homework answers in the Moodle assignment.

Part 1: Climate data

  1. Go back to the climate data. Summarize the data to get the average precipitation for each month. You will group by the month, and summarize to get the mean of the precip column.
  2. Repeat to determine the mean and standard deviation (sd) of the maximum temperatue (tmax) for each month.

Part 2: Water quality data

  1. Go back to the City of Raleigh water quality data. Complete a few more summaries to look at the differences between the sites. You choose what summaries you are interested in. Determine what parameters (i.e. do_percent_sat, pH, etc,) and what calculations (i.e. mean, median, max) you want to explore.
  2. Write up a couple of paragraphs of what you notice about the water quality from your summaries.
  3. Write a couple of sentences describing something you would be interested in exploring in the water quality data that you don’t know who to do.

List of new functions

  • group_by
  • summarize
  • slice_*
  • sd
  • geom_bar
  • na.rm = T
  • n()

Resources: