You can use this template script file to work through the lesson.
We have begun in class summarizing different variables in data
frames. But often we are interested in summaries based on different
groups. For example, in our climate data analysis, we looked to see what
the average precipitation value was in Raleigh, but maybe we want to see
the average for every year. We can do this utilizing both the
group_by() and summarize() functions.
group_by() puts the data frame in “groups” based on
a column or columns. This is done “behind the scenes” and doesn’t change
the data output. But, when your run other functions, like
mutate() or summarize() the groups are
honored.
summarize() allows you to make new columns based on
a summary functions – like the mean, median, or count.
summarize() uses a similar format to mutate,
new_column = function.
First, let’s load in our climate data and our dplyr package.
climate <- read.csv('data/raleigh_prism_climate.csv')
library(dplyr)
Now, I will summarize to get the mean of
precipitation (precip), grouping for the
different years:
climate |>
group_by(year) |>
summarize(average_precip = mean(precip))
## # A tibble: 43 × 2
## year average_precip
## <int> <dbl>
## 1 1981 70.5
## 2 1982 103.
## 3 1983 102.
## 4 1984 105.
## 5 1985 81.8
## 6 1986 76.0
## 7 1987 91.5
## 8 1988 80.9
## 9 1989 126.
## 10 1990 90.7
## # ℹ 33 more rows
This first code group_by groups the data by the year, so
that when you do the summarize the average will be taken
for each year group.
You can also make multiple summaries at once by adding more
calculations into the summarize() function. Here, we can
add in the standard deviation (sd()). You can add as many
summary calculations as you would like. Let’s also add in a filter for
only years in the 90’s to simply the table.
climate |>
filter(year %in% 1990:1999) |>
group_by(year) |>
summarize(average_precip = mean(precip), sd_precip = sd(precip))
## # A tibble: 10 × 3
## year average_precip sd_precip
## <int> <dbl> <dbl>
## 1 1990 90.7 49.2
## 2 1991 85.0 50.6
## 3 1992 98.4 43.7
## 4 1993 89.8 37.1
## 5 1994 88.6 39.5
## 6 1995 121. 69.7
## 7 1996 117. 64.5
## 8 1997 84.3 28.1
## 9 1998 109. 48.0
## 10 1999 111. 130.
Activity:
tmean?The City of Raleigh keeps track of incidents that involve bicycles and vehicles in Raleigh. We will use this data set to practice making summaries.
First, let’s load in the data, which I have stored in our class
website. Here, I am assigning it the name cycling.
cycling <- read.csv('https://maryglover.github.io/bio331/open_data/cycling_crashes.csv')
Let’s look at the data, by viewing the first few rows and the column names.
head(cycling)
## day_of_week day month year road road_feature weather
## 1 Saturday 7 February 2015 CHAPANOKE RD No Special Feature Clear
## 2 Tuesday 10 March 2015 SANDY FORKS RD No Special Feature Clear
## 3 Thursday 19 March 2015 NEW BERN AVE No Special Feature Clear
## 4 Monday 23 March 2015 WAKE FOREST RD No Special Feature Clear
## 5 Saturday 28 March 2015 KING CHARLES RD No Special Feature Clear
## 6 Saturday 28 March 2015 W. MORGAN ST. No Special Feature Clear
## traffic_control
## 1 Stop and go signal
## 2 No control present
## 3 Stop and go signal
## 4 Stop and go signal
## 5 Stop and go signal
## 6 No control present
colnames(cycling)
## [1] "day_of_week" "day" "month" "year"
## [5] "road" "road_feature" "weather" "traffic_control"
We can see that this data set includes the date of the incident (day_of_week, day, month, and year), where it occurred (road), and different characteristics (road_feature, weather, traffic control).
Let’s say we want to see how many crashes occurred each year. To get
the number of crashes, we can want to know the number of rows in each
group. This data is in the “tidy” format, so each crash is one row. In
dplyr we can simplify this by using the function
n(), which gives the number of items in the specified
group.
cycling |>
group_by(year)|>
summarise(number = n())
## # A tibble: 10 × 2
## year number
## <int> <int>
## 1 2015 70
## 2 2016 73
## 3 2017 67
## 4 2018 67
## 5 2019 55
## 6 2020 39
## 7 2021 64
## 8 2022 47
## 9 2023 47
## 10 2024 1
Now, let’s say we wanted to look at the number of crashes based on the weather, to see if the weather causes more crashes. Instead of grouping by the year, we simply change the group to weather.
cycling |>
group_by(weather)|>
summarise(number = n())
## # A tibble: 4 × 2
## weather number
## <chr> <int>
## 1 Clear 461
## 2 Cloudy 43
## 3 Fog, smog, smoke 2
## 4 Rain 24
Activity:
In our next example, we are using a dataset with names of babies born
from years 1910 to 2024. Use the following code to read in the dataset
and assigning it the name birth
birth <- read.csv('https://maryglover.github.io/bio331/open_data/nc_birth.csv')
Look just at the first few rows
head(birth)
## sex year name number decade
## 1 F 1910 Mary 837 1910
## 2 F 1910 Annie 401 1910
## 3 F 1910 Ruth 235 1910
## 4 F 1910 Ethel 199 1910
## 5 F 1910 Elizabeth 191 1910
## 6 F 1910 Margaret 171 1910
You can see that it has the number of babies born in a year with different names. It also includes a column indicating the decade.
Let’s first summarize to what is the most common name for boys and
girls for the entire data set. In this instance, we will group by
both the name and the sex. You can do this by adding
both of the columns in to the group_by function. Because
there are so many names, let’s also arrange the data to sort by the most
common.
birth |>
group_by(sex, name) |>
summarize(total = sum(number)) |>
arrange(-total)
## # A tibble: 10,645 × 3
## # Groups: sex [2]
## sex name total
## <chr> <chr> <int>
## 1 M James 211907
## 2 M William 161425
## 3 F Mary 134082
## 4 M John 125765
## 5 M Robert 116316
## 6 M Michael 96018
## 7 M David 92204
## 8 M Charles 82400
## 9 M Christopher 63345
## 10 M Thomas 60457
## # ℹ 10,635 more rows
We can also use the decade column to summarize data. Let’s say we want to know the most common name for girls for each decade. First, we will filter to just get data for girls. Then, we can group for both decade and name. Lastly, we will summarize for the total babies born with each name.
birth |>
filter(sex == 'F')|>
group_by(decade, name) |>
summarize(total = sum(number))
## # A tibble: 19,936 × 3
## # Groups: decade [12]
## decade name total
## <int> <chr> <int>
## 1 1910 Abbie 12
## 2 1910 Ada 804
## 3 1910 Addie 837
## 4 1910 Adelaide 107
## 5 1910 Adele 52
## 6 1910 Adeline 21
## 7 1910 Adell 117
## 8 1910 Adelle 17
## 9 1910 Aggie 10
## 10 1910 Agnes 742
## # ℹ 19,926 more rows
We now have the number of times a name was given in every decade. We
can use the useful function slice_max which we give us just
the row that has the highest value in a specific column. We only want
the top max value, so can set n=1 in
slice_max(). We want where the total value is
the highest.
birth |>
filter(sex == 'F')|>
group_by(decade, name) |>
summarize(total = sum(number)) |>
slice_max(total, n = 1)
## `summarise()` has grouped output by 'decade'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 3
## # Groups: decade [12]
## decade name total
## <int> <chr> <int>
## 1 1910 Mary 16575
## 2 1920 Mary 28711
## 3 1930 Mary 24040
## 4 1940 Mary 23554
## 5 1950 Mary 16906
## 6 1960 Lisa 13392
## 7 1970 Jennifer 12523
## 8 1980 Jessica 12615
## 9 1990 Ashley 9128
## 10 2000 Madison 6899
## 11 2010 Emma 5990
## 12 2020 Olivia 2533
Activity:
For our last example, we will use data collected by this class in
2024 for the macroinvertebrate insects they found during a class period
in a nearby stream. First, we will read in the data from our class
website. I will call it invert
invert <- read.csv('https://maryglover.github.io/bio331/open_data/macroinvertebrate_sampling.csv')
Let’s again start by just looking at the data
head(invert)
## Group Taxa N
## 1 Group 1 Caddisfly larvae 10
## 2 Group 1 Gilled Snails 5
## 3 Group 1 Damselfly 5
## 4 Group 1 Sowbugs 2
## 5 Group 1 Aquatic worms 5
## 6 Group 1 Midge larvae 8
In this data, we have the different taxa that two different student groups collected and how many of each they collected.
Let’s say we want to know how many total insects each group collected to see if one group collected more than the other. Go ahead and try this summary on your own. You should get the following answer.
## # A tibble: 2 × 2
## Group total
## <chr> <int>
## 1 Group 1 46
## 2 Group 2 37
Now, let’s see how many total insects we collected from each taxa. This gives us the abundance of each differnt invertebrate.
invert |>
group_by(Taxa)|>
summarize(total = sum(N))
## # A tibble: 10 × 2
## Taxa total
## <chr> <int>
## 1 Aquatic worms 18
## 2 Blackfly larvae 11
## 3 Caddisfly larvae 10
## 4 Crayfish 8
## 5 Damselfly 7
## 6 Gilled Snails 5
## 7 Midge larvae 10
## 8 Riffle beetles 8
## 9 Snails 3
## 10 Sowbugs 3
One thing you can do with this data is make a bar graph to show the different insects that were collected in the stream.
First, we want to save the data table we made. Remember, as with
other dplyr functions, the functions do not change
the underlying data.
invert_summary <- invert |>
group_by(Taxa)|>
summarize(total = sum(N))
Then, we can use ggplot to plot the data with the layer
geom_bar.
library(ggplot2)
ggplot(invert_summary, aes(x = Taxa, y = total))+
geom_bar()
You will notice that this code does not work.
geom_bar requires an additional argument that indicates how
to determine the bar height. We just want the bar height to be the value
in the total column. To do this we must set
stat = "identity"
ggplot(invert_summary, aes(x = Taxa, y = total))+
geom_bar(stat = 'identity')
You will notice that this is hard to read with the axis labels. We can
fix this in two ways:
To rotate the axis labels, you must adjust the angle of the
axis.text.x in a theme() layer like so:
ggplot(invert_summary, aes(x = Taxa, y = total))+
geom_bar(stat = 'identity')+
theme(axis.text.x = element_text(angle = 90))
Alternatively, we can just switch the x and the y to plot in the other direction.
ggplot(invert_summary, aes(y = Taxa, x = total))+
geom_bar(stat = 'identity')
Lastly, let’s look at the species richness or how many
different taxa were collected. Here, we want to tally
the individual taxa in each group. We can do this again with the
n() function.
invert |>
group_by(Group)|>
summarize(richness = n())
## # A tibble: 2 × 2
## Group richness
## <chr> <int>
## 1 Group 1 7
## 2 Group 2 7
Let’s now go back to our City of Raleigh water quality data and use plots to explore the data a bit before we do some summaries.
First, we need to read in the data into R. Last class, you saved the water quality data, removing the E. coli outlier. You will need to read in this data from where you saved it.
wq <- read.csv('data/raleigh_water_analysis.csv')
In the water quality data, let’s look at how the streams differ in
Turbidity. We will calculate the turbidity average for each of the
different streams, using the column Site.
wq |>
group_by(Site) |>
summarize(mean_turb = mean(Turbidity_NTU))
What do you notice?
In this data, and in many other data sets, there are NA values,
meaning not applicable or not available. This is used when we do not
have information about a value. We can’t put zero when we don’t know,
because we cannot say that the value is zero. Because there are NAs in
our data set, we can’t say for sure that the maximum turbidity in the
streams is not in a stream that wasn’t measured. So the answer for the
maximum value is NA. To ignore the NAs to get the maximum turbidity
based on the values that we measured, we must add
na.rm = TRUE to the function.
wq |>
group_by(Site) |>
summarize(mean_turb = mean(Turbidity_NTU, na.rm = T))
## # A tibble: 18 × 2
## Site mean_turb
## <chr> <dbl>
## 1 BB2 6.51
## 2 BBS3 27.6
## 3 BDB1 7.47
## 4 CC4 17.8
## 5 CC5 21.0
## 6 HC7 13.3
## 7 HSC6 9.63
## 8 LBC8 55.3
## 9 MC10 11.8
## 10 MC9 13.7
## 11 PC11 14.1
## 12 PHB12 7.45
## 13 RB15 9.94
## 14 RC13 20.6
## 15 RC14 20.1
## 16 SC16 13.9
## 17 TC17 16.3
## 18 WC18 19.7
Now, what do you notice about the data?
You can use this table to match the Site codes with the stream name.
| V1 | V2 |
|---|---|
| BDB1—Beaverdam Branch | 35.824303, -78.650559 |
| BB2—Big Branch | 35.822637, -78.629735 |
| BBS3—Big Branch South | 35.743394, -78.568786 |
| CC4—Crabtree Creek | 35.843795, -78.726600 |
| CC5—Crabtree Creek | 35.791234, -78.588230 |
| HSC6—Hare Snipe Creek | 35.845044, -78.688466 |
| HC7—House Creek | 35.834072, -78.677525 |
| LBC8—Little Brier Creek | 35.889018, -78.796077 |
| MC9—Marsh Creek | 35.799051, -78.590593 |
| MC10—Mine Creek | 35.841903, -78.662271 |
| PC11—Perry Creek | 35.879771, -78.547979 |
| PHB12—Pigeon House Branch | 35.805995, -78.615268 |
| RC13—Richland Creek | 35.834244, -78.720481 |
| RC14—Richland Creek-WF | 35.945004, -78.552641 |
| RB15—Rocky Branch | 35.759904, -78.641103 |
| SC16—Sycamore Creek | 35.847404, -78.726468 |
| TC17—Turkey Creek | 35.848474, -78.722960 |
| WC18—Walnut Creek | 35.749248, -78.535679 |
Complete the homework exercises in the same R script that you used for class. Submit the R script with both class exercises and homework answers in the Moodle assignment.
precip column.sd) of the maximum temperatue (tmax) for each
month.group_bysummarizeslice_*sdgeom_barna.rm = Tn()