You can use this template script file to work through the lesson.
A main use of R is conducting statistical analyses with data. R has many functions and packages for running different statistical analyzes. Just like with the work you have done in R, different functions perform different tasks and you specify how the function will work with arguments. You have control over the statistical tests you are performing, but you also have to know what you want to do.
In this lesson, we will assume a basic understanding of hypothesis testing. See the lesson slides for a review.
While there are multiple simple statistical tests to use with
experimental data to compare groups, we will only cover one, the
linear model. We will use the function lm(). In R,
this conveniently can work with both numerical and categorical
independent variables.
We will continue using the Raleigh temperature data set.
climate <- read.csv('data/raleigh_prism_climate.csv')
First, let’s say we are interested in learning about the pattern of precipitation in Raleigh. Specifically, we want to see if the mean temperature affects the precipitation.
Before, we look at this statistically, it is good to graph this to see what we are working with.
library(ggplot2)
ggplot(data = climate, aes(x = tmean, y = precip)) +
geom_point()
The graph shows us the relationship we are testing. We will use the
function lm() to make a linear model. The linear model will
“model” the “best fit line” of the relationship. Remember, a line
equation is \(y = mx + b\), where:
If the slope is 0 there is not relationship between the two variables, so in our linear regression we are testing the null hypothesis that there is no relationship and the slope is equal to 0.
The function lm expects the data (data =)
and a formula which specifies the model. The formula is in the format
response ~ predictor(s), where the response and predictor
are columns in the dataset. The response is the dependent variable and
is always a numerical value, at least for this linear model. The
predictor is the dependent variable and can be either numeric or
categorical. You can include multiple predictors by adding them,
i.e. response ~ predictor1 + predictor2
To calculate, you should assign the model to a named R object.
precip_lm <- lm(data = climate, precip ~ tmean)
To view the model, use the summary() function.
summary(precip_lm)
##
## Call:
## lm(formula = precip ~ tmean, data = climate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99.19 -36.15 -6.83 27.72 406.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.2792 5.3811 14.547 < 2e-16 ***
## tmean 1.2329 0.3053 4.039 6.2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.52 on 514 degrees of freedom
## Multiple R-squared: 0.03076, Adjusted R-squared: 0.02887
## F-statistic: 16.31 on 1 and 514 DF, p-value: 6.199e-05
Here, you will find the results of the model. The model has 2 coefficients, the intercept and the tmean, or the “slope.” Both of these are testing is the coefficient equal to 0. We are not concerned with whether the intercept of the best fit line is equal to 0, but if the slope of the line is different from 0. In this model, the best fit line has a slope of 1.23. The p-value is 6.2e-05.
Using the criteria of p < 0.05, we can “fail to accept” the null hypothesis that there is no relationship between mean temperature and precipitation. You can see the F statistic, degrees of freedom (DF) and p-value at the bottom. Here, the p-value of the full model and the tmean are the same, as we only looked at one predictor.
So, here we do see a “significant” relationship between temperature and precipitation. We also see that the relationship is not very strong. A slope of 1.23 is not very steep. You can also look at the R-squared value, which is the “the proportion of the variation in the dependent variable that is predictable from the independent variable” or the strenght of the relationship. Here the R-squared (adjusted) is 0.02887, which is very low.
You can also show the linear model, with the best fit line, on the
plot in ggplot. From a linear model, you can extract the
coefficients of slope and intercept using the coeff[1] and
coeff[2].
precip_lm$coeff[1]
## (Intercept)
## 78.27923
precip_lm$coeff[2]
## tmean
## 1.232891
You can also add a line layer using geom_abline, which
expects a value for the intercept and slope.
ggplot(data = climate, aes(x = tmean, y = precip)) +
geom_point()+
geom_abline(intercept = precip_lm$coeff[1], slope = precip_lm$coeff[2], color = 'red')
As a shortcut for graphing, you could also use the layer
geom_smooth and specify the method = lm which
will run the linear model and plot.
ggplot(data = climate, aes(x = tmean, y = precip)) +
geom_point()+
geom_smooth(method='lm')
## `geom_smooth()` using formula = 'y ~ x'
As a good practice, you should not display the best fit line if the model is not significant. Showing the line implies a relationship, which can be misleading there is no evidence for one.
In the previous example, both the response and the predictor were
numeric. But what if you have a categorical predictor variable. In this
case, you can still run the function lm() but the
underlying statistical test will be different. In this case, you are
testing the null hypothesis that the means of the group are
equal.
Let’s say we want to see how the average differs based on season.
First, we will add a column for season in the climate data frame. I
will use the function case_when() to modify with
mutate.
library(dplyr)
climate_season <- climate |>
mutate(season = case_when((month %in% c(12, 1, 2)) ~ 'winter',
(month %in% c(3, 4, 5)) ~ 'spring',
(month %in% c(6, 7, 8) ~ 'summer'),
(month %in% c(9, 10, 11) ~ 'fall')))
Now, we can plot the seasons based on average temperature to see the relationships.
ggplot(data = climate_season, aes(x = season, y = tmean)) +
geom_boxplot() +
theme_classic()
To perform the statistical test, the function works the same as the regression from the example before (response ~ predictor.)
season_lm <- lm(data = climate_season, tmean ~ season)
summary(season_lm)
##
## Call:
## lm(formula = tmean ~ season, data = climate_season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9141 -1.8672 0.0333 2.0292 8.8739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.3551 0.3102 52.733 <2e-16 ***
## seasonspring -0.9190 0.4386 -2.095 0.0367 *
## seasonsummer 9.1542 0.4386 20.870 <2e-16 ***
## seasonwinter -10.2681 0.4386 -23.410 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.523 on 512 degrees of freedom
## Multiple R-squared: 0.7933, Adjusted R-squared: 0.7921
## F-statistic: 655.1 on 3 and 512 DF, p-value: < 2.2e-16
But, you can see here that the output is a bit different. In the coefficients, you have the intercept coefficient and coefficients for different seasons. But only three of the seasons are listed – fall is missing. This is because the intercept estimate is actually the mean for fall. The rest of the estimates are means for the seasons based on fall. Fall is the reference group for the other levels. For example, the mean temperature of spring is 0.9190 less than mean of fall. With categorical data, R is looking at the “treatment contrasts” or the differences in the groups instead of a best fit line.
To interpret the output, you could still look at the p-value in the last line. This is the p-value for the full mode. Here, you see that it is less that 0.05, indicating that you can reject the null hypothesis that the means of the seasons are equal. To determine which is “significant” other tests would be needed, but can say that season has an effect on the temperature.
Let’s do a few more examples, exploring our class water quality data.
First, we need to import our water quality data that we have in our project workspace.
wq <- read.csv('data/raleigh_water_analysis.csv')
Let’s also go ahead and join in the data with the stream codes, for easier viewing of the different streams.
stream_codes <- read.csv('https://maryglover.github.io/bio331/open_data/stream_codes.csv')
wq <- stream_codes |>
right_join(wq)
## Joining with `by = join_by(Site)`
Again, we will explore the relationship between water temperature and dissolved oxygen.
Activity:
lm model in R.There are other relationships that we can explore with this data. While, theoretically, we could analyze every possible correlation, this would not be ethical. This is often called, data dredging, searching the data until you find something that is significant. The more tests that you do, the more likely you are to find that one is significant by chance.
Therefore, it is important to think about what analyses we want to complete before we complete are statistical tests.
Class activity: Come up with one other relationship to test.
For our last statistical test, let’s look to see if the turbidity levels differ in the different streams. Turbidity is the level of clarity in the water, which is influenced by the physical characteristics, like erosion, urban development/construction, etc.
turb_lm <- lm(data = wq, Turbidity_NTU ~ Site)
summary(turb_lm)
##
## Call:
## lm(formula = Turbidity_NTU ~ Site, data = wq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.33 -11.31 -6.52 -2.68 584.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5131 4.7561 1.369 0.17116
## SiteBBS3 21.0984 6.7262 3.137 0.00175 **
## SiteBDB1 0.9554 6.7262 0.142 0.88707
## SiteCC4 11.2803 6.7262 1.677 0.09382 .
## SiteCC5 14.4967 6.7262 2.155 0.03136 *
## SiteHC7 6.7459 6.7262 1.003 0.31612
## SiteHSC6 3.1164 6.7262 0.463 0.64323
## SiteLBC8 48.8131 6.7262 7.257 7.54e-13 ***
## SiteMC10 5.3213 6.7262 0.791 0.42904
## SiteMC9 7.2344 6.7262 1.076 0.28236
## SitePC11 7.5820 6.7262 1.127 0.25990
## SitePHB12 0.9344 6.7262 0.139 0.88954
## SiteRB15 3.4295 6.7262 0.510 0.61024
## SiteRC13 14.0541 6.7262 2.089 0.03690 *
## SiteRC14 13.5607 6.7262 2.016 0.04404 *
## SiteSC16 7.4131 6.7262 1.102 0.27065
## SiteTC17 9.8082 6.7262 1.458 0.14507
## SiteWC18 13.1672 6.7262 1.958 0.05053 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.15 on 1080 degrees of freedom
## (234 observations deleted due to missingness)
## Multiple R-squared: 0.07892, Adjusted R-squared: 0.06443
## F-statistic: 5.444 on 17 and 1080 DF, p-value: 7.843e-12
Here, we do so that the p-value is quite small, 7.8e-12, providing evidence that the streams do differ in Turbidity. Again, we can look at this with a graph
ggplot(data = wq, aes(x = Site, y = Turbidity_NTU)) +
geom_boxplot() +
theme_classic()
## Warning: Removed 234 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Go back to your Part 2 from your homework in the advanced R lesson. In this exercise, you will use the data that you created, were you joined the precipitation data to the water quality data.
lm()geom_abline()geom_smooth()