Basic Statistics in R

You can use this template script file to work through the lesson.

Introduction to statistics in R

A main use of R is conducting statistical analyses with data. R has many functions and packages for running different statistical analyzes. Just like with the work you have done in R, different functions perform different tasks and you specify how the function will work with arguments. You have control over the statistical tests you are performing, but you also have to know what you want to do.

In this lesson, we will assume a basic understanding of hypothesis testing. See the lesson slides for a review.

Linear regressions in R

While there are multiple simple statistical tests to use with experimental data to compare groups, we will only cover one, the linear model. We will use the function lm(). In R, this conveniently can work with both numerical and categorical independent variables.

We will continue using the Raleigh temperature data set.

climate <- read.csv('data/raleigh_prism_climate.csv')

First, let’s say we are interested in learning about the pattern of precipitation in Raleigh. Specifically, we want to see if the mean temperature affects the precipitation.

Before, we look at this statistically, it is good to graph this to see what we are working with.

library(ggplot2)

ggplot(data = climate, aes(x = tmean, y = precip)) + 
  geom_point()

The graph shows us the relationship we are testing. We will use the function lm() to make a linear model. The linear model will “model” the “best fit line” of the relationship. Remember, a line equation is \(y = mx + b\), where:

y is the dependent variable
x is the independent variable
m is the slope of the line
b is the intercept of the line.

If the slope is 0 there is not relationship between the two variables, so in our linear regression we are testing the null hypothesis that there is no relationship and the slope is equal to 0.

The function lm expects the data (data =) and a formula which specifies the model. The formula is in the format response ~ predictor(s), where the response and predictor are columns in the dataset. The response is the dependent variable and is always a numerical value, at least for this linear model. The predictor is the dependent variable and can be either numeric or categorical. You can include multiple predictors by adding them, i.e. response ~ predictor1 + predictor2

To calculate, you should assign the model to a named R object.

precip_lm <- lm(data = climate, precip ~ tmean)

To view the model, use the summary() function.

summary(precip_lm)

## 
## Call:
## lm(formula = precip ~ tmean, data = climate)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -99.19 -36.15  -6.83  27.72 406.18 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  78.2792     5.3811  14.547  < 2e-16 ***
## tmean         1.2329     0.3053   4.039  6.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.52 on 514 degrees of freedom
## Multiple R-squared:  0.03076,    Adjusted R-squared:  0.02887 
## F-statistic: 16.31 on 1 and 514 DF,  p-value: 6.199e-05

Here, you will find the results of the model. The model has 2 coefficients, the intercept and the tmean, or the “slope.” Both of these are testing is the coefficient equal to 0. We are not concerned with whether the intercept of the best fit line is equal to 0, but if the slope of the line is different from 0. In this model, the best fit line has a slope of 1.23. The p-value is 6.2e-05.

Using the criteria of p < 0.05, we can “fail to accept” the null hypothesis that there is no relationship between mean temperature and precipitation. You can see the F statistic, degrees of freedom (DF) and p-value at the bottom. Here, the p-value of the full model and the tmean are the same, as we only looked at one predictor.

So, here we do see a “significant” relationship between temperature and precipitation. We also see that the relationship is not very strong. A slope of 1.23 is not very steep. You can also look at the R-squared value, which is the “the proportion of the variation in the dependent variable that is predictable from the independent variable” or the strenght of the relationship. Here the R-squared (adjusted) is 0.02887, which is very low.

Graphing linear model

You can also show the linear model, with the best fit line, on the plot in ggplot. From a linear model, you can extract the coefficients of slope and intercept using the coeff[1] and coeff[2].

precip_lm$coeff[1]

## (Intercept) 
##    78.27923

precip_lm$coeff[2]

##    tmean 
## 1.232891

You can also add a line layer using geom_abline, which expects a value for the intercept and slope.

ggplot(data = climate, aes(x = tmean, y = precip)) + 
  geom_point()+
  geom_abline(intercept = precip_lm$coeff[1], slope = precip_lm$coeff[2], color = 'red')

As a shortcut for graphing, you could also use the layer geom_smooth and specify the method = lm which will run the linear model and plot.

ggplot(data = climate, aes(x = tmean, y = precip)) + 
  geom_point()+
  geom_smooth(method='lm')

## `geom_smooth()` using formula = 'y ~ x'

As a good practice, you should not display the best fit line if the model is not significant. Showing the line implies a relationship, which can be misleading there is no evidence for one.

Linear models with categorical variables

In the previous example, both the response and the predictor were numeric. But what if you have a categorical predictor variable. In this case, you can still run the function lm() but the underlying statistical test will be different. In this case, you are testing the null hypothesis that the means of the group are equal.

Let’s say we want to see how the average differs based on season.

First, we will add a column for season in the climate data frame. I will use the function case_when() to modify with mutate.

library(dplyr)
climate_season <- climate |>
  mutate(season = case_when((month %in% c(12, 1, 2)) ~ 'winter', 
                           (month %in% c(3, 4, 5)) ~ 'spring',
                           (month %in% c(6, 7, 8) ~ 'summer'), 
                           (month %in% c(9, 10, 11) ~ 'fall')))

Now, we can plot the seasons based on average temperature to see the relationships.

ggplot(data = climate_season, aes(x = season, y = tmean)) +
  geom_boxplot() +
  theme_classic()

To perform the statistical test, the function works the same as the regression from the example before (response ~ predictor.)

season_lm <- lm(data = climate_season, tmean ~ season)
summary(season_lm)

## 
## Call:
## lm(formula = tmean ~ season, data = climate_season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9141 -1.8672  0.0333  2.0292  8.8739 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   16.3551     0.3102  52.733   <2e-16 ***
## seasonspring  -0.9190     0.4386  -2.095   0.0367 *  
## seasonsummer   9.1542     0.4386  20.870   <2e-16 ***
## seasonwinter -10.2681     0.4386 -23.410   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.523 on 512 degrees of freedom
## Multiple R-squared:  0.7933, Adjusted R-squared:  0.7921 
## F-statistic: 655.1 on 3 and 512 DF,  p-value: < 2.2e-16

But, you can see here that the output is a bit different. In the coefficients, you have the intercept coefficient and coefficients for different seasons. But only three of the seasons are listed – fall is missing. This is because the intercept estimate is actually the mean for fall. The rest of the estimates are means for the seasons based on fall. Fall is the reference group for the other levels. For example, the mean temperature of spring is 0.9190 less than mean of fall. With categorical data, R is looking at the “treatment contrasts” or the differences in the groups instead of a best fit line.

To interpret the output, you could still look at the p-value in the last line. This is the p-value for the full mode. Here, you see that it is less that 0.05, indicating that you can reject the null hypothesis that the means of the seasons are equal. To determine which is “significant” other tests would be needed, but can say that season has an effect on the temperature.

Water Quality Data

Let’s do a few more examples, exploring our class water quality data.

First, we need to import our water quality data that we have in our project workspace.

wq <- read.csv('data/raleigh_water_analysis.csv')

Let’s also go ahead and join in the data with the stream codes, for easier viewing of the different streams.

stream_codes <- read.csv('https://maryglover.github.io/bio331/open_data/stream_codes.csv')

wq <- stream_codes |>
  right_join(wq)

## Joining with `by = join_by(Site)`

Again, we will explore the relationship between water temperature and dissolved oxygen.

Activity:

What is the null hypothesis?
Create a lm model in R.
Interpret the results. Is there a relationship between the two?
Graph the two variables using ggplot.
Add the best fit line to the graph.

There are other relationships that we can explore with this data. While, theoretically, we could analyze every possible correlation, this would not be ethical. This is often called, data dredging, searching the data until you find something that is significant. The more tests that you do, the more likely you are to find that one is significant by chance.

Therefore, it is important to think about what analyses we want to complete before we complete are statistical tests.

Class activity: Come up with one other relationship to test.

For our last statistical test, let’s look to see if the turbidity levels differ in the different streams. Turbidity is the level of clarity in the water, which is influenced by the physical characteristics, like erosion, urban development/construction, etc.

turb_lm <- lm(data = wq, Turbidity_NTU ~ Site)
summary(turb_lm)

## 
## Call:
## lm(formula = Turbidity_NTU ~ Site, data = wq)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52.33 -11.31  -6.52  -2.68 584.39 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.5131     4.7561   1.369  0.17116    
## SiteBBS3     21.0984     6.7262   3.137  0.00175 ** 
## SiteBDB1      0.9554     6.7262   0.142  0.88707    
## SiteCC4      11.2803     6.7262   1.677  0.09382 .  
## SiteCC5      14.4967     6.7262   2.155  0.03136 *  
## SiteHC7       6.7459     6.7262   1.003  0.31612    
## SiteHSC6      3.1164     6.7262   0.463  0.64323    
## SiteLBC8     48.8131     6.7262   7.257 7.54e-13 ***
## SiteMC10      5.3213     6.7262   0.791  0.42904    
## SiteMC9       7.2344     6.7262   1.076  0.28236    
## SitePC11      7.5820     6.7262   1.127  0.25990    
## SitePHB12     0.9344     6.7262   0.139  0.88954    
## SiteRB15      3.4295     6.7262   0.510  0.61024    
## SiteRC13     14.0541     6.7262   2.089  0.03690 *  
## SiteRC14     13.5607     6.7262   2.016  0.04404 *  
## SiteSC16      7.4131     6.7262   1.102  0.27065    
## SiteTC17      9.8082     6.7262   1.458  0.14507    
## SiteWC18     13.1672     6.7262   1.958  0.05053 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.15 on 1080 degrees of freedom
##   (234 observations deleted due to missingness)
## Multiple R-squared:  0.07892,    Adjusted R-squared:  0.06443 
## F-statistic: 5.444 on 17 and 1080 DF,  p-value: 7.843e-12

Here, we do so that the p-value is quite small, 7.8e-12, providing evidence that the streams do differ in Turbidity. Again, we can look at this with a graph

ggplot(data = wq, aes(x = Site, y = Turbidity_NTU)) +
  geom_boxplot() +
  theme_classic()

## Warning: Removed 234 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Exercise

Go back to your Part 2 from your homework in the advanced R lesson. In this exercise, you will use the data that you created, were you joined the precipitation data to the water quality data.

Perform a statistical test to determine if there is a relationship between precipitation and turbidity. In this test, the response is turbidity and the predictor is precipitation.
Summarize the result of the statistical test. Is the result significant? If so, is the relationship positive or negative?
Make a graph of the relationship between precipiation and turbidity. Make it pretty!
If the result was significant, add a best fit line to the graph.
Write a 3-part figure caption with 1) A descriptive title 2) a methods statement and 3) a summary of the main result.

New Functions

lm()
geom_abline()
geom_smooth()