Data manipulation

R packages

So far, we have worked with functions that are loaded into R or base R functions. However, there is a world of other functions out there. You can create your own functions or use ones that others have made. An R package is a group of functions that can be loaded and used in R. Again, any one can make an R package, but R hosts a wide variety of functions that go through certain checks. These packages can be installed and loaded into your own R space for you to use. Most packages are a group of functions for a specific use, for example, DNA sequence analysis, diversity analyses , reading maps, etc.

We are going to use the dplyr package to orgaize and manipulate data in this lesson.

There are two ways to install a package.

One, you can go to the packages tab in the lower right hand side of Rstudio, where your files are. Then, click install and find the package your want to install. Keep the checkmark checked for “Install dependencies.” Some packages use or depend on functions from other packages. This will install all additional packages you need.

This will run the code install.packages(). The second way is to simply run the code install.packages() in the console, with the name of the package

For example, let’s say I want the RColorBrewer package. This packages adds different colors and palettes that you can use in R. To install the package, I would run:

install.packages("RColorBrewer")

Now, whenever I want to use functions from this package they are installed in by R library. You only ever have to install a package once.

Once, you have installed a package, it is there on your computer. But R doesn’t “load” packages unless you tell it to. You can do this by using the library() function.

library(RColorBrewer)

Now you can use the functions in RColorBrewer.

RColorBrewer has may different color palettes. Here, I want to look at a combination of 3 colors in the Set2 palette.

display.brewer.pal(n = 3, name = 'Set2')

Every time you open R, you do have to load the packages you want to use using the library() function.

Using dplyr

We are going to use package dplyr for managing and manipulating data. This is a widely used package for manipulating and summarizing data.

This package has already been installed for you in the Learning R project.

To use dplyr, you will need to load in the package. Go ahead and do that now, based on what you just learned!

Climate analysis

To highlight the basic features of dplyr, we will continue with the climate data from Raleigh. This data is from the PRISM Climate Group ¹ is a part of Oregon State University and creates datasets using weather stations across the United States. The data sets include total precipitation, maximum temperature, minimum temperature, mean temperature, dew point temperature, vapor pressure deficit, and measurements of global shortwave solar radiation.

The data we are working with is for Raleigh, NC and includes minimum temperature, maximum temperature, mean temperature and precipitation for each month from 1981 to 2023.

First, we will import that data. Again, this data is in the data directory in your Learning R project. I will assign the data the name climate.

climate <- read.csv("data/raleigh_prism_climate.csv")
head(climate)

##   year month   tmean    tmin    tmax   precip
## 1 2023     7 27.2688 22.1419 32.3958 114.6551
## 2 2023     8 26.3314 20.8735 31.7895  94.6812
## 3 2023     9 22.3231 16.8723 27.7740 125.5386
## 4 2023    10 17.2026 11.0189 23.3865  32.2431
## 5 2023    11 10.1585  3.0728 17.2443  58.3917
## 6 2023    12  8.3325  1.5722 15.0930 185.7540

Pipes

As you can see, when you load the data, it is too large to show the entire data set in the R console, so the function head can be useful to show just the first 10 rows. One very useful tool in R which is utilized well in dplyr is the use of “pipes”. Pipes allow you to “pipe” data into a function. This is very helpful when running many functions on a dataset.

The pipe syntax is |>. You may also see %>% which is an older version. The pipe takes the output of the previous code and uses it as the first argument in the next function. In dplyr, as with many functions, the first argument is the data.

For example,

climate |> head()

##   year month   tmean    tmin    tmax   precip
## 1 2023     7 27.2688 22.1419 32.3958 114.6551
## 2 2023     8 26.3314 20.8735 31.7895  94.6812
## 3 2023     9 22.3231 16.8723 27.7740 125.5386
## 4 2023    10 17.2026 11.0189 23.3865  32.2431
## 5 2023    11 10.1585  3.0728 17.2443  58.3917
## 6 2023    12  8.3325  1.5722 15.0930 185.7540

A much easier way to read the pipe is by putting functions on different lines.

climate |>
  head()

##   year month   tmean    tmin    tmax   precip
## 1 2023     7 27.2688 22.1419 32.3958 114.6551
## 2 2023     8 26.3314 20.8735 31.7895  94.6812
## 3 2023     9 22.3231 16.8723 27.7740 125.5386
## 4 2023    10 17.2026 11.0189 23.3865  32.2431
## 5 2023    11 10.1585  3.0728 17.2443  58.3917
## 6 2023    12  8.3325  1.5722 15.0930 185.7540

Filter

To explore the data a bit, we can use dplyr functions filter and select to view only certain rows or columns.

The filter function takes arguments for data and the criteria for filtering the data based on a specific column. For example, you could “filter” the data to show only rows where the year is 2023 in the climate data with the argument year == 2023. Here, because the year is a number you do not need quotations. If you are sorting by a character, you would need quotations marks (ex. city == “Raleigh”).

filter(climate, year == 2023)

##    year month   tmean    tmin    tmax   precip
## 1  2023     7 27.2688 22.1419 32.3958 114.6551
## 2  2023     8 26.3314 20.8735 31.7895  94.6812
## 3  2023     9 22.3231 16.8723 27.7740 125.5386
## 4  2023    10 17.2026 11.0189 23.3865  32.2431
## 5  2023    11 10.1585  3.0728 17.2443  58.3917
## 6  2023    12  8.3325  1.5722 15.0930 185.7540
## 7  2023     1  8.9147  3.3799 14.4497  93.7049
## 8  2023     2 10.7567  4.9577 16.5557  72.8692
## 9  2023     3 11.6672  5.2850 18.0496  75.1907
## 10 2023     4 16.8021 10.3899 23.2143 192.0933
## 11 2023     5 18.7094 13.4401 23.9789  79.6373
## 12 2023     6 23.0194 17.5976 28.4413  95.8790

To use the pipe, you could also do the following

climate |>
  filter(year == 2023)

##    year month   tmean    tmin    tmax   precip
## 1  2023     7 27.2688 22.1419 32.3958 114.6551
## 2  2023     8 26.3314 20.8735 31.7895  94.6812
## 3  2023     9 22.3231 16.8723 27.7740 125.5386
## 4  2023    10 17.2026 11.0189 23.3865  32.2431
## 5  2023    11 10.1585  3.0728 17.2443  58.3917
## 6  2023    12  8.3325  1.5722 15.0930 185.7540
## 7  2023     1  8.9147  3.3799 14.4497  93.7049
## 8  2023     2 10.7567  4.9577 16.5557  72.8692
## 9  2023     3 11.6672  5.2850 18.0496  75.1907
## 10 2023     4 16.8021 10.3899 23.2143 192.0933
## 11 2023     5 18.7094 13.4401 23.9789  79.6373
## 12 2023     6 23.0194 17.5976 28.4413  95.8790

Important syntax

== : equals
> : greater than
< : less than
!= : not equal to
>= : greater than or equal to
<= : less than or equal to
: or
%in% : matches contents of a vector

You can also filter by multiple criteria. Let’s say you want only January of months before 2020.

climate |>
  filter(year < 2020, month == 1)

##    year month    tmean   tmin   tmax  precip
## 1  1981     1 1.562000 -5.320  8.445  29.154
## 2  1982     1 1.113000 -4.439  6.667 146.087
## 3  1983     1 3.644000 -1.235  8.525  48.100
## 4  1984     1 2.811000 -2.922  8.544 112.225
## 5  1985     1 1.411000 -4.049  6.872  99.806
## 6  1986     1 3.948000 -2.252 10.150  46.020
## 7  1987     1 3.571000 -1.830  8.972 186.841
## 8  1988     1 1.177000 -4.130  6.486  87.408
## 9  1989     1 7.110000  1.335 12.885  47.079
## 10 1990     1 9.033000  2.490 15.577  90.624
## 11 1991     1 5.731000  0.352 11.112 122.571
## 12 1992     1 5.908000  0.085 11.732 111.145
## 13 1993     1 6.259000  0.952 11.567 120.780
## 14 1994     1 2.742000 -3.353  8.836 101.831
## 15 1995     1 5.645000  0.157 11.134 134.382
## 16 1996     1 3.401000 -1.898  8.702 120.611
## 17 1997     1 5.218000 -0.588 11.025  79.732
## 18 1998     1 7.138000  2.077 12.202 209.320
## 19 1999     1 7.813000  1.056 14.572 148.250
## 20 2000     1 3.943000 -1.812  9.700 147.929
## 21 2001     1 4.572000 -1.656 10.801  36.725
## 22 2002     1 6.148000  0.006 12.292 137.531
## 23 2003     1 2.740000 -2.845  8.326  59.701
## 24 2004     1 3.375000 -2.477  9.228  33.850
## 25 2005     1 6.443000  0.993 11.894  74.003
## 26 2006     1 8.529000  2.669 14.391  52.648
## 27 2007     1 7.409000  1.556 13.264  75.280
## 28 2008     1 4.879000 -1.096 10.854  26.438
## 29 2009     1 3.961000 -1.455  9.379  68.061
## 30 2010     1 3.337000 -2.505  9.180 117.815
## 31 2011     1 2.982000 -2.651  8.617  37.024
## 32 2012     1 6.749001  0.346 13.154  63.561
## 33 2013     1 6.322001  0.952 11.693  79.784
## 34 2014     1 2.335000 -4.235  8.906  61.532
## 35 2015     1 3.754000 -2.142  9.649 104.963
## 36 2016     1 3.643000 -1.692  8.980  68.965
## 37 2017     1 7.140000  2.029 12.252  94.333
## 38 2018     1 2.235000 -4.070  8.542  81.913
## 39 2019     1 5.972000  0.345 11.599  77.125

You can also use the “|” sign for “or”.

climate |>
  filter(year == 2023 | year == 1985)

##    year month   tmean    tmin    tmax   precip
## 1  2023     7 27.2688 22.1419 32.3958 114.6551
## 2  2023     8 26.3314 20.8735 31.7895  94.6812
## 3  2023     9 22.3231 16.8723 27.7740 125.5386
## 4  2023    10 17.2026 11.0189 23.3865  32.2431
## 5  2023    11 10.1585  3.0728 17.2443  58.3917
## 6  2023    12  8.3325  1.5722 15.0930 185.7540
## 7  1985     1  1.4110 -4.0490  6.8720  99.8060
## 8  1985     2  5.5930 -0.2060 11.3930 123.8080
## 9  1985     3 11.8720  4.4480 19.2970  33.3370
## 10 1985     4 17.6430  9.6020 25.6850  11.7330
## 11 1985     5 20.4780 13.9350 27.0210  78.6880
## 12 1985     6 24.3130 17.7360 30.8910  77.4210
## 13 1985     7 25.3010 20.0430 30.5600 134.8740
## 14 1985     8 24.2630 18.8350 29.6920 122.9370
## 15 1985     9 21.5910 15.2500 27.9330  13.5910
## 16 1985    10 18.2420 12.4870 23.9970  68.5060
## 17 1985    11 15.3670  9.7730 20.9630 183.4270
## 18 1985    12  4.3550 -2.1120 10.8220  33.3070
## 19 2023     1  8.9147  3.3799 14.4497  93.7049
## 20 2023     2 10.7567  4.9577 16.5557  72.8692
## 21 2023     3 11.6672  5.2850 18.0496  75.1907
## 22 2023     4 16.8021 10.3899 23.2143 192.0933
## 23 2023     5 18.7094 13.4401 23.9789  79.6373
## 24 2023     6 23.0194 17.5976 28.4413  95.8790

Lastly, let’s say you only want the odd years in the 1990s. You could use %in% which will match any “in” a given vector. Remember the c() can make a vector.

climate |>
  filter(year %in% c(1991, 1993, 1995, 1997, 1999 ))

##    year month     tmean   tmin   tmax  precip
## 1  1991     1  5.731000  0.352 11.112 122.571
## 2  1991     2  8.296000  1.697 14.895  15.668
## 3  1991     3 12.244000  5.843 18.647 128.508
## 4  1991     4 16.723001 10.559 22.887  42.827
## 5  1991     5 22.355001 16.536 28.177  90.951
## 6  1991     6 24.433001 18.354 30.513 113.636
## 7  1991     7 27.254002 22.124 32.385 181.898
## 8  1991     8 25.129002 20.266 29.992 123.393
## 9  1991     9 21.965000 15.755 28.175  56.753
## 10 1991    10 15.955001  9.038 22.873  37.623
## 11 1991    11  9.323001  2.215 16.433  25.796
## 12 1991    12  8.050000  1.058 15.044  80.302
## 13 1993     1  6.259000  0.952 11.567 120.780
## 14 1993     2  4.765000 -1.784 11.314  55.441
## 15 1993     3  8.510000  2.249 14.773 164.502
## 16 1993     4 13.932001  6.625 21.240 119.347
## 17 1993     5 20.693001 14.322 27.065  90.104
## 18 1993     6 24.626001 18.249 31.003  20.048
## 19 1993     7 28.080002 21.872 34.290  74.712
## 20 1993     8 25.542002 19.577 31.509  62.508
## 21 1993     9 23.101002 17.043 29.161 109.959
## 22 1993    10 15.186001  8.829 21.544  99.568
## 23 1993    11 10.550000  3.890 17.212  76.211
## 24 1993    12  4.532000 -1.430 10.494  84.839
## 25 1995     1  5.645000  0.157 11.134 134.382
## 26 1995     2  4.585000 -1.044 10.215 137.375
## 27 1995     3 11.340000  4.228 18.453  98.181
## 28 1995     4 15.916000  8.687 23.145  33.417
## 29 1995     5 20.113001 13.973 26.254  90.784
## 30 1995     6 23.166000 18.247 28.087 258.510
## 31 1995     7 26.499001 20.869 32.131  73.495
## 32 1995     8 26.334002 20.810 31.860 123.741
## 33 1995     9 21.195002 16.302 26.089  89.748
## 34 1995    10 17.236000 10.898 23.575 247.161
## 35 1995    11  8.073000  1.920 14.227 123.340
## 36 1995    12  3.516000 -2.605  9.639  44.453
## 37 1997     1  5.218000 -0.588 11.025  79.732
## 38 1997     2  8.106000  2.146 14.068  75.188
## 39 1997     3 12.844001  5.593 20.097  87.205
## 40 1997     4 13.412001  6.666 20.161 135.245
## 41 1997     5 18.079000 10.812 25.348  53.624
## 42 1997     6 22.141001 16.401 27.882 103.663
## 43 1997     7 26.699001 20.889 32.511 124.724
## 44 1997     8 24.546001 18.329 30.763  32.709
## 45 1997     9 21.785002 16.134 27.436  68.137
## 46 1997    10 15.900001  9.592 22.209  75.728
## 47 1997    11  8.671000  2.670 14.674  89.114
## 48 1997    12  5.570000  0.086 11.055  86.881
## 49 1999     1  7.813000  1.056 14.572 148.250
## 50 1999     2  7.329000  0.451 14.208  44.975
## 51 1999     3  8.551001  1.469 15.636 104.431
## 52 1999     4 16.349001  9.623 23.076  67.906
## 53 1999     5 19.112001 12.644 25.581  32.010
## 54 1999     6 23.180000 17.775 28.586  44.236
## 55 1999     7 26.694002 21.230 32.159  75.205
## 56 1999     8 26.580002 20.564 32.598 108.969
## 57 1999     9 20.672001 15.546 25.799 509.944
## 58 1999    10 14.967001  9.012 20.924  98.421
## 59 1999    11 12.886001  5.815 19.958  39.651
## 60 1999    12  6.398000 -0.079 12.876  60.766

NOTE: Notice in each of the previous examples your data frame climate remains unchanged. You have just been viewing the filtered data. If you want to save the filtered data you would have to assign it to a new variable.

climate2023 <- climate |>
  filter(year == 2023)

climate2023

##    year month   tmean    tmin    tmax   precip
## 1  2023     7 27.2688 22.1419 32.3958 114.6551
## 2  2023     8 26.3314 20.8735 31.7895  94.6812
## 3  2023     9 22.3231 16.8723 27.7740 125.5386
## 4  2023    10 17.2026 11.0189 23.3865  32.2431
## 5  2023    11 10.1585  3.0728 17.2443  58.3917
## 6  2023    12  8.3325  1.5722 15.0930 185.7540
## 7  2023     1  8.9147  3.3799 14.4497  93.7049
## 8  2023     2 10.7567  4.9577 16.5557  72.8692
## 9  2023     3 11.6672  5.2850 18.0496  75.1907
## 10 2023     4 16.8021 10.3899 23.2143 192.0933
## 11 2023     5 18.7094 13.4401 23.9789  79.6373
## 12 2023     6 23.0194 17.5976 28.4413  95.8790

Activity: Filter the data for the year that you were born.

Select

select() is very similar to filter except it works on columns.

Let’s say you only want to look at the precipitation column and not the temperature ones.

climate |>
  select(year, month, precip) |>
  head()

##   year month   precip
## 1 2023     7 114.6551
## 2 2023     8  94.6812
## 3 2023     9 125.5386
## 4 2023    10  32.2431
## 5 2023    11  58.3917
## 6 2023    12 185.7540

You also can use the minus sign (-) to remove a column.

climate |>
  select(-precip) |>
  head()

##   year month   tmean    tmin    tmax
## 1 2023     7 27.2688 22.1419 32.3958
## 2 2023     8 26.3314 20.8735 31.7895
## 3 2023     9 22.3231 16.8723 27.7740
## 4 2023    10 17.2026 11.0189 23.3865
## 5 2023    11 10.1585  3.0728 17.2443
## 6 2023    12  8.3325  1.5722 15.0930

Another useful tip is that you can even rename columns by using the equal sign with a new name.

climate |>
  select(year, month, precipitation = precip ) |>
  head()

##   year month precipitation
## 1 2023     7      114.6551
## 2 2023     8       94.6812
## 3 2023     9      125.5386
## 4 2023    10       32.2431
## 5 2023    11       58.3917
## 6 2023    12      185.7540

Lastly, you can also chain the filter and select functions again using the pipe. Here, we will look only at maximum temperature for January

climate |>
  filter(month == 1) |>
  select(year, tmax)

##    year    tmax
## 1  1981  8.4450
## 2  1982  6.6670
## 3  1983  8.5250
## 4  1984  8.5440
## 5  1985  6.8720
## 6  1986 10.1500
## 7  1987  8.9720
## 8  1988  6.4860
## 9  1989 12.8850
## 10 1990 15.5770
## 11 1991 11.1120
## 12 1992 11.7320
## 13 1993 11.5670
## 14 1994  8.8360
## 15 1995 11.1340
## 16 1996  8.7020
## 17 1997 11.0250
## 18 1998 12.2020
## 19 1999 14.5720
## 20 2000  9.7000
## 21 2001 10.8010
## 22 2002 12.2920
## 23 2003  8.3260
## 24 2004  9.2280
## 25 2005 11.8940
## 26 2006 14.3910
## 27 2007 13.2640
## 28 2008 10.8540
## 29 2009  9.3790
## 30 2010  9.1800
## 31 2011  8.6170
## 32 2012 13.1540
## 33 2013 11.6930
## 34 2014  8.9060
## 35 2015  9.6490
## 36 2016  8.9800
## 37 2017 12.2520
## 38 2018  8.5420
## 39 2019 11.5990
## 40 2020 13.6700
## 41 2021  9.8236
## 42 2022  9.5421
## 43 2023 14.4497

Activity: Filter the data for the year AND month that you were born. Select only the mean temperature column to display

Tidy Data

So far, you have been working with already “clean” tidy data. There are no mistakes in the data, including typos and the data is already in a tidy format. This is not always the case, either due to mistakes in entering your own data or the format of open source data. There are many functions in dplyr to edit data frames, including some that will change to format of rows and columns to make them in tidy format.

We will not go into functions to edit the structure of the data frame but there are a few useful functions to edit the cells or columns in your data frame.

How data are organized in a data table is an important part of analyzing data. Keeping your data organized in a consistent and standard way will help you explore the data and use functions in R that expect data to be in a specific format.

We will be working with “tidy” data. Tidy data is set up such that: - Every column is a variable - Every row is an observation - Every cell is a value.

For example:

Airplane activity: Is your data tidy? Go back to your google sheets entry and make it tidy!

Additional functions

Here are some other functions that can be useful when exploring the data in different columns. For each of these functions, you must specify the column you want to work with.

distinct(): show only the unique value
arrange() : sort the data by a specific column

You can also use other base r functions that we have used before. For example,

max()
min()
mean()
median()

Distinct

Distinct is a helpful function for viewing what unique characters are in your data set. Let’s see what different years are in the climate data

climate |>
  distinct(year)

Arrange

arrange() is very useful for sorting data by a specific column.

Let’s say we want to sort based on the mean temperature.

climate |>
  arrange(tmean) |>
  head()

##   year month tmean   tmin  tmax  precip
## 1 1989    12 0.961 -4.670 6.594  83.588
## 2 1982     1 1.113 -4.439 6.667 146.087
## 3 1988     1 1.177 -4.130 6.486  87.408
## 4 2015     2 1.262 -5.092 7.617  78.897
## 5 2010    12 1.320 -3.640 6.280  59.538
## 6 1985     1 1.411 -4.049 6.872  99.806

Here, you can see that it is sorting the mean temperature from least to greatest. We can see that December 1989 had the lowest mean temperature of 0.96 degrees C.

If we want to sort from greatest to least, we can specify, that we want in descending order using desc()

climate |>
  arrange(desc(tmean)) |>
  head()

##   year month  tmean   tmin   tmax  precip
## 1 2012     7 28.353 22.457 34.250 222.759
## 2 2007     8 28.161 21.424 34.900  34.658
## 3 2011     7 28.118 21.627 34.610  64.191
## 4 1993     7 28.080 21.872 34.290  74.712
## 5 1986     7 27.955 21.470 34.442 105.343
## 6 2005     7 27.658 22.194 33.123 112.643

Now, we can see that July 2012 had the highest mean temperature.

Alternatively, we can use the minus sign “-” to show that we want in descending order.

climate |>
  arrange(-tmean) |>
  head()

##   year month  tmean   tmin   tmax  precip
## 1 2012     7 28.353 22.457 34.250 222.759
## 2 2007     8 28.161 21.424 34.900  34.658
## 3 2011     7 28.118 21.627 34.610  64.191
## 4 1993     7 28.080 21.872 34.290  74.712
## 5 1986     7 27.955 21.470 34.442 105.343
## 6 2005     7 27.658 22.194 33.123 112.643

Activity: What month had the highest precipitation?

Mutate

In dplyr, you can make new columns based on existing columns using the mutate() function. The mutate() function expects the format new_column = how to modify. This can be very useful for determining rates, proportions, etc.

For example, you could get the difference in maximum temperature and minimum temperate by doing the following:

climate |>
  mutate(temp_difference = tmax - tmin) |>
  arrange(temp_difference) |>
  head()

##   year month   tmean   tmin   tmax  precip temp_difference
## 1 2018     9 25.2290 20.953 29.507 195.477        8.554001
## 2 2014     9 21.9320 17.543 26.323 152.583        8.780001
## 3 2002    10 16.6560 12.263 21.051 210.451        8.788001
## 4 2020     9 21.8465 17.440 26.253 278.672        8.813000
## 5 2020     8 26.7255 22.289 31.162 141.443        8.873001
## 6 2014     8 23.9320 19.457 28.408 181.368        8.951000

Activity : In the climate data the mean temperature (tmean) is calculated by the Prism data group as the average of the minimum and maximum temperature. Can you use mutate() to calculate this variable and compare to the column calculated by Prism.

Water quality activity

In this course, we will exploring historic water quality data from the City of Raleigh. The data is collected from 2008 - 2023 and includes 19 different water quality variables, including water temperature, E. coli bacteria, and dissolved oxygen for 18 different stream sites.

You can read the data into R. The file “raleigh_wq_2008_2023.csv” is in your data folder.

wq <- read.csv('data/raleigh_wq_2008_2023.csv')

Activity: Review the data. Is it tidy?

Quality control of data

When working data, it is important to check the data to make sure there are no errors or issues in the data. For example, are there typos? numbers that don’t make sense?

Let’s take a look at the water quality data to check to see if there are any issues.

First, take a look at the first few rows, using the function head()

head(wq)

##   Site       Date Time        Parameter Result PQL Unit
## 1  BB2 2008-09-30 9:52          Calcium   <NA>  NA mg/L
## 2  BB2 2008-09-30 9:52   Hardness_total   <NA>  NA mg/L
## 3  BB2 2008-09-30 9:52        Magnesium   <NA>  NA mg/L
## 4  BB2 2008-09-30 9:52         Salinity   <NA>  NA  ppt
## 5  BB2 2008-09-30 9:52 Phosphorus_total  <0.05  NA mg/L
## 6  BB2 2008-09-30 9:52              NH3  <0.02  NA mg/L

What do you notice already?

Now, let’s look at a tidy version so that we can use some of the dplyr functions we just learned. Here, I am giving you a copy of the water quality data that I have put into a “tidy” format.

To read in the data, run the following code.

wq_tidy <- read.csv('https://maryglover.github.io/bio331/water_quality_manip/raleigh_wq_tidy.csv')

First take a look at what columns are in the dataset using the colnames() function
Take a look at the first few rows to see what the data look like using head()
Use the distinct() function to see what locations or “Sites” were evaluated in Raleigh.

What do you notice about the data?

Now you can take a look at the data that has been cleaned up!

Use the following code to download the csv.

wq_clean <- read.csv('https://maryglover.github.io/bio331/water_quality_manip/raleigh_wq_clean.csv')

Take a look at the first few rows to see what the data look like using head()
Use the function nrow() on the data to see how many observations are in the dataset

Homework Exercise

Complete the rest of the exercises in an R script. Use comments to separate questions and describe what your code is doing. When you complete the exercises, submit the R script (.R file) in Moodle.

For homework, you will answer questions on a dataset with data from imdb movie data.

Set up

Read into R the dataset named “movies.csv” in your data folder. Assign it the name movies.
Run the following code so that you can analyze the data. You don’t need to do anything else for this questions :)

library(dplyr)
movies <- movies |>
mutate(budget = as.numeric(budget), intgross = as.numeric(intgross))

Analyzing the data

There are a lot of columns in this data set, review the columns using the code colnames(movies)
Use select to only show the following columns. Resave the data with only the selected columns by assigning it a new name. You can use this new data frame for the rest of the assignment. Select columns

year
title
budget
intgross
rated
metascore
imdb_rating

Filter the data for movies that were released in your birth year. How many movies are there from that year? (Hint: use a comment in your code to give your answer).
Use the arrange() function to determine what 3 movies had the highest budget.
Use the arrange() function to determine the G rated movie with the highest imdb metascore. (Hint: you will have to filter first!)
Determine the profit of the movies by subtracting the budget from the international gross (intgross), using the mutate() function. (hint, you will have to have a name for your new profit column).
What movie had the highest profit?
What movie had the highest profit in 1990?
Reach: What year had the highest average imdb meta score (1980, 1990, or 2000). Hint: you will have to do each year separately and use the mean function to get the average.

List of new functions

filter()
select()
distinct()
nrow()
arrange()
mutate()

You can always use the “?” before a function to pull up the help page of a function, which also has examples. For example ?mutate

Resources

There are loads of resources out there for dplyr, including simply googling what you want to do. Here are a few:

R for Data Science book. See data transformation and data tidying sections
Stat 454 Intro to dplyr
Data school dplyr tutorial

PRISM Climate Group, Oregon State University, https://prism.oregonstate.edu, accessed 17 Jan 2024.↩︎