So far, we have worked with functions that are loaded into R or base R functions. However, there is a world of other functions out there. You can create your own functions or use ones that others have made. An R package is a group of functions that can be loaded and used in R. Again, any one can make an R package, but R hosts a wide variety of functions that go through certain checks. These packages can be installed and loaded into your own R space for you to use. Most packages are a group of functions for a specific use, for example, DNA sequence analysis, diversity analyses , reading maps, etc.
We are going to use the dplyr package to orgaize and
manipulate data in this lesson.
There are two ways to install a package.
One, you can go to the packages tab in the lower right hand side of Rstudio, where your files are. Then, click install and find the package your want to install. Keep the checkmark checked for “Install dependencies.” Some packages use or depend on functions from other packages. This will install all additional packages you need.
This will run the code install.packages(). The second
way is to simply run the code install.packages() in the
console, with the name of the package
For example, let’s say I want the RColorBrewer package.
This packages adds different colors and palettes that you can use in R.
To install the package, I would run:
install.packages("RColorBrewer")
Now, whenever I want to use functions from this package they are installed in by R library. You only ever have to install a package once.
Once, you have installed a package, it is there on your computer. But
R doesn’t “load” packages unless you tell it to. You can do this by
using the library() function.
library(RColorBrewer)
Now you can use the functions in RColorBrewer.
RColorBrewer has may different color palettes. Here, I want to look at a combination of 3 colors in the Set2 palette.
display.brewer.pal(n = 3, name = 'Set2')
Every time you open R, you do have to load the packages you
want to use using the library() function.
We are going to use package dplyr for managing and
manipulating data. This is a widely used package for manipulating and
summarizing data.
This package has already been installed for you in the Learning R project.
To use dplyr, you will
need to load in the package. Go ahead and do that now, based on what you
just learned!
To highlight the basic features of dplyr, we will
continue with the climate data from Raleigh. This data is from the PRISM
Climate Group 1 is a part of Oregon State University and
creates datasets using weather stations across the United States. The
data sets include total precipitation, maximum temperature, minimum
temperature, mean temperature, dew point temperature, vapor pressure
deficit, and measurements of global shortwave solar radiation.
The data we are working with is for Raleigh, NC and includes minimum temperature, maximum temperature, mean temperature and precipitation for each month from 1981 to 2023.
First, we will import that data. Again, this data is in the data
directory in your Learning R project. I will assign the data the name
climate.
climate <- read.csv("data/raleigh_prism_climate.csv")
head(climate)
## year month tmean tmin tmax precip
## 1 2023 7 27.2688 22.1419 32.3958 114.6551
## 2 2023 8 26.3314 20.8735 31.7895 94.6812
## 3 2023 9 22.3231 16.8723 27.7740 125.5386
## 4 2023 10 17.2026 11.0189 23.3865 32.2431
## 5 2023 11 10.1585 3.0728 17.2443 58.3917
## 6 2023 12 8.3325 1.5722 15.0930 185.7540
As you can see, when you load the data, it is too large to show the
entire data set in the R console, so the function head can
be useful to show just the first 10 rows. One very useful tool in R
which is utilized well in dplyr is the use of “pipes”.
Pipes allow you to “pipe” data into a function. This is very helpful
when running many functions on a dataset.
The pipe syntax is |>. You may also see
%>% which is an older version. The pipe takes the output
of the previous code and uses it as the first argument in the next
function. In dplyr, as with many functions, the first
argument is the data.
For example,
climate |> head()
## year month tmean tmin tmax precip
## 1 2023 7 27.2688 22.1419 32.3958 114.6551
## 2 2023 8 26.3314 20.8735 31.7895 94.6812
## 3 2023 9 22.3231 16.8723 27.7740 125.5386
## 4 2023 10 17.2026 11.0189 23.3865 32.2431
## 5 2023 11 10.1585 3.0728 17.2443 58.3917
## 6 2023 12 8.3325 1.5722 15.0930 185.7540
A much easier way to read the pipe is by putting functions on different lines.
climate |>
head()
## year month tmean tmin tmax precip
## 1 2023 7 27.2688 22.1419 32.3958 114.6551
## 2 2023 8 26.3314 20.8735 31.7895 94.6812
## 3 2023 9 22.3231 16.8723 27.7740 125.5386
## 4 2023 10 17.2026 11.0189 23.3865 32.2431
## 5 2023 11 10.1585 3.0728 17.2443 58.3917
## 6 2023 12 8.3325 1.5722 15.0930 185.7540
To explore the data a bit, we can use dplyr functions
filter and select to view only certain rows or
columns.
The filter function takes arguments for data and the
criteria for filtering the data based on a specific column. For example,
you could “filter” the data to show only rows where the year is 2023 in
the climate data with the argument year == 2023. Here,
because the year is a number you do not need quotations. If you are
sorting by a character, you would need quotations marks (ex. city ==
“Raleigh”).
filter(climate, year == 2023)
## year month tmean tmin tmax precip
## 1 2023 7 27.2688 22.1419 32.3958 114.6551
## 2 2023 8 26.3314 20.8735 31.7895 94.6812
## 3 2023 9 22.3231 16.8723 27.7740 125.5386
## 4 2023 10 17.2026 11.0189 23.3865 32.2431
## 5 2023 11 10.1585 3.0728 17.2443 58.3917
## 6 2023 12 8.3325 1.5722 15.0930 185.7540
## 7 2023 1 8.9147 3.3799 14.4497 93.7049
## 8 2023 2 10.7567 4.9577 16.5557 72.8692
## 9 2023 3 11.6672 5.2850 18.0496 75.1907
## 10 2023 4 16.8021 10.3899 23.2143 192.0933
## 11 2023 5 18.7094 13.4401 23.9789 79.6373
## 12 2023 6 23.0194 17.5976 28.4413 95.8790
To use the pipe, you could also do the following
climate |>
filter(year == 2023)
## year month tmean tmin tmax precip
## 1 2023 7 27.2688 22.1419 32.3958 114.6551
## 2 2023 8 26.3314 20.8735 31.7895 94.6812
## 3 2023 9 22.3231 16.8723 27.7740 125.5386
## 4 2023 10 17.2026 11.0189 23.3865 32.2431
## 5 2023 11 10.1585 3.0728 17.2443 58.3917
## 6 2023 12 8.3325 1.5722 15.0930 185.7540
## 7 2023 1 8.9147 3.3799 14.4497 93.7049
## 8 2023 2 10.7567 4.9577 16.5557 72.8692
## 9 2023 3 11.6672 5.2850 18.0496 75.1907
## 10 2023 4 16.8021 10.3899 23.2143 192.0933
## 11 2023 5 18.7094 13.4401 23.9789 79.6373
## 12 2023 6 23.0194 17.5976 28.4413 95.8790
Important syntax
You can also filter by multiple criteria. Let’s say you want only January of months before 2020.
climate |>
filter(year < 2020, month == 1)
## year month tmean tmin tmax precip
## 1 1981 1 1.562000 -5.320 8.445 29.154
## 2 1982 1 1.113000 -4.439 6.667 146.087
## 3 1983 1 3.644000 -1.235 8.525 48.100
## 4 1984 1 2.811000 -2.922 8.544 112.225
## 5 1985 1 1.411000 -4.049 6.872 99.806
## 6 1986 1 3.948000 -2.252 10.150 46.020
## 7 1987 1 3.571000 -1.830 8.972 186.841
## 8 1988 1 1.177000 -4.130 6.486 87.408
## 9 1989 1 7.110000 1.335 12.885 47.079
## 10 1990 1 9.033000 2.490 15.577 90.624
## 11 1991 1 5.731000 0.352 11.112 122.571
## 12 1992 1 5.908000 0.085 11.732 111.145
## 13 1993 1 6.259000 0.952 11.567 120.780
## 14 1994 1 2.742000 -3.353 8.836 101.831
## 15 1995 1 5.645000 0.157 11.134 134.382
## 16 1996 1 3.401000 -1.898 8.702 120.611
## 17 1997 1 5.218000 -0.588 11.025 79.732
## 18 1998 1 7.138000 2.077 12.202 209.320
## 19 1999 1 7.813000 1.056 14.572 148.250
## 20 2000 1 3.943000 -1.812 9.700 147.929
## 21 2001 1 4.572000 -1.656 10.801 36.725
## 22 2002 1 6.148000 0.006 12.292 137.531
## 23 2003 1 2.740000 -2.845 8.326 59.701
## 24 2004 1 3.375000 -2.477 9.228 33.850
## 25 2005 1 6.443000 0.993 11.894 74.003
## 26 2006 1 8.529000 2.669 14.391 52.648
## 27 2007 1 7.409000 1.556 13.264 75.280
## 28 2008 1 4.879000 -1.096 10.854 26.438
## 29 2009 1 3.961000 -1.455 9.379 68.061
## 30 2010 1 3.337000 -2.505 9.180 117.815
## 31 2011 1 2.982000 -2.651 8.617 37.024
## 32 2012 1 6.749001 0.346 13.154 63.561
## 33 2013 1 6.322001 0.952 11.693 79.784
## 34 2014 1 2.335000 -4.235 8.906 61.532
## 35 2015 1 3.754000 -2.142 9.649 104.963
## 36 2016 1 3.643000 -1.692 8.980 68.965
## 37 2017 1 7.140000 2.029 12.252 94.333
## 38 2018 1 2.235000 -4.070 8.542 81.913
## 39 2019 1 5.972000 0.345 11.599 77.125
You can also use the “|” sign for “or”.
climate |>
filter(year == 2023 | year == 1985)
## year month tmean tmin tmax precip
## 1 2023 7 27.2688 22.1419 32.3958 114.6551
## 2 2023 8 26.3314 20.8735 31.7895 94.6812
## 3 2023 9 22.3231 16.8723 27.7740 125.5386
## 4 2023 10 17.2026 11.0189 23.3865 32.2431
## 5 2023 11 10.1585 3.0728 17.2443 58.3917
## 6 2023 12 8.3325 1.5722 15.0930 185.7540
## 7 1985 1 1.4110 -4.0490 6.8720 99.8060
## 8 1985 2 5.5930 -0.2060 11.3930 123.8080
## 9 1985 3 11.8720 4.4480 19.2970 33.3370
## 10 1985 4 17.6430 9.6020 25.6850 11.7330
## 11 1985 5 20.4780 13.9350 27.0210 78.6880
## 12 1985 6 24.3130 17.7360 30.8910 77.4210
## 13 1985 7 25.3010 20.0430 30.5600 134.8740
## 14 1985 8 24.2630 18.8350 29.6920 122.9370
## 15 1985 9 21.5910 15.2500 27.9330 13.5910
## 16 1985 10 18.2420 12.4870 23.9970 68.5060
## 17 1985 11 15.3670 9.7730 20.9630 183.4270
## 18 1985 12 4.3550 -2.1120 10.8220 33.3070
## 19 2023 1 8.9147 3.3799 14.4497 93.7049
## 20 2023 2 10.7567 4.9577 16.5557 72.8692
## 21 2023 3 11.6672 5.2850 18.0496 75.1907
## 22 2023 4 16.8021 10.3899 23.2143 192.0933
## 23 2023 5 18.7094 13.4401 23.9789 79.6373
## 24 2023 6 23.0194 17.5976 28.4413 95.8790
Lastly, let’s say you only want the odd years in the 1990s. You could
use %in% which will match any “in” a given vector. Remember
the c() can make a vector.
climate |>
filter(year %in% c(1991, 1993, 1995, 1997, 1999 ))
## year month tmean tmin tmax precip
## 1 1991 1 5.731000 0.352 11.112 122.571
## 2 1991 2 8.296000 1.697 14.895 15.668
## 3 1991 3 12.244000 5.843 18.647 128.508
## 4 1991 4 16.723001 10.559 22.887 42.827
## 5 1991 5 22.355001 16.536 28.177 90.951
## 6 1991 6 24.433001 18.354 30.513 113.636
## 7 1991 7 27.254002 22.124 32.385 181.898
## 8 1991 8 25.129002 20.266 29.992 123.393
## 9 1991 9 21.965000 15.755 28.175 56.753
## 10 1991 10 15.955001 9.038 22.873 37.623
## 11 1991 11 9.323001 2.215 16.433 25.796
## 12 1991 12 8.050000 1.058 15.044 80.302
## 13 1993 1 6.259000 0.952 11.567 120.780
## 14 1993 2 4.765000 -1.784 11.314 55.441
## 15 1993 3 8.510000 2.249 14.773 164.502
## 16 1993 4 13.932001 6.625 21.240 119.347
## 17 1993 5 20.693001 14.322 27.065 90.104
## 18 1993 6 24.626001 18.249 31.003 20.048
## 19 1993 7 28.080002 21.872 34.290 74.712
## 20 1993 8 25.542002 19.577 31.509 62.508
## 21 1993 9 23.101002 17.043 29.161 109.959
## 22 1993 10 15.186001 8.829 21.544 99.568
## 23 1993 11 10.550000 3.890 17.212 76.211
## 24 1993 12 4.532000 -1.430 10.494 84.839
## 25 1995 1 5.645000 0.157 11.134 134.382
## 26 1995 2 4.585000 -1.044 10.215 137.375
## 27 1995 3 11.340000 4.228 18.453 98.181
## 28 1995 4 15.916000 8.687 23.145 33.417
## 29 1995 5 20.113001 13.973 26.254 90.784
## 30 1995 6 23.166000 18.247 28.087 258.510
## 31 1995 7 26.499001 20.869 32.131 73.495
## 32 1995 8 26.334002 20.810 31.860 123.741
## 33 1995 9 21.195002 16.302 26.089 89.748
## 34 1995 10 17.236000 10.898 23.575 247.161
## 35 1995 11 8.073000 1.920 14.227 123.340
## 36 1995 12 3.516000 -2.605 9.639 44.453
## 37 1997 1 5.218000 -0.588 11.025 79.732
## 38 1997 2 8.106000 2.146 14.068 75.188
## 39 1997 3 12.844001 5.593 20.097 87.205
## 40 1997 4 13.412001 6.666 20.161 135.245
## 41 1997 5 18.079000 10.812 25.348 53.624
## 42 1997 6 22.141001 16.401 27.882 103.663
## 43 1997 7 26.699001 20.889 32.511 124.724
## 44 1997 8 24.546001 18.329 30.763 32.709
## 45 1997 9 21.785002 16.134 27.436 68.137
## 46 1997 10 15.900001 9.592 22.209 75.728
## 47 1997 11 8.671000 2.670 14.674 89.114
## 48 1997 12 5.570000 0.086 11.055 86.881
## 49 1999 1 7.813000 1.056 14.572 148.250
## 50 1999 2 7.329000 0.451 14.208 44.975
## 51 1999 3 8.551001 1.469 15.636 104.431
## 52 1999 4 16.349001 9.623 23.076 67.906
## 53 1999 5 19.112001 12.644 25.581 32.010
## 54 1999 6 23.180000 17.775 28.586 44.236
## 55 1999 7 26.694002 21.230 32.159 75.205
## 56 1999 8 26.580002 20.564 32.598 108.969
## 57 1999 9 20.672001 15.546 25.799 509.944
## 58 1999 10 14.967001 9.012 20.924 98.421
## 59 1999 11 12.886001 5.815 19.958 39.651
## 60 1999 12 6.398000 -0.079 12.876 60.766
NOTE: Notice in each of the previous examples your
data frame climate remains unchanged. You have just been
viewing the filtered data. If you want to save the filtered data you
would have to assign it to a new variable.
climate2023 <- climate |>
filter(year == 2023)
climate2023
## year month tmean tmin tmax precip
## 1 2023 7 27.2688 22.1419 32.3958 114.6551
## 2 2023 8 26.3314 20.8735 31.7895 94.6812
## 3 2023 9 22.3231 16.8723 27.7740 125.5386
## 4 2023 10 17.2026 11.0189 23.3865 32.2431
## 5 2023 11 10.1585 3.0728 17.2443 58.3917
## 6 2023 12 8.3325 1.5722 15.0930 185.7540
## 7 2023 1 8.9147 3.3799 14.4497 93.7049
## 8 2023 2 10.7567 4.9577 16.5557 72.8692
## 9 2023 3 11.6672 5.2850 18.0496 75.1907
## 10 2023 4 16.8021 10.3899 23.2143 192.0933
## 11 2023 5 18.7094 13.4401 23.9789 79.6373
## 12 2023 6 23.0194 17.5976 28.4413 95.8790
Activity: Filter the data for the year that you were born.
select() is very similar to filter except it works on
columns.
Let’s say you only want to look at the precipitation column and not the temperature ones.
climate |>
select(year, month, precip) |>
head()
## year month precip
## 1 2023 7 114.6551
## 2 2023 8 94.6812
## 3 2023 9 125.5386
## 4 2023 10 32.2431
## 5 2023 11 58.3917
## 6 2023 12 185.7540
You also can use the minus sign (-) to remove a column.
climate |>
select(-precip) |>
head()
## year month tmean tmin tmax
## 1 2023 7 27.2688 22.1419 32.3958
## 2 2023 8 26.3314 20.8735 31.7895
## 3 2023 9 22.3231 16.8723 27.7740
## 4 2023 10 17.2026 11.0189 23.3865
## 5 2023 11 10.1585 3.0728 17.2443
## 6 2023 12 8.3325 1.5722 15.0930
Another useful tip is that you can even rename columns by using the equal sign with a new name.
climate |>
select(year, month, precipitation = precip ) |>
head()
## year month precipitation
## 1 2023 7 114.6551
## 2 2023 8 94.6812
## 3 2023 9 125.5386
## 4 2023 10 32.2431
## 5 2023 11 58.3917
## 6 2023 12 185.7540
Lastly, you can also chain the filter and select functions again using the pipe. Here, we will look only at maximum temperature for January
climate |>
filter(month == 1) |>
select(year, tmax)
## year tmax
## 1 1981 8.4450
## 2 1982 6.6670
## 3 1983 8.5250
## 4 1984 8.5440
## 5 1985 6.8720
## 6 1986 10.1500
## 7 1987 8.9720
## 8 1988 6.4860
## 9 1989 12.8850
## 10 1990 15.5770
## 11 1991 11.1120
## 12 1992 11.7320
## 13 1993 11.5670
## 14 1994 8.8360
## 15 1995 11.1340
## 16 1996 8.7020
## 17 1997 11.0250
## 18 1998 12.2020
## 19 1999 14.5720
## 20 2000 9.7000
## 21 2001 10.8010
## 22 2002 12.2920
## 23 2003 8.3260
## 24 2004 9.2280
## 25 2005 11.8940
## 26 2006 14.3910
## 27 2007 13.2640
## 28 2008 10.8540
## 29 2009 9.3790
## 30 2010 9.1800
## 31 2011 8.6170
## 32 2012 13.1540
## 33 2013 11.6930
## 34 2014 8.9060
## 35 2015 9.6490
## 36 2016 8.9800
## 37 2017 12.2520
## 38 2018 8.5420
## 39 2019 11.5990
## 40 2020 13.6700
## 41 2021 9.8236
## 42 2022 9.5421
## 43 2023 14.4497
Activity: Filter the data for the year AND month that you were born. Select only the mean temperature column to display
So far, you have been working with already “clean” tidy data. There
are no mistakes in the data, including typos and the data is already in
a tidy format. This is not always the case, either due to mistakes in
entering your own data or the format of open source data. There are many
functions in dplyr to edit data frames, including some that
will change to format of rows and columns to make them in tidy
format.
We will not go into functions to edit the structure of the data frame but there are a few useful functions to edit the cells or columns in your data frame.
How data are organized in a data table is an important part of analyzing data. Keeping your data organized in a consistent and standard way will help you explore the data and use functions in R that expect data to be in a specific format.
We will be working with “tidy” data. Tidy data is set up such that: - Every column is a variable - Every row is an observation - Every cell is a value.
For example:
Airplane activity: Is your data tidy? Go back to your google sheets entry and make it tidy!
Here are some other functions that can be useful when exploring the data in different columns. For each of these functions, you must specify the column you want to work with.
distinct(): show only the unique valuearrange() : sort the data by a specific columnYou can also use other base r functions that we have used before. For example,
max()min()mean()median()Distinct is a helpful function for viewing what unique characters are in your data set. Let’s see what different years are in the climate data
climate |>
distinct(year)
## year
## 1 2023
## 2 1981
## 3 1982
## 4 1983
## 5 1984
## 6 1985
## 7 1986
## 8 1987
## 9 1988
## 10 1989
## 11 1990
## 12 1991
## 13 1992
## 14 1993
## 15 1994
## 16 1995
## 17 1996
## 18 1997
## 19 1998
## 20 1999
## 21 2000
## 22 2001
## 23 2002
## 24 2003
## 25 2004
## 26 2005
## 27 2006
## 28 2007
## 29 2008
## 30 2009
## 31 2010
## 32 2011
## 33 2012
## 34 2013
## 35 2014
## 36 2015
## 37 2016
## 38 2017
## 39 2018
## 40 2019
## 41 2020
## 42 2021
## 43 2022
arrange() is very useful for sorting data by a specific
column.
Let’s say we want to sort based on the mean temperature.
climate |>
arrange(tmean) |>
head()
## year month tmean tmin tmax precip
## 1 1989 12 0.961 -4.670 6.594 83.588
## 2 1982 1 1.113 -4.439 6.667 146.087
## 3 1988 1 1.177 -4.130 6.486 87.408
## 4 2015 2 1.262 -5.092 7.617 78.897
## 5 2010 12 1.320 -3.640 6.280 59.538
## 6 1985 1 1.411 -4.049 6.872 99.806
Here, you can see that it is sorting the mean temperature from least to greatest. We can see that December 1989 had the lowest mean temperature of 0.96 degrees C.
If we want to sort from greatest to least, we can specify, that we
want in descending order using desc()
climate |>
arrange(desc(tmean)) |>
head()
## year month tmean tmin tmax precip
## 1 2012 7 28.353 22.457 34.250 222.759
## 2 2007 8 28.161 21.424 34.900 34.658
## 3 2011 7 28.118 21.627 34.610 64.191
## 4 1993 7 28.080 21.872 34.290 74.712
## 5 1986 7 27.955 21.470 34.442 105.343
## 6 2005 7 27.658 22.194 33.123 112.643
Now, we can see that July 2012 had the highest mean temperature.
Alternatively, we can use the minus sign “-” to show that we want in descending order.
climate |>
arrange(-tmean) |>
head()
## year month tmean tmin tmax precip
## 1 2012 7 28.353 22.457 34.250 222.759
## 2 2007 8 28.161 21.424 34.900 34.658
## 3 2011 7 28.118 21.627 34.610 64.191
## 4 1993 7 28.080 21.872 34.290 74.712
## 5 1986 7 27.955 21.470 34.442 105.343
## 6 2005 7 27.658 22.194 33.123 112.643
Activity: What month had the highest precipitation?
In dplyr, you can make new columns based on existing columns using
the mutate() function. The mutate() function
expects the format new_column = how to modify. This can be very useful
for determining rates, proportions, etc.
For example, you could get the difference in maximum temperature and minimum temperate by doing the following:
climate |>
mutate(temp_difference = tmax - tmin) |>
arrange(temp_difference) |>
head()
## year month tmean tmin tmax precip temp_difference
## 1 2018 9 25.2290 20.953 29.507 195.477 8.554001
## 2 2014 9 21.9320 17.543 26.323 152.583 8.780001
## 3 2002 10 16.6560 12.263 21.051 210.451 8.788001
## 4 2020 9 21.8465 17.440 26.253 278.672 8.813000
## 5 2020 8 26.7255 22.289 31.162 141.443 8.873001
## 6 2014 8 23.9320 19.457 28.408 181.368 8.951000
Activity : In the climate
data the mean temperature (tmean) is calculated by the Prism data group
as the average of the minimum and maximum temperature. Can you use
mutate() to calculate this variable and compare to the
column calculated by Prism.
In this course, we will exploring historic water quality data from the City of Raleigh. The data is collected from 2008 - 2023 and includes 19 different water quality variables, including water temperature, E. coli bacteria, and dissolved oxygen for 18 different stream sites.
You can read the data into R. The file “raleigh_wq_2008_2023.csv” is in your data folder.
wq <- read.csv('data/raleigh_wq_2008_2023.csv')
Activity: Review the data. Is it tidy?
When working data, it is important to check the data to make sure there are no errors or issues in the data. For example, are there typos? numbers that don’t make sense?
Let’s take a look at the water quality data to check to see if there are any issues.
First, take a look at the first few rows, using the function
head()
head(wq)
## Site Date Time Parameter Result PQL Unit
## 1 BB2 2008-09-30 9:52 Calcium <NA> NA mg/L
## 2 BB2 2008-09-30 9:52 Hardness_total <NA> NA mg/L
## 3 BB2 2008-09-30 9:52 Magnesium <NA> NA mg/L
## 4 BB2 2008-09-30 9:52 Salinity <NA> NA ppt
## 5 BB2 2008-09-30 9:52 Phosphorus_total <0.05 NA mg/L
## 6 BB2 2008-09-30 9:52 NH3 <0.02 NA mg/L
What do you notice already?
Now, let’s look at a tidy version so that we can use some of the
dplyr functions we just learned. Here, I am giving you a
copy of the water quality data that I have put into a “tidy” format.
To read in the data, run the following code.
wq_tidy <- read.csv('https://maryglover.github.io/bio331/water_quality_manip/raleigh_wq_tidy.csv')
colnames() functionhead()distinct() function to see what locations or
“Sites” were evaluated in Raleigh.What do you notice about the data?
Now you can take a look at the data that has been cleaned up!
Use the following code to download the csv.
wq_clean <- read.csv('https://maryglover.github.io/bio331/water_quality_manip/raleigh_wq_clean.csv')
head()nrow() on the data to see how many
observations are in the datasetComplete the rest of the exercises in an R script. Use comments to separate questions and describe what your code is doing. When you complete the exercises, submit the R script (.R file) in Moodle.
For homework, you will answer questions on a dataset with data from imdb movie data.
movies.library(dplyr)
movies <- movies |>
mutate(budget = as.numeric(budget), intgross = as.numeric(intgross))
colnames(movies)select to only show the following columns. Resave
the data with only the selected columns by assigning it a new name.
You can use this new data frame for the rest of the assignment.
Select columnsarrange() function to determine what 3 movies
had the highest budget.arrange() function to determine the G rated
movie with the highest imdb metascore. (Hint: you will have to filter
first!)intgross), using the
mutate() function. (hint, you will have to have a name for
your new profit column).mean function to get the
average.filter()select()distinct()nrow()arrange()mutate()You can always use the “?” before a function to pull up the help
page of a function, which also has examples. For example
?mutate
There are loads of resources out there for dplyr, including simply googling what you want to do. Here are a few:
PRISM Climate Group, Oregon State University, https://prism.oregonstate.edu, accessed 17 Jan 2024.↩︎