In this lesson, you will learn how to plot graphs or figures in R.
When working with data, it is good practice to plot the data to evaluate
it. This can help identify trends or even problems in the data. Also
need plotting for displaying and communicating results. There are basic
plotting functions in R, but we will use the functions in the package
ggplot2.
We will use the package ggplot2 to produce
visualizations in this class:
dplyrggplot2To use ggplot2, you must first install it. You can use
the packages tab on the lower right hand pane of Rstudio or with the
code:
install.packages('ggplot2')
You only have to install the ggplot2 package once, but
don’t forget to load it in each time you want to use it.
library(ggplot2)
In this lesson, let’s start with the movie data that you used in the
homework of the dplyr lesson.
Making graphs is a great way to explore the data.
Go ahead and load in the data and also run the code from the lesson to make the data numeric.
movies <- read.csv('data/movies.csv')
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
movies <- movies |>
mutate(budget = as.numeric(budget),
intgross = as.numeric(intgross),
rated = as.factor(rated))
We will also simply the data by selecting only some of the columns
and reassigning it the name movies_select
movies_select <- movies |>
select(year, title, budget, intgross, rated, metascore, imdb_rating)
ggplot2 uses what are called “layers” to plot data.
These layers are added on one another to produce the visualization.
The first layer sets the foundation and uses the function
ggplot(). The ggplot() function expects:
aes(). This provides the columns
that will be displayed on the graph. For example, the what column is the
x axis and which is the y axis.Let’s say that we wanted to plot a histogram of the number of of the
movies released in each year. A histogram shows the distribution of
values, with the value on the x axis. In this case, our x axis will be
the year and the number of times is found in the data on the y axis. To
do this, we will set the aesthetic for the x axis as the
year column.
ggplot(data = movies_select, aes(x = year))
You will notice that this plots an empty graph. This code sets up the
foundation of the plot, including the “coordinate system” or scale of
the axes. To add the data, you must add an additional layer. To
plot the precipitation as a histogram, we will add the
geom_histogram() function to the foundation. Because you
have already stated the data and the axes, you do not need to provide
additional arguments.
ggplot(data = movies, aes(x = year)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
What do you notice about the movie release date data?
Activity: Make a histogram using the different movie metascrore.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
To create different types of plots, you would use a different layer. Some examples are:
geom_point: plots points for a scatterplotgeom_bar: barplotgeom_boxplot: boxplotgeom_line: line graph.For most graphs, you would need to specify both the x and y axis in
the aesthetics aes(). For example, lets say that you wanted
to see the budget of movies over the years. Here, we would have
year on the x axis and budget on the y axis.
You could do this with points for every data point with
geom_point.
ggplot(data = movies_select, aes(x = rated, y = budget)) +
geom_point()
You could also graph this data with a boxplot, which might make more sense here. A boxplot is a way to show the distribution of data, by displaying the summary statistics, including the median, the 25% and 75% quantiles, and outliers. Here, you need to specify the “group” in the boxplot aesthetic to show what goes into each box should be a separate box.
ggplot(data = movies_select, aes(x = rated, y = budget)) +
geom_boxplot()
You can see that some of these boxes, you probably don’t care about.
You can combine your dplyr functions with
ggplot. Let’s only look at ratings of G, PG, PG-13, and R.
Here, we will resave the data into the movies_select
object.
movies_select <- movies_select |>
filter(rated %in% c('G', 'PG', 'PG-13', 'R'))
Now, let’s look at the boxplot.
ggplot(data = movies_select, aes(x = rated, y = budget)) +
geom_boxplot()
Activity: Make another boxplot, but this time look at the metascore for each movie rating.
So far, the plots have just been using the default display settings. In ggplot, you can customize the appearance of the plots using addition code or layers.
First, we will use additional layers to add labels, including axis
labels and titles. To change the axis labels, we can use the
labs() layer. In labs you can specify changing
the x-axis (x), y axis (y), and the title.
We will “add” the labs() layer to change the x and y
axis labels
ggplot(movies_select, aes(x = rated, y = metascore))+
geom_boxplot()+
labs(x= "Movie Rating", y = "IMDB Metascore")
## Warning: Removed 159 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
You can also add a title with “title =”.
ggplot(movies_select, aes(x = rated, y = metascore))+
geom_boxplot()+
labs(x= "Movie Rating", y = "IMDB Metascore", title = "IMDB scores for different movie ratings")
## Warning: Removed 159 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The default in ggplot is to use have the background as gray with
white gridlines. There are layers to edit the background and gridlines.
But the easiest way to change the overall look of the plot is to change
the “theme” of the plot. There are several different themes that can be
“added” to a plot. We will use one called theme_minimal. To
see a full list, see the help file for the themes type
?theme_minimal
ggplot(movies_select, aes(x = rated, y = metascore))+
geom_boxplot()+
labs(x= "Movie Rating", y = "IMDB Metascore", title = "IMDB scores for different movie ratings")+
theme_minimal()
## Warning: Removed 159 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Exercise: Play around with the different themes to find one you like.
You can also change the color in ggplot using the ‘color’ or ‘fill’
argument. Color is used to change the color of the outline, and fill is
used to color the inside of a shape. For geom_boxplot(), you would
change the color of the lines with color = and the inner
color of the boxes with fill = .
ggplot(movies_select, aes(x = rated, y = metascore))+
geom_boxplot(color = 'darkgreen', fill = 'lightgreen')
## Warning: Removed 159 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Let’s say you are plotting a scatterplot of the budget and the
metascore. You can change the color of the points with
color = inside the geom_point() function.
ggplot(data = movies_select, aes(x = budget, y = metascore )) +
geom_point(color = 'seagreen')
## Warning: Removed 159 rows containing missing values or values outside the scale range
## (`geom_point()`).
To see a full list of colors you can use, see R color chart.
In the layers, you can also change other aspects including
size =shape =linetype =For a list of the aesthetics, see https://ggplot2.tidyverse.org/articles/ggplot2-specs.html
So far we have changed the color of all the data. However, a very useful way to use color is to base the color on a different variable, or column in the data.
To color based on a variable, we need to add color in the aesthetic
mapping or aes(), to say that color should be based on the
year column. Here, we will go back to our budget and metascore
scatterplot from the movies data. We can color each of the points based
on movie rating. We have to indicate in aes() that we want
the color to be based on the “rated” column.
ggplot(data = movies_select, aes(x = budget, y = metascore )) +
geom_point(aes(color = rated))
## Warning: Removed 159 rows containing missing values or values outside the scale range
## (`geom_point()`).
For homework, continue working with the Raleigh climate data from the
dplyr lesson.
Read in the climate data set. It is in your data folder as “raleigh_prism_climate.csv”
Make a histogram of the precipitation data.
What do you notice about the precipitation data? Write your answer as a comment in your R script.
Filter the data set to only include data from your birth year. Assign the data a new name
Make a scatter plot using the data you just made with your birth year. Again, plot the mean temperature data by the month data.
Connect the dots from the last plot to make a line graph. To do
this, you will add a line plot layer to the previously made graph. The
line plot layer is geom_line()
Do you notice a pattern? Comment your answer in the script.
Using the same graph of your birth year and mean temperature, change the color of the line.
Using the same graph, change the theme of the graph.
Using the same graph, add a title to the graph.
ggplot()geom_*()labs()theme_*()