Introduction to ggplot

In this lesson, you will learn how to plot graphs or figures in R. When working with data, it is good practice to plot the data to evaluate it. This can help identify trends or even problems in the data. Also need plotting for displaying and communicating results. There are basic plotting functions in R, but we will use the functions in the package ggplot2.

We will use the package ggplot2 to produce visualizations in this class:

  • ggplot works well with the “tidy” data and data manipulations that you did in the previous lesson with dplyr
  • ggplot is very customizable, allowing you to alter titles, colors, style, etc. with R code.
  • ggplot produces pretty graphs.

Install ggplot2

To use ggplot2, you must first install it. You can use the packages tab on the lower right hand pane of Rstudio or with the code:

install.packages('ggplot2')

You only have to install the ggplot2 package once, but don’t forget to load it in each time you want to use it.

library(ggplot2)

Getting the data ready

In this lesson, let’s start with the movie data that you used in the homework of the dplyr lesson. Making graphs is a great way to explore the data.

Go ahead and load in the data and also run the code from the lesson to make the data numeric.

movies <- read.csv('data/movies.csv')

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
movies <- movies |>
  mutate(budget = as.numeric(budget), 
         intgross = as.numeric(intgross), 
         rated = as.factor(rated))

We will also simply the data by selecting only some of the columns and reassigning it the name movies_select

movies_select <- movies |>
  select(year, title, budget, intgross, rated, metascore, imdb_rating)

Layers

ggplot2 uses what are called “layers” to plot data. These layers are added on one another to produce the visualization.

The first layer sets the foundation and uses the function ggplot(). The ggplot() function expects:

  1. the data you are plotting
  2. the aesthetic mapping aes(). This provides the columns that will be displayed on the graph. For example, the what column is the x axis and which is the y axis.

Let’s say that we wanted to plot a histogram of the number of of the movies released in each year. A histogram shows the distribution of values, with the value on the x axis. In this case, our x axis will be the year and the number of times is found in the data on the y axis. To do this, we will set the aesthetic for the x axis as the year column.

ggplot(data = movies_select, aes(x = year))

You will notice that this plots an empty graph. This code sets up the foundation of the plot, including the “coordinate system” or scale of the axes. To add the data, you must add an additional layer. To plot the precipitation as a histogram, we will add the geom_histogram() function to the foundation. Because you have already stated the data and the axes, you do not need to provide additional arguments.

ggplot(data = movies, aes(x = year)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

What do you notice about the movie release date data?

Activity: Make a histogram using the different movie metascrore.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

To create different types of plots, you would use a different layer. Some examples are:

  • geom_point: plots points for a scatterplot
  • geom_bar: barplot
  • geom_boxplot: boxplot
  • geom_line: line graph.

For most graphs, you would need to specify both the x and y axis in the aesthetics aes(). For example, lets say that you wanted to see the budget of movies over the years. Here, we would have year on the x axis and budget on the y axis. You could do this with points for every data point with geom_point.

ggplot(data = movies_select, aes(x = rated, y = budget)) +
  geom_point()

You could also graph this data with a boxplot, which might make more sense here. A boxplot is a way to show the distribution of data, by displaying the summary statistics, including the median, the 25% and 75% quantiles, and outliers. Here, you need to specify the “group” in the boxplot aesthetic to show what goes into each box should be a separate box.

ggplot(data = movies_select, aes(x = rated, y = budget)) +
  geom_boxplot()

You can see that some of these boxes, you probably don’t care about. You can combine your dplyr functions with ggplot. Let’s only look at ratings of G, PG, PG-13, and R. Here, we will resave the data into the movies_select object.

movies_select <- movies_select |>
  filter(rated %in% c('G', 'PG', 'PG-13', 'R'))

Now, let’s look at the boxplot.

ggplot(data = movies_select, aes(x = rated, y = budget)) +
  geom_boxplot()

Activity: Make another boxplot, but this time look at the metascore for each movie rating.

Customizing plots

So far, the plots have just been using the default display settings. In ggplot, you can customize the appearance of the plots using addition code or layers.

Labels

First, we will use additional layers to add labels, including axis labels and titles. To change the axis labels, we can use the labs() layer. In labs you can specify changing the x-axis (x), y axis (y), and the title.

We will “add” the labs() layer to change the x and y axis labels

ggplot(movies_select, aes(x = rated, y = metascore))+
  geom_boxplot()+
  labs(x= "Movie Rating", y = "IMDB Metascore")
## Warning: Removed 159 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

You can also add a title with “title =”.

ggplot(movies_select, aes(x = rated, y = metascore))+
  geom_boxplot()+
  labs(x= "Movie Rating", y = "IMDB Metascore", title = "IMDB scores for different movie ratings")
## Warning: Removed 159 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Theme

The default in ggplot is to use have the background as gray with white gridlines. There are layers to edit the background and gridlines. But the easiest way to change the overall look of the plot is to change the “theme” of the plot. There are several different themes that can be “added” to a plot. We will use one called theme_minimal. To see a full list, see the help file for the themes type ?theme_minimal

ggplot(movies_select, aes(x = rated, y = metascore))+
  geom_boxplot()+
  labs(x= "Movie Rating", y = "IMDB Metascore", title = "IMDB scores for different movie ratings")+
  theme_minimal()
## Warning: Removed 159 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Exercise: Play around with the different themes to find one you like.

Color

You can also change the color in ggplot using the ‘color’ or ‘fill’ argument. Color is used to change the color of the outline, and fill is used to color the inside of a shape. For geom_boxplot(), you would change the color of the lines with color = and the inner color of the boxes with fill = .

ggplot(movies_select, aes(x = rated, y = metascore))+
  geom_boxplot(color = 'darkgreen', fill = 'lightgreen')
## Warning: Removed 159 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Let’s say you are plotting a scatterplot of the budget and the metascore. You can change the color of the points with color = inside the geom_point() function.

ggplot(data = movies_select, aes(x = budget, y = metascore )) +
  geom_point(color = 'seagreen')
## Warning: Removed 159 rows containing missing values or values outside the scale range
## (`geom_point()`).

To see a full list of colors you can use, see R color chart.

Other customizations

In the layers, you can also change other aspects including

  • size: size =
  • shape of points: shape =
  • line type (ex. dashed, solid): linetype =

For a list of the aesthetics, see https://ggplot2.tidyverse.org/articles/ggplot2-specs.html

Color Aesthetics

So far we have changed the color of all the data. However, a very useful way to use color is to base the color on a different variable, or column in the data.

To color based on a variable, we need to add color in the aesthetic mapping or aes(), to say that color should be based on the year column. Here, we will go back to our budget and metascore scatterplot from the movies data. We can color each of the points based on movie rating. We have to indicate in aes() that we want the color to be based on the “rated” column.

ggplot(data = movies_select, aes(x = budget, y = metascore )) +
  geom_point(aes(color = rated))
## Warning: Removed 159 rows containing missing values or values outside the scale range
## (`geom_point()`).

Exercises

For homework, continue working with the Raleigh climate data from the dplyr lesson.

  1. Read in the climate data set. It is in your data folder as “raleigh_prism_climate.csv”

  2. Make a histogram of the precipitation data.

  3. What do you notice about the precipitation data? Write your answer as a comment in your R script.

  4. Filter the data set to only include data from your birth year. Assign the data a new name

  5. Make a scatter plot using the data you just made with your birth year. Again, plot the mean temperature data by the month data.

  6. Connect the dots from the last plot to make a line graph. To do this, you will add a line plot layer to the previously made graph. The line plot layer is geom_line()

  7. Do you notice a pattern? Comment your answer in the script.

  8. Using the same graph of your birth year and mean temperature, change the color of the line.

  9. Using the same graph, change the theme of the graph.

  10. Using the same graph, add a title to the graph.

New functions

  • ggplot()
  • geom_*()
  • labs()
  • theme_*()