Set up

In this practice, we will work with a dataset that comes with ggplot2 package.

To load this data, run the following code.

library(ggplot2)
data(mpg)

You can now see the top few lines using the head() function

head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

This dataset include fuel economy data for 38 different car models in 1999 and 2008. We will work with this dataset in the following this practice set.

In this activity, I will provide hints throughout. Only use the hints if you need them! Try to see what you can do without them first.

Summary tables

  1. Before you begin the summaries, make sure you load the package to do summaries!
    Hint Load in the dplyr package
    Answer
     library(dplyr)
  2. Make a summary table to include show the average city gas mileage, cty, for each vehicle class, class.
    Hint Group by class and summarise based on cty
    Answer
     mpg |>
       group_by(class)|>
       summarize(city_mean = mean(cty))
  3. Make a summary table to show the maximum highway gas mileage (hwy) for each manufacturer (manufacturer).
    Hint Group by the manufacturer and summarize to calculate the max() for the highway mileage.
    Answer
     mpg |> 
       group_by(manufacturer) |>
       summarize(hwy_max = max(hwy))
  4. Make a summary table to show the city gas mileage and the highway gas mileage for each year.
    Hint Group by the year. Then add two different summaries in summarize()
    Answer
        mpg|>
       group_by(year) |>
       summarize(city_mean = mean(cty), hwy_mean = mean(hwy))
  5. Make a summary table to show the mean city gas mileage for each class type and year.
    Hint Group by the year and class
    Answer
        mpg|>
       group_by(class, year) |>
       summarize(city_mean = mean(cty))
  6. Filter the summary table you just make to only include the year 2008.
    Hint Add a filter to the code you just used to make the table after a pipe.
    Answer
        mpg|>
       group_by(class, year) |>
       summarize(city_mean = mean(cty)) |>
       filter(year == 2008)
  7. Make a summary table to display the median city mileage for every manufacturer. Then, sort to find which manufacturer has the highest gas mileage.
    Hint Make the summary table and then arrange() by the column you made in the summarize function
    Answer
     mpg |>
       group_by(manufacturer) |>
       summarize(city_median = median(cty))|>
       arrange(-city_median)

Statistics

  1. Do a statistical test to determine if the year the car was manufactured affected the city gas mileage.
    Hint Use the function lm() with cty as the response and year as the predictor
    Answer
     year_lm <- lm(data = mpg, cty ~ year )
  2. View the statistical test. Is there a significant difference in the gas mileage for the different years?
    Hint Use the function summary() to view the statistical model
    Answer
       summary(year_lm)
    There is no difference between the different years.
  3. Do a statistical test to see if the highway gas mileage is predicted by the vehicle class
    Hint Use the function lm() with hwy as the response and class as the predictor
    Answer
     class_lm <- lm(data = mpg, hwy ~ class )
  4. View the statistical test. Is there a significant difference in the gas mileage based on vehicle class?
    Hint Use the function summary() to view the statistical model
    Answer
       summary(class_lm)
    There is a significant effect of vehicle class on highway gas mileage.
  5. Make a boxplot to show the highway gas mileage for each vehicle class.
    Hint Use your ggplot functions to make a boxplot with the mpg data, with class on x axis and hwy on y axis.
    Answer
       ggplot(mpg, aes(x=class, y = hwy)) +
       geom_boxplot() +
       theme_classic()
    There is a significant effect of vehicle class on highway gas mileage.
  6. Based on the plot, what class had the lowest highway mileage.
  7. Perform a statistical test to see if city gas mileage predicts highway gas mileage.
    Hint Use the lm() function with cty as the predictor and hwy as the response
    Answer
     mpg_lm <- lm(data = mpg, hwy ~ cty)
  8. Is the result significant?
    Hint Use the function summary to view the results.
    Answer
     summary(mpg_lm)
    There is a significant effect of city mileage on highway milegae.
  9. Plot a scatterplot of the city and highway gas mileage and add the best fit line.
    Hint Use your ggplot functions to make a boxplot with the mpg data, with cty on x axis and hwy on y axis. Add a geom_smooth() layer
    Answer
       ggplot(mpg, aes(x=cty, y = hwy)) +
       geom_point() +
       geom_smooth(method = 'lm') +
       theme_classic()

Joins

Here, we will look at the two dplyr datasets band_instruments and band_members.

Use the function head() to look at these two datasets

library(dplyr)
head(band_instruments)
## # A tibble: 3 × 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar
head(band_members)
## # A tibble: 3 × 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
  1. Join the two data sets to include all names in both data frames.

    Answer

     band_instruments |>
       full_join(band_members)
  2. Join the two data sets to include only names that are found in both data frames.

    Answer

     band_instruments |>
       inner_join(band_members)
  3. Join the two data sets to include only names that are found in the band_instruments data frames.

    Answer

     band_instruments |>
       left_join(band_members)

    or

     band_members |>
       right_join(band_instruments)
  4. Join the two data sets to include only names that are found in the band_members data frames.

    Answer

     band_instruments |>
       right_join(band_members)

    or

     band_members |>
       left_join(band_instruments)