A Grammar of Graphics

Introduction to ggplot2

Théo Boulakia

18 septembre 2024

Get started

Workflow

Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Demandes de valeurs foncières

Import

Simple (base R)

dvf = read.csv("data/dvf_aiglun.csv")

Faster (readr package)

dvf = read_csv("data/dvf_aiglun.csv")

Even faster

dvf = vroom::vroom("data/dvf_aiglun.csv")

Fast and simple

dvf = rio::import("data/dvf_aiglun.csv")

Fast, simple and safe

dvf = rio::import(here::here("data", "dvf_aiglun.csv"))

Inspect

class(dvf)
[1] "data.frame"
dvf
    X       date        type   prix surface pieces latitude longitude
1   1 2019-02-20      Maison 178000      80      4 44.05407  6.140493
2   2 2019-03-08 Appartement 120000      48      2 44.05963  6.147804
3   3 2019-05-24      Maison 320000     172      4 44.06694  6.138814
4   4 2019-06-06      Maison 164350      76      3 44.05272  6.142440
5   5 2019-06-07      Maison 250000      95      4 44.07036  6.133687
6   6 2019-11-08      Maison 290000     127      4 44.05521  6.134305
7   7 2019-11-08      Maison 306600     125      6 44.05431  6.132984
8   8 2019-11-16 Appartement 178000      91      5 44.06010  6.148028
9   9 2019-12-12 Appartement 139000      66      3 44.06010  6.148028
10 10 2020-02-11      Maison  76000      88      4 44.05283  6.133513
11 11 2020-06-18      Maison 250000     166      5 44.05407  6.140493
12 12 2020-08-07      Maison 190000      97      4 44.04928  6.140711
13 13 2020-07-30 Appartement 147500      91      4 44.05963  6.147804
14 14 2020-11-13      Maison 280000     160      4 44.05181  6.146251
15 15 2021-01-15      Maison 240000      81      4 44.05803  6.148892
16 16 2021-06-09 Appartement  44000      26      1 44.05963  6.147804
17 17 2021-07-02      Maison 389500     135      4 44.05583  6.147911
18 18 2021-11-04      Maison 250000      96      4 44.05021  6.138194
19 19 2022-03-25      Maison 180000     100      5 44.05719  6.136185
20 20 2022-05-24      Maison  93000     131      4 44.04278  6.143266
21 21 2022-05-30      Maison 220000     125      6 44.06630  6.140405
22 22 2022-07-06 Appartement 160000     122      4 44.06040  6.110681
23 23 2022-07-25 Appartement 194000      78      3 44.05952  6.147453
24 24 2022-07-19 Appartement  91000      35      2 44.05963  6.147804
25 25 2022-07-19      Maison 256550      99      4 44.06038  6.147991
26 26 2022-08-20      Maison 187000     125      5 44.05235  6.145856
27 27 2022-09-30      Maison 330000     132      4 44.05781  6.134520
28 28 2022-12-05 Appartement 100000      86      4 44.05963  6.147804
29 29 2022-12-19 Appartement  70000      40      2 44.05952  6.147453
30 30 2023-03-09      Maison 257700      77      4 44.05336  6.133104
31 31 2023-08-08      Maison  70000      50      2 44.04813  6.139726
32 32 2023-09-15      Maison 414550     119      6 44.05078  6.140327
33 33 2023-10-25      Maison 291300     126      5 44.06106  6.139871
34 34 2023-12-08      Maison 263600     125      5 44.05235  6.145856

Scatterplot

ggplot(data = dvf, mapping = aes(x = surface, y = prix)) +
  geom_point()

Boxplot

ggplot(data = dvf, mapping = aes(x = type, y = prix)) +
  geom_boxplot()

Size aesthetic

ggplot(data = dvf, mapping = aes(x = surface, y = prix,
                                 size = pieces)) +
  geom_point()

Let’s break it down

Code
ggplot(data = dvf,
       mapping = aes(x = surface, y = prix, size = pieces, colour = type)) +
  geom_point()

Députés

Import

deputes = readr::read_csv("data/deputes.csv")

Inspect

class(deputes)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
deputes
# A tibble: 577 × 7
   nom        prenom departement circonscription profession groupe groupe_abrege
   <chr>      <chr>  <chr>                 <dbl> <chr>      <chr>  <chr>        
 1 K/Bidi     Émeli… Réunion                   4 Avocate    Gauch… GDR          
 2 Bénard     Édoua… Seine-Mari…               3 Collabora… Gauch… GDR          
 3 Le Feur    Sandr… Finistère                 4 Agriculte… Ensem… EPR          
 4 Tesson     Thier… Nord                     17 (74) - An… Rasse… RN           
 5 Villedieu  Antoi… Haute-Saône               1 Policier   Rasse… RN           
 6 Gernigon   Franç… Maine-et-L…               1 Ancien ca… Horiz… HOR          
 7 Sas        Eva    Paris                     8 Cadre sup… Écolo… EcoS         
 8 Auzanot    Bénéd… Vaucluse                  2 (85) - Pe… Rasse… RN           
 9 Tavernier  Boris  Rhône                     2 (37) - Ca… Écolo… EcoS         
10 Christoph… Paul   Drôme                     1 (37) - Ca… Socia… SOC          
# ℹ 567 more rows

Barplot

ggplot(deputes, aes(x = groupe_abrege)) +
  geom_bar()

Indice de position sociale des collèges

Import

ips = arrow::read_parquet("data/ips_colleges.parquet")

Inspect

class(ips)
[1] "tbl_df"     "tbl"        "data.frame"
ips
# A tibble: 6,962 × 8
   nom_etablissement       academie   ips sd_ips secteur departement nom_commune
   <chr>                   <chr>    <dbl>  <dbl> <chr>   <chr>       <chr>      
 1 college hutinel         creteil  105.    34.8 public  seine-et-m… gretz arma…
 2 college de l europe     creteil  107.    31.8 public  seine-et-m… dammartin …
 3 college marie laurencin creteil  112.    34.4 public  seine-et-m… ozoir la f…
 4 college marie curie     creteil   86.9   30.3 public  seine-et-m… provins    
 5 college victor schoelc… creteil  115.    33   public  seine-et-m… torcy      
 6 college les bles d or   creteil  130.    31   public  seine-et-m… bailly rom…
 7 college erik satie      creteil  104.    34   public  seine-et-m… mitry mory 
 8 college lucie aubrac    creteil  111.    39.2 public  seine-et-m… montevrain 
 9 college colonel arnaud… creteil  127.    34.6 public  seine-et-m… vulaines s…
10 college flora tristan   versail…  86.9   32.6 public  yvelines    carrieres …
# ℹ 6,952 more rows
# ℹ 1 more variable: code_commune <chr>

Boxplot

Code
ggplot(ips, aes(x = secteur, y = ips)) +
  geom_boxplot()

Scatterplot

Code
ggplot(ips, aes(x = ips, y = sd_ips)) +
  geom_point()

Density 2D

ggplot(ips, aes(x = ips, y = sd_ips)) +
  geom_density2d()

Aesthetic mappings

Aesthetics

Mapping vs setting

Mapping

ggplot(data, aes(x = Var1, y = Var2, size = Var3, colour = Var4))

Setting

ggplot(data, colour = "red", alpha = 0.1, size = 4)

Raw scatterplot

ggplot(ips, aes(x = ips, y = sd_ips)) +
  geom_point()

Setting alpha

ggplot(ips, aes(x = ips, y = sd_ips)) +
  geom_point(alpha = 0.3)

Setting size

ggplot(ips, aes(x = ips, y = sd_ips)) +
  geom_point(size = 0.5)

Setting color

ggplot(ips, aes(x = ips, y = sd_ips)) +
  geom_point(color = "indianred")

Setting shape

ggplot(ips, aes(x = ips, y = sd_ips)) +
  geom_point(shape = 3)

Mapping color

ggplot(ips, aes(x = ips, y = sd_ips, colour = secteur)) +
  geom_point()

Maping shape

ggplot(dvf, aes(x = surface, y = prix)) +
  geom_point(aes(shape = type))

Mapping and setting

ggplot(dvf, aes(x = surface, y = prix)) +
  geom_point(aes(shape = type), size = 3)

Layers

Components

geom

  • histogram
  • violin
  • rug
  • line
  • point
  • tile
  • smooth
  • raster
  • and more…

stat

  • count
  • density
  • sum
  • identity
  • unique
  • summary
  • function
  • and more…

position

  • dodge
  • jitter
  • stack
  • identity
  • and more…

geom_X()

ggplot(deputes, aes(x = groupe_abrege)) +
  geom_bar()

stat_X()

ggplot(deputes, aes(x = groupe_abrege)) +
  stat_count()

layer()

ggplot(deputes, aes(x = groupe_abrege)) +
  layer(geom = "bar", stat = "count", position = "identity")

Exploring combinations

ggplot(deputes, aes(x = groupe_abrege)) +
  layer(geom = "point", stat = "count", position = "identity")

Adding layers

point

ggplot(dvf, aes(x = surface, y = prix)) +
  geom_point()

point + smooth

ggplot(dvf, aes(x = surface, y = prix)) +
  geom_point() + geom_smooth()

point + smooth + rug

ggplot(dvf, aes(x = surface, y = prix)) +
  geom_point() + geom_smooth() + geom_rug()

Scales

Color aesthetic

ggplot(data = dvf, mapping = aes(x = surface, y = prix,
                                 colour = type)) +
  geom_point(size = 3)

Scale color manual

ggplot(data = dvf, mapping = aes(x = surface, y = prix, colour = type)) +
  geom_point(size = 3) +
  scale_color_manual(values = c("Maison" = "red", "Appartement" = "darkblue"))

Scale color viridis

ggplot(data = dvf, mapping = aes(x = surface, y = prix, colour = type)) +
  geom_point(size = 3) +
  scale_color_viridis_d()

Coordinate systems

Cartesian coordinates

ggplot(ips, aes(x = "write anything", fill = secteur)) +
  geom_bar()

Polar coordinates

ggplot(ips, aes(y = "write anything", fill = secteur)) +
  geom_bar() +
  coord_polar()

Spatial data

Simple feature collection with 257 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.65187
Geodetic CRS:  WGS 84
First 10 features:
                      CNTR_NAME                       geometry
1      الإمارات العربية المتحدة MULTIPOLYGON (((56.35462 25...
2           افغانستان-افغانستان MULTIPOLYGON (((74.7055 37....
3           Antigua and Barbuda MULTIPOLYGON (((-61.80237 1...
4                      Anguilla MULTIPOLYGON (((-63.05444 1...
5                     Shqipëria MULTIPOLYGON (((19.831 42.4...
6                      Հայաստան MULTIPOLYGON (((46.45984 39...
7                        Angola MULTIPOLYGON (((23.83831 -1...
8                     Argentina MULTIPOLYGON (((-62.87957 -...
9  American Samoa-Sāmoa Amelika MULTIPOLYGON (((-169.3966 -...
10                   Österreich MULTIPOLYGON (((16.88365 48...

Cartographic coordinates

ggplot(countries) +
  geom_sf(fill = "black") +
  coord_sf(crs = "ESRI:54030")

Another one

ggplot(countries) +
  geom_sf(fill = "black") +
  coord_sf(crs = "+proj=bonne +lat_1=10")

Faceting

First example

ggplot(ips, aes(x = ips, y = sd_ips)) +
  geom_point() +
  facet_wrap(~secteur)

Meaningful example

Code
d = fs::path_home("these", "data", "eec") |> 
  arrow::open_dataset() |> 
  filter(rgi == 1,
         !is.na(acteu)) |> 
  select(sexe, ag, acteu, csp, csa, santgen) |> 
  mutate(cs = coalesce(csp, csa),
         cs = if_else(cs %in% c("10", "11", "12", "13"), "10", cs)) |>
  collect()

d |> 
  filter(between(ag, 20, 80)) |> 
  group_by(ag, cs, santgen) |> 
  summarise(n = n()) |> 
  mutate(pct = n / sum(n)) |> 
  ungroup() |> 
  pivot_wider(names_from = santgen, values_from = c(pct, n), values_fill = 0) |> 
  mutate(good = pct_1 + pct_2) |> 
  select(ag, cs, good) |> 
  ggplot() +
  geom_smooth(aes(x = ag, y = good)) +
  facet_wrap(~cs) +
  labs(x = "",
       y = "Proportion de personnes en bonne santé") +
  theme_bw() 

Labels

A plot to dress up

Code
ggplot(ips, aes(y = forcats::fct_reorder(str_to_title(academie), ips),
                    x = ips)) +
  stat_summary(fun = "median", geom = "point") +
  coord_cartesian(xlim = c(60, NA))

labs()

Code
ggplot(ips, aes(y = forcats::fct_reorder(str_to_title(academie), ips),
                    x = ips)) +
  stat_summary(fun = "median", geom = "point") +
  coord_cartesian(xlim = c(60, NA)) +
  labs(x = "Indice de position sociale",
       y = "",
       title = "Position sociale des académies",
       subtitle = "Pour la rentrée 2021-2022",
       caption = "Données : Indice de position sociale des collèges")

Themes

Save typing

p = ggplot(ips, 
           aes(y = forcats::fct_reorder(str_to_title(academie), ips),
               x = ips)) +
  stat_summary(fun = "median", geom = "point") +
  coord_cartesian(xlim = c(60, NA)) +
  labs(x = "Indice de position sociale",
       y = "",
       title = "Position sociale des académies",
       subtitle = "Pour la rentrée 2021-2022",
       caption = "Données : Indice de position sociale des collèges")

Grey (default)

p + theme_grey()

Classic

p + theme_classic()

Minimal

p + theme_minimal()

WSJ

p + ggthemes::theme_wsj()

Conclusion

Recap

  • Data
  • Aesthetics
  • Scales
  • Coordinate systems
  • Themes

Advantages

  • Reproducibility

  • Not just a collection of special cases

  • Graphs built incrementally

  • Same tool for exploratory analysis and communication

  • High level of control

  • Encourages custom-made graphics

  • Easy to extend