AVED: Data visualization (ggplot2)

There are a lot packages in R which provides basic or more complicated tools for data visualization. You can use general packages like graphics, grid or lattice or specialized tools like circlize and many more.

We focus on packages which are based on theoretical background of so called “grammar of graphics” (Wilkinson 2005, 2010). His theoretical concepts are implemented in a package ggplot2. (There is even newer package ggvis which is in its spirit and implementation very close to ggplot2. We found ggplot2 a more convenient option for a researcher as far as ggvis is more oriented on interactive graphics.)

A figure

Each figure produced by ggplot2 consist of three basic components:

Data visualization which is composed by one or many overlapping layers (usually created by geom_*() functions). Each layer add one thing to the Figure. In our schematic example first layer contains scatter plot and second smoothing line.
Elements which allows us to understand the visualization: scales (colors, sizes, etc.), legends, and axes.
Data-unrelated elements which defines general appearance of resulting figure – fonts, grids, background colors, etc.

All components are controlled separately and independently. The structure of the handout follows this structure. Firstly attention is paid to data visualization. Second chapter is devoted to controlling of visualization and last to setting general appearance of figures.

Resources for learning

Wickham, H. (2016): ggplot2 : Elegant graphics for Data Analysis. 2nd edition, Springer.
ggplot2 docs website: http://docs.ggplot2.org/current/

This handout borrows a lot from both and from package documentation.

Data visualization: Layer by layer

Basic design of ggplot2 is based on layers. Final figure is an union of multiple layers where every single one of them adds one quality in a figure.

To illustrate ggplot2 basics we will use data set diamonds which contains data on tens of thousands stones. We will use a random sample of 500 of them for speed and clarity.

library(ggplot2)
data("diamonds")
# Randomly sample 200 rows
diamonds %<>% sample_n(500)
print(diamonds)

## # A tibble: 500 × 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.51     Ideal     F     VS2  61.4    56  1569  5.13  5.16  3.16
## 2  1.51   Premium     F     SI1  62.2    58 11374  7.32  7.27  4.54
## 3  0.30 Very Good     D    VVS2  62.9    56   752  4.27  4.29  2.69
## 4  1.12   Premium     H     VS1  59.6    57  5055  6.82  6.77  4.05
## # ... with 496 more rows

Following sequence of figures illustrates the concept of construction by layers. Our goal is to get smoothed scatter plot of weight (carat) and price (price) of stones in the sample.

At first the basic layer is generated. The basic layer is a formatted “empty blackboard” – big enough to contain the data:

diamonds %>% 
  ggplot(data = ., mapping = aes(x = carat, y = price))

In the second layer we add dots which represent individual stones (geom_point()):

diamonds %>% 
  ggplot(data = ., mapping = aes(x = carat, y = price)) +
  geom_point()

The last layer in the example adds smoothing curve into the figure (geom_smooth()):

diamonds %>% 
  ggplot(data = ., mapping = aes(x = carat, y = price)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess'

Layer properties

geom_*() functions are shortcuts for more verbose layer(). For example geom_point() is identical to:

layer(
  data = NULL,
  mapping = NULL,
  geom = "point",
  stat = "identity",
  position = "identity"
)

Each layer() argument refers to a property of a layer:

data – a tidy data frame. Each layer has to have a parameter data specified or inherited from ggplot() call. (It is not necessary to specify data in ggplot() call if they are specified for each layer separately.)
mapping contains a set of aesthetic mappings specified using aes() function. aes() assigns data (specific columns from input data frame) to qualities of the geom. For example in the case of scatter plot the values of carat and price are assigned to point position, color of stones to color filling of points – see following code and resulting Figure:

ggplot(
  data = diamonds,
  mapping = 
    aes(
      x = carat,      # X-coordinate
      y = price,      # Y-coordinate
      color = cut,    # color of the "point" margin
      fill = color,   # color of filling
      size = table    # size of the point
    )
) + 
  geom_point(
    shape = 21,       # shape of "points"
    alpha = 0.2,      # points transparency
    stroke = 2        # thickness of margin
  )

The example uses all aesthetics available for geom_point(). One can notice two important things from the example:

Each layer can set all layer arguments. If they are not set they iherite settings from intial ggplot() call. See that geom_point does not specify data used or mapping.
Aesthetic characteristics can be assigned to data or can be specified as single value (which is used for all observations in the Figure). Aesthetic specification in layer()/geom_*() overrides specification given in initial ggplot() call.

geom contains the name of the geometric object to use to draw each observation – geom point is used in above discussed example.
stat the name of the statistical transformation used – e.g. geom_smooth() from example above.
position method used to adjust overlapping objects like jittering, stacking and dodging

Remember that functions geom_*() are shortcuts for the specific call of layer which may differ in all parameters – not only in geom parameter. We can demonstrate it on the example of default setting of layer(), geom_point(), and geom_jitter:

# Default layer() setting:
layer(
  data = NULL,
  mapping = NULL,
  geom = NULL,
  stat = NULL,
  position = NULL
)

# Default geom_point() setting (a layer() shortcut):
layer(
  data = NULL,
  mapping = NULL,
  geom = "point",
  stat = "identity",
  position = "identity"
)

# Default geom_jitter() setting (a layer() shortcut):
layer(
  data = NULL,
  mapping = NULL,
  geom = "point",
  stat = "identity",
  position = "jitter"
)

geom_point() and geom_jitter share the same setting of geom parameter but they differ in position setting. (Yes, it is confusing and it is improved in ggvis package.) See example of jittering:

diamonds %>%
  ggplot(data = ., mapping = aes(x = x, y = y)) +
    # x -- length of stones (mm)
    # y -- width of stones (mm)
  geom_point(
    color = "black"
  ) +
  geom_jitter(
    color = "blue",
    width = 1,
    height = 1,
    alpha = 0.3
  )

`geom_*()` functions

ggplot2 allows users to construct many figure types in countless colors, sizes, etc. Following text provides a basic overview of most common figure types and options.

Description of a variable (or variables)

The first thing which usually catches eye of a researcher is distribution of a variable. For its plotting we use a different tools for discrete and continuous variables.

Distribution of discrete variable: bar plots with `geom_bar()`

For a discrete variable it is crucial to see frequency of observed options. For example data set diamonds contains column color which contains evaluation of color of included stones from D (best) to J (worst). We can get the frequencies using table():

diamonds %$% table(color)

## color
##  D  E  F  G  H  I  J 
## 70 76 89 99 87 54 25

Common way to visualize frequencies of a discrete variable is to use a bar plot.

geom_bar() can be used to produce various bar plots. Basic (default) setting returns distribution of discrete variable (“histogram”), where height of a bar is equal to number of observation.

geom_bar() understands following aesthetics: x (required), alpha, colour, fill, linetype, size. As an example we can plot a bar plot with stones color distribution:

diamonds %>% 
  ggplot(
    aes(
      x = color
    )
  ) +
  geom_bar(
    stat = "count",     # default setting
    position = "stack"  # default setting
  )

geom_bar() returned number of cases at each x position. The numbers were supplied by a function stat_count() which processes data for geom_bar(stat="count"). It is not common (but it is possible) to use stat_*() functions directly as far as each of them is associated to some geom_*() function.

Only mandatory aesthetic x is used in the example. However one can use more then one aesthetic. See example:

diamonds %>% 
  ggplot(
    aes(
      x = color,
      fill = cut
    )
  ) +
  geom_bar()

diamonds %>% 
  ggplot(
    aes(
      x = color,
      fill = color
    )
  ) +
  geom_bar()

The second example shows that it is possible to map one variable to multiple aesthetics.

Distribution of continuous variables: histograms, density plots, box, and violin plots

A histogram is designed for plotting a distribution of observed values of a continuous variable. At first a continuous scale of observed values is divided into intervals (bins) and then number of observations in all bins is counted a plotted.

A basic histogram can be created using geom_histogram(). It uses the same aesthetics as geom_bar().

diamonds %>% 
  ggplot(aes(x=price)) +
  geom_histogram()

geom_histogram() allows user to change size (argument binwidth, default to NULL) or number (argument bins, default to 30) of bins. If binnwidth is set bins is ignored.

It is recommended to play a little with binwidth (or bins) to find optimal bin size (number of bins).

diamonds %>% 
  ggplot(
    aes(x=price)
    ) +
  geom_histogram(
    binwidth = 1000
    )

geom_histogram() displays observed values but sometimes is useful to estimated true density from the sample. geom_density() delivers kernel density estimate of x variable distribution.

geom_density() understands following aesthetics: x, y, alpha, colour, fill, linetype, size, and weight.

diamonds %>% 
  ggplot(
    aes(x=price)
    ) +
  geom_density()

Density estimate is delivered to geom_density() by stat_density() function which use kernel = "gaussian" by default. If you want to use a different kernel you need to call stat_density() directly:

diamonds %>% 
  ggplot(
    aes(x=price)
    ) +
  stat_density(
    kernel = "optcosine" # for more options see help for density()
  )

How could it be that stat_density() produces layer (geom)?! The relationship between geom_*() and stat_*() is actually a bit complicated. Most of geom_*() functions have an association with a stat function. But on the other hand stat_*() functions have a association with a geom function. See default setting of geom_histogram() and related stat_bin():

geom_histogram(mapping = NULL, data = NULL, stat = "bin",
  position = "stack", ..., binwidth = NULL, bins = NULL, na.rm = FALSE,
  show.legend = NA, inherit.aes = TRUE)

stat_bin(mapping = NULL, data = NULL, geom = "bar", position = "stack",
  ..., binwidth = NULL, bins = NULL, center = NULL, boundary = NULL,
  closed = c("right", "left"), pad = FALSE, na.rm = FALSE,
  show.legend = NA, inherit.aes = TRUE)

It might be important to see differences in density for different groups of stones (e.g. different cuts). ggplot2 provides multiple techniques for this task. The first one can be called grouping. Groups are specified by argument group in aes(). The argument group expects a variable which divides variable x into groups – it could be discrete (or logical) variable. The task specified in following geom_*() call is performed for each group separately and outcomes are plotted into the same Figure:

diamonds %>% 
  ggplot(
    aes(x=price, group = cut)
    ) +
  geom_density()

Hm, the outcome is hard to read. Let’s tune it following the same concept of grouping. If we assign grouping variable cut to different aesthetics (which geom_density() understands!) the results could be clear to read and even breath taking:

diamonds %>% 
  ggplot(
    aes(x=price, 
        fill = cut,
        color = cut # Only one aesthetics would suffice.
        )
    ) +
  geom_density(
    alpha = 0.2  # Just to make it even cooler and easy to read.
  )

Box and violin plots are designed for comparison of distributions of different variables. More common box plot can be created using geom_boxplot() which plots stylized distributions for each group.

geom_boxplot() understands following aesthetics:

lower (lower hinge, 25% quantile),
middle (median, 50% quantile),
upper (upper hinge, 75% quantile),
x (grouping variable),
ymax (upper whisker = largest observation less than or equal to upper hinge + 1.5 * IQR),
ymin (lower whisker = smallest observation greater than or equal to lower hinge - 1.5 * IQR),
and alpha, colour, fill, linetype, shape, size, weight

diamonds %>% 
  ggplot(
    aes(
        x = cut, # Grouping variable
        price    # Variable to be plotted
        )
  ) +
  geom_boxplot()

Violin plot is similar to box plot. Box plot provides information on observed quantiles, violin plot plots rotated kernel density plot on each side. See example produced by geom_violin():

diamonds %>% 
  ggplot(
    aes(x = cut, price)
  ) +
  geom_violin()

Distribution of 2D data

ggplot2 provides tools for visualization of 2D data distribution which are analogous to 1D functions described above.

ggplot2 has two geom_*() functions analogous to geom_histogram() which display distribution of observed combinations of two variables. Both of them split the plane to smaller areas and show the number of observations in each of them. geom_bin2d() splits the plane to rectangles and geom_hex() to hexagons.

geom_bin2d() understands to following aesthetics: x, y, and fill. geom_hex() adds colour, fill, and size.

geom_hex() depends on package hexbin.

diamonds %>% 
  ggplot(
    aes(x = carat, y = price)
  ) +
  geom_bin2d()

diamonds %>% 
  ggplot(
    aes(x = carat, y = price)
  ) +
  geom_hex()

Kernel estimate of 2D distribution is also available via geom_density2d(). The function returns contours of distribution estimated.

geom_density2d() understands following aesthetics: x, y, alpha, colour, linetype, and size.

diamonds %>% 
  ggplot(
    aes(x = carat, y = price)
  ) +
  geom_density2d()

Relationship of variables

Cross-sectional

Basic tool for visualization of relationship of two continuous variables is with no doubt a scatter plot. You can plot it using geom_point().

See examples above.

diamonds %>% 
  ggplot(
    aes(
      x = carat,
      y = price
    )
  ) + 
  geom_point()

It is possible to add more information on individual observations using many aesthetics available. However individual observations always will be – at least to some extent – “anonymous”. geom_text() and geom_label() allows user to replace shape representing observation by text (string) defined in required aesthetic label.

Both geom_ functions understand aesthetics label, x, y, alpha, angle, colour, family, fontface, hjust, lineheight, size, and vjust.

As an example we can draw a scatter plot of price and weight relationship where each stone has a label depicting its clarity. As this type of plot is generally more suitable for data set with low number of observations we will further reduce the sub-sample from diamonds data set.

diamonds %>% sample_n(50) -> sub_diamonds 
  
sub_diamonds %>% 
  ggplot(aes(
    x = carat,
    y = price,
    label = clarity
  )) +
  geom_text()

sub_diamonds %>% 
  ggplot(aes(
    x = carat,
    y = price,
    label = clarity
  )) +
  geom_label()

As it is apparent from the Figure that the difference between geom_text() and geom_label() is just aesthetical. geom_label() is also considerably slower.

It is also clear, that this type of plot very often suffers from over-plotting. As far as I know ggplot2 does not provide any automatic intelligent way to solve it. It is possible to manually adjust position of labels (see help) or use check_overlap = TRUE. Which is a dirty way:

sub_diamonds %>% 
  ggplot(aes(
    x = carat,
    y = price,
    label = clarity
  )) +
  geom_text(
    check_overlap = TRUE
  )

check_overlap = TRUE just suppress plotting of labels which would overlap with already plotted text. Oh, dear.

geom_smooth() returns smoothed line. It supports multiple smoothing methods: lm, glm, gam, loess, and rlm (given in method). Default method differs according to number of observations. It also returns confidence interval around smooth. Plotting of confidence interval can be suppressed by setting se = FALSE.

geom_smooth() understands aesthetics x, y, alpha, colour, fill, linetype, size, and weight.

diamonds %>% 
  ggplot(
    aes(
      x = carat,
      y = price
    )
  ) + 
  geom_smooth(
    fill = "pink",
    colour = "red"
  )

Notice, that you can easily have smoothing curve without having actual observations. You can also compare multiple smoothing methods by simply adding multiple layers:

diamonds %>% 
  ggplot(
    aes(
      x = carat,
      y = price
    )
  ) + 
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", colour = "red", fill = "pink") +
  geom_smooth(method = "loess", colour = "green", fill = "lightgreen")

ggplot2 contains some tools for investigating relationship between three variables. geom_raster(), geom_tile(), and geom_rect() provide similar functionality for plotting rectangles which is useful when plotting surface on a plane.

geom_raster() is the fastest of the three and together with data on estimated density of Old Faithful Geyser eruptions we will use it to demonstrate use of rectangles.

print(faithfuld)

## # A tibble: 5,625 × 3
##   eruptions waiting     density
##       <dbl>   <dbl>       <dbl>
## 1  1.600000      43 0.003216159
## 2  1.647297      43 0.003835375
## 3  1.694595      43 0.004435548
## 4  1.741892      43 0.004977614
## # ... with 5,621 more rows

faithfuld %>% 
  ggplot(
    aes(
      x = waiting,
      y = eruptions,
      fill = density
    )
  ) + 
  geom_raster()

faithfuld %>% 
  ggplot(
    aes(
      x = waiting,
      y = eruptions,
      fill = density
    )
  ) + 
  geom_raster(
    interpolate = TRUE
  )

The second Figure is created with an option interpolate = TRUE which deliver nicer outcomes.

Rectangles might be a bit difficult to combine with different geoms. In this case one can use geom_contour() which display contours of a 3D surface in 2D.

In the following task we want to combine estimated density of eruptions stored in table faithfuld and actual observations from table faithful:

The first way is to combine both data sets and plot Figure using a layer for density estimates and actual observations using the same table. However ggplot2 allows user to specify different data set for each layer. This is better way to go. We need to:

Plot contours using table faithfuld
Add layer with point defined in table faithful

ggplot(data = faithfuld,
       aes(
         x = waiting, 
         y = eruptions
       )
) + 
  geom_contour(aes(z = density)) +
  geom_point(data = faithful)

It is important to understand the use of aes() in this example. By coincidence both tables shares column names of variables mapped on x and y. It makes thing easier. The first use of aes() in ggplot() call sets aesthetics for all layers (even with different data used!). Second use of aes() in geom_contour() adds required aesthetics z. geom_point() does not need any additional aesthetics therefore argument mapping is not set – it means that its value is inherited from ggplot() call.

If the names in faithful were different (e.g. obs_eruptions and obs_waiting) we would need to rewrite it in the following fashion (which would lead to identical outcome):

ggplot(data = faithfuld,
       aes(
         x = waiting, 
         y = eruptions
       )
) + 
  geom_contour(aes(z = density)) +
  geom_point(data = faithful,
             aes(
               x = obs_waiting,
               y = obs_eruptions
             ))

Time-series

Above described geom_*() are primarily designed for cross-sectional data. Especially in economic we often use data with time dimension. To illustrate visualization of time series we will use data set economics which supplied together with ggplot2:

data("economics")
print(economics)

## # A tibble: 574 × 6
##         date   pce    pop psavert uempmed unemploy
##       <date> <dbl>  <int>   <dbl>   <dbl>    <int>
## 1 1967-07-01 507.4 198712    12.5     4.5     2944
## 2 1967-08-01 510.5 198911    12.5     4.7     2945
## 3 1967-09-01 516.3 199113    11.7     4.6     2958
## 4 1967-10-01 512.9 199311    12.5     4.9     3143
## # ... with 570 more rows

The basic tool for visualization is a line plot produced by geom_line().

geom_line() understands following aesthetics: x, y, alpha, colour, linetype, and size.

economics %>% 
  ggplot(aes(
    x = date,
    y = unemploy # No. of unemployed.
  )) + 
  geom_line()

Similar to geom_line() is geom_step() which uses the same aesthetics:

economics %>%
  arrange(desc(date)) %>% slice(n=1:36) %>% # Filter data for last three years
  ggplot(aes(
    x = date,
    y = unemploy # No. of unemployed.
  )) + 
  geom_step()

Areas can be plotted by geom_ribbon() which requires specification of x, ymax, and ymin values which set an interval on y-axis. geom_ribbon() also understands alpha, colour, fill, linetype, and size.

economics %>% 
  mutate(
    pce_min = pce*0.90,   # 95 % of original pce
    pce_max = pce*1.05    # 105 % of original pce
  ) %>% 
  ggplot(
    aes(
      x = date,
      y = pce,
      ymin = pce_min,
      ymax = pce_max
    )
  ) +
    geom_ribbon(
      fill = "lightgreen"
      ) +
    geom_line()

geom_ribbon() is useful e.g. for plotting confidence intervals – not necessarily with time dimension.

Maps: plotting spatial data

Quite often we deal with data with spatial dimension which we want to visualize on the map. We will illustrate few examples of spatial data visualization using ggplot2.

Points on the map

In this example we want to get a map of major earthquakes in Japan. Data on earthquakes comes from NOAA database (https://www.ngdc.noaa.gov/). At first we need to load them:

library(readr)
read_csv("data/jpn_quakes.csv") %>% 
  filter(long > 0) -> jpn_quakes

## Parsed with column specification:
## cols(
##   year = col_integer(),
##   lat = col_double(),
##   long = col_double(),
##   magnitude = col_double()
## )

# filter() removes few very remote observations

print(jpn_quakes)

## # A tibble: 395 × 4
##    year   lat  long magnitude
##   <int> <dbl> <dbl>     <dbl>
## 1   684  32.5 134.0       8.4
## 2   701  35.7 135.4       7.0
## 3   704  33.8 136.7       7.0
## 4   744  32.4 130.5       6.4
## # ... with 391 more rows

The major problem is always getting a map. There are some packages with geographical data – e.g. maps and cshapes. In this example we will use cshapes which contains a data set of historical borders.

# This function returns map of the world for 2000-1-1:
cshapes::cshp(date=as.Date("2000-1-1")) -> world2000

# In the next step we need to extract a map of Japan
world2000[world2000$ISO1AL3 == "JPN",] %>% 
# Resulting object is of class "SpatialPolygonsDataFrame". However ggplot2 needs a data.frame. We can convert it using fortify() from ggplot2 package.
  fortify %>% as_data_frame() -> jpn

print(jpn)

## # A tibble: 2,485 × 7
##       long      lat order  hole  piece    id  group
##      <dbl>    <dbl> <int> <lgl> <fctr> <chr> <fctr>
## 1 134.2994 34.70412     1 FALSE      1   143  143.1
## 2 134.2503 34.71527     2 FALSE      1   143  143.1
## 3 134.2421 34.70388     3 FALSE      1   143  143.1
## 4 134.2494 34.68915     4 FALSE      1   143  143.1
## # ... with 2,481 more rows

Plotting itself is relatively easy. In the first step we use geom_polygon() for plotting borders of Japan. In the second step we employ table jpn_quakes to plot individual earthquakes:

ggplot(
  data = jpn,
  aes(
    x = long,
    y = lat
  )
) +
  geom_polygon(
    aes(group=group),
    fill = NA,         # makes polygons transparent
    color = "black"    # plot borders in black
  ) +
  geom_point(
    data = jpn_quakes,
    aes(
      color = year,
      size = magnitude
        ),
    alpha = 0.2
  )

In this way you can even more layers and data sets:

Thematic maps

In this Figure a map is just another layer with no direct association to data. Another common example of spatial data visualization is thematic map (https://en.wikipedia.org/wiki/Thematic_map). The function meant for plotting it is geom_map().

We are going to construct thematic map with murder rates in U.S. First we need to get data on murders:

crimes <- data.frame(state = tolower(rownames(USArrests)), USArrests) %>% as_data_frame()
print(crimes)

## # A tibble: 50 × 5
##      state Murder Assault UrbanPop  Rape
## *   <fctr>  <dbl>   <int>    <int> <dbl>
## 1  alabama   13.2     236       58  21.2
## 2   alaska   10.0     263       48  44.5
## 3  arizona    8.1     294       80  31.0
## 4 arkansas    8.8     190       50  19.5
## # ... with 46 more rows

Column Murder contains number of murders per per 100,000.

The map used comes from package maps and it is extracted using map_data function from ggplot2:

states_map <- map_data("state") %>% as_data_frame()
print(states_map)

## # A tibble: 15,537 × 6
##        long      lat group order  region subregion
## *     <dbl>    <dbl> <dbl> <int>   <chr>     <chr>
## 1 -87.46201 30.38968     1     1 alabama      <NA>
## 2 -87.48493 30.37249     1     2 alabama      <NA>
## 3 -87.52503 30.37249     1     3 alabama      <NA>
## 4 -87.53076 30.33239     1     4 alabama      <NA>
## # ... with 15,533 more rows

It is crucial that map data contains ID (region) which allows us to associate geographical region with murder rate (via column state).

geom_map() has two data arguments:

data contains data to be plot in the map
map contains geographical data

Regions in the map are bound with data by aesthetic map_id:

ggplot(crimes, 
       aes(
         map_id = state
         )) +
    geom_map(
      aes(fill = Murder), 
      map = states_map
      ) +
    expand_limits(
      x = states_map$long, 
      y = states_map$lat
      )

When using geom_map() you need to specify the are to plot. It can be done using expand_limits() which includes all x and y coordinates which should be plotted in the resulting Figure.

Plot a route on the map

For example in transportation economics (or for training purposes) might be useful to be able to plot a route on the map. It is simple, when you have data. We will use data exported from Strava which were prepossessed by plotKML::readGPX(). It is a short run from Romolslia Øvre to Flatåsen Nedre and back (made just for AVED):

load("data/RomFlat.Rdata")
print(run)

## # A tibble: 585 × 4
##        lon      lat   ele                 time
## *    <dbl>    <dbl> <dbl>                <chr>
## 1 10.35784 63.37758  55.1 2016-07-18T20:57:50Z
## 2 10.35789 63.37755  54.7 2016-07-18T20:57:59Z
## 3 10.35788 63.37752  54.2 2016-07-18T20:58:01Z
## 4 10.35788 63.37747  53.6 2016-07-18T20:58:03Z
## # ... with 581 more rows

The path can be plotted using geom_path() which plots a link between observations in the order given by their position in input table (i.e. between row 1 and 2, 2 and 3, and so forth).

geom_path() understands following aesthetics: x, y, alpha, colour, linetype, and size.

run %>% 
  ggplot(
    aes(
      x = lon,
      y = lat,
      color = ele
    )
  ) +
  geom_path()

There is a package ggmap which allows you to plot data on various map tiles from various sources. See examples:

From visualization back to data: scales, axes, and legends

Let’s listen to Hadley (Wickham, 2016):

Scales control the mapping from data to aesthetics. They take your data and turn it into something that you can see, like size, colour, position or shape. Scales also provide the tools that let you read the plot: the axes and legends. Formally, each scale is a function from a region in data space (the domain of the scale) to a region in aesthetic space (the range of the scale). The axis or legend is the inverse function: it allows you to convert visual properties back to data.

We haven’t mentioned any of those things so far. But we were using them – recall simplified example from introduction:

ggplot(
  data = diamonds,
  mapping = 
    aes(
      x = carat,      # X-coordinate
      y = price,      # Y-coordinate
      color = cut,    # color of the "point" margin
      size = table    # size of the point
    )
) + 
  geom_point()

This example specifies only aesthetics – i.e. which data should be mapped to which aesthetics. How it should be done is controlled by scales.

In the initial example the scales were left in default setting. The call was therefore identical to:

ggplot(
  data = diamonds,
  mapping = 
    aes(
      x = carat,      # X-coordinate
      y = price,      # Y-coordinate
      color = cut,    # color of the "point" margin
      size = table    # size of the point
    )
) + 
  geom_point() +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_color_discrete() +
  scale_size_continuous()

We can use this example to demonstrate some options available via scales:

ggplot(
  data = diamonds,
  mapping = 
    aes(
      x = carat,      # X-coordinate
      y = price,      # Y-coordinate
      color = cut,    # color of the "point" margin
      size = table    # size of the point
    )
) + 
  geom_point() -> p

p + scale_x_continuous(trans = "log")         # Log axes
p + scale_y_continuous(limits = c(500,1000))  # Limits on axes
p + scale_color_brewer(palette = "Paired")      # Colors
p + scale_size_continuous(name = "Size")      # Transformations

>Notice that you can assign ggplot object to a variable!

The names of scales are systematic. They consist of three elements joined by _:

scale
the name of aesthetic (x, y, color, size, alpha,..)
the name of the scale (discrete, continuous, brewer, hue,…)

Scales are added to ggplot() call in the same manner as layers – i.e. by +. It is misleading a bit. In the case of layers + actually add additional layer but with scales it replaces default (or previously defined) one:

ggplot(
  data = diamonds,
  mapping = 
    aes(
      x = carat,
      y = price,
      color = cut
    )) +
  geom_point() -> p
  
p + scale_color_brewer(palette = "Paired") -> p1

p + scale_color_brewer(palette = "Paired") + scale_color_brewer(palette = "Set3") -> p2

A special attention is paid to colors in scales which drive mapping to aesthetics fill and color. There four gradient-based methods of mapping for continuous and two for discrete variables. Let`s start with continuous color scales:

scale_fill_continuous() is a default option identical with scale_fill_gradient(). It allows user to set a color for low and high values. ggplot2 will process any choice of colors but it is very difficult to come up with proper combination which is easy understand for human eye and brain, therefore you should use some prepared choices:

geyser + scale_fill_gradient()

geyser + scale_fill_gradient(low="white", high="black")

geyser + scale_fill_gradient(
  low = munsell::mnsl("5G 9/2"),
  high = munsell::mnsl("5G 6/8")
)

stones + scale_color_gradient()

stones + scale_color_gradient(low="white", high="black")

stones + scale_color_gradient(
  low = munsell::mnsl("5G 9/2"),
  high = munsell::mnsl("5G 6/8")
)

scale_fill_gradient2() allows user to combine two color gradients: from low to mid point and from midpoint to high. User can manually specify the value of midpoint (default is midpoint=0).

faithfuld$density %>% median -> mid

geyser + scale_fill_gradient2(midpoint = mid)
geyser + scale_fill_gradient2(midpoint = mid, 
                              low = "blue", 
                              high = "red", 
                              mid = "white")

diamonds$table %>% median -> mid

stones + scale_color_gradient2(midpoint = mid)
stones + scale_color_gradient2(midpoint = mid, low = "blue", high = "red", mid = "white")

scale_fill_gradientn() provides possibility to use n-element gradient specified by a vector in argument colors. It should be used only if there is a strong reason for it. I also recommend to use gradient prepared by experts – see some examples:

geyser + scale_fill_gradientn(colours = terrain.colors(7))
# Function terrain.colors(n) from grDevices generates gradient (palette) of n colors.

geyser + scale_fill_gradientn(colours = colorspace::heat_hcl(7))
# Similar function from colorspace package

geyser + scale_fill_gradientn(colours = viridis::viridis(7))
# ...and from viridis package. Viridis provides very nice palettes for color-blind people and even its own function scale_fill_viridis.

scale_fill_distiller() applies ColorBrewer colors (see http://colorbrewer2.org/) on continuous data. It allows user to choose from three types of palettes: seq (sequential), div (diverging) or qual (qualitative). ggplot2 can and will use qual even for continuous data but there is no reason for using it.

Particular palette can be set by its name (see website for it) or by its number. You can also change direction of the palette using direction = 1 or direction = -1.

geyser + scale_fill_distiller(type = "seq", palette = "YlOrRd", direction = 1)
geyser + scale_fill_distiller(type = "seq", palette = "Oranges", direction = 1)
geyser + scale_fill_distiller(type = "div", palette = "BrBG", direction = 1)

geyser + scale_fill_distiller(type = "seq", palette = "YlOrRd", direction = -1)
geyser + scale_fill_distiller(type = "seq", palette = "Oranges", direction = -1)
geyser + scale_fill_distiller(type = "div", palette = "BrBG", direction = -1)

There are also four methods for discrete color scales:

The default color scheme is scale_fill_hue() which picks evenly scaled hues around HCL wheel. HCL is color definition system used by ggplot2. Color are defined by three components: hue (h, [0, 360]), chroma (c), and luminance (l, [0, 100]). scale_fill_hue() returns evenly hues with chroma and luminance being equal.

scale_fill_hue(..., h = c(0, 360) + 15, c = 100, l = 65, h.start = 0, direction = 1, na.value = "grey50")

User can control the range of hues as well as values of chroma and luminance. See example:

stones_color + scale_color_hue()
stones_color + scale_color_hue(c = 50, l = 10)
stones_color + scale_color_hue(h = c(100,200))

stones_fill + scale_fill_hue()
stones_fill + scale_fill_hue(c = 50, l = 10)
stones_fill + scale_fill_hue(h = c(100,200))

It is very difficult to find good colors using hue. Therefore it is good to try some prepared palettes.

scale_fiĺl_brewer() allows to use palletes qualitative palettes from http://colorbrewer2.org/:

stones_color + scale_color_brewer(type = "qual", palette = "Set1")
stones_color + scale_color_brewer(type = "qual", palette = "Pastel2")
stones_color + scale_color_brewer(type = "seq", palette = "YlOrRd")

stones_fill + scale_fill_brewer(type = "qual", palette = "Set1")
stones_fill + scale_fill_brewer(type = "qual", palette = "Pastel2")
stones_fill + scale_fill_brewer(type = "seq", palette = "YlOrRd")

H.W. recommends to use qualitative palettes Set1 and Dark2 for points and Set2, Pastel1, Pastel2, and Accent for areas. It also make sense to use sequential palettes in the case of ordered options

scale_fill_grey() provide black and white palette for discrete data. Shades are scaled from light (start) to dark (end).

stones_color + scale_color_grey()
stones_color + scale_color_grey(start = 0.5, end = 1)
stones_color + scale_color_grey(start = 0, end = 1)

stones_fill + scale_fill_grey()
stones_fill + scale_fill_grey(start = 0.5, end = 1)
stones_fill + scale_fill_grey(start = 0, end = 1)

scale_fill_manual() allows user to define his own palette or use palette from different package.

Positioning

There are four ways which drives positioning of observation representation on the page. The scales transformation was discussed above. Description of other three options follows.

Position adjustments

layer() function has an argument position with options:

identity – no position adjustment (default for most geom_*() functions)
jitter – jitter points to avoid overlapping
dodge – avoid overlapping by dodging on side
stack – put overlaps on the top of each other
nudge – shifts overlaps by set x and y distance
jitterdodge – combines jitter and dodge

Jittering

Jittering is a technique useful for avoiding overlapping (especially) in scatter plots. Actual coordinates of each observation are randomly changed within specified limits.

The most common use of jittering is via geom_jitter() a shortcut for layer(geom = "points", position = "jitter,...):

geom_jitter(mapping = NULL, data = NULL, stat = "identity",
  position = "jitter", width = NULL, height = NULL,...)

width : Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here.
height : Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here.

Recall an example from the introduction:

diamonds %>%
  ggplot(data = ., mapping = aes(x = x, y = y)) +
    # x -- length of stones (mm)
    # y -- width of stones (mm)
  geom_point(
    color = "black"
  ) +
  geom_jitter(
    color = "blue",
    width = 1,
    height = 1,
    alpha = 0.3
  )

Stack, dodge and fill

These options are commonly used in bar plots for putting geoms (bars) on the top of each other, next to each other, and getting shares of options. See an example:

mean.price <- diamonds$price %>% mean

diamonds %>% 
  mutate(
    high.price = price > mean.price
  ) %>% 
  ggplot(
    aes(
      x = cut,
      fill = high.price
    )
  ) + scale_fill_brewer(name = "High price", type = "qual", palette = "Set2") -> p
  

p +
geom_bar(
    position = "stack" # default value
  )

Resulting figure does not provide very clear idea of ratio of low/high price stones in each category. You can get a clear picture by setting position = "fill":

p +
geom_bar(
    position = "fill"
  )

This figure shows nicely shares within groups (cuts) but it cannot provide comparison among groups. For that purpose use position = "dodge":

p +
geom_bar(
    position = "dodge"
  )

It also possible to specify width of the dodge:

p +
geom_bar(
    position = position_dodge(width=0.5)
  )

Faceting

ggplot2 allows you to break single figure into multiple “facets”. See example:

diamonds %>%
  ggplot(data = ., mapping = aes(x = carat, y = price)) +
  geom_point() -> p

p + facet_wrap(~cut, ncol = 3)

In the resulting figure there is a “subfigure” for each cut. All subfigures have identical scales. You can change this behavior using argument scales with options fixed (default), free, free_x, and free_y.

facet_grid() adds a special feature which is not available with facet_wrap() it allows to organize facets to a grid along two dimensions:

p + facet_grid(color ~ cut)

Coordinate systems

Linear coordinate systems

ggplot2 supports both linear and non-linear coordinate systems. In most cases we use default coord_cartesian() which is default linear system, where the position of an element is given by x and y coordinates.

Following example use a simple scatter plot to illustrate default properties of coord_cartesian(). It uses data from VGAMdata to show arrow shots. Each shot is described by its X a Y coordinates:

print(archery)

## # A tibble: 126 × 3
##       X     Y archer
##   <dbl> <dbl>  <dbl>
## 1 24.14 -9.55      1
## 2 28.55  6.57      1
## 3  3.97  0.46      1
## 4 28.57 26.84      1
## # ... with 122 more rows

archery %>% 
ggplot(
  aes(
    y = Y,
    x = X
  )
) + geom_point() -> p

p

coord_cartesian() sets ratio to fit required figure size. But it is misleading it this case as far as units on both axes are equal. Luckily coord_fixed() allows user to set ratio:

p + coord_fixed(ratio = 1) # ratio = 1 is default value

Another function which modifies linear coordinate system is coord_flip() which just flips axes:

p + coord_fixed(ratio = 1) + # ratio = 1 is default value
  coord_flip()

All coord_*() allows user to use arguments ylim and xlim to zoom part of the figure. Let’s zoom 1st quadrant:

p + coord_fixed(xlim = c(0,30), ylim = c(0,30))

Similar functionality is provided by scales. They might appear identical, but they differ deeply. Let’s see an example using simulated data:

data_frame(
  x = seq(from=-10, to=10, by=0.1)
) %>% 
  mutate(
    y = x^2 # ...and we have a parabola
  ) %>% 
  ggplot(
    aes(x=x,y=y)
  ) + geom_point() -> p

We can plot it with the OLS fitted line and then limit the figure on 1st quadrant using coord_cartesian() and scales:

p <- p + geom_smooth(method = "lm", se=FALSE)

p1 <- p + coord_cartesian(xlim = c(0,10), ylim = c(0,100))

p2 <- p + coord_cartesian() +
  scale_x_continuous(limits = c(0,10)) +
  scale_y_continuous(limits = c(0,100))

The difference is clear. Limits set by coord truly zoom the figure and all observations are taken into account (onto OLS fit in this case), but scales completely excludes observations.

Non-linear coordinate systems

There are also non-linear coordinate systems supported by ggplot2. Two of them – polar coordinates (coord_polar()) and map projections (coord_map()) are quite rarely used.

See example of 45 degree line in cartesian and polar coordinate system:

diamonds %>% 
  ggplot(aes(x = x, y = y)) +
    geom_point(aes(slope=1, intercept=0)) + 
    geom_smooth(se=FALSE) -> p

## Warning: Ignoring unknown aesthetics: slope, intercept

p + coord_cartesian() -> p1     # Just for clarity. coord_cartesian() is default option.
p + coord_polar() -> p2

## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'

The last one – coord_trans() – is used to transform axes after statistical transformations are applied. Axes transformation can be also done by scales. However there is a substantial difference (again). See example:

# Generate new data: y = x + white noise
data_frame(
  x = seq(from=0, to=100, by=0.1)
) %>% rowwise() %>% 
  mutate(
    y = x + rnorm(1, sd=3)
  ) %>% 
  filter(y>0,x>0) %>% 
  ggplot(
    aes(x=x,y=y)
  ) +
  geom_point() +
  geom_smooth(method="lm", se = FALSE) -> p

p1 <- p + coord_trans(y = "log")
p2 <- p + scale_y_continuous(trans = "log")

Using coord_trans() transformation is applied after smoothing – therefore OLS fit closely follows data. With scale transformation is applied before smoothing – i.e. geom_smooth() is trying to fit transformed values. (+ coord_trans() nicely transforms background grid)

Themes

Data unrelated elements are controlled via themes which allows user to completely change appearance of a Figure. ggplot2 supplies eight complete schemes…

## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'

…and many more in additional packages. See some examples from ggtheme:

…and it is still ggplot2! Notice, that data-related elements are still the same in all figures! Themes really controls only data-unrelated features such as fonts, grids, background colors, fonts sizes, and so forth.

Theming system in ggplot2 consist of following components:

elements specify the non-data elements which can be controlled – e.g. plot.title element controls the appearance of plot title.
each element is associated with an element function which describes visual properties if the element – there are four basic ones: element_text(), element_line(), element_rect(), and element_blank()

element_text(family = NULL, face = NULL, colour = NULL, size = NULL,
  hjust = NULL, vjust = NULL, angle = NULL, lineheight = NULL,
  color = NULL, margin = NULL, debug = NULL)

element_line(colour = NULL, size = NULL, linetype = NULL,
  lineend = NULL, color = NULL)

element_rect(fill = NULL, colour = NULL, size = NULL, linetype = NULL,
  color = NULL)

theme() function which allows user to set themes – e.g. by call p + theme(plot.title = element_text(size = 20))

Alike in the case of colors it is difficult to come up with nice theme. Therefore there is a number of ready to use complete themes – see examples above.

We will demonstrate using themes by mimicking appearance of a Figure from an OECD report:

Let’s create some data first and plot a very default figure:

expand.grid(
  country = c("MEX","USA","CAN"),
  year = 2005:2014
) %>% rowwise %>%  
  mutate(
    value = rnorm(1, mean = 0.25, sd = 2)
  ) %>% 
  group_by(country) %>% 
  mutate(
    value = cumsum(value)
  ) -> oecd

oecd %>% 
  ggplot(
    aes(x=year,y=value)
  ) +
  geom_line(
    aes(color = country)
  ) +
  ggtitle("QuasiOECD Figure") -> p

p

Default figure is saved in variable p. In the first step we should adjust data-related elements to fit OECD figure:

p + scale_color_manual(
  # Set name of the scale
  name = "Country",
  # Set colors as a named vector (see help)
  values = c(
    "CAN" = "black",
    "MEX" = "blue",
    "USA" = "grey50"
  ),
  # Set order in the legend -- see help for scale_discrete()
  breaks = c("CAN","MEX","USA"),
  # Set labels in the legend -- see help for scale_discrete()
  labels = c(
    "CAN" = "Canada",
    "MEX" = "Mexico",
    "USA" = "United States"
  )
) +
  scale_x_continuous(
    breaks = unique(oecd$year)
  ) -> p

p

And now we can adjust data-unrelated elements. The easiest way is to modify complete scheme but we will do it from scratch.

In the first step we will set plot elements: background, margin and title:

p +
  theme(
    # Background properties
    plot.background = element_rect(fill="pink",            # Fill color
                                   linetype = 3,           # Border linetype 
                                   size = 5,               # Border size
                                   color = "yellow"),      # Border color
    plot.title = element_text(family = "times",            # Font family
                              face = "bold",               # Font face
                              color = "red",               # Font color
                              angle = 180                  # Text angle
                              ),                           # ...and even more -- see help for element_text()   
    # Margin is set by a special function margin() -- no element_*()
    plot.margin = margin(t = 20, r = 0, b = 5, l = 5)
  )

No, not even close. Let’s try it again:

p + theme(
  # We will use element_blank() to remove plot.title. element_blank() draws nothing and assigns no space.
  plot.title = element_blank()
) -> p

p

We will proceed with axis elements:

p + theme(
  # axis.line = element_line() # controls lines parallel to axis -- THERE IS A BUG IN GGPLOT2 v2.1.0 -- use axis.line.x and axis.line.y
  axis.title = element_blank(), # There are no axis titles in OECD Figure
  # axis.title.x = element_text()
  axis.ticks = element_blank(), # There are no actual ticks in OECD Figure, normally set by element_line()
  # Length of ticks is again set by a special function unit()
  # axis.ticks.length = unit(10, units="pt")
  axis.text = element_text(
    color = "black", 
    size = 11
  )
) -> p

p

It is a time to modify legend elements:

p + theme(
  legend.background = element_rect(
    fill = "grey90",                     # light grey background
    size = NA                            # no border
  ),
  legend.key = element_rect(
    fill = NA,                            # use no extra fill for keys
    color = NA                            # and no border
  ),
  legend.key.width = unit(30, units = "pt"), # make it a bit longer
  legend.title = element_blank(),         # There is no name in OECD Figure
  legend.position = "top",
  legend.direction = "horizontal",
  legend.text = element_text(
    size = 11
  )
) -> p

p

And finally panel elements:

p + theme(
  panel.background = element_rect(
    fill = "#e1fcfd"                    # Super-light blue as a RGB code
  ),
  panel.border = element_rect(
    color = "black",
    fill = NA
  ),
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  aspect.ratio = 1
)

…and it is almost as ugly as the original.

Export figures with `ggsave()`

ggsave() is a function which save last plot displayed. It supports export to multiple vector (pdf, svg, eps/ps, and wmf) and bitmap (png, jpeg, tiff, bmp) formats.

Using vector graphic can be highly recommended. Vector graphics formats save infromation on all elements in figure (their position and other properties) which allows user to scale them without loss of quality (no blurred edges etc.). On the other hand it may results in considerable file size. This is the case especially of scatter plots with many (often overlaping) observations. In this situation you can consider using bitmaps.

Example

Following figure has less then 200KB in PNG (a bitmap), over 11MB in PDF, and 26MB in SVG.

data_frame(
    x = runif(200000, min=0.01, max = 5)
) %>% rowwise() %>% 
    mutate(
        y = log(x) + rnorm(1, sd=1)
    ) -> sdata

sdata %>% 
    ggplot(
        aes(x=x,y=y)
    ) +
    geom_point(alpha = 0.1) +
    geom_smooth(se=FALSE) +
    xlab("") + ylab("") +
    theme_bw()

## `geom_smooth()` using method = 'gam'

Choice of formats also depends on intended use: PNG and SVG are designed for web sites, PDF and EPS/PS for (e-)printed documents. Make sure that you can process vector graphics in text tools of choice!

Additional features…

plot multiple plots in one figure: cowplot, gridExtra
extra complete themes: ggthemes, xkcd
new geom functions: ggpolypath, ggthemes, ggnetwork

…and many, many more packages – see https://cran.r-project.org/web/packages/ggplot2/index.html

What (not) to do

ggplot2 is a powerful tool which allows you to do pretty much anything you want (incl. pie charts) – but you should try to use it effectively.

Some advice from Tufte (2001) on Principles of graphical excellence:

Graphical excellence consists of complex ideas communicated with clarity, precision, and efficiency.
Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.
Graphical excellence is nearly always multivariate.
Graphical excellence requires telling the truth about the data.

Think of graphical excellence when plotting.

Homework 1

Plot density plots for control and treatment group

We will use a simulated data from an experiment. There is a table trial_data in the file HW_trial_data.Rdata with columns x, control, and treatment with data observed:

## Source: local data frame [1,000 x 3]
## Groups: <by row>
## 
## # A tibble: 1,000 × 3
##            x   control treatment
##        <dbl>     <dbl>     <dbl>
## 1 -2.6075287 -6.254996 -5.861075
## 2 -0.8257511 -5.875684 -2.458799
## 3 -0.6754555 -2.628165 -2.956889
## # ... with 997 more rows

x represents exogenenous variable and values in control and treatment responses in control and treatment groups.

I want you to plot a figure like this one:

## `geom_smooth()` using method = 'gam'

Do not forget…

You are supposed to submit a code – not a figure
Create smoothing line for each group (control/treatment)
Smoothing lines are below points
Color of smoothing lines is mandatory
Pay attention to labels
Points differ in shape and color. Selection of shapes and colors is up to you (default is just OK).
Points are transparent a bit (say from 50 %)

Homework 2

Use table VGAMdata::oly12 and compare weight and height of London 2012 Summer Olympic Games with BMI limits set by WHO.

Your results should look like one of following figures. (Feel free to choose your favorite colors. Other features are mandatory.)

Tips:

Use BMI formula and limits defined by WHO: http://apps.who.int/bmi/index.jsp?introPage=intro_3.html
You can add BMI limits as an additional table (1st solution) or as a set of functions (2nd solution).

Easier solution:

Better but more difficult solution:

You are supposed to submit your solution (code without any data nor picture) to IS. Your script should load data from package VGAMdata and export resulting figure into results.pdf. Output file will be located in working directory. Names of all variables and files are mandatory (otherwise you will get 0 points). Remember that Linux and R are case-sensitive (Windows is not).