2016
Each figure produced by ggplot2
consist of three basic components:
Let's listen to Hadley (Wickham, 2016):
Scales control the mapping from data to aesthetics. They take your data and turn it into something that you can see, like size, colour, position or shape. Scales also provide the tools that let you read the plot: the axes and legends. Formally, each scale is a function from a region in data space (the domain of the scale) to a region in aesthetic space (the range of the scale). The axis or legend is the inverse function: it allows you to convert visual properties back to data.
We haven't mentioned any of those things so far. But we were using them – recall simplified example from introduction:
ggplot( data = diamonds, mapping = aes( x = carat, # X-coordinate y = price, # Y-coordinate color = cut, # color of the "point" margin size = table # size of the point ) ) + geom_point()
This example specifies only aesthetics – i.e. which data should be mapped to which aesthetics. How it should be done is controlled by scales.
In the initial example the scales were left in default setting. The call was therefore identical to:
ggplot( data = diamonds, mapping = aes( x = carat, # X-coordinate y = price, # Y-coordinate color = cut, # color of the "point" margin size = table # size of the point ) ) + geom_point() + scale_x_continuous() + scale_y_continuous() + scale_color_discrete() + scale_size_continuous()
We can use this example to demonstrate some options available via scales:
ggplot( data = diamonds, mapping = aes( x = carat, # X-coordinate y = price, # Y-coordinate color = cut, # color of the "point" margin size = table # size of the point ) ) + geom_point() -> p p + scale_x_continuous(trans = "log") # Log axes p + scale_y_continuous(limits = c(500,1000)) # Limits on axes p + scale_color_brewer(palette = "Paired") # Colors p + scale_size_continuous(name = "Size") # Transformations
Underlying (constructor) functions for different scale*
functions are discrete_scale()
and continuous_scale()
. They have a lot of arguments. Most important (from the user POV) are:
name
– name of the scale (printed in the legend)breaks
– position on major breaksminor_breaks
– position of minor breakslabels
– labelslimits
– limits of the scale (e.g. axes)trans
– transformation (log,…) of the scaleposition
– position of the axis (new in ggplot2 2.2.0)guide
See help for more details…
The names of scales are systematic. They consist of three elements joined by _
:
scale
x
, y
, color
, size
, alpha
,..)discrete
, continuous
, brewer
, hue
,…)Scales are added to ggplot()
call in the same manner as layers – i.e. by +
. It is misleading a bit. In the case of layers +
actually add additional layer but with scales it replaces default (or previously defined) one.
Arguments given in ...
are passed to constructor functions.
ggplot( data = diamonds, mapping = aes( x = carat, y = price, color = cut )) + geom_point() -> p p + scale_color_brewer(palette = "Paired") -> p1 p + scale_color_brewer(palette = "Paired") + scale_color_brewer(palette = "Set3") -> p2
A special attention is paid to colors in scales which drive mapping to aesthetics fill
and color
. There four gradient-based methods of mapping for continuous and two for discrete variables. Let`s start with continuous color scales.
scale_fill_continuous()
is a default option identical with scale_fill_gradient()
. It allows user to set a color for low
and high
values. ggplot2
will process any choice of colors but it is very difficult to come up with proper combination which is easy understand for human eye and brain, therefore you should use some prepared choices.Example of geom_ratser()
using ggplot2::faithfuld
data:
faithfuld %>% ggplot( aes( x = waiting, y = eruptions, fill = density ) ) + geom_raster( interpolate = TRUE ) -> geyser
Example of geom_point()
using subset of ggplot2::diamonds
data:
ggplot( data = diamonds, mapping = aes( x = carat, y = price, color = table )) + geom_point() -> stones
Default, manual, using munsell
package
geyser + scale_fill_gradient() geyser + scale_fill_gradient(low="white", high="black") geyser + scale_fill_gradient( low = munsell::mnsl("5G 9/2"), high = munsell::mnsl("5G 6/8") ) stones + scale_color_gradient() stones + scale_color_gradient(low="white", high="black") stones + scale_color_gradient( low = munsell::mnsl("5G 9/2"), high = munsell::mnsl("5G 6/8") )
Default, manual, using munsell
package
scale_fill_gradient2()
allows user to combine two color gradients: from low to mid point and from midpoint to high. User can manually specify the value of midpoint
(default is midpoint=0
).faithfuld$density %>% median -> mid geyser + scale_fill_gradient2(midpoint = mid) geyser + scale_fill_gradient2(midpoint = mid, low = "blue", high = "red", mid = "white") diamonds$table %>% median -> mid stones + scale_color_gradient2(midpoint = mid) stones + scale_color_gradient2(midpoint = mid, low = "blue", high = "red", mid = "white")
scale_fill_gradientn()
provides possibility to use n-element gradient specified by a vector in argument colors
. It should be used only if there is a strong reason for it. I also recommend to use gradient prepared by experts – see some examples.Function terrain.colors(n) from grDevices generates gradient (palette) of n colors:
geyser + scale_fill_gradientn(colours = terrain.colors(7))
Similar function from colorspace package:
geyser + scale_fill_gradientn(colours = colorspace::heat_hcl(7))
…and from viridis package. Viridis provides very nice palettes for color-blind people and even its own function scale_fill_viridis()
:
geyser + scale_fill_gradientn(colours = viridis::viridis(7))
scale_fill_distiller()
applies ColorBrewer colors (see http://colorbrewer2.org/) on continuous data. It allows user to choose from three types of palettes: seq
(sequential), div
(diverging) or qual
(qualitative). ggplot2
can and will use qual
even for continuous data but there is no reason for using it.Particular palette can be set by its name (see website for it) or by its number. You can also change direction of the palette using direction = 1
or direction = -1
.
geyser + scale_fill_distiller(type = "seq", palette = "YlOrRd", direction = 1) geyser + scale_fill_distiller(type = "seq", palette = "Oranges", direction = 1) geyser + scale_fill_distiller(type = "div", palette = "BrBG", direction = 1) geyser + scale_fill_distiller(type = "seq", palette = "YlOrRd", direction = -1) geyser + scale_fill_distiller(type = "seq", palette = "Oranges", direction = -1) geyser + scale_fill_distiller(type = "div", palette = "BrBG", direction = -1)
Example of geom_point()
using subset of ggplot2::diamonds
data:
diamonds %>% ggplot(aes( x = carat, y = price, color = color )) + geom_point() + theme_classic() + theme( legend.position = "none", axis.title.x = element_blank(), axis.title.y = element_blank() ) -> stones_color
Example of geom_bar()
using subset of ggplot2::diamonds
data:
diamonds %>% ggplot(aes( x = color, fill = color )) + geom_bar() + theme_classic() + theme( legend.position = "none", axis.title.x = element_blank(), axis.title.y = element_blank() ) -> stones_fill
scale_fill_hue()
which picks evenly scaled hues around HCL wheel. HCL is color definition system used by ggplot2
. Color are defined by three components: hue (h
, [0, 360]), chroma (c
), and luminance (l
, [0, 100]). scale_fill_hue()
returns evenly hues with chroma and luminance being equal.User can control the range of hues as well as values of chroma and luminance. See example:
stones_color + scale_color_hue() stones_color + scale_color_hue(c = 50, l = 10) stones_color + scale_color_hue(h = c(100,200)) stones_fill + scale_fill_hue() stones_fill + scale_fill_hue(c = 50, l = 10) stones_fill + scale_fill_hue(h = c(100,200))
It is very difficult to find good colors using hue
. Therefore it is good to try some prepared palettes.
scale_fiĺl_brewer()
allows to use palletes qualitative palettes from http://colorbrewer2.org/H.W. recommends to use qualitative palettes Set1
and Dark2
for points and Set2
, Pastel1
, Pastel2
, and Accent
for areas. It also make sense to use sequential palettes in the case of ordered options
stones_color + scale_color_brewer(type = "qual", palette = "Set1") stones_color + scale_color_brewer(type = "qual", palette = "Pastel2") stones_color + scale_color_brewer(type = "seq", palette = "YlOrRd") stones_fill + scale_fill_brewer(type = "qual", palette = "Set1") stones_fill + scale_fill_brewer(type = "qual", palette = "Pastel2") stones_fill + scale_fill_brewer(type = "seq", palette = "YlOrRd")
scale_fill_grey()
provide black and white palette for discrete data. Shades are scaled from light (start
) to dark (end
).stones_color + scale_color_grey() stones_color + scale_color_grey(start = 0.5, end = 1) stones_color + scale_color_grey(start = 0, end = 1) stones_fill + scale_fill_grey() stones_fill + scale_fill_grey(start = 0.5, end = 1) stones_fill + scale_fill_grey(start = 0, end = 1)
scale_fill_manual()
and scale_colour_manual()
allows user to define his own palette or use palette from different package.
scale_fill_identity()
and scale_colour_identity()
use values from already scaled variable
There are more aestheitcs then fill
and color
. You can also find specialized functions for them:
continuous
– default for continuous datadiscrete
– default for discrete dataidentity
– uses directly values given in a scaled variablemanual
– allows user to define his/her own rulesThere are four ways which drives positioning of observation representation on the page. The scales transformation was discussed above. Description of other three options follows.
layer()
function has an argument position
with options:
identity
– no position adjustment (default for most geom_*()
functions)jitter
– jitter points to avoid overlappingdodge
– avoid overlapping by dodging on sidestack
– put overlaps on the top of each othernudge
– shifts overlaps by set x
and y
distancejitterdodge
– combines jitter
and dodge
Jittering is a technique useful for avoiding overlapping (especially) in scatter plots. Actual coordinates of each observation are randomly changed within specified limits.
The most common use of jittering is via geom_jitter()
a shortcut for layer(geom = "points", position = "jitter,...)
:
geom_jitter(mapping = NULL, data = NULL, stat = "identity", position = "jitter", width = NULL, height = NULL,...)
width
: Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here.height
: Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here.Recall an example from the introduction:
diamonds %>% ggplot(data = ., mapping = aes(x = x, y = y)) + # x -- length of stones (mm) # y -- width of stones (mm) geom_point( color = "black" ) + geom_jitter( color = "blue", width = 1, height = 1, alpha = 0.3 )
These options are commonly used in bar plots for putting geoms (bars) on the top of each other, next to each other, and getting shares of options. See an example:
mean.price <- diamonds$price %>% mean diamonds %>% mutate( high.price = price > mean.price ) %>% ggplot( aes( x = cut, fill = high.price ) ) + scale_fill_brewer(name = "High price", type = "qual", palette = "Set2") -> p
p + geom_bar( position = "stack" # default value )
Resulting figure does not provide very clear idea of ratio of low/high price stones in each category. You can get a clear picture by setting position = "fill"
.
p + geom_bar( position = "fill" )
This figure shows nicely shares within groups (cuts) but it cannot provide comparison among groups. For that purpose use position = "dodge"
.
p + geom_bar( position = "dodge" )
ggplot2
allows you to break single figure into multiple "facets". See example:
There are two functions that can arrange faceting for you:
facet_wrap()
(used on previous slide) takes a variable or combination of multiple variables and create a "subfigure" for each level.facet_grid()
creates a matrix of panels defined by row and column facetting variables.facet_wrap()
facet_wrap(facets, nrow = NULL, ncol = NULL, scales = "fixed", shrink = TRUE, labeller = "label_value", as.table = TRUE, switch = NULL, drop = TRUE, dir = "h", strip.position = "top")
Argmunets:
facets
– either a formula or a character vector.You can get identical results in the example with + facet_wrap(~cut)
and + facet_wrap("cut")
. You will learn about formulas in the "Econometrics in R" lecture.
nrow
and ncol
– number of rows and columnsscales
– All subfigures have identical scales by default. You can change this behavior using argument scales
with options fixed
(default), free
, free_x
, and free_y
.strip.position
– set strip positionfacet_grid()
facet_grid(facets, margins = FALSE, scales = "fixed", space = "fixed", shrink = TRUE, labeller = "label_value", as.table = TRUE, switch = NULL, drop = TRUE)
Argmunets:
facets
– a formula with rows on the LHS and columns on the RHSmargins
– if TRUE
adds an extra row and column with all observations row row/columndiamonds %>% sample_n(500) %>% ggplot(data = ., mapping = aes(x = carat, y = price)) + geom_point() -> p p + facet_wrap(~cut, ncol = 3) p + facet_wrap("cut", ncol = 3, scales = "free") p + facet_wrap(c("cut","color")) p + facet_grid(cut ~ color) p + facet_grid(cut ~ color, margins = TRUE)
In order to do so you can:
labeller
optionlabeller
labeller
is a function which breaks original data.frame into list of data.frames. Each of them (i.e. each item) is used as a data input for one panel.
You can use labeller
to change strip labels – see example:
aux <- c( "Fair" = "Fair cut", "Good" = "Good cut", "Very Good" = "Very good cut", "Premium" = "Premium cut", "Ideal" = "Ideal cut" ) p + facet_wrap("cut", labeller = as_labeller(aux))
labeller
ggplot2
supports both linear and non-linear coordinate systems. In most cases we use default coord_cartesian()
which is default linear system, where the position of an element is given by x
and y
coordinates.
Following example use a simple scatter plot to illustrate default properties of coord_cartesian()
. It uses data from VGAMdata
to show arrow shots. Each shot is described by its X a Y coordinates.
## # A tibble: 126 × 3 ## X Y archer ## <dbl> <dbl> <dbl> ## 1 24.14 -9.55 1 ## 2 28.55 6.57 1 ## 3 3.97 0.46 1 ## 4 28.57 26.84 1 ## 5 -3.43 8.57 1 ## 6 9.68 16.33 1 ## 7 -5.95 20.73 1 ## 8 17.32 4.59 1 ## 9 -0.48 -7.72 1 ## 10 -18.42 -5.64 1 ## # ... with 116 more rows
coord_fixed()
coord_cartesian()
sets ratio to fit required figure size. But it is misleading it this case as far as units on both axes are equal. Luckily coord_fixed()
allows user to set ratio:
p + coord_fixed(ratio = 1) # ratio = 1 is default value
coord_fixed()
coord_flip()
Another function which modifies linear coordinate system is coord_flip()
which just flips axes:
p + coord_fixed(ratio = 1) + # ratio = 1 is default value coord_flip()
coord_flip()
All coord_*()
allows user to use arguments ylim
and xlim
to zoom part of the figure. Let's zoom 1st quadrant:
p + coord_fixed(xlim = c(0,30), ylim = c(0,30))
Similar functionality is provided by scales. They might appear identical, but they differ deeply. Let's see an example using simulated data:
data_frame( x = seq(from=-10, to=10, by=0.1) ) %>% mutate( y = x^2 # ...and we have a parabola ) %>% ggplot( aes(x=x,y=y) ) + geom_point() -> p
We can plot it with the OLS fitted line and then limit the figure on 1st quadrant using coord_cartesian()
and scales:
p <- p + geom_smooth(method = "lm", se=FALSE) p1 <- p + coord_cartesian(xlim = c(0,10), ylim = c(0,100)) p2 <- p + coord_cartesian() + scale_x_continuous(limits = c(0,10)) + scale_y_continuous(limits = c(0,100))
The difference is clear. Limits set by coord
truly zoom the figure and all observations are taken into account (onto OLS fit in this case), but scales completely excludes observations.
There are also non-linear coordinate systems supported by ggplot2
. Two of them – polar coordinates (coord_polar()
) and map projections (coord_map()
) are quite rarely used.
Data unrelated elements are controlled via themes which allows user to completely change appearance of a Figure.
ggplot2
ggplot2
Data-related elements are still the same in all figures! Themes really controls only data-unrelated features such as fonts, grids, background colors, fonts sizes, and so forth.
Theming system in ggplot2
consist of following components:
plot.title
element controls the appearance of plot title.element_text()
, element_line()
, element_rect()
, and element_blank()
element_text(family = NULL, face = NULL, colour = NULL, size = NULL, hjust = NULL, vjust = NULL, angle = NULL, lineheight = NULL, color = NULL, margin = NULL, debug = NULL) element_line(colour = NULL, size = NULL, linetype = NULL, lineend = NULL, color = NULL) element_rect(fill = NULL, colour = NULL, size = NULL, linetype = NULL, color = NULL)
theme()
function which allows user to set themes – e.g. by call:p + theme(plot.title = element_text(size = 20))
Alike in the case of colors it is difficult to come up with nice theme. Therefore there is a number of ready to use complete themes – see examples in handout.
We will demonstrate using themes by mimicking appearance of a Figure from an OECD report:
Let's create a simulated data:
expand.grid( country = c("MEX","USA","CAN"), year = 2005:2014 ) %>% rowwise %>% mutate( value = rnorm(1, mean = 0.25, sd = 2) ) %>% group_by(country) %>% mutate( value = cumsum(value) ) -> oecd save(oecd, file="data/oecd_sim.Rdata")
And very first figure:
oecd %>% ggplot( aes(x=year,y=value) ) + geom_line( aes(color = country) ) + labs( title = "QuasiOECD Figure", subtitle = "This is subtitle", caption = "Source: Simulated data" ) -> p
New in ggplot2
2.2.0 – You can add title, subtitle, and caption using function labs()
.
In the first step we should adjust data-related elements to fit OECD figure:
p + scale_color_manual( # Set name of the scale name = "Country", # Set colors as a named vector (see help) values = c( "CAN" = "black", "MEX" = "blue", "USA" = "grey50" ), # Set order in the legend -- see help for scale_discrete() breaks = c("CAN","MEX","USA"), # Set labels in the legend -- see help for scale_discrete() labels = c( "CAN" = "Canada", "MEX" = "Mexico", "USA" = "United States" ) ) + scale_x_continuous( breaks = unique(oecd$year) ) -> p
And now we can adjust data-unrelated elements. The easiest way is to modify complete scheme but we will do it from scratch.
In the first step we will set plot elements: background, margin and title:
p + theme( # Background properties plot.background = element_rect(fill="pink", # Fill color linetype = 3, # Border linetype size = 5, # Border size color = "yellow"), # Border color plot.title = element_text(family = "times", # Font family face = "bold", # Font face color = "red", # Font color angle = 180 # Text angle ), # Margin is set by a special function margin() -- no element_*() plot.margin = margin(t = 20, r = 0, b = 5, l = 5) ) -> pde
No, not even close. Let's try it again:
p + theme( # We will use element_blank() to remove # plot.title. element_blank() draws nothing # and assigns no space. plot.title = element_blank(), plot.subtitle = element_blank(), plot.caption = element_blank() ) -> p
We will proceed with axis elements:
p + theme( # axis.line = element_line() # controls lines parallel to axis axis.title = element_blank(), # There are no axis titles in OECD Figure # axis.title.x = element_text() axis.ticks = element_blank(), # There are no actual ticks in OECD Figure, # normally set by element_line() # Length of ticks is again set by a special function unit() # axis.ticks.length = unit(10, units="pt") axis.text = element_text( color = "black", size = 11 ) ) -> p
It is a time to modify legend elements:
p + theme( legend.background = element_rect( fill = "grey90", # light grey background size = NA # no border ), legend.key = element_rect( fill = NA, # use no extra fill for keys color = NA # and no border ), legend.key.width = unit(30, units = "pt"), # make it a bit longer legend.title = element_blank(), # There is no name in OECD Figure legend.position = "top", legend.direction = "horizontal", legend.text = element_text( size = 11 ) ) -> p
And finally panel elements:
p + theme( panel.background = element_rect( fill = "#e1fcfd" # Super-light blue as a RGB code ), panel.border = element_rect( color = "black", fill = NA ), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), aspect.ratio = 1 ) -> p
…and it is almost as ugly as the original.
ggsave()
ggsave()
is a function which save last plot displayed. It supports export to multiple vector (pdf, svg, eps/ps, and wmf) and bitmap (png, jpeg, tiff, bmp) formats.
Using vector graphic can be highly recommended. Vector graphics formats save infromation on all elements in figure (their position and other properties) which allows user to scale them without loss of quality (no blurred edges etc.). On the other hand it may results in considerable file size. This is the case especially of scatter plots with many (often overlaping) observations. In this situation you can consider using bitmaps.
Following figure has less then 200KB in PNG (a bitmap), over 11MB in PDF, and 26MB in SVG.
ggsave()
Choice of formats also depends on intended use: PNG and SVG are designed for web sites, PDF and EPS/PS for (e-)printed documents. Make sure that you can process vector graphics in text tools of choice!
ggplot2
is a powerful tool which allows you to do pretty much anything you want (incl. pie charts) – but you should try to use it effectively.
Some advice from Tufte (2001) on Principles of graphical excellence:
Think of graphical excellence when plotting.
Use table VGAMdata::oly12
and compare weight and height of London 2012 Summer Olympic Games with BMI limits set by WHO.
Your results should look like one of following figures. (Feel free to choose your favorite colors. Other features are mandatory.)
tidyr
, dplyr
, and, of course, ggplot2