There are a lot packages in R which provides basic or more complicated tools for data visualization. You can use general packages like graphics
, grid
or lattice
or specialized tools like circlize
and many more.
We focus on packages which are based on theoretical background of so called “grammar of graphics” (Wilkinson 2005, 2010). His theoretical concepts are implemented in a package ggplot2
. (There is even newer package ggvis
which is in its spirit and implementation very close to ggplot2
. We found ggplot2
a more convenient option for a researcher as far as ggvis
is more oriented on interactive graphics.)
Each figure produced by ggplot2
consist of three basic components:
geom_*()
functions). Each layer add one thing to the Figure. In our schematic example first layer contains scatter plot and second smoothing line.All components are controlled separately and independently. The structure of the handout follows this structure. Firstly attention is paid to data visualization. Second chapter is devoted to controlling of visualization and last to setting general appearance of figures.
ggplot2
docs website: http://docs.ggplot2.org/current/This handout borrows a lot from both and from package documentation.
Basic design of ggplot2
is based on layers. Final figure is an union of multiple layers where every single one of them adds one quality in a figure.
To illustrate ggplot2
basics we will use data set diamonds
which contains data on tens of thousands stones. We will use a random sample of 500 of them for speed and clarity.
library(ggplot2)
data("diamonds")
# Randomly sample 200 rows
diamonds %<>% sample_n(500)
print(diamonds)
## # A tibble: 500 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.51 Ideal F VS2 61.4 56 1569 5.13 5.16 3.16
## 2 1.51 Premium F SI1 62.2 58 11374 7.32 7.27 4.54
## 3 0.30 Very Good D VVS2 62.9 56 752 4.27 4.29 2.69
## 4 1.12 Premium H VS1 59.6 57 5055 6.82 6.77 4.05
## # ... with 496 more rows
Following sequence of figures illustrates the concept of construction by layers. Our goal is to get smoothed scatter plot of weight (carat
) and price (price
) of stones in the sample.
At first the basic layer is generated. The basic layer is a formatted “empty blackboard” – big enough to contain the data:
diamonds %>%
ggplot(data = ., mapping = aes(x = carat, y = price))
In the second layer we add dots which represent individual stones (geom_point()
):
diamonds %>%
ggplot(data = ., mapping = aes(x = carat, y = price)) +
geom_point()
The last layer in the example adds smoothing curve into the figure (geom_smooth()
):
diamonds %>%
ggplot(data = ., mapping = aes(x = carat, y = price)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
geom_*()
functions are shortcuts for more verbose layer()
. For example geom_point()
is identical to:
layer(
data = NULL,
mapping = NULL,
geom = "point",
stat = "identity",
position = "identity"
)
Each layer()
argument refers to a property of a layer:
data
specified or inherited from ggplot()
call. (It is not necessary to specify data in ggplot()
call if they are specified for each layer separately.)aes()
function. aes()
assigns data (specific columns from input data frame) to qualities of the geom. For example in the case of scatter plot the values of carat
and price
are assigned to point position, color
of stones to color filling of points – see following code and resulting Figure:ggplot(
data = diamonds,
mapping =
aes(
x = carat, # X-coordinate
y = price, # Y-coordinate
color = cut, # color of the "point" margin
fill = color, # color of filling
size = table # size of the point
)
) +
geom_point(
shape = 21, # shape of "points"
alpha = 0.2, # points transparency
stroke = 2 # thickness of margin
)
The example uses all aesthetics available for geom_point()
. One can notice two important things from the example:
ggplot()
call. See that geom_point
does not specify data used or mapping.layer()
/geom_*()
overrides specification given in initial ggplot()
call.geom
point
is used in above discussed example.geom_smooth()
from example above.Remember that functions geom_*()
are shortcuts for the specific call of layer which may differ in all parameters – not only in geom
parameter. We can demonstrate it on the example of default setting of layer(), geom_point()
, and geom_jitter
:
# Default layer() setting:
layer(
data = NULL,
mapping = NULL,
geom = NULL,
stat = NULL,
position = NULL
)
# Default geom_point() setting (a layer() shortcut):
layer(
data = NULL,
mapping = NULL,
geom = "point",
stat = "identity",
position = "identity"
)
# Default geom_jitter() setting (a layer() shortcut):
layer(
data = NULL,
mapping = NULL,
geom = "point",
stat = "identity",
position = "jitter"
)
geom_point()
and geom_jitter
share the same setting of geom
parameter but they differ in position
setting. (Yes, it is confusing and it is improved in ggvis
package.) See example of jittering:
diamonds %>%
ggplot(data = ., mapping = aes(x = x, y = y)) +
# x -- length of stones (mm)
# y -- width of stones (mm)
geom_point(
color = "black"
) +
geom_jitter(
color = "blue",
width = 1,
height = 1,
alpha = 0.3
)
geom_*()
functionsggplot2
allows users to construct many figure types in countless colors, sizes, etc. Following text provides a basic overview of most common figure types and options.
The first thing which usually catches eye of a researcher is distribution of a variable. For its plotting we use a different tools for discrete and continuous variables.
geom_bar()
For a discrete variable it is crucial to see frequency of observed options. For example data set diamonds
contains column color
which contains evaluation of color of included stones from D
(best) to J
(worst). We can get the frequencies using table()
:
diamonds %$% table(color)
## color
## D E F G H I J
## 70 76 89 99 87 54 25
Common way to visualize frequencies of a discrete variable is to use a bar plot.
geom_bar()
can be used to produce various bar plots. Basic (default) setting returns distribution of discrete variable (“histogram”), where height of a bar is equal to number of observation.
geom_bar()
understands following aesthetics: x
(required), alpha
, colour
, fill
, linetype
, size
. As an example we can plot a bar plot with stones color distribution:
diamonds %>%
ggplot(
aes(
x = color
)
) +
geom_bar(
stat = "count", # default setting
position = "stack" # default setting
)
geom_bar()
returned number of cases at each x
position. The numbers were supplied by a function stat_count()
which processes data for geom_bar(stat="count")
. It is not common (but it is possible) to use stat_*()
functions directly as far as each of them is associated to some geom_*()
function.
Only mandatory aesthetic x
is used in the example. However one can use more then one aesthetic. See example:
diamonds %>%
ggplot(
aes(
x = color,
fill = cut
)
) +
geom_bar()
diamonds %>%
ggplot(
aes(
x = color,
fill = color
)
) +
geom_bar()
The second example shows that it is possible to map one variable to multiple aesthetics.
A histogram is designed for plotting a distribution of observed values of a continuous variable. At first a continuous scale of observed values is divided into intervals (bins) and then number of observations in all bins is counted a plotted.
A basic histogram can be created using geom_histogram()
. It uses the same aesthetics as geom_bar()
.
diamonds %>%
ggplot(aes(x=price)) +
geom_histogram()
geom_histogram()
allows user to change size (argument binwidth
, default to NULL
) or number (argument bins
, default to 30) of bins. If binnwidth
is set bins
is ignored.
It is recommended to play a little with binwidth
(or bins
) to find optimal bin size (number of bins).
diamonds %>%
ggplot(
aes(x=price)
) +
geom_histogram(
binwidth = 1000
)
geom_histogram()
displays observed values but sometimes is useful to estimated true density from the sample. geom_density()
delivers kernel density estimate of x
variable distribution.
geom_density()
understands following aesthetics: x
, y
, alpha
, colour
, fill
, linetype
, size
, and weight
.
diamonds %>%
ggplot(
aes(x=price)
) +
geom_density()
Density estimate is delivered to geom_density()
by stat_density()
function which use kernel = "gaussian"
by default. If you want to use a different kernel you need to call stat_density()
directly:
diamonds %>%
ggplot(
aes(x=price)
) +
stat_density(
kernel = "optcosine" # for more options see help for density()
)
How could it be that stat_density()
produces layer (geom)?! The relationship between geom_*()
and stat_*()
is actually a bit complicated. Most of geom_*()
functions have an association with a stat
function. But on the other hand stat_*()
functions have a association with a geom
function. See default setting of geom_histogram()
and related stat_bin()
:
geom_histogram(mapping = NULL, data = NULL, stat = "bin",
position = "stack", ..., binwidth = NULL, bins = NULL, na.rm = FALSE,
show.legend = NA, inherit.aes = TRUE)
stat_bin(mapping = NULL, data = NULL, geom = "bar", position = "stack",
..., binwidth = NULL, bins = NULL, center = NULL, boundary = NULL,
closed = c("right", "left"), pad = FALSE, na.rm = FALSE,
show.legend = NA, inherit.aes = TRUE)
It might be important to see differences in density for different groups of stones (e.g. different cuts). ggplot2
provides multiple techniques for this task. The first one can be called grouping. Groups are specified by argument group
in aes()
. The argument group
expects a variable which divides variable x
into groups – it could be discrete (or logical) variable. The task specified in following geom_*()
call is performed for each group separately and outcomes are plotted into the same Figure:
diamonds %>%
ggplot(
aes(x=price, group = cut)
) +
geom_density()
Hm, the outcome is hard to read. Let’s tune it following the same concept of grouping. If we assign grouping variable cut
to different aesthetics (which geom_density()
understands!) the results could be clear to read and even breath taking:
diamonds %>%
ggplot(
aes(x=price,
fill = cut,
color = cut # Only one aesthetics would suffice.
)
) +
geom_density(
alpha = 0.2 # Just to make it even cooler and easy to read.
)
Box and violin plots are designed for comparison of distributions of different variables. More common box plot can be created using geom_boxplot()
which plots stylized distributions for each group.
geom_boxplot()
understands following aesthetics:
lower
(lower hinge, 25% quantile),middle
(median, 50% quantile),upper
(upper hinge, 75% quantile),x
(grouping variable),ymax
(upper whisker = largest observation less than or equal to upper hinge + 1.5 * IQR),ymin
(lower whisker = smallest observation greater than or equal to lower hinge - 1.5 * IQR),alpha
, colour
, fill
, linetype
, shape
, size
, weight
diamonds %>%
ggplot(
aes(
x = cut, # Grouping variable
price # Variable to be plotted
)
) +
geom_boxplot()
Violin plot is similar to box plot. Box plot provides information on observed quantiles, violin plot plots rotated kernel density plot on each side. See example produced by geom_violin()
:
diamonds %>%
ggplot(
aes(x = cut, price)
) +
geom_violin()
ggplot2
provides tools for visualization of 2D data distribution which are analogous to 1D functions described above.
ggplot2
has two geom_*()
functions analogous to geom_histogram()
which display distribution of observed combinations of two variables. Both of them split the plane to smaller areas and show the number of observations in each of them. geom_bin2d()
splits the plane to rectangles and geom_hex()
to hexagons.
geom_bin2d()
understands to following aesthetics: x
, y
, and fill
. geom_hex()
adds colour
, fill
, and size
.
geom_hex()
depends on package hexbin
.
diamonds %>%
ggplot(
aes(x = carat, y = price)
) +
geom_bin2d()
diamonds %>%
ggplot(
aes(x = carat, y = price)
) +
geom_hex()
Kernel estimate of 2D distribution is also available via geom_density2d()
. The function returns contours of distribution estimated.
geom_density2d()
understands following aesthetics: x
, y
, alpha
, colour
, linetype
, and size
.
diamonds %>%
ggplot(
aes(x = carat, y = price)
) +
geom_density2d()
Basic tool for visualization of relationship of two continuous variables is with no doubt a scatter plot. You can plot it using geom_point()
.
See examples above.
diamonds %>%
ggplot(
aes(
x = carat,
y = price
)
) +
geom_point()
It is possible to add more information on individual observations using many aesthetics available. However individual observations always will be – at least to some extent – “anonymous”. geom_text()
and geom_label()
allows user to replace shape representing observation by text (string) defined in required aesthetic label
.
Both geom_
functions understand aesthetics label
, x
, y
, alpha
, angle
, colour
, family
, fontface
, hjust
, lineheight
, size
, and vjust
.
As an example we can draw a scatter plot of price and weight relationship where each stone has a label depicting its clarity. As this type of plot is generally more suitable for data set with low number of observations we will further reduce the sub-sample from diamonds
data set.
diamonds %>% sample_n(50) -> sub_diamonds
sub_diamonds %>%
ggplot(aes(
x = carat,
y = price,
label = clarity
)) +
geom_text()
sub_diamonds %>%
ggplot(aes(
x = carat,
y = price,
label = clarity
)) +
geom_label()
As it is apparent from the Figure that the difference between geom_text()
and geom_label()
is just aesthetical. geom_label()
is also considerably slower.
It is also clear, that this type of plot very often suffers from over-plotting. As far as I know ggplot2
does not provide any automatic intelligent way to solve it. It is possible to manually adjust position of labels (see help) or use check_overlap = TRUE
. Which is a dirty way:
sub_diamonds %>%
ggplot(aes(
x = carat,
y = price,
label = clarity
)) +
geom_text(
check_overlap = TRUE
)
check_overlap = TRUE
just suppress plotting of labels which would overlap with already plotted text. Oh, dear.
geom_smooth()
returns smoothed line. It supports multiple smoothing methods: lm, glm, gam, loess, and rlm (given in method
). Default method differs according to number of observations. It also returns confidence interval around smooth. Plotting of confidence interval can be suppressed by setting se = FALSE
.
geom_smooth()
understands aesthetics x
, y
, alpha
, colour
, fill
, linetype
, size
, and weight
.
diamonds %>%
ggplot(
aes(
x = carat,
y = price
)
) +
geom_smooth(
fill = "pink",
colour = "red"
)
Notice, that you can easily have smoothing curve without having actual observations. You can also compare multiple smoothing methods by simply adding multiple layers:
diamonds %>%
ggplot(
aes(
x = carat,
y = price
)
) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", colour = "red", fill = "pink") +
geom_smooth(method = "loess", colour = "green", fill = "lightgreen")
ggplot2
contains some tools for investigating relationship between three variables. geom_raster()
, geom_tile()
, and geom_rect()
provide similar functionality for plotting rectangles which is useful when plotting surface on a plane.
geom_raster()
is the fastest of the three and together with data on estimated density of Old Faithful Geyser eruptions we will use it to demonstrate use of rectangles.
print(faithfuld)
## # A tibble: 5,625 × 3
## eruptions waiting density
## <dbl> <dbl> <dbl>
## 1 1.600000 43 0.003216159
## 2 1.647297 43 0.003835375
## 3 1.694595 43 0.004435548
## 4 1.741892 43 0.004977614
## # ... with 5,621 more rows
faithfuld %>%
ggplot(
aes(
x = waiting,
y = eruptions,
fill = density
)
) +
geom_raster()
faithfuld %>%
ggplot(
aes(
x = waiting,
y = eruptions,
fill = density
)
) +
geom_raster(
interpolate = TRUE
)
The second Figure is created with an option interpolate = TRUE
which deliver nicer outcomes.
Rectangles might be a bit difficult to combine with different geoms. In this case one can use geom_contour()
which display contours of a 3D surface in 2D.
In the following task we want to combine estimated density of eruptions stored in table faithfuld
and actual observations from table faithful
:
The first way is to combine both data sets and plot Figure using a layer for density estimates and actual observations using the same table. However ggplot2
allows user to specify different data set for each layer. This is better way to go. We need to:
faithfuld
faithful
ggplot(data = faithfuld,
aes(
x = waiting,
y = eruptions
)
) +
geom_contour(aes(z = density)) +
geom_point(data = faithful)
It is important to understand the use of aes()
in this example. By coincidence both tables shares column names of variables mapped on x
and y
. It makes thing easier. The first use of aes()
in ggplot()
call sets aesthetics for all layers (even with different data used!). Second use of aes()
in geom_contour()
adds required aesthetics z
. geom_point()
does not need any additional aesthetics therefore argument mapping
is not set – it means that its value is inherited from ggplot()
call.
If the names in faithful
were different (e.g. obs_eruptions
and obs_waiting
) we would need to rewrite it in the following fashion (which would lead to identical outcome):
ggplot(data = faithfuld,
aes(
x = waiting,
y = eruptions
)
) +
geom_contour(aes(z = density)) +
geom_point(data = faithful,
aes(
x = obs_waiting,
y = obs_eruptions
))
Above described geom_*()
are primarily designed for cross-sectional data. Especially in economic we often use data with time dimension. To illustrate visualization of time series we will use data set economics
which supplied together with ggplot2
:
data("economics")
print(economics)
## # A tibble: 574 × 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <int> <dbl> <dbl> <int>
## 1 1967-07-01 507.4 198712 12.5 4.5 2944
## 2 1967-08-01 510.5 198911 12.5 4.7 2945
## 3 1967-09-01 516.3 199113 11.7 4.6 2958
## 4 1967-10-01 512.9 199311 12.5 4.9 3143
## # ... with 570 more rows
The basic tool for visualization is a line plot produced by geom_line()
.
geom_line()
understands following aesthetics: x
, y
, alpha
, colour
, linetype
, and size
.
economics %>%
ggplot(aes(
x = date,
y = unemploy # No. of unemployed.
)) +
geom_line()
Similar to geom_line()
is geom_step()
which uses the same aesthetics:
economics %>%
arrange(desc(date)) %>% slice(n=1:36) %>% # Filter data for last three years
ggplot(aes(
x = date,
y = unemploy # No. of unemployed.
)) +
geom_step()
Areas can be plotted by geom_ribbon()
which requires specification of x
, ymax
, and ymin
values which set an interval on y-axis. geom_ribbon()
also understands alpha
, colour
, fill
, linetype
, and size
.
economics %>%
mutate(
pce_min = pce*0.90, # 95 % of original pce
pce_max = pce*1.05 # 105 % of original pce
) %>%
ggplot(
aes(
x = date,
y = pce,
ymin = pce_min,
ymax = pce_max
)
) +
geom_ribbon(
fill = "lightgreen"
) +
geom_line()
geom_ribbon()
is useful e.g. for plotting confidence intervals – not necessarily with time dimension.
Quite often we deal with data with spatial dimension which we want to visualize on the map. We will illustrate few examples of spatial data visualization using ggplot2
.
In this example we want to get a map of major earthquakes in Japan. Data on earthquakes comes from NOAA database (https://www.ngdc.noaa.gov/). At first we need to load them:
library(readr)
read_csv("data/jpn_quakes.csv") %>%
filter(long > 0) -> jpn_quakes
## Parsed with column specification:
## cols(
## year = col_integer(),
## lat = col_double(),
## long = col_double(),
## magnitude = col_double()
## )
# filter() removes few very remote observations
print(jpn_quakes)
## # A tibble: 395 × 4
## year lat long magnitude
## <int> <dbl> <dbl> <dbl>
## 1 684 32.5 134.0 8.4
## 2 701 35.7 135.4 7.0
## 3 704 33.8 136.7 7.0
## 4 744 32.4 130.5 6.4
## # ... with 391 more rows
The major problem is always getting a map. There are some packages with geographical data – e.g. maps
and cshapes
. In this example we will use cshapes
which contains a data set of historical borders.
# This function returns map of the world for 2000-1-1:
cshapes::cshp(date=as.Date("2000-1-1")) -> world2000
# In the next step we need to extract a map of Japan
world2000[world2000$ISO1AL3 == "JPN",] %>%
# Resulting object is of class "SpatialPolygonsDataFrame". However ggplot2 needs a data.frame. We can convert it using fortify() from ggplot2 package.
fortify %>% as_data_frame() -> jpn
print(jpn)
## # A tibble: 2,485 × 7
## long lat order hole piece id group
## <dbl> <dbl> <int> <lgl> <fctr> <chr> <fctr>
## 1 134.2994 34.70412 1 FALSE 1 143 143.1
## 2 134.2503 34.71527 2 FALSE 1 143 143.1
## 3 134.2421 34.70388 3 FALSE 1 143 143.1
## 4 134.2494 34.68915 4 FALSE 1 143 143.1
## # ... with 2,481 more rows
Plotting itself is relatively easy. In the first step we use geom_polygon()
for plotting borders of Japan. In the second step we employ table jpn_quakes
to plot individual earthquakes:
ggplot(
data = jpn,
aes(
x = long,
y = lat
)
) +
geom_polygon(
aes(group=group),
fill = NA, # makes polygons transparent
color = "black" # plot borders in black
) +
geom_point(
data = jpn_quakes,
aes(
color = year,
size = magnitude
),
alpha = 0.2
)
In this way you can even more layers and data sets:
In this Figure a map is just another layer with no direct association to data. Another common example of spatial data visualization is thematic map (https://en.wikipedia.org/wiki/Thematic_map). The function meant for plotting it is geom_map()
.
We are going to construct thematic map with murder rates in U.S. First we need to get data on murders:
crimes <- data.frame(state = tolower(rownames(USArrests)), USArrests) %>% as_data_frame()
print(crimes)
## # A tibble: 50 × 5
## state Murder Assault UrbanPop Rape
## * <fctr> <dbl> <int> <int> <dbl>
## 1 alabama 13.2 236 58 21.2
## 2 alaska 10.0 263 48 44.5
## 3 arizona 8.1 294 80 31.0
## 4 arkansas 8.8 190 50 19.5
## # ... with 46 more rows
Column Murder
contains number of murders per per 100,000.
The map used comes from package maps
and it is extracted using map_data
function from ggplot2
:
states_map <- map_data("state") %>% as_data_frame()
print(states_map)
## # A tibble: 15,537 × 6
## long lat group order region subregion
## * <dbl> <dbl> <dbl> <int> <chr> <chr>
## 1 -87.46201 30.38968 1 1 alabama <NA>
## 2 -87.48493 30.37249 1 2 alabama <NA>
## 3 -87.52503 30.37249 1 3 alabama <NA>
## 4 -87.53076 30.33239 1 4 alabama <NA>
## # ... with 15,533 more rows
It is crucial that map data contains ID (region
) which allows us to associate geographical region with murder rate (via column state
).
geom_map()
has two data arguments:
data
contains data to be plot in the mapmap
contains geographical dataRegions in the map are bound with data by aesthetic map_id
:
ggplot(crimes,
aes(
map_id = state
)) +
geom_map(
aes(fill = Murder),
map = states_map
) +
expand_limits(
x = states_map$long,
y = states_map$lat
)
When using
geom_map()
you need to specify the are to plot. It can be done using expand_limits()
which includes all x
and y
coordinates which should be plotted in the resulting Figure.
For example in transportation economics (or for training purposes) might be useful to be able to plot a route on the map. It is simple, when you have data. We will use data exported from Strava which were prepossessed by plotKML::readGPX()
. It is a short run from Romolslia Øvre to Flatåsen Nedre and back (made just for AVED):
load("data/RomFlat.Rdata")
print(run)
## # A tibble: 585 × 4
## lon lat ele time
## * <dbl> <dbl> <dbl> <chr>
## 1 10.35784 63.37758 55.1 2016-07-18T20:57:50Z
## 2 10.35789 63.37755 54.7 2016-07-18T20:57:59Z
## 3 10.35788 63.37752 54.2 2016-07-18T20:58:01Z
## 4 10.35788 63.37747 53.6 2016-07-18T20:58:03Z
## # ... with 581 more rows
The path can be plotted using geom_path()
which plots a link between observations in the order given by their position in input table (i.e. between row 1 and 2, 2 and 3, and so forth).
geom_path()
understands following aesthetics: x
, y
, alpha
, colour
, linetype
, and size
.
run %>%
ggplot(
aes(
x = lon,
y = lat,
color = ele
)
) +
geom_path()
There is a package ggmap
which allows you to plot data on various map tiles from various sources. See examples:
Let’s listen to Hadley (Wickham, 2016):
Scales control the mapping from data to aesthetics. They take your data and turn it into something that you can see, like size, colour, position or shape. Scales also provide the tools that let you read the plot: the axes and legends. Formally, each scale is a function from a region in data space (the domain of the scale) to a region in aesthetic space (the range of the scale). The axis or legend is the inverse function: it allows you to convert visual properties back to data.
We haven’t mentioned any of those things so far. But we were using them – recall simplified example from introduction:
ggplot(
data = diamonds,
mapping =
aes(
x = carat, # X-coordinate
y = price, # Y-coordinate
color = cut, # color of the "point" margin
size = table # size of the point
)
) +
geom_point()
This example specifies only aesthetics – i.e. which data should be mapped to which aesthetics. How it should be done is controlled by scales.
In the initial example the scales were left in default setting. The call was therefore identical to:
ggplot(
data = diamonds,
mapping =
aes(
x = carat, # X-coordinate
y = price, # Y-coordinate
color = cut, # color of the "point" margin
size = table # size of the point
)
) +
geom_point() +
scale_x_continuous() +
scale_y_continuous() +
scale_color_discrete() +
scale_size_continuous()
We can use this example to demonstrate some options available via scales:
ggplot(
data = diamonds,
mapping =
aes(
x = carat, # X-coordinate
y = price, # Y-coordinate
color = cut, # color of the "point" margin
size = table # size of the point
)
) +
geom_point() -> p
p + scale_x_continuous(trans = "log") # Log axes
p + scale_y_continuous(limits = c(500,1000)) # Limits on axes
p + scale_color_brewer(palette = "Paired") # Colors
p + scale_size_continuous(name = "Size") # Transformations
>Notice that you can assign
ggplot
object to a variable!
The names of scales are systematic. They consist of three elements joined by _
:
scale
x
, y
, color
, size
, alpha
,..)discrete
, continuous
, brewer
, hue
,…)Scales are added to ggplot()
call in the same manner as layers – i.e. by +
. It is misleading a bit. In the case of layers +
actually add additional layer but with scales it replaces default (or previously defined) one:
ggplot(
data = diamonds,
mapping =
aes(
x = carat,
y = price,
color = cut
)) +
geom_point() -> p
p + scale_color_brewer(palette = "Paired") -> p1
p + scale_color_brewer(palette = "Paired") + scale_color_brewer(palette = "Set3") -> p2
A special attention is paid to colors in scales which drive mapping to aesthetics fill
and color
. There four gradient-based methods of mapping for continuous and two for discrete variables. Let`s start with continuous color scales:
scale_fill_continuous()
is a default option identical with scale_fill_gradient()
. It allows user to set a color for low
and high
values. ggplot2
will process any choice of colors but it is very difficult to come up with proper combination which is easy understand for human eye and brain, therefore you should use some prepared choices:geyser + scale_fill_gradient()
geyser + scale_fill_gradient(low="white", high="black")
geyser + scale_fill_gradient(
low = munsell::mnsl("5G 9/2"),
high = munsell::mnsl("5G 6/8")
)
stones + scale_color_gradient()
stones + scale_color_gradient(low="white", high="black")
stones + scale_color_gradient(
low = munsell::mnsl("5G 9/2"),
high = munsell::mnsl("5G 6/8")
)
scale_fill_gradient2()
allows user to combine two color gradients: from low to mid point and from midpoint to high. User can manually specify the value of midpoint
(default is midpoint=0
).faithfuld$density %>% median -> mid
geyser + scale_fill_gradient2(midpoint = mid)
geyser + scale_fill_gradient2(midpoint = mid,
low = "blue",
high = "red",
mid = "white")
diamonds$table %>% median -> mid
stones + scale_color_gradient2(midpoint = mid)
stones + scale_color_gradient2(midpoint = mid, low = "blue", high = "red", mid = "white")
scale_fill_gradientn()
provides possibility to use n-element gradient specified by a vector in argument colors
. It should be used only if there is a strong reason for it. I also recommend to use gradient prepared by experts – see some examples:geyser + scale_fill_gradientn(colours = terrain.colors(7))
# Function terrain.colors(n) from grDevices generates gradient (palette) of n colors.
geyser + scale_fill_gradientn(colours = colorspace::heat_hcl(7))
# Similar function from colorspace package
geyser + scale_fill_gradientn(colours = viridis::viridis(7))
# ...and from viridis package. Viridis provides very nice palettes for color-blind people and even its own function scale_fill_viridis.
scale_fill_distiller()
applies ColorBrewer colors (see http://colorbrewer2.org/) on continuous data. It allows user to choose from three types of palettes: seq
(sequential), div
(diverging) or qual
(qualitative). ggplot2
can and will use qual
even for continuous data but there is no reason for using it.Particular palette can be set by its name (see website for it) or by its number. You can also change direction of the palette using direction = 1
or direction = -1
.
geyser + scale_fill_distiller(type = "seq", palette = "YlOrRd", direction = 1)
geyser + scale_fill_distiller(type = "seq", palette = "Oranges", direction = 1)
geyser + scale_fill_distiller(type = "div", palette = "BrBG", direction = 1)
geyser + scale_fill_distiller(type = "seq", palette = "YlOrRd", direction = -1)
geyser + scale_fill_distiller(type = "seq", palette = "Oranges", direction = -1)
geyser + scale_fill_distiller(type = "div", palette = "BrBG", direction = -1)
There are also four methods for discrete color scales:
scale_fill_hue()
which picks evenly scaled hues around HCL wheel. HCL is color definition system used by ggplot2
. Color are defined by three components: hue (h
, [0, 360]), chroma (c
), and luminance (l
, [0, 100]). scale_fill_hue()
returns evenly hues with chroma and luminance being equal.scale_fill_hue(..., h = c(0, 360) + 15, c = 100, l = 65, h.start = 0, direction = 1, na.value = "grey50")
User can control the range of hues as well as values of chroma and luminance. See example:
stones_color + scale_color_hue()
stones_color + scale_color_hue(c = 50, l = 10)
stones_color + scale_color_hue(h = c(100,200))
stones_fill + scale_fill_hue()
stones_fill + scale_fill_hue(c = 50, l = 10)
stones_fill + scale_fill_hue(h = c(100,200))
It is very difficult to find good colors using hue
. Therefore it is good to try some prepared palettes.
scale_fiĺl_brewer()
allows to use palletes qualitative palettes from http://colorbrewer2.org/:stones_color + scale_color_brewer(type = "qual", palette = "Set1")
stones_color + scale_color_brewer(type = "qual", palette = "Pastel2")
stones_color + scale_color_brewer(type = "seq", palette = "YlOrRd")
stones_fill + scale_fill_brewer(type = "qual", palette = "Set1")
stones_fill + scale_fill_brewer(type = "qual", palette = "Pastel2")
stones_fill + scale_fill_brewer(type = "seq", palette = "YlOrRd")
H.W. recommends to use qualitative palettes Set1
and Dark2
for points and Set2
, Pastel1
, Pastel2
, and Accent
for areas. It also make sense to use sequential palettes in the case of ordered options
scale_fill_grey()
provide black and white palette for discrete data. Shades are scaled from light (start
) to dark (end
).stones_color + scale_color_grey()
stones_color + scale_color_grey(start = 0.5, end = 1)
stones_color + scale_color_grey(start = 0, end = 1)
stones_fill + scale_fill_grey()
stones_fill + scale_fill_grey(start = 0.5, end = 1)
stones_fill + scale_fill_grey(start = 0, end = 1)
scale_fill_manual()
allows user to define his own palette or use palette from different package.There are four ways which drives positioning of observation representation on the page. The scales transformation was discussed above. Description of other three options follows.
layer()
function has an argument position
with options:
identity
– no position adjustment (default for most geom_*()
functions)jitter
– jitter points to avoid overlappingdodge
– avoid overlapping by dodging on sidestack
– put overlaps on the top of each othernudge
– shifts overlaps by set x
and y
distancejitterdodge
– combines jitter
and dodge
Jittering is a technique useful for avoiding overlapping (especially) in scatter plots. Actual coordinates of each observation are randomly changed within specified limits.
The most common use of jittering is via geom_jitter()
a shortcut for layer(geom = "points", position = "jitter,...)
:
geom_jitter(mapping = NULL, data = NULL, stat = "identity",
position = "jitter", width = NULL, height = NULL,...)
width
: Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here.height
: Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here.Recall an example from the introduction:
diamonds %>%
ggplot(data = ., mapping = aes(x = x, y = y)) +
# x -- length of stones (mm)
# y -- width of stones (mm)
geom_point(
color = "black"
) +
geom_jitter(
color = "blue",
width = 1,
height = 1,
alpha = 0.3
)
These options are commonly used in bar plots for putting geoms (bars) on the top of each other, next to each other, and getting shares of options. See an example:
mean.price <- diamonds$price %>% mean
diamonds %>%
mutate(
high.price = price > mean.price
) %>%
ggplot(
aes(
x = cut,
fill = high.price
)
) + scale_fill_brewer(name = "High price", type = "qual", palette = "Set2") -> p
p +
geom_bar(
position = "stack" # default value
)
Resulting figure does not provide very clear idea of ratio of low/high price stones in each category. You can get a clear picture by setting position = "fill"
:
p +
geom_bar(
position = "fill"
)
This figure shows nicely shares within groups (cuts) but it cannot provide comparison among groups. For that purpose use position = "dodge"
:
p +
geom_bar(
position = "dodge"
)
It also possible to specify width of the dodge:
p +
geom_bar(
position = position_dodge(width=0.5)
)
ggplot2
allows you to break single figure into multiple “facets”. See example:
diamonds %>%
ggplot(data = ., mapping = aes(x = carat, y = price)) +
geom_point() -> p
p + facet_wrap(~cut, ncol = 3)
In the resulting figure there is a “subfigure” for each cut. All subfigures have identical scales. You can change this behavior using argument scales
with options fixed
(default), free
, free_x
, and free_y
.
facet_grid()
adds a special feature which is not available with facet_wrap()
it allows to organize facets to a grid along two dimensions:
p + facet_grid(color ~ cut)
ggplot2
supports both linear and non-linear coordinate systems. In most cases we use default coord_cartesian()
which is default linear system, where the position of an element is given by x
and y
coordinates.
Following example use a simple scatter plot to illustrate default properties of coord_cartesian()
. It uses data from VGAMdata
to show arrow shots. Each shot is described by its X a Y coordinates:
print(archery)
## # A tibble: 126 × 3
## X Y archer
## <dbl> <dbl> <dbl>
## 1 24.14 -9.55 1
## 2 28.55 6.57 1
## 3 3.97 0.46 1
## 4 28.57 26.84 1
## # ... with 122 more rows
archery %>%
ggplot(
aes(
y = Y,
x = X
)
) + geom_point() -> p
p
coord_cartesian()
sets ratio to fit required figure size. But it is misleading it this case as far as units on both axes are equal. Luckily coord_fixed()
allows user to set ratio:
p + coord_fixed(ratio = 1) # ratio = 1 is default value
Another function which modifies linear coordinate system is coord_flip()
which just flips axes:
p + coord_fixed(ratio = 1) + # ratio = 1 is default value
coord_flip()
All coord_*()
allows user to use arguments ylim
and xlim
to zoom part of the figure. Let’s zoom 1st quadrant:
p + coord_fixed(xlim = c(0,30), ylim = c(0,30))
Similar functionality is provided by scales. They might appear identical, but they differ deeply. Let’s see an example using simulated data:
data_frame(
x = seq(from=-10, to=10, by=0.1)
) %>%
mutate(
y = x^2 # ...and we have a parabola
) %>%
ggplot(
aes(x=x,y=y)
) + geom_point() -> p
We can plot it with the OLS fitted line and then limit the figure on 1st quadrant using coord_cartesian()
and scales:
p <- p + geom_smooth(method = "lm", se=FALSE)
p1 <- p + coord_cartesian(xlim = c(0,10), ylim = c(0,100))
p2 <- p + coord_cartesian() +
scale_x_continuous(limits = c(0,10)) +
scale_y_continuous(limits = c(0,100))
The difference is clear. Limits set by coord
truly zoom the figure and all observations are taken into account (onto OLS fit in this case), but scales completely excludes observations.
There are also non-linear coordinate systems supported by ggplot2
. Two of them – polar coordinates (coord_polar()
) and map projections (coord_map()
) are quite rarely used.
See example of 45 degree line in cartesian and polar coordinate system:
diamonds %>%
ggplot(aes(x = x, y = y)) +
geom_point(aes(slope=1, intercept=0)) +
geom_smooth(se=FALSE) -> p
## Warning: Ignoring unknown aesthetics: slope, intercept
p + coord_cartesian() -> p1 # Just for clarity. coord_cartesian() is default option.
p + coord_polar() -> p2
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
The last one – coord_trans()
– is used to transform axes after statistical transformations are applied. Axes transformation can be also done by scales. However there is a substantial difference (again). See example:
# Generate new data: y = x + white noise
data_frame(
x = seq(from=0, to=100, by=0.1)
) %>% rowwise() %>%
mutate(
y = x + rnorm(1, sd=3)
) %>%
filter(y>0,x>0) %>%
ggplot(
aes(x=x,y=y)
) +
geom_point() +
geom_smooth(method="lm", se = FALSE) -> p
p1 <- p + coord_trans(y = "log")
p2 <- p + scale_y_continuous(trans = "log")
Using coord_trans()
transformation is applied after smoothing – therefore OLS fit closely follows data. With scale transformation is applied before smoothing – i.e. geom_smooth()
is trying to fit transformed values. (+ coord_trans()
nicely transforms background grid)
Data unrelated elements are controlled via themes which allows user to completely change appearance of a Figure. ggplot2
supplies eight complete schemes…
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
…and many more in additional packages. See some examples from ggtheme
:
…and it is still ggplot2
! Notice, that data-related elements are still the same in all figures! Themes really controls only data-unrelated features such as fonts, grids, background colors, fonts sizes, and so forth.
Theming system in ggplot2
consist of following components:
plot.title
element controls the appearance of plot title.element_text()
, element_line()
, element_rect()
, and element_blank()
element_text(family = NULL, face = NULL, colour = NULL, size = NULL,
hjust = NULL, vjust = NULL, angle = NULL, lineheight = NULL,
color = NULL, margin = NULL, debug = NULL)
element_line(colour = NULL, size = NULL, linetype = NULL,
lineend = NULL, color = NULL)
element_rect(fill = NULL, colour = NULL, size = NULL, linetype = NULL,
color = NULL)
theme()
function which allows user to set themes – e.g. by call p + theme(plot.title = element_text(size = 20))
Alike in the case of colors it is difficult to come up with nice theme. Therefore there is a number of ready to use complete themes – see examples above.
We will demonstrate using themes by mimicking appearance of a Figure from an OECD report:
Let’s create some data first and plot a very default figure:
expand.grid(
country = c("MEX","USA","CAN"),
year = 2005:2014
) %>% rowwise %>%
mutate(
value = rnorm(1, mean = 0.25, sd = 2)
) %>%
group_by(country) %>%
mutate(
value = cumsum(value)
) -> oecd
oecd %>%
ggplot(
aes(x=year,y=value)
) +
geom_line(
aes(color = country)
) +
ggtitle("QuasiOECD Figure") -> p
p
Default figure is saved in variable p
. In the first step we should adjust data-related elements to fit OECD figure:
p + scale_color_manual(
# Set name of the scale
name = "Country",
# Set colors as a named vector (see help)
values = c(
"CAN" = "black",
"MEX" = "blue",
"USA" = "grey50"
),
# Set order in the legend -- see help for scale_discrete()
breaks = c("CAN","MEX","USA"),
# Set labels in the legend -- see help for scale_discrete()
labels = c(
"CAN" = "Canada",
"MEX" = "Mexico",
"USA" = "United States"
)
) +
scale_x_continuous(
breaks = unique(oecd$year)
) -> p
p
And now we can adjust data-unrelated elements. The easiest way is to modify complete scheme but we will do it from scratch.
In the first step we will set plot elements: background, margin and title:
p +
theme(
# Background properties
plot.background = element_rect(fill="pink", # Fill color
linetype = 3, # Border linetype
size = 5, # Border size
color = "yellow"), # Border color
plot.title = element_text(family = "times", # Font family
face = "bold", # Font face
color = "red", # Font color
angle = 180 # Text angle
), # ...and even more -- see help for element_text()
# Margin is set by a special function margin() -- no element_*()
plot.margin = margin(t = 20, r = 0, b = 5, l = 5)
)
No, not even close. Let’s try it again:
p + theme(
# We will use element_blank() to remove plot.title. element_blank() draws nothing and assigns no space.
plot.title = element_blank()
) -> p
p
We will proceed with axis elements:
p + theme(
# axis.line = element_line() # controls lines parallel to axis -- THERE IS A BUG IN GGPLOT2 v2.1.0 -- use axis.line.x and axis.line.y
axis.title = element_blank(), # There are no axis titles in OECD Figure
# axis.title.x = element_text()
axis.ticks = element_blank(), # There are no actual ticks in OECD Figure, normally set by element_line()
# Length of ticks is again set by a special function unit()
# axis.ticks.length = unit(10, units="pt")
axis.text = element_text(
color = "black",
size = 11
)
) -> p
p
It is a time to modify legend elements:
p + theme(
legend.background = element_rect(
fill = "grey90", # light grey background
size = NA # no border
),
legend.key = element_rect(
fill = NA, # use no extra fill for keys
color = NA # and no border
),
legend.key.width = unit(30, units = "pt"), # make it a bit longer
legend.title = element_blank(), # There is no name in OECD Figure
legend.position = "top",
legend.direction = "horizontal",
legend.text = element_text(
size = 11
)
) -> p
p
And finally panel elements:
p + theme(
panel.background = element_rect(
fill = "#e1fcfd" # Super-light blue as a RGB code
),
panel.border = element_rect(
color = "black",
fill = NA
),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
aspect.ratio = 1
)
…and it is almost as ugly as the original.
ggsave()
ggsave()
is a function which save last plot displayed. It supports export to multiple vector (pdf, svg, eps/ps, and wmf) and bitmap (png, jpeg, tiff, bmp) formats.
Using vector graphic can be highly recommended. Vector graphics formats save infromation on all elements in figure (their position and other properties) which allows user to scale them without loss of quality (no blurred edges etc.). On the other hand it may results in considerable file size. This is the case especially of scatter plots with many (often overlaping) observations. In this situation you can consider using bitmaps.
Following figure has less then 200KB in PNG (a bitmap), over 11MB in PDF, and 26MB in SVG.
data_frame(
x = runif(200000, min=0.01, max = 5)
) %>% rowwise() %>%
mutate(
y = log(x) + rnorm(1, sd=1)
) -> sdata
sdata %>%
ggplot(
aes(x=x,y=y)
) +
geom_point(alpha = 0.1) +
geom_smooth(se=FALSE) +
xlab("") + ylab("") +
theme_bw()
## `geom_smooth()` using method = 'gam'
Choice of formats also depends on intended use: PNG and SVG are designed for web sites, PDF and EPS/PS for (e-)printed documents. Make sure that you can process vector graphics in text tools of choice!
cowplot
, gridExtra
ggthemes
, xkcd
ggpolypath
, ggthemes
, ggnetwork
…and many, many more packages – see https://cran.r-project.org/web/packages/ggplot2/index.html
ggplot2
is a powerful tool which allows you to do pretty much anything you want (incl. pie charts) – but you should try to use it effectively.
Some advice from Tufte (2001) on Principles of graphical excellence:
Think of graphical excellence when plotting.
We will use a simulated data from an experiment. There is a table trial_data
in the file HW_trial_data.Rdata
with columns x
, control
, and treatment
with data observed:
## Source: local data frame [1,000 x 3]
## Groups: <by row>
##
## # A tibble: 1,000 × 3
## x control treatment
## <dbl> <dbl> <dbl>
## 1 -2.6075287 -6.254996 -5.861075
## 2 -0.8257511 -5.875684 -2.458799
## 3 -0.6754555 -2.628165 -2.956889
## # ... with 997 more rows
x
represents exogenenous variable and values in control
and treatment
responses in control and treatment groups.
I want you to plot a figure like this one:
## `geom_smooth()` using method = 'gam'
Do not forget…
Use table VGAMdata::oly12
and compare weight and height of London 2012 Summer Olympic Games with BMI limits set by WHO.
Your results should look like one of following figures. (Feel free to choose your favorite colors. Other features are mandatory.)
Tips:
Easier solution:
Better but more difficult solution:
You are supposed to submit your solution (code without any data nor picture) to IS. Your script should load data from package
VGAMdata
and export resulting figure intoresults.pdf
. Output file will be located in working directory. Names of all variables and files are mandatory (otherwise you will get 0 points). Remember that Linux and R are case-sensitive (Windows is not).