AVED: Data visualization (ggplot2)

2016

Before we start…

Install (update) ggplot2 in version 2.2.0 (released 14/11/2016):

https://blog.rstudio.org/2016/11/14/ggplot2-2-2-0/

Changes from 2.1.X:

Subtitles and captions.
A large rewrite of the facetting system.
Improved theme options.
Better stacking.

What is `ggplot2`

One of many packages available for data visualization
Follows different approach – so called "layered grammar of graphics"
Different in implementation – it is a microlanguage within R
Centerpoint of huge ecosystem of packages: https://cran.r-project.org/web/packages/ggplot2/index.html
The best package you can use for data visualization!

(There is a package similar in the spirit – ggvis – which is oriented on interactive graphics.)

Resources for learning

Wickham, H. (2016): ggplot2 : Elegant graphics for Data Analysis. 2nd edition, Springer.
ggplot2 docs website: http://docs.ggplot2.org/current/
http://ggplot2.tidyverse.org/

Dataset for examples

To illustrate ggplot2 basics we will use data set diamonds which contains data on tens of thousands stones. We will use a random sample of 500 of them for speed and clarity.

## # A tibble: 500 × 10
##    carat     cut color clarity depth table price     x     y     z
##    <dbl>   <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1   0.50   Ideal     E     VS2  62.5    57  1629  5.10  5.04  3.17
## 2   0.32   Ideal     D     VS2  61.0    57   972  4.46  4.42  2.71
## 3   1.01   Ideal     F     SI2  61.5    56  4458  6.42  6.49  3.97
## 4   1.13 Premium     F     VS2  59.8    61  6822  6.81  6.76  4.06
## 5   1.60 Premium     I     VS2  60.2    58  9784  7.63  7.56  4.57
## 6   0.44    Good     H     SI1  63.5    57   733  4.82  4.85  3.07
## 7   0.71   Ideal     D     SI1  60.8    56  2863  5.80  5.77  3.52
## 8   0.36   Ideal     E     VS2  61.7    55   742  4.57  4.60  2.83
## 9   1.09   Ideal     F    VVS2  62.1    56 10246  6.55  6.59  4.08
## 10  0.58   Ideal     E     VS2  62.3    54  1809  5.43  5.39  3.37
## # ... with 490 more rows

Table of content

Construction of a figure with ggplot2
Data visualization (geoms)
Elements which allows us to understand the data (legends,…)
Data-unrelated elements (fonts,…)
Exporting a figure
Homework 1 and 2

A basic logic of ggplot2

Components of a figure

Each figure produced by ggplot2 consist of three basic components:

Data visualization which is composed by one or many overlapping layers.
Elements which allows us to understand the visualization: scales (colors, sizes, etc.), legends, and axes.
Data-unrelated elements which defines general appearance of resulting figure – fonts, grids, background colors, etc.

All three components are controled independently.

Data visualization: layer by layer

Data visualization which is composed by one or many overlapping layers. Final figure is an union of multiple layers where every single one of them adds one quality in a figure.

Data visualization: layer by layer

In our schematic example first layer contains scatter plot and second smoothing line.

Layers in motion (1)

Following sequence of figures illustrates the concept of construction by layers. Our goal is to get smoothed scatter plot of weight (carat) and price (price) of stones in the sample.

Layers in motion (2)

At first the basic layer is generated by call of ggplot() function. The basic layer is just a "empty blackboard":

diamonds %>% 
  ggplot(data = ., mapping = aes(x = carat, y = price))

Layers in motion (3)

In the second layer we add dots which represent individual stones (using geom_point() function):

diamonds %>% 
  ggplot(data = ., mapping = aes(x = carat, y = price)) +
  geom_point()

Layers in motion (4)

The last layer in the example adds smoothing curve into the figure (using geom_smooth() function):

diamonds %>% 
  ggplot(data = ., mapping = aes(x = carat, y = price)) +
  geom_point() +
  geom_smooth()

Layers in motion (5)

Notice that smoothing curve is actually drawn over the points! Order of layers really matters.

Definition of a layer

We used geom_*() functions to add additional layers. These functions are actually shortcuts for more verbose layer().
For example geom_point() is identical to:

layer(
  data = NULL,
  mapping = NULL,
  geom = "point",
  stat = "identity",
  position = "identity"
)

It is very rare to call layer() directly.

Definition of a layer

Each layer() argument refers to a property of a layer:

data – a tidy data frame. Each layer has to have a parameter data specified or inherited from ggplot() call. (It is not necessary to specify data in ggplot() call if they are specified for each layer separately.)
mapping contains a set of aesthetic mappings specified using aes() function. aes() assigns data (specific columns from input data frame) to qualities of the geom.

Aesthetics/mapping (1)

Think about simple scatterplots. How many dimensions (stone qualities) can be plotted in a simple scatterplot?

See ggplot2::diamonds:

## # A tibble: 500 × 10
##   carat     cut color clarity depth table price     x     y     z
##   <dbl>   <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.50   Ideal     E     VS2  62.5    57  1629  5.10  5.04  3.17
## 2  0.32   Ideal     D     VS2  61.0    57   972  4.46  4.42  2.71
## 3  1.01   Ideal     F     SI2  61.5    56  4458  6.42  6.49  3.97
## 4  1.13 Premium     F     VS2  59.8    61  6822  6.81  6.76  4.06
## 5  1.60 Premium     I     VS2  60.2    58  9784  7.63  7.56  4.57
## # ... with 495 more rows

Aesthetics/mapping (2)

ggplot(
  data = diamonds,
  mapping = 
    aes(
      x = carat,      # X-coordinate
      y = price,      # Y-coordinate
      color = cut,    # color of the "point" margin
      fill = color,   # color of filling
      size = table    # size of the point
    )
) + 
  geom_point(
    shape = 21,       # shape of "points"
    alpha = 0.2,      # points transparency
    stroke = 2        # thickness of margin
  )

Aesthetics/mapping (3)

I have cheated – legends are "turned on" by default.

Behold…

The example uses all aesthetics available for geom_point(). One can notice two important things from the example:

Each layer can set all layer arguments. If they are not set they inherite settings from intial ggplot() call. See that geom_point does not specify data used or mapping.
Aesthetic characteristics can be assigned to data or can be specified as single value (which is used for all observations in the Figure). Aesthetic specification in layer()/geom_*() overrides specification given in initial ggplot() call.

Other layer properties

geom contains the name of the geometric object to use to draw each observation – geom point is used in above discussed example.
stat the name of the statistical transformation used – e.g. geom_smooth() from example above.
position method used to adjust overlapping objects like jittering, stacking and dodging

Shortcuts (1)

Remember that functions geom_*() are shortcuts for the specific call of layer which may differ in all parameters – not only in geom parameter.

We can demonstrate it on the example of default setting of geom_point() and geom_jitter().

Shortcuts (2)

Default geom_point():

layer(
  data = NULL,
  mapping = NULL,
  geom = "point",
  stat = "identity",
  position = "identity"
)

Default geom_jitter():

layer(
  data = NULL,
  mapping = NULL,
  geom = "point",
  stat = "identity",
  position = "jitter"
)

Exercise 1: Hello diamond!

Plot a scatterplot which depicts following qualities of cars (datasets::mtcars):

wt – weight (1000 lbs)
qsec – 1/4 mile time
am – transmission (0 = automatic, 1 = manual)

Exercise: dataset

mtcars %<>% as_tibble()
mtcars %>% print(n=5)

## # A tibble: 32 × 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
## 2  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
## 3  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1
## 4  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1
## 5  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2
## # ... with 27 more rows

Solutions

You can map variables on different aesthetics. It is up to you to make it look nice.

Solutions

ggplot(
    data = mtcars,
    mapping = aes(x = wt, y = qsec, colour = factor(am))
) +
    geom_point()

ggplot() +
    geom_point(
        data = mtcars,
        mapping = aes(x = wt, y = qsec, colour = factor(am))
    )

Both codes produce identical figures. Why?

Data visualization

Distributions
Relationships

`geom_*()` functions

ggplot2 allows users to construct many figure types in countless colors, sizes, etc. Following slides provide a basic overview of most common figure types and options.

Description of a variable (or variables)

The first thing which usually catches eye of a researcher is distribution of a variable. For its plotting we use a different tools for discrete and continuous variables.

Distribution of discrete variable

Bar plots with `geom_bar()`

For a discrete variable it is crucial to see frequency of observed options. For example data set diamonds contains column color which contains evaluation of color of included stones from D (best) to J (worst). We can get the frequencies using table():

diamonds %$% table(color)

## color
##   D   E   F   G   H   I   J 
##  70  85  87 100  84  54  20

Common way to visualize frequencies of a discrete variable is to use a bar plot.

Distribution of discrete variable

Bar plots with `geom_bar()`

geom_bar() can be used to produce various bar plots. Basic (default) setting returns distribution of discrete variable ("histogram"), where height of a bar is equal to number of observation.

geom_bar() understands following aesthetics: x (required), alpha, colour, fill, linetype, size. As an example we can plot a bar plot with stones color distribution.

Distribution of discrete variable

Bar plots with `geom_bar()`

diamonds %>% 
  ggplot(
    aes(
      x = color
    )
  ) +
  geom_bar(
    stat = "count",
    position = "stack"
  )

Distribution of discrete variable

Bar plots with `geom_bar()`

geom_bar() returned number of cases at each x position. The numbers were supplied by a function stat_count() which processes data for geom_bar(stat="count"). It is not common (but it is possible) to use stat_*() functions directly as far as each of them is associated to some geom_*() function.

Only mandatory aesthetic x is used in the example. However one can use more then one aesthetic.

Distribution of discrete variable

Bar plots with `geom_bar()`

diamonds %>% 
  ggplot(
    aes(
      x = color,
      fill = cut
    )
  ) +
  geom_bar()

Distribution of discrete variable

Bar plots with `geom_bar()`

diamonds %>% 
  ggplot(
    aes(
      x = color,
      fill = color
    )
  ) +
  geom_bar()

This example shows that it is possible to map one variable to multiple aesthetics.

Distribution of continuous variables

Histograms

A histogram is designed for plotting a distribution of observed values of a continuous variable. At first a continuous scale of observed values is divided into intervals (bins) and then number of observations in all bins is counted a plotted.

A basic histogram can be created using geom_histogram(). It uses the same aesthetics as geom_bar().

Distribution of continuous variables

Histograms

diamonds %>% 
  ggplot(aes(x=price)) +
  geom_histogram()

geom_histogram() allows user to change size (argument binwidth, default to NULL) or number (argument bins, default to 30) of bins. If binnwidth is set bins is ignored.

Distribution of continuous variables

Histograms

It is recommended to play a little with binwidth (or bins) to find optimal bin size (number of bins).

diamonds %>% 
  ggplot(aes(x=price)) +
  geom_histogram(binwidth = 1000)

Distribution of continuous variables

Density plots

geom_histogram() displays observed values but sometimes is useful to estimated true density from the sample. geom_density() delivers kernel density estimate of x variable distribution.

geom_density() understands following aesthetics: x, y, alpha, colour, fill, linetype, size, and weight.

Distribution of continuous variables

Density plots

diamonds %>% 
  ggplot(
    aes(x=price)
    ) +
  geom_density()

Density estimate is delivered to geom_density() by stat_density() function which use kernel = "gaussian" by default.

Distribution of continuous variables

Density plots

If you want to use a different kernel you need to call stat_density() directly:

diamonds %>% 
  ggplot(
    aes(x=price)
    ) +
  stat_density(
    kernel = "optcosine" 
    # see density() help
  )

stat vs. geom

How could it be that stat_density() produces layer (geom)?! The relationship between geom_*() and stat_*() is actually a bit complicated. Most of geom_*() functions have an association with a stat function. But on the other hand stat_*() functions have a association with a geom function.
See default setting of geom_histogram() and related stat_bin():

geom_histogram(mapping = NULL, data = NULL, stat = "bin",
  position = "stack", ..., binwidth = NULL, bins = NULL, na.rm = FALSE,
  show.legend = NA, inherit.aes = TRUE)

stat_bin(mapping = NULL, data = NULL, geom = "bar", position = "stack",
  ..., binwidth = NULL, bins = NULL, center = NULL, boundary = NULL,
  closed = c("right", "left"), pad = FALSE, na.rm = FALSE,
  show.legend = NA, inherit.aes = TRUE)

Distribution of continuous variables

`group`

It might be important to see differences in density for different groups of stones (e.g. different cuts). ggplot2 provides multiple techniques for this task. The first one can be called grouping. Groups are specified by argument group in aes(). The argument group expects a variable which divides variable x into groups – it could be discrete (or logical) variable. The task specified in following geom_*() call is performed for each group separately and outcomes are plotted into the same Figure.

Distribution of continuous variables

Density plots

diamonds %>% 
  ggplot(
    aes(x=price, group = cut)
    ) +
  geom_density()

Distribution of continuous variables

Density plots

Hm, the outcome is hard to read. Let's tune it following the same concept of grouping. If we assign grouping variable cut to different aesthetics (which geom_density() understands!) the results could be clear to read and even breath taking:

diamonds %>% 
  ggplot(
    aes(x=price, 
        fill = cut,
        color = cut # Only one aesthetics would suffice. It just looks cool.
        )
    ) +
  geom_density(
    alpha = 0.2  # Just to make it even cooler and easy to read.
  )

Distribution of continuous variables

Density plots

Distribution of continuous variables

Box and violin plots

Box and violin plots are designed for comparison of distributions of different variables. More common box plot can be created using geom_boxplot() which plots stylized distributions for each group.

geom_boxplot() understands following aesthetics:

x (grouping variable),
ymax (upper whisker = largest observation less than or equal to upper hinge + 1.5 * IQR),
ymin (lower whisker = smallest observation greater than or equal to lower hinge - 1.5 * IQR),
lower, middle, upper (quantiles)
and alpha, colour, fill, linetype, shape, size, weight

Distribution of continuous variables

Box and violin plots

diamonds %>% 
  ggplot(
    aes(x = cut, # Grouping variable
        price)   # Variable to be plotted
  ) + geom_boxplot()

Distribution of 2D data

ggplot2 provides tools for visualization of 2D data distribution which are analogous to 1D functions described above.

ggplot2 has two geom_*() functions analogous to geom_histogram() which display distribution of observed combinations of two variables. Both of them split the plane to smaller areas and show the number of observations in each of them. geom_bin2d() splits the plane to rectangles and geom_hex() to hexagons.

Distribution of 2D data

`geom_bin2d()`

geom_bin2d() understands to following aesthetics: x, y, and fill. geom_hex() adds colour, fill, and size.

diamonds %>% 
  ggplot(
    aes(x = carat, y = price)
  ) +
  geom_bin2d()

Distribution of 2D data

`geom_hex()`

diamonds %>% 
  ggplot(
    aes(x = carat, y = price)
  ) +
  geom_hex()

Distribution of 2D data

`geom_density2d()`

Kernel estimate of 2D distribution is also available via geom_density2d(). The function returns contours of distribution estimated.

geom_density2d() understands following aesthetics: x, y, alpha, colour, linetype, and size.

diamonds %>% 
  ggplot(
    aes(x = carat, y = price)
  ) +
  geom_density2d()

Relationships of variables

`geom_point()`

Basic tool for visualization of relationship of two continuous variables is with no doubt a scatter plot. You can plot it using geom_point().

It is possible to add more information on individual observations using many aesthetics available. However individual observations always will be – at least to some extent – "anonymous". geom_text() and geom_label() allows user to replace shape representing observation by text (string) defined in required aesthetic label.

Relationships of variables

`geom_text()` and `geom_label()`

Both geom_ functions understand aesthetics label, x, y, alpha, angle, colour, family, fontface, hjust, lineheight, size, and vjust.

As an example we can draw a scatter plot of price and weight relationship where each stone has a label depicting its clarity. As this type of plot is generally more suitable for data set with low number of observations we will further reduce the sub-sample from diamonds data set.

Relationships of variables

`geom_text()` and `geom_label()`

Relationships of variables

`geom_text()` and `geom_label()`

As it is apparent from the Figure that the difference between geom_text() and geom_label() is just aesthetical. geom_label() is also considerably slower.

It is also clear, that this type of plot very often suffers from over-plotting. As far as I know ggplot2 does not provide any automatic intelligent way to solve it. It is possible to manually adjust position of labels (see help) or use check_overlap = TRUE. Which is a dirty way.

Relationships of variables

`geom_text()` and `geom_label()`

sub_diamonds %>% 
  ggplot(aes(
    x = carat,
    y = price,
    label = clarity
  )) +
  geom_text(
    check_overlap = TRUE
  )

check_overlap = TRUE just suppress plotting of labels which would overlap with already plotted text. Oh, dear.

Relationships of variables

`geom_smooth()`

geom_smooth() returns smoothed line. It supports multiple smoothing methods: lm, glm, gam, loess, and rlm (given in method). Default method differs according to number of observations. It also returns confidence interval around smooth. Plotting of confidence interval can be suppressed by setting se = FALSE.

geom_smooth() understands aesthetics x, y, alpha, colour, fill, linetype, size, and weight.

Relationships of variables

`geom_smooth()`

diamonds %>% 
  ggplot(
    aes(
      x = carat,
      y = price
    )
  ) + 
  geom_smooth(
    fill = "pink",
    colour = "red"
  )

Notice, that you can easily have smoothing curve without having actual observations.

Relationships of variables

`geom_smooth()`

You can also compare multiple smoothing methods by simply adding multiple layers:

diamonds %>% 
  ggplot(
    aes(x = carat,y = price)
  ) + 
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", 
              colour = "red", 
              fill = "pink") +
  geom_smooth(method = "loess", 
              colour = "green", 
              fill = "lightgreen")

Relationships of variables

3 variables

ggplot2 contains some tools for investigating relationship between three variables. geom_raster(), geom_tile(), and geom_rect() provide similar functionality for plotting rectangles which is useful when plotting surface on a plane.

Relationships of variables

3 variables

geom_raster() is the fastest of the three and together with data on estimated density of Old Faithful Geyser eruptions we will use it to demonstrate use of rectangles.

print(faithfuld, n=5)

## # A tibble: 5,625 × 3
##   eruptions waiting     density
##       <dbl>   <dbl>       <dbl>
## 1  1.600000      43 0.003216159
## 2  1.647297      43 0.003835375
## 3  1.694595      43 0.004435548
## 4  1.741892      43 0.004977614
## 5  1.789189      43 0.005424238
## # ... with 5,620 more rows

Relationships of variables

faithfuld %>% 
  ggplot(aes(x = waiting,
             y = eruptions,
             fill = density)) + geom_raster()

The second Figure is created with an option interpolate = TRUE which deliver nicer outcomes.

Relationships of variables

Rectangles might be a bit difficult to combine with different geoms. In this case one can use geom_contour() which display contours of a 3D surface in 2D.

In the following task we want to combine estimated density of eruptions stored in table faithfuld and actual observations from table faithful.

Relationships of variables

The first way is to combine both data sets and plot Figure using a layer for density estimates and actual observations using the same table.

However ggplot2 allows user to specify different data set for each layer. We need to:

Plot contours using table faithfuld.
Add layer with point defined in table faithful.

ggplot(data = faithfuld,
       aes(
         x = waiting, 
         y = eruptions
       )
) + 
  geom_contour(aes(z = density)) +
  geom_point(data = faithful)

Relationships of variables

If the names in faithful were different (e.g. obs_eruptions and obs_waiting) we would need to rewrite it in the following fashion (which would lead to identical outcome):

ggplot(data = faithfuld,
       aes(
         x = waiting, 
         y = eruptions
       )
) + 
  geom_contour(aes(z = density)) +
  geom_point(data = faithful,
             aes(
               x = obs_waiting,
               y = obs_eruptions
             ))

Spatial data

Plotting maps

A lot of datasets have spatial dimension which is useful to visualize on the map. There are two different cases:

data layer is independent on map layer (e.g. "points on the map")
data directly interacts with a map and are plotted as a part of map layer (e.g. thematic maps)

Spatial data

Getting maps

The first thing you need is a map. Some maps are available in R-packages:

maps (a few maps)
cshapes (extremely useful package with historical maps)
…?

However, these packages provide just very basic maps. You will need more than that. You can find a lot of maps on the internet, but those are not constructed for direct usage in ggplot2. You need special tools for loading them and converting them into a data.frame. (ggplot2 can process only data.frames).

Spatial data

ESRI Shapefile (shp)

I recommend you to use data in so called ESRI shapefiles. There are tools for loading them in the package rgdal. (rgdal is just a frontend. It needs to have GDAL library installed.)

You would need especially two functions from rgdal:

readOGR() which requires user to specify source file (directory) and map layer name. (One shapefile can contain multiple layers.)
ogrListLayers() returns list of layers available in a shapefile

Here you can download Czech Republic shapefiles with many layers: https://www.arcdata.cz/produkty/geograficka-data/arccr-500 (License!)

Spatial data

ESRI Shapefile (shp)

readOGR returns a special spatial S4 class which you need to transform to data.frame. (Un)fortunately there is a method for it implemented in broom::tidy(). (We will talk about broom later – it is also part of tidyverse.)

broom::tidy() is an alternative and future replacement of ggplot2::fortify()

Spatial data

ESRI Shapefile (shp): Example

Donwload and unpack ArcČR 500. I will eventually get a directory with a lot of stuff (almost 200 files):

dir("./data/ArcCR500_v33.gdb/") %>% head

## [1] "a0000000a.gdbindexes" "a0000000a.gdbtable"   "a0000000a.gdbtablx"  
## [4] "a0000000a.spx"        "a0000000b.gdbindexes" "a0000000b.gdbtable"

There is no chance to do something with it without specialized tools.

Spatial data

ESRI Shapefile (shp): Example

At first we need to know what layers are in the shapefile (we need to feed it into readOGR()

library(rgdal)
ogrListLayers("./data/ArcCR500_v33.gdb/")

##  [1] "Hranice"                "Zeleznice"             
##  [3] "SidlaBody"              "SidlaPlochy"           
##  [5] "VyskoveKoty"            "Silnice_2015"          
##  [7] "BazinyARaseliniste"     "VodniPlochy"           
##  [9] "ZeleznicniStanice"      "VodniToky"             
## [11] "Letiste"                "Lesy"                  
## [13] "ChranenaUzemi"          "Vrstevnice"            
## [15] "KladyTopografickychMap" "KladyZakladnichMap"    
## [17] "SouradnicovaSitJTSK"    "ZemepisnaSitETRS89"    
## [19] "ZemepisnaSitWGS84"      "Silnice_2016"          
## attr(,"driver")
## [1] "OpenFileGDB"
## attr(,"nlayers")
## [1] 20

Spatial data

ESRI Shapefile (shp): Example

Let's assume that one might be a rail-nut:

readOGR("./data/ArcCR500_v33.gdb/","Zeleznice") -> Zeleznice

## OGR data source with driver: OpenFileGDB 
## Source: "./data/ArcCR500_v33.gdb/", layer: "Zeleznice"
## with 3525 features
## It has 5 fields

Spatial data

ESRI Shapefile (shp): Example

Zeleznice %>% class()

## [1] "SpatialLinesDataFrame"
## attr(,"package")
## [1] "sp"

Zeleznice %>% typeof()

## [1] "S4"

Spatial data

ESRI Shapefile (shp): Example

We need to get a data.frame out of Zeleznice. We will use tidy():

library(broom)
Zeleznice %>% tidy() %>% as_tibble() -> Zeleznice_df
print(Zeleznice_df, n=2)

## # A tibble: 20,284 × 6
##        long      lat order  piece  group    id
##       <dbl>    <dbl> <int> <fctr> <fctr> <chr>
## 1 -823494.8 -1070256     1      1    1.1     1
## 2 -823275.9 -1070347     2      1    1.1     1
## # ... with 2.028e+04 more rows

Spatial data

ESRI Shapefile (shp): Example

You can see two major problems here:

long and lat do not look like WGS84 coordinates! You are damn right – it is S_JTSK coordinate system.
group identifiers are just generic IDs – it is a problem to match it with data!

Spatial data

ESRI Shapefile (shp): Example

Plot it using geom_path():

Zeleznice_df %>% 
    ggplot(
        aes(x=long, y=lat, group=id)
        ) + 
    geom_path()

geom_path() draws a line between points as they follow…

Spatial data

ESRI Shapefile (shp): Example

Spatial data

ESRI Shapefile (shp): Example

Zeleznice@data %>% 
    mutate(
        id = row_number() %>% as.character()
    ) %>% 
    left_join(Zeleznice_df,.) -> Zeleznice_df

print(Zeleznice_df, n=5)

## # A tibble: 20,284 × 11
##        long      lat order  piece  group    id ELEKTRIFIKACE KATEGORIE
##       <dbl>    <dbl> <int> <fctr> <fctr> <chr>         <int>     <int>
## 1 -823494.8 -1070256     1      1    1.1     1             1         2
## 2 -823275.9 -1070347     2      1    1.1     1             1         2
## 3 -822975.1 -1070429     3      1    1.1     1             1         2
## 4 -822726.4 -1070459     4      1    1.1     1             1         2
## 5 -822536.8 -1070516     5      1    1.1     1             1         2
## # ... with 2.028e+04 more rows, and 3 more variables: KOLEJNOST <int>,
## #   ROZCHODNOST <int>, SHAPE_Length <dbl>

Spatial data

ESRI Shapefile (shp): Example

Now you can differentiate railways in the picture:

Zeleznice_df %>% 
    ggplot(
        aes(x=long, 
            y=lat, 
            group=id, 
            color = factor(ELEKTRIFIKACE))
        ) + 
    geom_path() +
    coord_fixed()

Spatial data

ESRI Shapefile (shp): Example

Spatial data

ESRI Shapefile (shp): Example

It would feel natural to add borders into the figure. But we do not have them in the railway layer.

So we need to access different layer from different shapefile and put thing together.

ogrListLayers("./data/AdministrativniCleneni_v13.gdb/")

##  [1] "ZakladniSidelniJednotkyBody"       
##  [2] "UzemneTechnickeJednotkyBody"       
##  [3] "UzemneTechnickeJednotkyPolygony"   
##  [4] "KatastralniUzemiBody"              
##  [5] "KatastralniUzemiPolygony"          
##  [6] "MestskeObvodyAMestskeCastiBody"    
##  [7] "MestskeObvodyAMestskeCastiPolygony"
##  [8] "CastiObceBody"                     
##  [9] "CastiObcePolygony"                 
## [10] "ObceBody"                          
## [11] "ObcePolygony"                      
## [12] "ObceSPoverenymUrademBody"          
## [13] "ObceSPoverenymUrademPolygony"      
## [14] "ObceSRozsirenouPusobnostiBody"     
## [15] "ObceSRozsirenouPusobnostiPolygony" 
## [16] "OkresyBody"                        
## [17] "OkresyPolygony"                    
## [18] "KrajeBody"                         
## [19] "KrajePolygony"                     
## [20] "StatBod"                           
## [21] "StatPolygon"                       
## [22] "ZakladniSidelniJednotkyPolygony"   
## attr(,"driver")
## [1] "OpenFileGDB"
## attr(,"nlayers")
## [1] 22

Spatial data

ESRI Shapefile (shp): Example

Read and transform the data:

readOGR("./data/AdministrativniCleneni_v13.gdb/","StatPolygon") %>% 
    tidy() %>% 
    as_tibble() -> CR

## OGR data source with driver: OpenFileGDB 
## Source: "./data/AdministrativniCleneni_v13.gdb/", layer: "StatPolygon"
## with 1 features
## It has 30 fields

Spatial data

ESRI Shapefile (shp): Example

Zeleznice_df %>% 
    ggplot(
        aes(x=long, 
            y=lat, 
            group=id, 
            color = factor(ELEKTRIFIKACE))
        ) + 
    geom_path() +
    geom_path(
        data = CR
    ) +
    coord_fixed()

…and we would end up with an error. Can you tell me why?

Spatial data

ESRI Shapefile (shp): Example

Zeleznice_df %>% 
    ggplot(
        aes(x=long, 
            y=lat
        ) + 
    geom_path(
        aes(group=id,color = factor(ELEKTRIFIKACE)))
    ) +
    geom_path(
        data = CR,
        aes(group=id),
        color="black"
    ) +
    coord_fixed()

Spatial data

ESRI Shapefile (shp): Example

Exercise 2: Suffer with data

Plot a path of a trail run!

Exercise 2: Data

Transformed data from GPX file produced by Strava.com:

load("data/run_slides.Rdata")
print(run, n=5)

## # A tibble: 476 × 4
##        lon      lat   ele                 time
##      <dbl>    <dbl> <dbl>                <chr>
## 1 16.59912 49.24200 243.4 2016-08-10T16:58:51Z
## 2 16.59916 49.24204 242.7 2016-08-10T16:58:53Z
## 3 16.59922 49.24207 242.2 2016-08-10T16:58:55Z
## 4 16.59926 49.24210 242.1 2016-08-10T16:58:57Z
## 5 16.59930 49.24214 242.3 2016-08-10T16:58:59Z
## # ... with 471 more rows

Exercise 2: Plot the path

run %>% 
  ggplot(
    aes(x = lon, y = lat)
  ) +
  geom_line()

Heart-shaped nonsense…

Exercise 2: Plot the path

run %>% 
  ggplot(
    aes(x = lon, y = lat)
  ) +
  geom_path()

Exercise 2: Plot the path

run %>% 
  ggplot(
    aes(x = lon, 
        y = lat,
        color = ele)
  ) +
  geom_path()

Exercise 2: Cool stuff to do

library(ggmap)

lon <- mean(run$lon)
lat <- mean(run$lat)

get_map(location = c(lon,lat), 
        zoom=16) -> m1

ggmap(m1) -> p

p +
    geom_path(
        data = run,
        aes(
            x = lon,
            y = lat
        ),
        color = "red",
        size = 1
    ) -> p

Exercise 2: Cool stuff to do

Homework

Plot density plots for control and treatment group

Data

We will use a simulated data from an experiment. There is a table trial_data in the file HW_trial_data.Rdata with columns x, control, and treatment with data observed:

## Source: local data frame [1,000 x 3]
## Groups: <by row>
## 
## # A tibble: 1,000 × 3
##            x   control treatment
##        <dbl>     <dbl>     <dbl>
## 1 -2.6075287 -6.254996 -5.861075
## 2 -0.8257511 -5.875684 -2.458799
## 3 -0.6754555 -2.628165 -2.956889
## # ... with 997 more rows

x represents exogenenous variable and values in control and treatment responses in control and treatment groups.

Figure

I want you to plot a figure like this one:

Do not forget…

You are supposed to submit a code – not a figure
Create smoothing line for each group (control/treatment)
Smoothing lines are below points
Color of smoothing lines is mandatory
Pay attention to labels
Points differ in shape and color. Selection of shapes and colors is up to you (default is just OK).
Points are transparent a bit (say from 50 %)

All you need is in this presentation + some of tidyr might be useful.

Before we start…

What is ggplot2

Resources for learning

Dataset for examples

Table of content

A basic logic of ggplot2

Components of a figure

Data visualization: layer by layer

Data visualization: layer by layer

Layers in motion (1)

Layers in motion (2)

Layers in motion (3)

Layers in motion (4)

Layers in motion (5)

Definition of a layer

Definition of a layer

Aesthetics/mapping (1)

Aesthetics/mapping (2)

Aesthetics/mapping (3)

Behold…

Other layer properties

Shortcuts (1)

Shortcuts (2)

Exercise 1: Hello diamond!

Exercise: dataset

Solutions

Solutions

Data visualization

geom_*() functions

Description of a variable (or variables)

Distribution of discrete variable

Bar plots with geom_bar()

Distribution of discrete variable

Bar plots with geom_bar()

Distribution of discrete variable

Bar plots with geom_bar()

Distribution of discrete variable

Bar plots with geom_bar()

Distribution of discrete variable

Bar plots with geom_bar()

Distribution of discrete variable

Bar plots with geom_bar()

Distribution of continuous variables

Histograms

Distribution of continuous variables

Histograms

Distribution of continuous variables

Histograms

Distribution of continuous variables

Density plots

Distribution of continuous variables

Density plots

Distribution of continuous variables

Density plots

stat vs. geom

Distribution of continuous variables

group

Distribution of continuous variables

Density plots

Distribution of continuous variables

Density plots

Distribution of continuous variables

Density plots

Distribution of continuous variables

Box and violin plots

Distribution of continuous variables

Box and violin plots

Distribution of 2D data

Distribution of 2D data

geom_bin2d()

Distribution of 2D data

geom_hex()

Distribution of 2D data

geom_density2d()

Relationships of variables

geom_point()

Relationships of variables

geom_text() and geom_label()

Relationships of variables

geom_text() and geom_label()

What is `ggplot2`

`geom_*()` functions

Bar plots with `geom_bar()`

Bar plots with `geom_bar()`

Bar plots with `geom_bar()`

Bar plots with `geom_bar()`

Bar plots with `geom_bar()`

Bar plots with `geom_bar()`

`group`

`geom_bin2d()`

`geom_hex()`

`geom_density2d()`

`geom_point()`

`geom_text()` and `geom_label()`

`geom_text()` and `geom_label()`

`geom_text()` and `geom_label()`

`geom_text()` and `geom_label()`

`geom_smooth()`

`geom_smooth()`

`geom_smooth()`