4. Graphs Why graphs? Graphs represent a crucial tool in statistical analysis. They are used for exploratory data analysis, parameter comparisons between samples, and illustrations of associations between variables. A number of graph types exist, which are suitable for different purposes. They are very useful for communication, including also data/analysis result presentation. Most people simply prefer seeing a graph to studying numbers presented in a table. Producing nice graphs is thus an important part of presentation of scientific results1. There are no universal rules how a nice graph should look like, but the good thing is that the quality of your graphs will quickly improve with practice and experience. Getting inspired by graphical presentations of other researchers is also very helpful. An important aspect of the graphs is that they cannot display all the information contained in the raw data. An ideal graph should minimize this loss of information while efficiently depicting the patterns of interest. These requirements are however often in conflict. A reasonable solution often lies in providing the reader both the graph and the raw data (attached as supplementary material or deposited in a public repository such as Dryad: https://datadryad.org/stash). Many scientific journals nowadays require disclosure of the original data anyway, which is important for checking the integrity of the analyses presented. By contrast, presenting the same descriptive statistics in both table and graph format is generally considered superfluous and should be avoided. Basic graph types Table 4.1 Summary of basic graph types, their advantages and limitations Graph type Number of Preservation of Display of sample Visualization of variables* information parameters dependence Histogram 1_ Boxplot 1 quantitative + 1 categorical r Barplot (for lor 2 counts) categorical Dotchart (with 1 quantitative r ++ r errorbars) + 1 categorical Scatterplot 2 quantitative ++ ++ excellent, + good, - (still) adequate, - poor * Refers to a minimum (typical) number. May be increased e.g., by combining multiple categorical predictors, or categorization of point in a scatterplot. 1 Note here, that most readers of scientific papers only read the abstract and then look at the figures; and all of them do this before deciding whether the paper is worth of further reading. This applies also for journal editors and submitted manuscript. Figure quality and attractiveness may thus have a decisive effect on the editor's decision on publication. Graph plotting in R You may take several ways how to plot graphs in R. Two most common include the R base graphics and the package ggplot2. These approaches have their advantages and disadvantages. The R base graphics uses the same script grammar as the rest of R. Thus, you do not need to study another specialized package. However, plotting complicated plots may require quite a lot of programming. Producing graphically nice outputs may also require adjustments of many parameters in some (albeit not all) graph types. The ggplot2 package uses its own script grammar, which is quite different from R base. This means that you need to study another programming language. However, with this language you can easily plot complicated graphs with just two lines of code (instead of 20 in base graphics). In this material, I take things pragmatically. Generally, ggplot is in the focus as a modern tool of graph plotting. However, if it easier to to produce a particular graph type with base graphics, I choose that way. The ggplot grammar Each definition of a ggplot graph starts with the ggplot function, which defines the data (i.e. data frame where to look for variables) and so called aesthetics mappings, which are definition of variables used for plotting. For a data frame called df with variables x and y used to define a scatterplot, it is: ggplot2(data, mapping=aes(x=x, y=y)) The scatterplot is then plotted by a geom function: geom_point() The ggplot2 definition, geoms and possible further functions are separated by "+". In our case, it would be: ggplot2(data, aes(x=x, y=y))+geom_point() Other elements or arrangements of plot are specified by additional functions added with another"+". The most essential among these are: • theme() - setting visual attributes; There are preset graphical themes. I really prefer theme_classic() or theme_bw() to the default theme_gray(). • xlab(), ylab(), or labs() - specification of axis label text • facet_wrap() - faceting graphs, i.e. defining multipanel plots with panels based on a variable • grid.arrange() - creating general multipanel plots (package gridExtra) • ggsave(file.name, width, height) - saves the most recent plot to hard drive (height and width are specified in cm). File type (pdf, svg, png) is automatically set by file extension. For ggplot2, there is abundant reference available, such as a free online book (https://ggplot2-book.org/index.html; printed version for $$) and ggplot2 cheat sheet (https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). You can also find valuable reference/guidelines by Google searches such as "scatterplot in ggplot". The Google searches are very efficient way how to get the guidelines, generally more efficient than using the R help like ?ggplot_function. R base graphics The R base graphics uses the common R grammar, i.e. plot(df$x, df$y, further parameters) or plot(y~x, data=df, further parameters). For the base R graphics, there is also a cheat sheet available (http://publish.illinois.edu/iohnrgallagher/files/2015/10/BaseGraphicsCheatsheet.pdf) as well as abundant online resources. R help is in general quite informative. Info on parameters of graphical functions can be called by ?par. To export plots produced by R graphics, you need to define a graphical device in a file, draw the plot there and save the file. I.e.: pdf("filename.pdf", width, height) # Creating the file - width and height are in inches (vector graphics - pdf, svg) or in pixels (raster graphics - png, jpg, tif) plot(y~x, data=df, further parameters) dev.off() #Closes and saves the file. This is essential. Histogram Histogram was already introduced in chapter 2. Construction of histograms is done in two steps. The range of the values is first divided into a number of intervals. These are plotted on the x-axis. Individual values are then assigned into them and the resulting frequencies of observations are plotted on the y-axis. Thus, histograms display the data with only minimal loss of information. They are a perfect tool for exploration of data distribution. o c 0) er 0) 35 30 25 H 20 15 -10 - 5 0 -1 r 2 3 xy.2$y2 Fig. 4.1. Histogram of the variable xy.2$y2 plotted by base R. How to do in R The R base function hist applied on the variable to be plotted produces the histogram. In ggplot, plotting the histogram is rather complicated. Boxplot Boxplot was also introduced in chapter 2. Boxplots display summary of descriptive statistics of samples: the median, quartiles, non-outlier range and outliers. Typically, they are used to study association between a categorical (factor) and a quantitative (numeric) variable, where they display differences between individual categories (levels). Boxplots do not display means, so it is not possible to use them for direct mean comparisons. However, crucial characteristics of the distributions are visible on the plots: variability, symmetry, presence of outliers. This makes boxplots an important tool of exploratory data analysis