4. Graphs Why graphs? Graphs represent a crucial tool in statistical analysis. They are used for exploratory data analysis, parameter comparisons between samples and illustrations of associations between variables. A number of graph types exist of which each is best suited to individual purposes. They are very useful for communication, including also data/analysis result presentation. Most people simply like seeing a graph more than studying numbers presented in a table. Producing nice graphs is thus an important part of presentation of scientific results1. There are no universal rules how a nice graph should look like but the good think is that the quality of your graphs will quickly improve with practice and experience. Getting inspired by graphical presentations of other researchers is also very helpful. An important aspect of the graphs is that they cannot display all the information contained in the raw data. Ideal graph should minimize this loss of information while efficiently depicting the patterns of interest. These requirements are however often in conflict. A reasonable solution often lies in providing the reader both the graph and the raw data (attached as supplementary material or deposited in a public repository such as Dryad: https://datadryad.org/stash). Many scientific journals nowadays require disclosure of the original data in these ways, which is important for checking the integrity of the analyses presented. By contrast, presenting the same descriptive statistics in both table and graph format is generally considered superfluous and should be avoided. Basic graph types Table 4.1 Summary of basic graph types, their advantages and limitations Graph type Number of variables* Preservation of information Display of sample parameters Visualization of dependence Histogram 1 ++ -- -- Boxplot 1 quantitative + 1 categorical + - + Barplot 1 quantitative + 1 categorical -- + ++ Dotchart 1 quantitative + 1 categorical -- ++ ++ Scatterplot 2 quantitative ++ - ++ ++ excellent, + good, - adequate, -- poor * Refers to minimum (typical) number. May be increased e.g. by combining multiple categorical predictors, or categorization of point in scatterplot. 1 Note here, that most readers of scientific papers only read the abstract and then look at the figures; and all of them do this before deciding whether the paper is worth of further reading. This applies also for journal editors and submitted manuscript. Figure quality and attractiveness may thus have decisive effect on the editor’s decision on publication. Histogram Histogram was already introduced in chapter 2. Construction of histograms is done in two steps. The range of the values is first divided into a number of intervals. These are plotted on the x-axis. Individual values are then assigned into them and the resulting frequencies of observations are plotted on the y-axis. Thus, histograms display the data with only minimal loss of information. They are a perfect tool for exploration of data distribution. Fig. 4.1. Histogram of the variable xy.2$y2 How to do in R function hist applied on the variable to be plotted produces the histogram Useful tip: Parameter col=”grey” makes the histogram more readable/elegant; other colors may be used. Boxplot Boxplot was also introduced in chapter 2. Boxplots display summary of descriptive statistics of samples: the median, quartiles, non-outlier range and outliers. Typically, they are used to study association between a categorical (factor) and a quantitative (numeric) variable, where they display differences between individual categories (levels). Boxplots do not display means, so it is not possible to use them for direct mean comparisons. However, crucial characteristics of the distributions are visible on the plots: variability, symmetry, presence of outliers. This makes boxplots an important tool of exploratory data analysis Fig 4.2. Boxplot displaying the values of the variable y2 for individual categories of type.1. Note the non-symmetric distributions and the outliers. How to do in R function boxplot applied on formula numeric~factor produces the boxplot Useful tips: Parameter col=”grey” makes the boxplot more readable/elegant; other colors may be used. The midpoints of boxes of boxplots are located at integer numbers on the x-axis (unless changed by parameters). Modern alternatives to boxplots Boxplots have many advantages, which makes them standard plot type for displaying associations between a categorical and quantitative variable. However, there are also some issues. One obvious is that they do not display the mean values. In addition, they may provide misleading results if the underlying distribution is e.g. binomial. For these reasons, alternative types were developed called Bean plot and Violin plot. While useful, their use is still rather limited in biological community. For more information, see e.g. https://cran.rproject.org/web/packages/beanplot/vignettes/beanplot.pdf . Fig 4.3. Beanplot displaying the values and densities of the variable y2 for individual categories of type.1. Dotted line, bold lines and short narrow lines indicate global mean, group means and individual observations respectively. Barplot Barplots mostly display means of quantitative variables, in particular difference between means of individual categories (factor levels). To judge on difference between means, it is necessary to display also a characteristic of uncertainty of mean estimate or variability. Therefore, barplots are usually supplies by error bars displaying standard errors, confidence intervals or standard deviation. Of these, the generally best choice is probably the confidence intervals, which indicates the range of values within which the population mean lies with 95% probability (more on that in chapter 7). In any case, specification of error bars (what they display) must always be included in graph caption. The strong aspect of barplots is that they allow judging on difference between means. However, this comes with substantial loss of information: barplots do not display the distribution at all and may even be misleading. The y-axis range in barplots should always start at 0. Displaying negative values (or combination of negative and positive values) may look awkward. Barplots may also be used to display counts, where they display the raw data without any loss of information. Fig. 4.3. Barplot displaying mean values of variable y2 for individual categories of type.1. Error bars indicate 1 standard error. How to do in R function barplot requires a numeric vector of bar height (i.e. means) Errorbars are supplied by function arrows arrows(x0=x.coords, y0=means-err.b, y1= means+err.b, code=3, length=0.05) where err.b is the errorbar parameter (standard error or confidence interval). May also be range but then, the interval is not symmetric. The midpoints of bars are not located at integer numbers on the x-axis. To get their coordinates, you need to save the output of barplot in a vector, like: x.coords<-barplot(x); this plots the barplot and saves the midpoints in the x.coords vector. Useful tips: The barplot function does not allow the formula input and data parameter. Parameter col=”grey” makes the boxplot more readable/elegant; other colors may be used. R does not have a dedicated function for plotting a barplot with errorbars. I have made one for you: see barchartN.R in IS (Study Materials/Learning Materials/Rfunctions). The barplotN function also implements automatic calculation of means; just the classifying factor and the numeric variable need to be supplied. The type of error bars is specified by parameters. Dotchart Dotcharts are closely similar to barcharts. They are also very suitable to comparisons between means. They however better display negative values and allow adjustment of y-axis range. Thus they are considered generally superior options to barplots. Fig. 4.4. Barplot displaying mean values of variable y2 for individual categories of type.1. Error bars indicate 1 standard error. How to do in R function dotchart requires a numeric vector of means. Using this function is however quite awkward. A better option is to use simple plot plot(1:N, means) where N is the number of categories and means is the vector containing the means. Errorbars are supplied by function arrows arrows(x0=1:N, y0=means-err.b, y1= means+err.b, code=3, length=0.05) where err.b is the errorbar parameter (standard error or confidence interval). May also be range but then, the interval is not symmetric. Useful tip: You may change the point symbol of the mean by parameter pch pch = 16 creates a filled point pch = 15 makes a filled box Because R does not have a dedicated function for plotting a dotchart with errorbars, I have made one for you: see dotchartN.R in IS (Study Materials/Learning Materials/Rfunctions). The dotchartN function also implements automatic calculation of means; just the classifying factor and the numeric variable need to be supplied. The type of error bars is specified by parameters. Scatterplot Scatterplot is a simple point-based plot illustrating the association between two quantitative variables. The point in scatterplot usually represent original data, thus there is little loss of information, if any. Using scatterplot, it is possible to explore interdependence between the two variables. Regression line (with confidence) intervals may also be added to the raw scatterplot to visualize a regression model (see chapter 10 for details). Fig 4.4. Scatterplots displaying the relationships between x and y (a) and x and y2 with indication of assignment into categories of type.2 (b). Note the different types of relationships between variables: linear (a) and exponential (b). How to do in R scatterplot is produced by function plot(x, y) if both x and y are numeric variables. Alternatively, plot also accepts the formula and data parameters. Useful tips: With large data, or data, where values are limited to few integers, overlapping points may occur in a scatterplot, which are not visible. A solution to this is to use semitransparent point color. The number overlapping points is then indicated by color intensity. Semitransparent color is specified by parameter alpha in color-specifying function rgb, e.g. col=rgb(0.2,0.2,0.2,alpha=0.5) produces semitransparent grey. Function scatterplot of car package provides many additional functionalities for enhanced scatterplots like adding a regression line, possibility to draw categorized scatterplots (i.e. with different types of points), etc. However, its default settings is not very nice. My friend Pavel Fibich has also scripted a scatterplot function which plots a scatterplot together with regression line and its confidence intervals. It is called lmconf and is available in IS (Study Materials/Learning Materials/Rfunctions). General tips for graph creation/adjustments in R 1. Exporting graphs is best done by saving them as separate files. These files may be raster or vector graphics (see e.g. here for explanation https://vector- conversions.com/vectorizing/raster_vs_vector.html) a. vectors: functions pdf, svg (svg can easily be postprocessed in InkScape https://inkscape.org/). b. rasters: functions png, jpg c. general syntax is e.g. pdf(“file.name.pdf”, width.in.inches, height.in.inches)# In rasters, the width and height are specified in pixels. In addition, raster resolution (in dpi) can also be specified. plot(x,y) dev.off() # Closes the file and saves it to disk. 2. Graphical parameters are set by function par a. ?par provides info on all graphical parameters used also in other functions like plot b. Parameter setting done by par affects the plot, which is produced afterwards. e.g. par(mar=c(2,2,2,2)) sets all plot margins to 2 text lines. A graph produced by plot afterwards will have such margins. c. most important parameters used directly in par: mfcol, mfrow: A vector of the form c(nr, nc). Subsequent figures will be drawn in an nr-by-nc array on the device by columns (mfcol), or rows (mfrow), respectively. mar: A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot. The default is c(5, 4, 4, 2) + 0.1. d. most important graphical parameters used mostly in other functions (like plot) xlim: the x limits (x1, x2) of the plot. The same with ylim for y-axis xlab, ylab: axis labels main: graph headline cex: A numerical value giving the amount by which plotting text and symbols should be magnified relative to the default. This starts as 1 when a device is opened, and is reset when the layout is changed. las: numeric in {0,1,2,3}; the style of axis labels. 0:always parallel to the axis [default], 1:always horizontal, 2:always perpendicular to the axis, 3:always vertical. pch: plotting ‘character’, i.e., symbol to use. This can either be a single character or an integer code for one of a this set of graphics symbols. lty: line type (2 for dashed) lwd: line width log: produces log-scaled axis; use log=’x’, log=’y’ or log=’xy’ for horizontal, vertical or both axes respectively. 3. Other useful functions a. legend: this function adds legend to an existing plot. It is very useful e.g. in scatterplots with multiple point types. b. text: adds a text to a specified place within the plotting region; mtext adds text onto graph margins