Introduction to R programming language

Mojmír Vinkler, 16.9.2016

What is R?

R is a programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, surveys of data miners, and studies of scholarly literature databases show that R's popularity has increased substantially in recent years.

-- Wikipedia

Why should I care about R?

  • easy to learn & use
  • R is one of the best programming languages for data processing and #1 for statistics
  • used at Google, Microsoft and many others
  • thousands of well maintained open-source libraries on CRAN
  • good support from open-source and enterprise projects (e.g. Spark, Microsoft Azure, Google products, ...)
  • goes well with other languages - use R for visual output and prototyping, rewrite parts when you need performance to C

title

Disadvantages of R

  • not ideal for all purposes
  • invented by a statistician, some language features feels weird to programmers
  • could be slow if used inappropriately (avoid for loops at all cost!)

Alternatives to R for working with data

Python

The only serious "competitor" to R. Python libraries for data analysis like pandas were heavily inspired by R.

SAS, Statistica

Very expensive enterprise solution. Their intuitive "drag & drop" environment made for non-programmers turns out to be hell for programmers.

Excel

Seriously?

Programming environments for R

Interesting libraries

R as a "better" calculator

In [1]:
sqrt((42 + 4.2)^2 + sin(exp(1) * pi))
Out[1]:
46.2083752439454

Variables

In [3]:
a = 1
# operator <- is the same as =
b <- 2
c = a + b
c
Out[3]:
3

Basic data types

In [44]:
# numerical type
a = 2
a = 2.2

# string
a = 'text'

# true / false
a = TRUE
a = FALSE
a = T
a = F

# vectors
a = c(1,2,3)
a = 1:3

Operators

In [20]:
1 == 1
1 != 2
1 < 2
!TRUE
is.na(NaN)
is.null(NULL)
Out[20]:
TRUE
Out[20]:
TRUE
Out[20]:
TRUE
Out[20]:
TRUE
Out[20]:
TRUE

Advanced data types

In [4]:
# matrix
A = matrix(1:9, 3, 3)
A = rbind(c(1,2,3), c(4,5,6), c(7,8,9))
A
Out[4]:
123
456
789
In [5]:
# dataframe (= matrix with column names)
df = data.frame(a=c(1,2,3), b=c('a', 'b', 'c'))
head(df)
Out[5]:
ab
11a
22b
33c

Operations on vectors

In [6]:
a = c(1,2,3)
b = c(3,4,5)
# addition / multiplication by elements
print(a + b)
print(a * b)
# cross product
print(a %*% b)
[1] 4 6 8
[1]  3  8 15
     [,1]
[1,]   26

Operations with matrices

In [33]:
A = matrix(1, 2, 3)

# element wise multiplication
A * A
Out[33]:
111
111
In [36]:
# matrix multiplication
A %*% t(A)
Out[36]:
33
33

Data indexing

In [55]:
a = 1:10
a[4]
Out[55]:
4
In [58]:
A = matrix(1:9, 3, 3)
# element
A[1,3]

# 1. row
A[1,]

# 1. column
A[,1]
Out[58]:
7
Out[58]:
  1. 1
  2. 4
  3. 7
Out[58]:
  1. 1
  2. 2
  3. 3
In [7]:
A = data.frame(a=c(1,2,3), b=c('a', 'b', 'c'))
# column `a`
A[,'a']
# first row
A[1,]
# columns `a` and `b`
A[,c('a', 'b')]
Out[7]:
  1. 1
  2. 2
  3. 3
Out[7]:
ab
11a
Out[7]:
ab
11a
22b
33c

Basic functions

In [29]:
a = c(1,2,3)
sum(a)
mean(a)
max(a)
min(a)
sd(a)
Out[29]:
6
Out[29]:
2
Out[29]:
3
Out[29]:
1
Out[29]:
1

Graphics

In [36]:
options(repr.plot.width=7, repr.plot.height=5)
In [37]:
x = 1:100
y = sin(0.1 * x)
plot(x, y, xlab='x-label', ylab='y-label', main='Main title')
In [38]:
require(datasets)
pairs(iris[1:4], main="Edgar Anderson's Iris Data", pch=21,
       bg = c("red", "green3", "blue")[unclass(iris$Species)])

Functions

In [39]:
add = function(a, b, c=1){
    return(a + b + c)
}

add(3, 4)
add(3, 4, c=2)
Out[39]:
8
Out[39]:
9

If-else conditions

In [40]:
answer = 42
if(answer == 42){
    print('Correct answer')
} else if(abs(answer - 42) <= 2){
    print('Almost...')
} else {
    print('Wrong answer')
}
[1] "Correct answer"

Loops

Always vectorize your code if you can! If not, then either implement core parts in C or take a very long cofee break.

In [17]:
vec = 1:10
for(i in vec){
    print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
In [11]:
i = 0
while (i < 2){
    i = i + 1
    print(i)
}
[1] 1
[1] 2

Loading data

In [24]:
data = read.csv('foo.csv', sep=';', header=TRUE)
head(data)
Out[24]:
idsexage
11male25
22female17

External libraries / scripts

In [ ]:
# install library from CRAN
install.packages('circular')

# load library into current namespace
library(circular)

# execute code from file (load external functions)
source('my-functions.R')

Function apply

Function apply applies given function to all rows / columns of a matrix (or data.frame).

In [27]:
# column mean
apply(data[,c('id', 'age')], 2, mean)
Out[27]:
id
1.5
age
21
In [28]:
# row mean
apply(data[,c('id', 'age')], 1, mean)
Out[28]:
  1. 13
  2. 9.5

Other functions

In [30]:
# better print with string concatenation
cat('Print', 'text', 42, '\n', 'to next line')
Print text 42 
 to next line
In [35]:
# string concatenation
print(paste('Text', 'with', 'spaces'))
print(paste0('Text', 'without', 'spaces'))
[1] "Text with spaces"
[1] "Textwithoutspaces"

To be updated...