Statistics
for Computer Science
doc. PaedDr RNDr. Stanislav Katina, PhD.
Institute of Mathematics and Statistics, Masaryk University
Honorary Research Fellow, The University of Glasgow
Lecture 01
StKa, Oct 6 2015
Syllabus
1. Why computer scientists should study statistics?
2. Computer science related problems with analysed data
3. Why the thought study based on data is useful?
4. Data types
5. Sampling
6. Parametric probabilistic and statistical models
7. Likelihood principle and parameter estimation using numerical methods
8. Descriptive statistics (tables, listings, ﬁgures)
9. From description to statistical inference
10.Hypothesis testing and parameters of a model
11.Goodness-of-ﬁt tests
12.Testing hypotheses about one-sample
13.Testing hypotheses about two-samples
14.Testing hypotheses about more than to sample problems
15.Interpretation of statistical ﬁndings StKa, Oct 6 2015
Why computer scientist should study statistics?
Joke "hot-air balloon"
There is a story about two doctors who are ﬂoating above the
countryside in a hot-air balloon. They are drifting with the wind and
enjoying the scenery, but after couple of hours, they realise that they
are totally lost. They see someone down on the ground, and shout
down "Hello! Can you tell us where we are?" The person on the
ground replies, "You are ﬁfty feet up in the air, in a hot-air balloon."
One doctor turn to the other and says, "That person on the ground
must be a statistician." "How did you know?", came astonished reply
from the ground. "Only a statistician would provide an answer that
was totally accurate and totally useless at the same time." But in this
story, the statistician has the last word. "Very good. But I can also tell
that you two are doctors."
StKa, Oct 6 2015
Why computer scientist should study statistics?
Reality "BBC report"
A UK social atlas suggests that British society is becoming more
segregated by class, researchers have said [. . .] It found that:
• An average child in the wealthiest 10% of neighbourhoods can
expect to inherit at least 40 times as much wealth as a typical child
in the poorest 10%
• In some areas, 16-to-24-year-olds are 50 times more likely to attend
an elite university than in others
• In the most impoverished parts of the country young adults in this
age group are almost 20 times more likely not to be in education,
employment or training than those in the wealthiest
neighbourhoods
• There are no large neighbourhoods where under ﬁve year olds from
the highest social class spend time with any other class of children
other the one just beneath them.
(http://news.bbc.co.uk/2/hi/uk_news/6984707.stm)
StKa, Oct 6 2015
Why computer scientist should study statistics?
Reason "all is about the data"
In any scientiﬁc investigation, there are bound to be some sources
of bias. Perhaps the sample wasn’t quite random, or maybe the
respondents tended to overestimate their income. We do
everything we can to minimize those biases, but we can never
eliminate them entirely. Consequently, small, artifactual relationships
are likely to creep into the data. A large sample [or dataset] is like a
very sensitive measuring instrument. It’s so sensitive that it detects
those artifactual relationships along with the true relationships.
(Allison 1999, pp.58–59)
StKa, Oct 6 2015
Why computer scientist should study statistics?
Don't believe everything you read!
It is good to question what a statistic is telling you. Something can
appear statistically important yet still have no social or scientiﬁc
meaning. Be guided by theory and by prior research ﬁndings, as
well as by what the data are saying. Try not to 'ﬁt the facts' to the
data, just because it is convenient to do so. Statistics are an aid for
your intelligence and judgement, not a replacement for them!
StKa, Oct 6 2015
Three types of statistics
Descriptive statistics
Descriptive statistics is providing a summary of a set of measurements.
Arithmetic averages may be calculated, the most common values
identiﬁed and the spread of values obtained (for example, by ﬁnding the
minimum and maximum).
Descriptive statistics are useful in their own right and as a precursor to more
sophisticated analysis. They are cognitively useful: they help to frame
understanding of the data and of the real-world events and processes they
measure. They are also useful diagnostically, especially for detecting error
(often caused by missing data due to incomplete record keeping)
EXAMPLE: The (mean) average annual rainfall at Darwin airport in Northern
Australia was 1847.1 mm for the period 1971–2007. For Melbourne, on the
south coast, it is lower: 654.4 mm. (Source: Australian Government Bureau
of Meteorology www.bom.gov.au)
StKa, Oct 6 2015
Why computer scientist should study statistics?
Reason "all is about the data"
The beneﬁt of studying statistics is in gaining a skill set that is
transferable to other research methods, disciplines and walks of
life.
Studying statistics encourages an approach to research that is
reﬂective, thoughtful and mindful of the limitations of data and
their analysis. Encouraging the researcher to form a clear and
manageable research question, a means to answer the question,
and awareness of the assumptions, methodological limitations
and the researcher’s own prejudices is a discipline conducive to
all empirical work, quantitative or qualitative.
For statistical research, it is good practice to ask whether the results
have both statistical and substantive meaning. Focusing on the
statistical helps to avoid sensationalising events that are either
potentially random or entirely predictable – to avoid claiming
they are unusual when they are not.
StKa, Oct 6 2015
Three types of statistics
Inferential statistics
Inferential statistics go beyond describing a set of data in its own right and
use the information to say something about the population from which
the data were sampled.
EXAMPLE: The UK Conservative Party (as part of the governing coalition) is
predicted to have the support of 37% of the electorate against 41% for the
Labour Party. How does a polling company know this when it asked only
1500 people from an electorate of approximately 45 million? It does not
know, not for certain, but it can make an inference if it believes the
sample is representative of the electorate as a whole. Being
representative means those who have been polled are not disproportionately
located in, say, Labour-voting industrial areas or Conservative-voting rural
areas. The poll is not biased towards a particular section of the electorate.
StKa, Oct 6 2015
Three types of statistics
Relational statistics and potentially explanatory
At a minimum they identify whether particular events or circumstances
coincide.
EXAMPLE: Does shaving less than once a day really increase a man’s risk of
having a stroke (BBC News online, 7 February 2003)? Unlikely, but they could
well be related, perhaps by lifestyle.
Going further, relational statistics become explanatory when they offer ﬁrm insight
on "what causes what".
EXAMPLE: In an analysis of rural to urban migration in China, Wang and Fan
(2006) identify some of the personal, social and economic reasons that can lead
a migrant not to continue in a city but instead return home to the countryside.
Chief amongst them is age (the older the migrant, the more likely they are to
return) but important also is family responsibility – those who are married and/or
have children are increasingly likely to return.
Going further, if statistics are explanatory then they may also be predictive. A
statistical model might be built to answer ‘what if?’ questions.
StKa, Oct 6 2015
Three types of statistics
There are three types of statistic: those that describe
and summarise a set of data in its own right; those that
'go beyond' the data to infer something more general
about the population from which the data were sampled;
and those which examine relationships.
StKa, Oct 6 2015
Analysis and errors
Analysis
The Oxford English Dictionary gives one deﬁnition of it: ‘a detailed
examination of the elements or structure of something’.
For statistical analysis we examine data – information that can be stored,
retrieved, queried, summarised, classiﬁed and used for calculations by a
statistical package running on a computer. The data will often be numeric
but not always.
EXAMPLE: Use a word processor to count how many times the word
statistics appears in this lecture. The result is numeric but the objects of the
analysis are words.
When we examine data we do so because we are interested in what the
data represent. If they are measurements, then our interest is in what has
been measured.
EXAMPLE: If I measure the heights of 6, 7, 8, 9, 10-years-old children, I am
not merely interested in the numbers. I am interested in what those
numbers tell me about my children’s heights (for example, how much
they have changed since the last time I made a measurement).
StKa, Oct 6 2015
Analysis and errors
Errors
The problem is that any measurement is subject to error.
EXAMPLE: The children not standing straight, the tape measure
warping, myself not aligning it vertically, and so forth.
An analogy is often used from physics and engineering.
EXAMPLE: Consider a radio wave carrying an analogue transmission
of a news broadcast. A number of factors affect the quality of what
you hear, including the quality of the radio, the size of its aerial, the
distance from a transmitter, atmospheric conditions, and so forth.
What you want to hear is the news – the signal. Any background hiss
is unwanted noise.
A challenge of statistical analysis is to separate ‘the noise from the
signal’.
StKa, Oct 6 2015
Analysis and errors
Errors
EXAMPLE: Imagine we used a questionnaire-based survey to gather
data about business and manufacturing for a study about regional
economics and competitiveness. What we have to accept is that the data
obtained are imperfect. People may misunderstand the questions or
make an error when completing a tick box. We may miscode the data
or ask questions that are not fully relevant. Nevertheless, we hope to
take the data and use them to understand why some regions prosper
whilst others decline.
There will always be noise in data.
EXAMPLE: Take a ruler and measure the length of the following: _________.
How long is it? Can you repeatedly get the same exact answer twice?
Probably not, because it depends on exactly where you position the ruler,
where your eye adjudges to be the end of the line, the ruler being perfectly
aligned with the line, a lack of hand shake, no creasing of the paper, and so
forth.
Obviously it has length. And you can probably measure it well, just not
deﬁnitively because you cannot avoid the chance of error. Also, you need
to be wary of systematically repeated error: for example, measuring the line
in centimetres but thinking they are inches; or using a plastic ruler that
has warped in the Sun.
StKa, Oct 6 2015
Analysis and errors
Errors and variability
Because noise – error – creeps into any analysis, no single answer,
measurement or result will be perfectly deﬁnitive (and you would have no way
of knowing even if it was).
Put another way, if an experiment is repeated twice or more, there is no
guarantee the same exact result will be obtained each time. Instead, we
expect the opposite: to ﬁnd variability in the results even if the structure
and properties of that being analysed do not change.
However, error is not the only reason to expect data to vary. Social, economic
and environmental systems themselves create outcomes that are not the
same everywhere. That is science of data!
EXAMPLE: Mean temperature in New York is not the same as in Melbourne.
EXAMPLE: Property prices in Manhattan are not the same as in The Bronx.
EXAMPLE: Atmospheric pollution in Los Angeles is different to that in Hawaii.
Acknowledging this variation (variability) does not mean the issue of error
goes away. It adds to the challenge of explanatory data analysis: to ﬁnd and
explain real variations (variability), not those due to error.
StKa, Oct 6 2015
Analysis and errors
Errors is unavoidable
Error creeps into any data collection and analysis – it cannot be avoided because
there are so many ways it can arise:
1. a faulty instrument,
2. misreading,
3. data wrongly coded,
4. 'rounding' errors (for example, treating 1.7 as 1),
5. tiredness,
6. interference (for example, poor atmospheric conditions);
7. misunderstanding the research question; and so forth.
The hope is that the effects of error will be neutral overall – that they will not
seriously distort what is learned from the data.
The worst errors are those that are
• hard to detect,
• difﬁcult to correct and
• have a large impact on the analysis.
StKa, Oct 6 2015
Key points
1. Statistics have three main purposes: to be descriptive, inferential
or relational.
2. Knowledge of statistics is important because data collection and
analysis are central to the functioning of society.
3. Statistical practice encourages a research rigour and reﬂectivity
that is useful for both quantitative and qualitative research.
4. Users of statistics do so for a variety of reasons and with a range
of motivations, beliefs and philosophical perspectives.
5. Data analysis is about detecting, examining, understanding and
predicting events and phenomena that are biological, medical,
geographical, technical, economical etc. in nature. It tries to
separate what is true of those features from error in the data.
6. It is a good idea to ask whether the results of research have both
statistical and substantive meaning, probing beneath the surface
of what the data appear to be telling you and making links to
academic debate and theories.
StKa, Oct 6 2015
Statistical vocabulary
• Population – large, inﬁnitely large set of statistical units
(individuals)
• Sample – sampling statistical units from a population
• Variable – measuring its values (realisations, observations,
data, information about variable), maximising measurement
precision
• Error (noise) – systematic or random (imprecise measurement;
randomness in the data), minimising errors
• Signal – searching for a signal, maximising signal-to-noise
ratio, separating noise from the signal
StKa, Oct 6 2015
Statistical vocabulary
• Variability – variability (variations) in the data, its size and
direction, searching for sources of variability
• Parameter – a variable characteristics, estimating it from a
sample
• Estimate – estimating parameter (parameter estimate),
underestimate, overestimate, correct estimate
• Bias – sources of bias, minimising bias, estimation bias, biased
estimate
• Description and analysis of the data
• Statistics – descriptive (describing data), inferential
(inferring about parameters, based on data but about situation
in a population, generalisation), relational (distinguishing
artifactual and true relationships)
StKa, Oct 6 2015