Statistics
for Computer Science
doc. PaedDr RNDr. Stanislav Katina, PhD.
Institute of Mathematics and Statistics, Masaryk University
Honorary Research Fellow, The University of Glasgow
Lecture 02
Data and Variables, Introduction to Sampling
StKa, Oct 11 2015
Data and variables
Forming knowledge from data is central to statistical analysis.
Data are information obtained by measuring ‘things’ at different times or locations
across a study region.
Each measurement is called an observation, because it is a record of what has
been observed where and when the measurement was made.
That ‘thing’ – the focus of the study – will be a social, scientiﬁc or environmental
feature, or process, that we want to learn more about.
EXAMPLE:
• geographical – elevation, surface temperature, ozone levels, soil quality,
unemployment, noise, trafﬁc congestion, access to shops and amenities, voting
behaviour, water salinity, slope stability, vegetation cover, life expectancy,
environmental quality, quality of life, happiness, crime rates, etc.
• medical, biological – laboratory variables, immunological variables,
anthropological measurements (length, angles, width, height, circumference,
volume, etc.)
• economical – GDP, exchange rates, ...
StKa, Oct 11 2015
Data and variables
A single observation cannot measure change or variation, and there
is the risk of the measurement being wrong.
Therefore a series of observations is made by recording
measurements for different times or places within the study region.
The set of measurements is known as a variable. It is called a variable
for good reason. Looking at the observations will reveal variations
(variability) in the measurements – they are not all the same.
There are two reasons for the variations (variability).
1. Because what is measured is not stationary – it has different
characteristics at different times and places.
EXAMPLE: unemployment is greater in some places and in some
years more than others; the same is true of air pollution.
2. There are other things that degrade the quality of the
measurement, creating error, that do themselves vary across time
and space.
Given both these considerations it would be surprising if all
observations had exactly the same value. StKa, Oct 11 2015
Key points
1. Data are measurements of something of interest. They are also
called observations because the measurements help us to
observe (and to quantify) an attribute of whatever is being studied.
2. A set of measurements is called a variable because the values are
unlikely to have the same value at all times and locations at which
the measurements were made.
3. Descriptive statistics summarise key information about the
variable. Other types of statistics seek to explain the causes of
the variations (variability).
StKa, Oct 11 2015
Data types
It is important to distinguish between different types of numeric data
because what you can do with data is related to the types of data
they are. This is true for more basic descriptions of data and for more
advanced statistical methods.
However, thinking about data types is not only about matching the data
to an appropriate method of analysis – though that is important. To
consider the data type is also to give thought to the properties of the
phenomenon under study and to the sorts of data it can generate, as
well as to how the phenomenon has been conceptualised and
observed by the researcher.
The data type is a product of what is measured, how it is measured
and why it is measured.
StKa, Oct 11 2015
One of the most common ways of deﬁning data types is by using
the Stevens scale (Stevens 1946). However, it may also be too
strict to apply to real-world data and, in any case,
scale type, as deﬁned by Stevens, is not an attribute of the data, but rather depends
upon the questions we intend to ask of the data and upon any additional information
we may have. It may change due to transformation of the data, it may change with
the addition of new information that helps us to interpret the data differently, or it may
change simply because of the questions we choose to ask. (Velleman and Wilkinson
1993, p.69)
In general, we distinguish between discrete and continuous
data.
Data types
StKa, Oct 11 2015
Discrete data
Discrete data take on one from a limited (and therefore ﬁnite) set of
possible values. Because of this it is possible to count how many times
each speciﬁc value appears in the data (countable) and to produce a
tally – a frequency table – of those counts.
Discrete values tend to be whole numbers, also known as integer
numbers. These are values which have no fractional part and are written
without a decimal point.
EXAMPLE: 1 is an integer whereas 1.1 is not. The value -2 is also an
integer but -2.0 is not: the inclusion of the decimal point implies -2.0 is
from a set of non-integer values that might include -2.1 or -2.2, and
-2.0 might itself be an approximation of -2.04 for example.
Why we are speaking about this common situation here?
StKa, Oct 11 2015
Continuous data
In contrast, continuous data are drawn from an inﬁnite set and can take on
any value – or, at least, any value between a lower and upper limit. They are
often ‘real’ or ‘ﬂoating-point’ numbers, those with a decimal point.
EXAMPLE: 1.001 is a real number, written to three decimal places (there are
three digits after the decimal point). Another is -4.112 34, with ﬁve decimal
places.
Why we are speaking about this common situation here?
Because there are an inﬁnite number of values continuous data could take, it
is futile to produce a frequency table for them. Many or all of the values that
do appear will do so only once. Instead, we can arrange them into groups (for
example, 0–4.99, 5–9.99, 10–14.99, etc.) and then count the number of
members in each group. The result will not be independent of the
groupings. We will see that the way we group continuous data affects our
portrayals of them. These data are call interval data.
StKa, Oct 11 2015
Continuous data
It could be argued that no data are perfectly continuous since there are
always limits to the precision (the number of digits) by which events can be
measured and recorded.
No measurements are drawn from a truly inﬁnite set. However, the difference
between discrete and continuous data is better understood as a property of
what is being measured than of the data themselves.
EXAMPLE: most light switches have two discrete states, either on or off,
whilst the luminance of energy-saving light bulbs increases, on a continuous
scale, from when they receive an electric current to when they are fully lit.
It is also better to understand the words discrete and continuous as two ends
of a continuum along which different sets of data are placed. It then
becomes a matter of asking
which is the better model for the data I have,
what I want to do with it and
for what I am trying to study?’
Integer data are never really continuous, but unless the set contains only
a few (discrete) values then it is neither unusual nor necessarily
mistaken to treat it as if it were continuous.
StKa, Oct 11 2015
Categorical data
An exception to our focus on numeric data will be categorical variables –
those that are labels distinguishing one category from others.
EXAMPLE:
M and F, for males and females
1, 2 and 3 for different types of land use (1 = arable; 2 = forest; etc.)
‘left’ and ‘right’ for political allegiance
low, moderate and high for levels of risk
‘good’ and ‘bad’ as qualitative judgements; and so forth
We will use categorical variables to split a set of measurements into
groups that are then compared. We distinguish (Agresti 2007)
1. ordinal data – the categories are ordered in some way
2. nominal data – categories are not ordered but simply have names
3. binary data
StKa, Oct 11 2015
Variable
Categorical
(qualitative)
Numerical
(qualitative)
Nominal Ordinal Discrete Continuous
categories are
mutually
exclusive and
unordered
e.g.
gender (males,
females),
blood groups
(A, B, AB, 0)
categories are
mutually
exclusive and
ordered
e.g.
decease stage
(mild,
moderate,
severe)
integer values,
typically counts
e.g.
days sick last
year
takes any value
in a range of
values
e.g.
weight (in kg),
height (in cm)
StKa, Oct 11 2015
Key points
1. Discrete data are those that take on one from a restricted
set of possible values. They are usually whole or integer
data.
2. Continuous data could take on any value, or any value
within a lower and upper limit and to a certain level of
precision. They are ‘real’ or ‘ﬂoating-point’ numbers.
In practice, the difference between these data types is not ﬁxed
or immutable. Discrete data are often analysed as though they
are continuous, and continuous data can be grouped into
discrete categories (for example, by sorting height data into
'short', 'average' and 'tall').
StKa, Oct 11 2015
Derived data
Percentages
Ratios and quotients
Rates
Scores
All of these data can be treated as numerical variables for most analyses.
Where the variable is derived using more than one value, it is important to
record all of the values used.
Censored data – either laboratory data below or above detection limit or
failures in time where we are following subject for a certain period of time
and recording a failure
StKa, Oct 11 2015
Sampling – learning objectives
• Know the distinction between a sample and the population.
• Explain and recognise examples of sampling bias and the
notion of a representative sample.
• Outline the stages involved in the process of sample design.
• Distinguish between non-probabilistic and probabilistic
sampling methods.
• Summarise the nature of the following non-probabilistic
sampling methods – judgemental, quota, snowball and
convenience sampling.
• Explain the concept of a sampling frame and be able to apply
this to a study of your design.
StKa, Oct 11 2015
Sampling – learning objectives
• Summarise the following probabilistic sampling methods –
systematic, simple random, stratiﬁed random, multi-stage
random and cluster sampling.
• Highlight the effectiveness of different sampling methods and be
able to construct a sample design that is effective and efﬁcient
for a given problem.
• Appreciate the relationship between sample size, precision and
conﬁdence in general terms.
• Weigh up the practicalities of a particular sample design, taking
into account cost, safety, access and environmental/human ethics.
StKa, Oct 11 2015
• It is often simply impractical to collect all data relating to a task for
logistical reasons including budgets, time and access.
EXAMPLE: We cannot
- interview every adult within an electoral district,
- measure the girth of every tree within a forest or
- count every grain of sand on a beach.
• This would not be a very efﬁcient way of going about your task even if a full
survey (or census) was feasible.
• Largely, we do not need to gather a complete and full set of data relating to our
task in order to draw general conclusions about the processes or phenomena
we are investigating.
• There are always exceptions to any generalisation, and there are certain
circumstances where it is possible and feasible to deal with a complete
population, although these situations are rare.
• You could for example interview every individual on a planning committee
regarding a particular decision. More normally in research we may gather a
sample (or subset) of data from our target population, from which we may
draw robust conclusions or develop scientiﬁc theories.
Sampling
StKa, Oct 11 2015
Target population
• The target population can be deﬁned as the complete set of
measurements that might hypothetically be recorded in a
particular study context that are relevant to the study.
EXAMPLE: In a research context this could include
- all people or vegetation within a deﬁned area at a certain time, or
- the distributed activities of a single business unit operating from one
location.
• Often, the notion of population is more theoretical than quantiﬁable.
• A population falling outside either the geographical area or context
of a study is sometimes referred to as an out-of-scope population.
StKa, Oct 11 2015
Sampling
• The characteristics of a sample of a population are known as
statistics.
• A sample that matches (or represents) the population is known as
a representative sample.
• Achieving a representative sample is a key issue when adopting
probabilistic statistical methods in your research, but also
applies when adopting other research methodologies whether
qualitative or quantitative in approach.
• However, not all research traditions require that the outcomes are
representative of the whole, as some work is illustrative in nature.
It is important to respect these differences across the discipline;
there is a place for the in-depth case study, particularly in regard to
gaining more detailed knowledge about how a process or
phenomenon emerges.
StKa, Oct 11 2015
Representative sample
This is a sample that matches (or represents) the
statistical characteristics of the overall target population.
The characteristics of a sample are known as its statistics.
StKa, Oct 11 2015
The process of sampling
• Considering which design best meets the goals of your study
forms an important next step in the process, with the underlying
aim behind the process of sample design being to ensure that
the subset of data collected reﬂects the characteristics of the
overall target population.
• Sub-questions here include how the design should be
parameterised, and whether you should ﬁrst conduct a pilot
study.
• Theoretical ideals and practical realities conﬂict – practical
pointers that may suggest that your sampling strategy needs
modiﬁcation before you go out in the ﬁeld.
• And there is something more to consider – any analysis is
dependent on data and whether they are ﬁt for purpose.
StKa, Oct 11 2015
The process of sampling
Scope and scale
spatial, temporal
and structural
dimensions of
your study
Sampling methods
sample design
and
parametrisation,
practicalities,
implementation
Deﬁne your research question
Review the related literature
Review the scope of your study
Construct a sample frame
Select a sample design method
Review your design from practical
perspective
Implement the design
StKa, Oct 11 2015
Formulating the research question
• Before jumping into speciﬁc methods for sampling, we need to be sure
that the sample we are collecting is representative for our purpose.
• At this point, we need to stand back from the overall problem.
• First, we need to be absolutely clear about the research question
we are trying to answer.
• Having established this fundamental step in the overall, and much
larger, research design process, we need to ask which environmental
or human factors inﬂuence the process or phenomenon we wish to
investigate.
• We also need to consider the scope of the speciﬁc study.
• Is it feasible, in terms of both the complexity of the question and
geographical coverage intended that you could collect sufﬁcient
data to answer the question in the time you have available?
• It is also important to consider at what scale you might expect the
process to occur or pattern to materialise at this early stage in the
study and what associated assumptions it might be reasonable to
make regarding the data.
StKa, Oct 11 2015
Formulating the research question
• What is your research question?
• What processes or phenomena do you expect to inﬂuence that
question?
• What data should you be collecting?
• Have you reﬂected on whether the question is tractable given the
time and resources at your disposal?
• How much do you already know about the data and underlying
geographical process? Rather than making unreasonable and
ungrounded assumptions, it would be appropriate to undertake a
pilot study before continuing with your main sampling campaign if
necessary.
EXAMPLE: Formulate your own research question, and answer other
questions above. Discuss the answers in the class.
StKa, Oct 11 2015
Review the relevant literature
There is no substitute for detailed background literature work prior
to developing a research methodology and statistical sampling
design.
EXAMPLE: Commonly encountered factors, subject to the study in
question, might include:
gender, sexual preference, family background, cultural background,
economic background, educational background, computer literacy and
access, age, niche personal interests, political preferences, subsurface
or deep geology, longer-term environmental history of an area (for
example, ﬂooding, ﬁre, uplift), meteorology and/or climate, depth,
elevation, soil type, land cover, land use, basin size, distance from the
sea, aspect, slope.
This long, but hardly comprehensive, list serves to illustrate the need to
constrain the scope of your study.
The tighter your reference question, the easier it is to avoid a biased
sample.
StKa, Oct 11 2015
The scope of the study
Formulating precisely a research question that is as speciﬁc as
possible is the easiest way of managing the scope of your
study.
The scope of your study can be further tightened in two main ways:
• Firstly by reducing the research extent of your study
(narrowing your focus).
• Secondly by choosing sample sites such that a number of
potentially important factors affecting the process you are
observing are held steady. This second practice is known as
controlling your variables.
StKa, Oct 11 2015
Research extent
A look at the titles or abstracts of journal papers in areas of the
discipline that particularly interest you can be instructive here.
EXAMPLE: Taking one example of how you might construct a
focused dissertation title, consider the following title: ‘The
geography of shoplifting in a British city: evidence from
Cardiff’ (Bromley and Thomas 1999).
The title shows that the authors are interested in the phenomenon of
British shoplifting as a whole, but indicates with clarity that their
evidence in regard to the complete country is partial and
geographically speciﬁc.
StKa, Oct 11 2015
Controlling the variables
EXAMPLE: A solid piece of research looking at the effect of geology
on river incision is better managed by comparing two basins of
similar size and rainfall patterns than trying to undertake a
complex model that accounts for varying basin size, rainfall volume
and rainfall intensity in one go. This is because achieving a
representative sample in the latter case would be a sizeable task
before you could achieve any statistical certainty in your results,
owing to the multiple potential interactions between variables.
EXAMPLE: You might choose to investigate the impact of an
environmental variable (for example, distance from a power station
or mobile phone mast) on health by comparing groups of similar
age, economic and genetic background who have lived in the area
for similar lengths of time.
StKa, Oct 11 2015
Sampling frame
The sampling frame contains all possible data (or sampling units)
to be selected (the population); it frames or outlines the data set.
Only data within the sampling frame may subsequently be selected
for analysis.
Sampling frame is the actual list from which your sample items
are drawn.
EXAMPLE: It might be the GIS data grid subdividing your study
area in the context of a physical geography example, or the rather
more tangible electoral roll (register of voters) in the case of a
human geography task.
StKa, Oct 11 2015
Hierarchy of terms in probability sampling
We have outlined the population, target population,
sample frame and sample. These relate as follows:
Population
Target population
Sampling frame
Sample
StKa, Oct 11 2015
Sampling frame
EXAMPLES:
Subject related sampling frame – electoral roll for an area, the
list of students matriculated at a university or school, or the
complete police record of all shoplifting crime reported in a
particular place or timespan.
Geography related sampling frames
– the electoral ward or enumeration district, a list of postcodes
in an area or another form of regional boundary
– the list of operating meteorological stations or pollution control
instruments in an area, a watershed relating to a river feature
under investigation or a digital map showing geological or
elevation classiﬁcations for the study area under review
Sample units should be deﬁned by size (for example, quadrat
extent in a bio-geographical study), the location and time of
records.
StKa, Oct 11 2015
Scenarios and questions – scope, scale and extent
EXAMPLE: If you are looking at biodiversity indicators, do you
need to record the individual proportions of named species in your
sample notebook? Might it be better to cover a wider geographical
range, noting number of different species, instead?
EXAMPLE: You are seeking data about shoplifting in the UK. Should
you rely on secondary data concerning recorded shoplifting
incidents, given that these are suggested to be but a small sample of
the whole (Bromley and Thomas 1999), or should you weigh up other
strategies to gather your data?
StKa, Oct 11 2015
Scenarios and questions – scope, scale and extent
EXAMPLE: You are interested in the impact of green lanes on
butterﬂy distributions. Butterﬂies are in general most active at high
levels of sunshine. Is it really worth undertaking a sampling campaign
at both 9am and 12pm, or would one set of observations at 12pm be
just as effective?
EXAMPLE: You have been investigating the effects of heavy
industry on the quality of river water. While you are there, you
decide it might be interesting to look at lev“levels of phosphorus and
nitrogen also. After you have analysed the different chemicals, at
considerable time and expense, you realise that you have no data
concerning the history of agricultural land use in the area and cannot
make good use of your nitrogen and phosphorus data. Further,
because you have undertaken more analysis than planned, your
project may run late and runs the risk of not being completed. Think
back, what might you have done differently?
StKa, Oct 11 2015
Scenarios and questions – scope, scale and extent
In general, these scenarios arise for a variety of reasons.
1. First, there is doubt regarding the research question
itself or a lack of consideration concerning the processes
and variables inﬂuencing your research question.
2. Secondly, focus and discipline are required as part of
an effective sampling campaign in the ﬁeldwork. This
second issue is more likely to trap the enthusiastic and
diligent researcher, determined to succeed. There is also
a danger in collecting data because they are what has
been collected in the past, affordable, familiar or easy to
collect.
StKa, Oct 11 2015