Mapping and modeling species distributions
Department of Botany and Zoology, Masaryk University
Bi9661 Selected issues in Ecology, Autumn 2013
Borja Jiménez-Alfaro, PhD
Part 3:
MAPPING + MODELING
Model evaluation and implementation
MODEL EVALUATION
The question: how to estimate model accuracy?
MODEL EVALUATION
Data preparation
Did you think about model evaluation when sampling?
How did you organize your modeling project?
Main atributes: quantity and quality
MODEL EVALUATION
Calibration versus Evaluation dataset
From Guisan and
Zimmerman 2000
MODEL EVALUATION
Option A – INDEPENDENT DATA
You should test your model using completely different data
- Using alternative data from different sources
- Or a new sampling design to collect NEW data
- Thus you will have training data for calibration
testing data for evaluation
MODEL EVALUATION
Option B – DATA PARTITION
When option A is not posible, a common procedure is to
separate a subset of your own data for validation
(although sampled in a similar way)
- You will have again training data and testing data
- Common procedure is to separate 80% of occurrences for
training and 20% for testing
- For only two predictors, a ratio of 50/50 is recommended
MODEL EVALUATION
Option B – DATA PARTITION
With few samples, you can apply general techniques:
K-fold crossvalidation (leave-one out)
(if k = 10) you split the data into 10 subsets, and compute 10
models using 9 subsets for training and 1 for calibration. You can
then average the models and the validation statistics
Bootstrap sampling
You can compute multiple models using a random selection of
occurrences (sampling with replacement) to estimate prediction
accuracy
MODEL EVALUATION
For example, in MaxEnt
Random % testing data
External testing data
Number of replicates (k)
Resampling type
A
B
MODEL EVALUATION
The properties of model evaluation
Training data
a) Categorical
(1/0)
Model
predictions
b) Probabilistic
(0.01……1)
Testing data
Categorical (1/0)
(i.e. presences/absences)
MODEL EVALUATION
For categorical models:
Threshold-dependent measures (e.g. KAPPA)
(you define a threshold between suitable/unsuitable)
For probabilistic models:
Threshold-independent measures (e.g. AUC)
(you assess the complete range of probabilities)
Measures of accuracy (= model performance)
MODEL EVALUATION
Threshold-dependent measures
TESTING DATA
THE
MODEL
(1)
Presence
(0)
Absence
(1)
Presence
n n
(0)
Absence
n n
The confusion (error) matrix
MODEL EVALUATION
Threshold-dependent measures
TESTING DATA
THE
MODEL
(1)
Presence
(0)
Absence
(1)
Presence
(0)
Absence
The confusion (error) matrix
MODEL EVALUATION
Threshold-dependent measures
TESTING DATA
THE
MODEL
(1)
Presence
(0)
Absence
(1)
Presence
(0)
Absence
The confusion (error) matrix
MODEL EVALUATION
Threshold-dependent measures
TESTING DATA
THE
MODEL
(1)
Presence
(0)
Absence
(1)
Presence
(0)
Absence
The confusion (error) matrix
MODEL EVALUATION
Threshold-dependent measures
TESTING DATA
THE
MODEL
(1)
Presence
(0)
Absence
(1)
Presence
(commission error)
(0)
Absence
(omission error)
The confusion (error) matrix
MODEL EVALUATION
Threshold-dependent measures
TESTING DATA
THE
MODEL
(1)
Presence
(0)
Absence
(1)
Presence
(0)
Absence
Sensitivity
(% true positives)
Specificity
(% true negatives)
The confusion (error) matrix
MODEL EVALUATION
Evaluating models
From Franklin 2009
MODEL EVALUATION
Evaluating models
Most common measures of accuracy for categorical models:
KAPPA (from 0 to 1)
Pros Widely recognized measure of agreement for categorical data
Cons In some cases is sensitive to prevalence of the data
(better to be used when prevalence is c. 50%)
TRUE STILL STATISTIC (TSS) (from -1 to +1)
Pros An alternative to Kappa, less senstitive to prevalence
Cons Sometimes it can be negatively related to prevalence
MODEL EVALUATION
An example of
using Kappa for
model evaluation
MODEL EVALUATION
Threshold-independent measures
Are based on continuous probabilistic outputs
Are independent of the prevalence
Useful for comparing the accuracy of different models
(e.g. with different frequencies and prevalences)
ASCIITo_Sm
Value
1
0
MODEL EVALUATION
The ROC plot
(ROC = Receiving Operating Characteristic)Sensitivity(truepositiverate)
1 – Specificity (false positive rate)
Values referred
to different prob.
thresholds
(0, 0.1… 1)
MODEL EVALUATION
AUC (Area under the Curve) of the ROC plot
Prob. that a random selection classify > suitability for presence than for absenceSensitivity(truepositiverate)
1 – Specificity (false positive rate)
Model performance
(AUC values)
0.9 - 1.0: very good
0.8 - 0.9: good
0.7 - 0.8: moderate
0.6 - 0.7: low
0.5 - 0.6: very low
MODEL EVALUATION
The ROC space
A (good)
P=100 N=100
P=91 TP=63 FP=28
N=109 FN=37 TN=72
B (random)
P=100 N=100
P=154 TP=77 FP=77
N=46 FN=23 TN=23
C C (bad)
P=100 N=100
P=112 TP=24 FP=88
N=88 FN=76 TN=12
MODEL EVALUATION
What happens with presence-only methods?
Only presences means only sensitivity
It is necessary to use pseudo-absences or background data
In Maxent:
(1 – specificity) or commission error….
…is substituted by the fraction of the study area predicted
as presence
MODEL EVALUATION
Probably over fitted
Independent testing data
All presences predicted as presence
MODEL EVALUATION
AUC is widely used for assesing model performance
MODEL EVALUATION
Probability thresholds
Thresholds are necessary for:
- Obtaining categorical models
(presence/absence)
- Comparing model performance
(Kappa, TSS, etc)
- Documenting model outputs
(suitable areas for a species)
MODEL EVALUATION
Probability thresholds
Without threshold (from 0 to 1)
Minimum threshold (from 0.17 to 1)
Threshold 0.17 for binary output (0 or 1)
MODEL EVALUATION
From Peterson et al. 2011
MODEL EVALUATION
From Franklin 2009
MODEL EVALUATION
For example, in MaxEnt