Mapping and modeling species distributions Department of Botany and Zoology, Masaryk University Bi9661 Selected issues in Ecology, Autumn 2013 Borja Jiménez-Alfaro, PhD Part 3: MAPPING + MODELING Model evaluation and implementation MODEL EVALUATION The question: how to estimate model accuracy? MODEL EVALUATION Data preparation Did you think about model evaluation when sampling? How did you organize your modeling project? Main atributes: quantity and quality MODEL EVALUATION Calibration versus Evaluation dataset From Guisan and Zimmerman 2000 MODEL EVALUATION Option A – INDEPENDENT DATA You should test your model using completely different data - Using alternative data from different sources - Or a new sampling design to collect NEW data - Thus you will have training data for calibration testing data for evaluation MODEL EVALUATION Option B – DATA PARTITION When option A is not posible, a common procedure is to separate a subset of your own data for validation (although sampled in a similar way) - You will have again training data and testing data - Common procedure is to separate 80% of occurrences for training and 20% for testing - For only two predictors, a ratio of 50/50 is recommended MODEL EVALUATION Option B – DATA PARTITION With few samples, you can apply general techniques: K-fold crossvalidation (leave-one out) (if k = 10) you split the data into 10 subsets, and compute 10 models using 9 subsets for training and 1 for calibration. You can then average the models and the validation statistics Bootstrap sampling You can compute multiple models using a random selection of occurrences (sampling with replacement) to estimate prediction accuracy MODEL EVALUATION For example, in MaxEnt Random % testing data External testing data Number of replicates (k) Resampling type A B MODEL EVALUATION The properties of model evaluation Training data a) Categorical (1/0) Model predictions b) Probabilistic (0.01……1) Testing data Categorical (1/0) (i.e. presences/absences) MODEL EVALUATION For categorical models: Threshold-dependent measures (e.g. KAPPA) (you define a threshold between suitable/unsuitable) For probabilistic models: Threshold-independent measures (e.g. AUC) (you assess the complete range of probabilities) Measures of accuracy (= model performance) MODEL EVALUATION Threshold-dependent measures TESTING DATA THE MODEL (1) Presence (0) Absence (1) Presence n n (0) Absence n n The confusion (error) matrix MODEL EVALUATION Threshold-dependent measures TESTING DATA THE MODEL (1) Presence (0) Absence (1) Presence (0) Absence The confusion (error) matrix MODEL EVALUATION Threshold-dependent measures TESTING DATA THE MODEL (1) Presence (0) Absence (1) Presence (0) Absence The confusion (error) matrix MODEL EVALUATION Threshold-dependent measures TESTING DATA THE MODEL (1) Presence (0) Absence (1) Presence (0) Absence The confusion (error) matrix MODEL EVALUATION Threshold-dependent measures TESTING DATA THE MODEL (1) Presence (0) Absence (1) Presence (commission error) (0) Absence (omission error) The confusion (error) matrix MODEL EVALUATION Threshold-dependent measures TESTING DATA THE MODEL (1) Presence (0) Absence (1) Presence (0) Absence Sensitivity (% true positives) Specificity (% true negatives) The confusion (error) matrix MODEL EVALUATION Evaluating models From Franklin 2009 MODEL EVALUATION Evaluating models Most common measures of accuracy for categorical models: KAPPA (from 0 to 1) Pros Widely recognized measure of agreement for categorical data Cons In some cases is sensitive to prevalence of the data (better to be used when prevalence is c. 50%) TRUE STILL STATISTIC (TSS) (from -1 to +1) Pros An alternative to Kappa, less senstitive to prevalence Cons Sometimes it can be negatively related to prevalence MODEL EVALUATION An example of using Kappa for model evaluation MODEL EVALUATION Threshold-independent measures Are based on continuous probabilistic outputs Are independent of the prevalence Useful for comparing the accuracy of different models (e.g. with different frequencies and prevalences) ASCIITo_Sm Value 1 0 MODEL EVALUATION The ROC plot (ROC = Receiving Operating Characteristic)Sensitivity(truepositiverate) 1 – Specificity (false positive rate) Values referred to different prob. thresholds (0, 0.1… 1) MODEL EVALUATION AUC (Area under the Curve) of the ROC plot Prob. that a random selection classify > suitability for presence than for absenceSensitivity(truepositiverate) 1 – Specificity (false positive rate) Model performance (AUC values) 0.9 - 1.0: very good 0.8 - 0.9: good 0.7 - 0.8: moderate 0.6 - 0.7: low 0.5 - 0.6: very low MODEL EVALUATION The ROC space A (good) P=100 N=100 P=91 TP=63 FP=28 N=109 FN=37 TN=72 B (random) P=100 N=100 P=154 TP=77 FP=77 N=46 FN=23 TN=23 C C (bad) P=100 N=100 P=112 TP=24 FP=88 N=88 FN=76 TN=12 MODEL EVALUATION What happens with presence-only methods? Only presences means only sensitivity It is necessary to use pseudo-absences or background data In Maxent: (1 – specificity) or commission error…. …is substituted by the fraction of the study area predicted as presence MODEL EVALUATION Probably over fitted Independent testing data All presences predicted as presence MODEL EVALUATION AUC is widely used for assesing model performance MODEL EVALUATION Probability thresholds Thresholds are necessary for: - Obtaining categorical models (presence/absence) - Comparing model performance (Kappa, TSS, etc) - Documenting model outputs (suitable areas for a species) MODEL EVALUATION Probability thresholds Without threshold (from 0 to 1) Minimum threshold (from 0.17 to 1) Threshold 0.17 for binary output (0 or 1) MODEL EVALUATION From Peterson et al. 2011 MODEL EVALUATION From Franklin 2009 MODEL EVALUATION For example, in MaxEnt