Mapping and modeling species distributions Department of Botany and Zoology, Masaryk University Bi9661 Selected issues in Ecology, Autumn 2013 Borja Jiménez-Alfaro, PhD Part 2: MODELING An introduction to Maxent MAXENT What means “maxent”? The term refers to MAXIMUM ENTROPY, a type of modeling approach commonly used for resolving problems in systems with restricted information (computer vision, Physics, Natural Language Processing, etc) The very basic concept is based on the principle of maximum entropy (a system with reduced extrinsic information) to predict the distribution of probabilities less biased What means “maxent”? Maximum entropy models are applied to many fields They arrived to the topic of distribution modeling “recently” MAXENT A project: “Machine Learning for Madagascar Conservation Planning” MAXENT MAXENT Pros Free software, transparent method Very high predictive power in comparison with other methods Allow interactions between variables Results offer the participation of each variable in the model Very good performance with few samples Cons General problems associated to presence-only methods: - Very sensitive to biased data - No indicative of prevalence (proportion of occupied sites) MAXENT MODELING METHODS TYPES OF (CORRELATIVE) MODELS We will use the classification of Hijmans and Elith (2013) for “dismo” package in R Similar classification is provided in Wikipedia MODELING METHODS Profile methods Bioclim Domain Mahalanobis distance Regression models Generalized Linear Models (GLM) Generalized Additive Models (GAM) Machine learning Boosted Regression Trees (BRT) Random Forests (RF) MaxEnt (Presence data) (Presence/absence data) (Presence data) (+background data) A brief overview of methods and algorithms MODELING METHODS Profile methods Only consider presence data Based on the environmental distance to known sites Simple algorithms easy to understand Bioclim, Domain, Mahalabobis Others: Environmental Niche Factor Analyses (ENFA) MODELING METHODS BIOCLIM The BIOCLIM algorithm – environmental envelope – computes the similarity of a location by comparing the values of environmental variables at any location to a percentile distribution of the values at known locations of occurrence (’training sites’) Nix, H.A. (1986) A biogeographic analysis of Australian elapid snakes. In: Atlas of Elapid Snakes of Australia. (Ed.) R. Longmore, pp. 4-15. Australian Flora and Fauna Series Number 7. Australian Government Publishing Service: Canberra Implemented in: R (dismo), Openmodeller, Diva-GIS MODELING METHODS Cutoff = 0.67 Cutoff = 0.99 Each variable has its own envelope represented by the interval [m - c*s, m + c*s] The result is a cubic space (box) of n dimensions Example for one species using temperature x precipitation MODELING METHODS DOMAIN The Domain algorithm is a distance-based method based on the Gower distance between environmental variables at any location and those at any of the known locations (’training sites’). Results are not a probability of occurrence but a measure of similarity Implemented in: R (dismo), Openmodeller, Diva-GIS Carpenter G., A.N. Gillison and J. Winter, 1993. Domain: a flexible modelling procedure for mapping potential distributions of plants and animals. Biodiversity Conservation 2: 667-680 MODELING METHODS Ecological distance methods – Multivariate similarity DISTANCIAS ECOLÓGICAS MODELING METHODS models in the environmental space (temperature x precipitation) generated with the same input (Thalurania furcata boliviana localities) but with different parameters (source: openmodeller) Mahalanobis distance to the centroid Gower distance to the centroid Euclidean distance to the centroid MODELING METHODS MAHALANOBIS Distance Unlike Euclidean distance, Mahalanobis distance takes into account the correlations of the variables in the data set, and it is not dependent on the scale of measurements. It is based on a descriptive statistic developed by P.C. Mahalanobis in 1936. Implemented in: R (dismo), Openmodeller, IDRISI, ArcView Mahalanobis, P.C., 1936. On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India 2: 49-55. MODELING METHODS A comparison of PROFILE ALGORITHMS (Tsoar et al 2007) MODELING METHODS PROFILE ALGORITHMS Pros Models of easy interpretation and representation Support low number of occurrences Widely used for large-scale approaches Cons Lack of procedures for variable selection Interactions between variables are not allowed All the variables have the same weight Very sensitive to bias data and marginal points MODELING METHODS Regression models Combine presence and absence data Models are calibrated by extracting the values of locations Robust methods and classical statistics (probabilistic inference) GLMs, GAMs (implemented in most stat software) Other: Multiple Adaptive Regression Splines (MARS) MODELING METHODS GENERALIZED LINEAR MODELS (GLM) . MODELING METHODS GENERALIZED LINEAR MODELS (GLM) MODELING METHODS GENERALIZED LINEAR MODELS (GLM) MODELING METHODS GENERALIZED ADDITIVE MODELS (GAM) MODELING METHODS GENERALIZED ADDITIVE MODELS (GAM) MODELING METHODS GENERALIZED ADDITIVE MODELS (GAM) “Semi-parametric” extension of GLM Very flexible, it can fit very complex models Main limitation: GAM cannot be used to calculate species response parameters Good for exploratory analysis of responses MODELING METHODS Example of applications in SDMs Prediction surface for mullein, Lava Beds NM (From Edwards 2009) MODELING METHODS Machine learning techniques They consider presence data and absence or background data Also called ´data mining´methods Based on information theory and artificial inteligence Probablistic distributions are infered from incomplete data Boosted Regresion Trees (RT), Random Forests (RF), MaxEnt Others: Artificial Neural Networks, Genetic algorithm (GARP) MODELING METHODS Why machine learning systems? SDMs can be considered as a supervised learning problem In statistical inference you must decide distributional form and parametres ML methods learn the function inductively from training data Inductive (supervised) machine learning methods are commonly used in artificial inteligence (computer vision, robotics, medical diagnosis, etc) MODELING METHODS Boosted Regression Trees (= ”Gradient Boost”, ”Stochastic Gradient Boosting”) It uses a technique of boosting (a regression-like method based on Machine learning) to combine large numbers of simple tree models adaptively, to optimize predictive performance Implemented in: R (dismo, gbm) Elith, J., J.R. Leathwick and T. Hastie, 2009. A working guide to boosted regression trees. Journal of Animal Ecology 77: 802-81 MODELING METHODS Boosted Regression Trees Ensemble models based on a LARGE number of tree models Computationally intensive methods MODELING METHODS Random Forests Random Forest (Breiman, 2001) method is an extension of Classification and regression trees (CART; Breiman et al., 1984) It builds a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees Implemented in: R (randomForest), openmodeller . Breiman, L., 2001. Random Forests. Machine Learning 45: 5-32 MODELING METHODS Random Forests Not a real models like in logistic regression Higher prediction accuracy than ordinary decision trees No graphical tree model like in classification trees http://www.ualberta.ca/~drr3/random-forest.html MODELING METHODS MODELING METHODS From Edwards (2009) MODELING METHODS Step #4 – TheTest: Independent Validation - Which isBest? Logistic GAM CART RF PCC 75.5 75.9 77.6 87.7 Specificity 74.5 74.4 76.8 89.4 Sensitivity 84.1 90.3 86.5 71.0 kappa 0.29 0.32 0.35 0.46 AUC 0.88 0.89 0.85 0.88 Edwards et al., ECOCHANGE, , Lausanne, Sep 2009 From Edwards (2009) MODELING METHODS MAXENT MaxEnt (Maximum Entropy; Phillips et al., 2006) is the most widely used SDM algorithm. Is a general-purpose machine learning method with precise mathematical formulation. It uses presences and absences (generating a background layer to fit the model) Implemented in: R (dismo, others), Stand-alone Java program Phillips SJ et al. 2006. Maximum entropy modeling of species geographic distributions. Ecological Modelling 190 231-259 MODELING METHODS MAXENT Main statement: a probability distribution with maximum entropy (the most spread out, closest to uniform), subject to known constrains, is the best approximation of an unknown distribution. It combines machine learning with statistical methods Exponential output (difficult to intepret) Logistic output (0 - 1) MODELING METHODS Model comparisons MODELING METHODS Elith et al. 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29: 129 -151 MODELING METHODS NEXT WEEK… PRACTICE WITH MAXENT