University of Redlands
InSPIRe @ Redlands
MS GIS Program Major Individual Projects Geographic Information Systems
8-2014
Predictive Model of Archaeological Sites of the
Hopi Reservation of Northeastern Arizona
Jerry Thomas Holton Jr.
University of Redlands
Follow this and additional works at: https://inspire.redlands.edu/gis_gradproj
Part of the Geographic Information Sciences Commons
This material may be protected by copyright law (Title 17 U.S. Code).
This Thesis is brought to you for free and open access by the Geographic Information Systems at InSPIRe @ Redlands. It has been accepted for
inclusion in MS GIS Program Major Individual Projects by an authorized administrator of InSPIRe @ Redlands. For more information, please contact
inspire@redlands.edu.
Recommended Citation
Holton, J. T. (2014). Predictive Model of Archaeological Sites of the Hopi Reservation of Northeastern Arizona (Master's thesis, University
of Redlands). Retrieved from https://inspire.redlands.edu/gis_gradproj/222
University of Redlands
Predictive Model of Archaeological Sites of the Hopi Reservation of
Northeastern Arizona
A Major Individual Project submitted in partial satisfaction of the requirements
for the degree of Master of Science in Geographic Information Systems
by
Tom Holton
Mark Kumler, Ph.D., Committee Chair
Douglas Flewelling, Ph.D.
August 2014
Predictive Model of Archaeological Sites of the Hopi Reservation of
Northeastern Arizona
Copyright © 2014
by
Jerry Thomas Holton, Jr.
v
Acknowledgements
In final conclusion, I would like to thank God first and then my parents who have
stood strong by my side since the beginning of the MS GIS program at the University of
Redlands and my whole life. I would also most definitely like thank to Dr. Mark Kumler,
my advisor, for his support, patience, and excellent technical advice throughout the year.
I also want to thank my client, Dr. Wes Bernardini, for his flexibility, patience, and being
an all-around excellent resource as I worked through my project. In addition, I would like
to thank Michael Heilen, of Statistical Research, Inc., who gave me numerous tips on
how to develop and improve the predictive models. I also want to thank the faculty of the
MS GIS program, Dr. Fang Ren, Dr. Douglas Flewelling, Dr. Ruijin Ma, and Dr. Russell
Weaver. All of them provided me with excellent technical and personal advice and
assisted me while developing my project. Finally, I want to thank all my friends from
Cohort 24, which is all of them. Without their friendship, support, knowledge of
Geographic Information Systems science, and amazing weekend extravaganzas,
surviving this year might not have been possible.
vii
Abstract
Predictive Model of Archaeological Sites of the Hopi
Reservation of Northeastern Arizona
by
Jerry Thomas Holton, Jr.
The Predictive Model for Archaeological Sites in the Hopi Reservation of
Northeastern Arizona was developed to assist archaeologists in minimizing their study
area for locating archaeological sites. Extensive research exists on predictive models for
locating archaeological sites since the 1970s because many study areas are too large for
archaeologists to cover on foot. The archaeological site types for these models were
Habitation, Rock Art, and Scatter and were established between 500-1500 C.E. The
independent variable categories for these models developed in ArcGIS 10.2 were based
on topography, water resources, and vegetation. The logistic regression model in the
Statistical Package for the Social Sciences (SPSS) was the selected statistical approach
for building the predictive models. The final seven archaeological site type predictive
surfaces were then created in the ArcGIS 10.2 Raster Calculator based on the coefficients
created and statistically significant independent variables determined from the logistic
regression models.
ix
Table of Contents
Chapter 1 – Introduction............................................................................................. 1
1.1 Client................................................................................................................. 2
1.2 Problem Statement............................................................................................ 3
1.3 Proposed Solution............................................................................................. 3
1.3.1 Goals and Objectives ........................................................................................ 3
1.3.2 Scope................................................................................................................. 4
1.3.3 Methods............................................................................................................. 4
1.3.4 Audience ........................................................................................................... 5
1.4 Overview of the Rest of this Report ................................................................. 5
Chapter 2 – Background and Literature Review...................................................... 7
2.1 Model Variables................................................................................................ 7
2.2 The Principal Component Analysis .................................................................. 9
2.3 Logistic Regression Approach in Predictive Modeling.................................. 12
2.4 Applications of the Logistic Regression Model in Predictive Modeling........ 14
2.5 Accuracy and Validation................................................................................. 15
2.6 Summary......................................................................................................... 15
Chapter 3 – Systems Analysis and Design................................................................ 17
3.1 Problem Statement.......................................................................................... 17
3.2 Requirements Analysis ................................................................................... 17
3.3 System Design ................................................................................................ 18
3.4 Project Plan..................................................................................................... 19
3.5 Summary......................................................................................................... 21
Chapter 4 – Database Design..................................................................................... 23
4.1 Conceptual Data Model .................................................................................. 23
4.2 Logical Model................................................................................................. 24
4.3 Data Sources ................................................................................................... 26
4.4 Data Collection Methods ................................................................................ 27
4.5 Data Scrubbing and Loading .......................................................................... 27
4.6 Summary......................................................................................................... 27
Chapter 5 – Implementation...................................................................................... 29
5.1 Data Compilation and Creation Methods ....................................................... 29
5.2 Independent Variable Development ............................................................... 36
5.3 Variable Import............................................................................................... 61
5.4 Testing for Multicollinearity........................................................................... 61
5.5 Logistic Regression Model Development in SPSS......................................... 62
5.6 Model Validation Tests................................................................................... 63
5.7 Model Assessment .......................................................................................... 64
5.8 Predictive Raster Surface Creation ................................................................. 66
5.9 Summary......................................................................................................... 66
Chapter 6 – Results and Analysis.............................................................................. 67
6.1 Model Validity................................................................................................ 67
6.2 Model Assessment .......................................................................................... 72
6.3 Summary......................................................................................................... 87
x
Chapter 7 – Conclusions and Future Work............................................................. 89
7.1 Soils Variable.................................................................................................. 89
7.2 Historical-Period Resources Variables ........................................................... 90
7.3 Random Forests Statistical Approach............................................................. 90
7.4 Summary......................................................................................................... 90
Works Cited……………………………………………………………………………..91
xi
Table of Figures
Figure 1-1: Reference map of the Hopi Reservation. ..................................................2
Figure 2-1: Data points bounded by a fitted ellipse...................................................10
Figure 2-2: First Principal Component. .....................................................................11
Figure 2-3: Second principal component...................................................................12
Figure 3-1: System design of predictive model workflow.........................................19
Figure 3-2: Original project plan ...............................................................................20
Figure 3-3: Revised project plan................................................................................21
Figure 4-1: Conceptual data model............................................................................25
Figure 4-2: The logical data model............................................................................25
Figure 5-1: The attribute table of archaeological sites displaying unstandardized site.
names. ........................................................................................................................31
Figure 5-2: The Minimum Bounding Geometry geoprocessing tool to create convex.
Hulls...........................................................................................................................33
Figure 5-3: The Create Random Points geoprocessing tool to create non-site points
within the convex hull................................................................................................34
Figure 5-4: Large gap in the mosaicked USGS DEM ...............................................35
Figure 5-5: The slope variable created, which can influence site location................37
Figure 5-6: The north-south aspect variable which can influence site location. .......39
Figure 5-7: The east-west aspect variable which can influence site location............40
Figure 5-8: The Focal Statistics geoprocessing tool utilized to create the local
elevation change variable...........................................................................................41
Figure 5-9: The local elevation change variable, which measures surface roughness
an indicator of how easy it is to traverse the landscape.............................................42
Figure 5-10: The terrain texture variable created to measure the amount of variability
within a 1-kilometer radius........................................................................................44
Figure 5-11: The shelter variable created to measure the amount of protection the
landscape can provide sites........................................................................................46
Figure 5-12: The TPI variable created to explain a sites location on the landscape
....................................................................................................................................48
Figure 5-13: The cost to traverse variable created to measure the relative difficulty
of moving across the landscape. ................................................................................50
Figure 5-14: The Cost Distance geoprocessing tool used to create the cost to traverse
to major streams variable...........................................................................................51
Figure 5-15: The cost to traverse to major streams variable created to measure the
difficulty level to traverse the landscape to this water resource. ...............................52
Figure 5-16: The cost to traverse to all streams variable created to measure the
difficulty level to traverse the landscape to all streams, excluding the major streams.
....................................................................................................................................53
Figure 5-17: The cost to traverse to springs and seeps variable created to measure
the difficulty level to traverse the landscape to this water resource. .........................54
Figure 5-18: The cost to traverse to water bodies variable created to measure the
difficulty level to traverse the landscape to this water resource. ...............................55
Figure 5-19: The cost to traverse to stream network nodes variable created to
xii
measure the difficulty of reaching this water resource.............................................. 56
Figure 5-20: The cost to traverse to major stream network nodes variable created to
determine the cost to traverse to the major stream network nodes. .......................... 57
Figure 5-21: The Focal Statistics geoprocessing tool for creating the vegetation
richness variable........................................................................................................ 58
Figure 5-22: The vegetation richness variable created to determine the count of
plant species within the specified 100-meter radius.................................................. 59
Figure 5-23: The vegetation richness variable created to determine the count of plant
species with the specified 500-meter radius.............................................................. 60
Figure 5-24: Composite Bands geoprocessing tool................................................... 61
Figure 5-25: Forward Likelihood Ratio methodology. ............................................. 63
Figure 5-26: Hosmer-Lemeshow goodness-of-fit test............................................... 64
Figure 5-27: The Raster to Point geoprocessing tool to raster pixels to points......... 65
Figure 6-1: Distribution of current Rock Art sites on the predictive raster surface
utilizing slope, east-west aspect, and terrain texture predictive variables................. 75
Figure 6-2: Distribution of current Rock Art sites on the predictive Rock Art
surface utilizing east-west aspect and cost to traverse to all stream nodes predictive
variables..................................................................................................................... 76
Figure 6-3: Distribution of current Rock Art sites on the predictive Rock Art surface
utilizing east-west aspect and terrain texture explanatory variables. ........................ 77
Figure 6-4: Distribution of current Scatter sites on the predictive surface utilizing
vegetation richness at 500-meters, shelter, cost surface, and cost to traverse to water
bodies explanatory variables. .................................................................................... 80
Figure 6-5: Distribution of current Scatter sites on the predictive surface utilizing
vegetation richness with a 500-meter radius, slope, shelter, and cost distance to
major stream nodes explanatory variables. ............................................................... 81
Figure 6-6: Distribution of current Habitation sites on the predictive surface utilizing
vegetation richness with a 500-meter radius, slope, shelter, and cost distance to water
bodies, and cost to traverse to all stream nodes explanatory variables. .................... 84
Figure 6-7: Distribution of current Habitation sites on the predictive surface utilizing
slope, shelter, and cost surface, cost to traverse to water bodies, and cost distance to
all stream nodes explanatory variables...................................................................... 85
xiii
List of Tables
Table 1. Functional and Non-functional requirements. .................................................... 18
Table 2. Classification table for assessing model accuracy.............................................. 65
Table 3. Omnibus tests for model coefficients for Rock Art Models 1, 2, and 3. ............ 68
Table 4. Hosmer-Lemeshow goodness-of-fit test for Rock Art Models 1, 2, and 3......... 69
Table 5. Omnibus tests for model coefficients for Scatter site models 1 and 2................ 70
Table 6. Hosmer-Lemeshow goodness-of-fit test for Scatter site models 1 and 2. .......... 71
Table 7. The omnibus tests for model coefficients for Habitation site models 1 and 2.... 72
Table 8. The Hosmer-Lemeshow goodness-of fit test for Habitation site models 1
and 2.................................................................................................................................. 72
Table 9. Classification table for Rock Art logistic regression models 1, 2, and 3............ 74
Table 10. Classification table for the Scatter site models 1 and 2. ................................... 79
Table 11. Classification table for the Habitation site models 1 and 2. ............................. 83
Table 12. Percentage of Rock Art site models predictive surface that contained a
probability of 0.5 >= of finding a Rock Art site type. ...................................................... 86
Table 13. Percentage of Scatter site models predictive surface that contained a probability
of 0.5 >= of finding a Scatter site type. ............................................................................ 87
Table 14. Percentage of Habitation site models predictive surface that contained a
probability of 0.5 >= of finding a Scatter site type........................................................... 87
xv
List of Acronyms and Definitions
DEM – Digital Elevation Model
CRM – Cultural Resource Management
GAP – Gap Analysis Program
GIS – Geographic Information Systems
NAD – North American Datum
PCA – Principal Component Analysis
SPSS – Statistical Package for the Social Sciences
SRI – Statistical Research, Inc.
SSURGO – Soil Survey Geographic Database
STATSGO – State Soil Geographic Database
TPI – Topographic Position Index
USGS – United States Geological Survey
1
Chapter 1 – Introduction
The Predictive Model of Archaeological Sites in the Hopi Reservation of Northeastern
Arizona was designed to assist Dr. Wesley Bernardini, Associate Professor of the
Department of Sociology and Anthropology at the University of Redlands, to better
determine where to place survey transects to locate archaeological sites. The model
created raster surfaces indicating the probability of finding three archaeological site types
in the Hopi Reservation of Northeastern Arizona [Figure 1-1]. The site types were
Habitation, Scatter, and Rock Art. The logistic regression model was the selected
statistical approach that was executed in the Statistical Package for the Social Sciences
(SPSS) to determine what independent variables and associated weights to use in creating
the final predictive raster surfaces for the archaeological site types. In accomplishing this,
archaeological survey efforts could be improved in the Hopi Reservation.
2
Figure 1-1: Reference map of the Hopi Reservation.
1.1 Client
Dr. Wesley Bernardini was the client for the project. He provided vector data of the
current archaeological sites within the Hopi Reservation and its boundaries. He required a
predictive model for archaeological sites in the Hopi Reservation to create a raster
surface that indicates the likelihood of finding archaeological sites. Dr. Bernardini also
3
required a file geodatabase to house the current data and that accommodates for future
data, and to standardize projections for all data.
1.2 Problem Statement
Archaeologists are confronted with the problem of locating archaeological sites in
extremely large study areas. This requires extensive field work and time in order to cover
these study areas. Since the 1970s, developing predictive archaeological site models has
become a common strategy among many archaeologists to increase the likelihood of
locating sites (Kohler & Parker, 1986). As a result, Dr. Bernardini wanted to develop a
predictive model for archaeological sites based on environmental variables (e.g.,
accessibility to water resources, aspect, slope, etc.). The outputs of this model were
created to help focus his efforts and reduce the amount of fieldwork in the 1,619,936-acre
Hopi Reservation (Bernardini, personal communication, September 30, 2013).
1.3 Proposed Solution
By developing a predictive model for archaeological sites in the Hopi Reservation, it was
important to discuss the solution to the project. The strategy was influenced by a previous
predictive archaeological model created by Statistical Research, Inc. (SRI), based out of
Albuquerque, New Mexico.
The topographic, water resource, and vegetation variables for the model were
created in ArcGIS 10.2. The logistic regression models were conducted in SPSS. Once
the weights for each environmental variable were created based on the coefficients from
the logistic regression model, these weights were then used in the weighted overlay
analysis in ArcGIS 10.2. Outputs were created for the three archaeological site types.
According to Fish (2013), The Learning Center of the Southwest, states that scatter sites
are “entirely of artifacts and lacking associated features. Some artifact scatters may be
comprised of a single material, such as a flaked stone or ceramics, whereas others
encompass multiple artifact types.” Habitation sites range from ephemeral campsites to
massive villages. Rock art sites are made up of pictographs and petroglyphs (Fish, 2013).
As a result, raster surfaces indicating the likelihood of finding these three archaeological
site types were created for the Hopi Reservation.
1.3.1 Goals and Objectives
This project contained one goal and five objectives. The overarching goal of the project
was:
1. To create predictive models for three archaeological site types: Habitation, Rock
Art, and Scatter.
In order to accomplish the project goal, five objectives were met:
1. Create a file geodatabase to store the appropriate data for building the models and
other maps necessary for the project.
4
2. Conduct analyses for topographic, water resource, and vegetation independent
variables to be included in the model.
3. Build logistic regression models in SPSS.
4. Create predictive raster surfaces for three archaeological site types.
5. Create a workflow for the client to model predictions when new data are added.
1.3.2 Scope
The scope of the project entailed creating a file geodatabase to house the appropriate GIS
data to conduct the topographic, water resource, and vegetation independent variable
analysis. The logistic regression models were then built in SPSS. The third component of
the project was creating the predictive raster surface outputs in ArcGIS 10.2. Finally, a
workflow was created that explained in detail how to update and run the model.
1.3.3 Methods
Once the file geodatabase was constructed, the topographic, water resource, and
vegetation independent variables were created that were used as independent variables in
the logistic regression model. The topographic variables included in the model were
slope, north-south aspect, east-west aspect, local elevation change, terrain texture, shelter,
topographic position index (TPI), and cost surface. The water resource variables utilized
were cost to traverse to streams, cost to traverse to major streams, cost to traverse to
springs and seeps, cost to traverse to water bodies, cost to traverse to all stream junction
nodes, and cost to traverse to major stream junction nodes. Vegetation richness with a
100-meter and 500-meter radius were the only vegetation variables utilized in building
models.
Various techniques were utilized to create the topographic variables. These variables
were all created from a 10-meter United States Geological Survey (USGS) Digital
Elevation Model (DEM) downloaded from http://nationalmap.gov/viewer.html. The
seven topographic variables were then created in ArcGIS 10.2 utilizing various
geoprocessing tools.
The water resource variables were the next environmental variables created. The
stream flow lines, water bodies, and seeps and springs feature classes were all
downloaded from the National Hydrography Dataset Plus website (http://www.horizonsystems.com/nhdplus/NHDPlusV1_CO.php).
All water resource variables analyzed the
relative difficulty of traveling across the landscape to a particular water resource (e.g.,
seeps and springs, major streams, water bodies). Major stream lines were identified by
employing the Strahler Stream Order Methodology for determining the highest order
streams within the reservation (Strahler, 1957). Stream orders 3-5 were then determined
to be the highest order streams in the reservation, which allowed them to be designated as
the Major Streams feature class. Major stream network nodes were also identified by
utilizing various geoprocessing tools. The cost to traverse to major streamlines, all
streamlines excluding the major stream lines, springs and seeps, water bodies, all stream
junction nodes excluding the major stream junction nodes, and major stream junction
nodes were all created in ArcGIS 10.2.
Vegetation data were downloaded from the Gap Analysis Program (GAP), a
National Land Cover Data program of the USGS (http://gapanalysis.usgs.gov/
5
gaplandcover/). Vegetation was then examined by analyzing vegetation richness with a
100- and 500-meter radius within the reservation in ArcGIS 10.2.
Once the variables were created, the values for each environmental variable
associated with each site type were exported out of ArcGIS 10.2 and were input in the
logistic regression model. The dependent variables considered were the archaeological
site types and the archaeological non-site types. Archaeological non-site types also had to
be created for the model because the dependent variable in a logistic regression analysis
is binary. The independent variables were topographic, water resources, and vegetation.
Once the model was constructed and run, the coefficients generated by the model were
used as weights for the statistically significant variables in the weighted raster overlay
analysis. From this, the Raster Calculator was run for the three site types, which created
various raster surfaces. There were multiple predictive surfaces created for each
archaeological site type because the logistic regression analysis was run various ways to
employ various strategies to create the most accurate and valid models.
1.3.4 Audience
The audience for this report is the client, Dr. Wesley Bernardini. Dr. Bernardini has a
strong background in geographic information systems and sciences. Therefore, the final
report is written in such a manner that he understands the majority of the technical
concepts and language, although, there is a significant amount of clarification of technical
terminology, concepts, and acronyms throughout the report due to the specialized nature
of the project.
1.4 Overview of the Rest of this Report
Chapter 2 addresses the project background and literature review of research relevant to
this project. Chapter 3 discusses systems analysis and design. Chapter 4 addresses the file
overall database design for creating the models model. Chapter 5 discusses project
implementation. Chapter 6 discusses the results and analysis of the report. Chapter 7
makes future recommendations as to where there could be improvements to the predictive
model.
7
Chapter 2 – Background and Literature Review
It is common practice in the United States that federal land managers contract
archaeologists to conduct surveys on federal lands to locate archaeological sites. Because
of this, there has been development in the popularity of sensitivity models due to their
ability to predict where settlement sites likely exist (Kohler & Parker, 1986). In the article
Predictive Models for Archaeological Resource Location Timothy A. Kohler and Sandra
C. Parker put forward two approaches to predictive modeling: the empiric correlative
approach and the deductive approach. According to Kohler and Parker (1986), the
empiric correlative model factors in environmental variables as being the main predictors
of settlement location. This approach is designed to predict settlement areas that have
similar environmental features, such as proximity to water resources, and generally does
not consider social factors in the model. The deductive approach examines spatial
behavior and can better answer questions as to why groups determine where they conduct
their activities (Kohler & Parker, 1986). The predictive model created in this project
followed an empirical approach because of the environmental variables that were utilized
in the model (e.g., vegetation richness, local elevation change, shelter, and aspect) and
did not consider social variables.
Heilen et al. (2012), states that archaeological sensitivity models have been used for
Cultural Resource Management (CRM) since the 1970s. The Anthropological Studies
Center of Sonoma State University states the following about CRM:
…inventorying [cultural] sites, evaluating them, and at times mitigating the
adverse effects of development projects and construction. Cultural resources are
the remains and sites associated with human activities and include the following:
prehistoric and ethnohistoric Native American archaeological sites; historic and
archaeological sites; historic buildings; elements or areas of the natural landscape
which have traditional cultural significance (Sonoma State University, 2008).
Because of this need, archaeological predictive models have been developed to assist in
locating these sites in large study areas.
This chapter addresses the models’ main components that had to be researched in
order for successful model development. These areas were the model variables, the
logistic regression statistical approach to model development, applications of the logistic
regression model in predictive modeling, and approaches to model validation and
accuracy. This chapter also discusses the principal component analysis (PCA) that is
another method that can be used in building a predictive model, but was ultimately not
used in this project. The chapter concludes by providing a high-level understanding of
how each of these topic areas were implemented in building models.
2.1 Model Variables
There are many approaches that have been used for determining which independent
variables should be utilized in predictive modeling. Statistical Research, Inc. (SRI) staff
developed a predictive archaeological site model in southern New Mexico in 2012 and
created variables within five categories:
8
 Water resources
 Soils
 Topography
 Historical-period resources
 Vegetation
Many of these variables were implemented in this predictive model for archaeological
sites in the Hopi Reservation.
The topographic variables utilized for SRI’s model were all created from a 10-meter
DEM of the state of New Mexico. These variables were slope, north-south aspect, eastwest
aspect, local elevation change, terrain texture, shelter, elevation above water, and
cost surface. Slope is a very common variable utilized in predictive modeling because the
slope of the terrain can have a strong impact on archaeological site locations. Aspect is
also very important because a site’s orientation can have direct impact as to the level of
exposure to sun and wind (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012).
According to Heilen et al. (2012), terrain texture is a measure of terrain roughness.
Shelter was also an important variable for modeling for site location. This is “the degree
to which topographic features offer shelter from the weather, sun, or even visibility”
(Heilen, Leckman, Byrd, Homburg, & Heckman, 2012). Local elevation change is
important for modeling site locations because a rough terrain can “inhibit day-to-day
activities and travel to and from sites” (Kvamme, 1988). The cost surface variable was
created to measure the relative cost of moving across the landscape. This variable was
useful in creating variables that measure the distance between a raster cell and the closest
resource of a given type (e.g., water bodies) (Heilen, Leckman, Byrd, Homburg, &
Heckman, 2012). These variables were used in creating predictive archaeological site
models for southern New Mexico.
The model developed by SRI staff also utilized soil variables that were extracted
from the Soil Survey Geographic Database (SSURGO) and State Soil Geographic
Database (STATSGO). SRI staff used multiple soil-attribute variables in the model
because they found that for many sites each modeling unit was located along the edges of
soil map units (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012). Some of these
variable were available water capacity, pH level, organic matter, surface-horizon
thickness, and sodium absorption ratio. SRI staff also created additional soil variables
such as cost to traverse to soil-texture boundary, standard deviation in soil-texture
boundary index, and range in organic matter. Cost to traverse to soil-texture boundary
examined the cost to traverse to a soil mapping unit boundary. The standard deviation in
soil-texture index variable was created to examine the variation in soil texture between
adjacent soil-mapping units. The range in organic-matter content was created because it
was observed that sites close to soil-mapping unit boundaries were “located on units that
has substantially different organic matter contents than immediately adjacent soilmapping
units” (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012).
Water resource variables were also created to examine their availability on the
landscape. The variables taken into consideration for water resources were cost to
traverse to streamlines, cumulative drainage of nearest stream segment, cost to traverse to
major streamlines, cost to traverse to springs or seeps, cost to traverse to water bodies,
cost to traverse to stream-network nodes, cost to traverse to major stream-network nodes,
9
and elevation relative to water. All these variables measured the relative difficulty of
traversing the landscape to reach the specified water resource (e.g., cost to traverse to
springs or seeps). Elevation relative to water was another water resource variable that SRI
staff utilized in building models. Because these models were developed in the canyon and
mesa land of southern New Mexico, it was important to examine not just the horizontal
cost to traverse to a water resource but also to examine the vertical distance, for example,
an area might horizontally be in close proximity to water but is separated from it by a
cliff. For this reason, elevation relative to water was included in the building models
(Heilen, Leckman, Byrd, Homburg, & Heckman, 2012).
Another variable taken into consideration by SRI staff was historical period
resources. Heilen et al. states,
Historical-period sites are often located near transportation routes, such as trails,
wagon roads, and railroads because much historical-period land use in the US West
was dependent on the use of transportation for exploration and population migration
as well as the redistribution of goods and materials. Therefore, proximity to such
features of the built environment is often considered a primary determinant of
historical-period settlement behavior (Heilen, Leckman, Byrd, Homburg, &
Heckman, 2012).
Similar to the water resource variables, all the names of the historical-period variables are
self-explanatory in how they influenced site establishment (e.g., cost to traverse to
historical-period places). These variables included cost to traverse to historical-period
transportation routes, cost to traverse to historical-period places, and cost to traverse to
[water] tanks.
Vegetation was the last variable that was considered in the variable analysis. SRI
staff utilized this variable because they believed there were strong associations between
site location and vegetation type. SRI staff only examined two variables, vegetation type
and richness. SRI staff considered vegetation richness because they believed sites that
were “located near the edges of multiple vegetation types have greater access to a wider
variety of resources than sites located in an area with uniform vegetation” (Heilen,
Leckman, Byrd, Homburg, & Heckman, 2012).
The article GIS Reveals Basis for Ancient Settlement Location, by Dr. Terrance L.
Winemiller, Associate Professor of Archaeology, Auburn University of Montgomery,
Alabama, discusses a similar predictive model development project that was created for
covering most of northwestern Yucatán, Mexico (Winemiller, 2014). The underlying
theory behind this model was that the Mayas established settlements close to water
sources, especially small cenotes (natural wells). There was additional analysis of other
environmental variables, such as rainfall, climate, and soil type, but no significant
relationships were found between site type and these variables (Winemiller, 2014).
Essentially, what was ascertained from this research is that water resource variables were
the strongest site determinant in the study. This could be due to the fact that the terrain
was relatively flat, which caused site location to not be dependent on topography.
2.2 The Principal Component Analysis
The principal components analysis (PCA) has been used in predictive models to reduce
the number of independent variables that were originally created for building models for
simplification and also to eliminate multicollinearity among variables. Multicollinearity
10
arises where there are very strong correlations among independent variables and the
independent variables have no significant impact on explaining the dependent variable
(Fattah, 2010). According to the Esri ArcGIS Resources Center website, the PCA tool
found in the Spatial Analyst toolbox of ArcGIS 10.2 transforms the original raster data
“in the input bands to a new multivariate attribute space whose axes are rotated with
respect to the original space. The axes (attributes) in the new space are uncorrelated”
(Esri, 2011). The purpose of this statistical technique is to eliminate multicollinearity or
redundant variables. For example, elevation, slope, and aspect are derived from a DEM,
from which most of the variance can be explained. The tool then creates a multiband
raster that contains the same amount of bands as the specified number of components.
The first principal component raster band contains the greatest amount of variance, the
second contains the second greatest variance, and so on (Esri, 2011). Typically, principal
component rasters containing 95 % of the variance will be utilized in building models. In
order for the Principal Components Analysis tool to be run, the user must enter the
following inputs: the input bands (environmental variables), the number of principal
components to transform the original input bands, and the name of the statistics text file
(Esri, 2011).
In order to shift and rotate the axes, the data are plotted onto a scatter plot. An ellipse
then bounds the data points in the scatter plot [Figure 2.1] (Esri, 2011).
Figure 2-1: Data points bounded by a fitted ellipse. Adapted from ArcGIS 10.2
Help, How Principal Component Works (Esri, 2011).
11
Next, the software program determines the major axis of the ellipse. This component
exhibits the most variation “because it is the largest transect that can be drawn through
the ellipse. The direction of PC1 is the eigenvector, and its magnitude is the eigenvalue
[Figure 2-2]. The angle of the x-axis to PC1 is the angle of rotation that is used in the
transformation” (Esri, 2011).
Figure 2-2: First Principal Component. Adapted from ArcGIS 10.2 Help, How
Principal Component Works (Esri, 2011).
The orthogonal line that is directly perpendicular to the first principal component is then
calculated (Esri, 2011). This line created is now the second principal component and is
also the new axis of the original Y-axis [Figure 2-3]. The axis now describes the greatest
variance that was not described by the first principal component (Esri, 2011).
12
Figure 2-3: Second principal component. Adapted from ArcGIS 10.2 Help, How
Principal Component Works (Esri, 2011)
A line formula is then created that defines the shift and rotation previously mentioned by
utilizing the eigenvectors, eigenvalues, and the covariance matrix that was created by the
input multiband raster (Esri, 2011).
It is important to understand how the percent variance is determined. The percent
variance identifies how much variance each eigenvalue captures. This is used to interpret
the PCA results. The formula to determine the percent variance is the following:
Percent Variance = (eigenvalue * 100)/Sum
It is recommended to use the principal components whose eigenvalues contain the
majority of the variance to build models (Esri, 2012).
2.3 Logistic Regression Approach in Predictive Modeling
The logistic regression analysis is the statistical approach that was selected for this
project. Logistic regression is a multivariate regression specification for a dichotomous
dependent variable and independent variables that are either categorical or continuous
(Field, 2005). In essence, categorical dichotomous variables are binary. For this model,
the categorical variables were archaeological site presence or absence. The purpose of the
logistic regression model was to determine areas in the Hopi Reservation where there is
13
likelihood of archaeological site presence or absence, based on multiple independent
variables. In a linear regression equation where there are multiple independent variables,
the relationship between the dependent variable and a given independent variable is
captured by an estimated coefficient on that variable. For example,
Yi = + + + ··· + + ε₁
where is the constant, ε₁ is residual term, is the regression coefficient
corresponding to the independent variable, (Field, 2005). In a logistic regression
model, the objective is to predict the probability of Y occurring given the known
independent variable values.
In a logistic regression, where there are multiple independent variables, the equation
creates a probability range between 0 and 1. This indicates that a value close to 1
suggests that Y is likely to occur and a value close to 0 suggests that Y is not likely to
occur (Field, 2005). This is expressed in the following equation:
P(Y)
₁
where P(Y) is the probability of Y occurring, е is the base logarithm, is the constant,
and the other coefficients listed are the same as the previous logistic regression analysis
discussed above. When the logistic regression model is run, the value of the coefficients
needs to be estimated so the equation can be solved (Field, 2005). The objective is that
the chosen model will be one that when the independent variable values are placed in the
model, the results of the Y values will be closest to the observed values—the
archaeological sites/non-sites that have already been located (Field, 2005). The primary
goal of the logistic regression model is to fit a model to the observed data that allows the
researcher to estimate values of the dependent variable from the known values of the
independent variables (Field, 2005).
Field then begins to discuss what method to use to determine the most significant
variables for determining the final model, whether it be stepwise or the forced-entry
method. He states that most people believe the stepwise method is most appropriate for
exploratory work, therefore, it is most suitable for the predictive model, considering no
prior research has been done on this topic in the Hopi Reservation.
In the forward stepwise method, the model is initially defined only by the constant,
. This method is then able to detect the independent variable that best predicts the
dependent variable, Y. It selects the independent variable based on the variable that has
the highest correlation with the dependent variable. The forward stepwise method then
selects the next variable that has next highest correlation, or semi-partial correlation, with
the independent variable. Essentially, this methodology is able to select the independent
variable that can explain most of the dependent variable. Semi-partial correlation can be
explained by how the software program correlates each of the independent variables with
the dependent variables “while controlling for the effect of the first predictor
[independent variable]” that has already been selected (Field, 2005).
The backward step-wise methodology begins by including all the independent
variables in the model and then determines the level of contribution of each one by
14
examining the significance value of the t-test of each independent variable (Field, 2005).
When utilizing the stepwise method, Field recommends using the backward methodology
because of suppressor effects. Suppressor effects influence when an independent variable
has significant effects, but only when another variable is held constant. According to
Field, the forward step-wise is more likely to not include independent variables that are
impacted by suppressor effects (Field, 2005). However, in this project the two
methodologies produced the same results.
These concepts and principles which have been previously discussed comprise the
background of the logistic regression model. This is important to comprehend in order to
effectively create a predictive archaeological site model utilizing logistic regression. This
will also allow one to understand how logistic regression has been applied in other
predictive modeling applications.
2.4 Applications of the Logistic Regression Model in Predictive
Modeling
In the article Predictive Mapping of Blackberry in the Condamine Catchment Using
Logistic Regression and Spatial Analysis, the logistic regression modeling approach was
utilized for predictive mapping (Apan, Wells, Reardon-Smith, McDougall, & Basnet,
2008). In this article, the authors posed that a logistic regression model is “used to predict
the probability of occurrence of an event as a function of the independent variables. It is
useful when the observed outcome is restricted to two values, which usually represent the
occurrence or non-occurrence of an outcome event” (Apan et al., 2008). This statistical
approach does not adopt the approach that the relationship between the independent and
dependent variables is linear, meaning a 1:1 correlation. It also does not assume that the
dependent variable is normally distributed (Apan et al., 2008). This is important for
archaeological predictive modeling because archaeological site locations would probably
never be normally distributed (Field, 2005), meaning they tend to be clustered due to the
environmental conditions that dictated their location.
The authors selected 2,277 randomly selected sample points that contained either
weed or no weed observations. From this, they chose 1,592 points for the training set of
the analysis. The authors utilized the forward likelihood ratio [forward like-wise] method,
which allowed them to determine the number of categorical variables to be used in the
model, as opposed to the SRI staff who used the PCA approach to limit the number of
variables to be used in their model. After each of the variables had been tested with “an
assessment of their related statistics, a model was finally selected. The following equation
was implemented using the ArcGIS Spatial Analyst raster calculator to produce the
predictive map” (Apan et al., 2008):
Pred_bberry = 1 div (1 + exp(-(-7.751 + (-.002*[dist_stream]) + (.000*[dist_border]) +
(-.037 * [fpc]) + (.015 * [elev])))))
By using the forward-likelihood ratio [forward like-wise], the authors were able to
determine the most important variables to include in the final model. The final output
15
map was a predictive raster surface where each cell value ranged from 0 to 1, where 1
indicated the highest likelihood of finding blackberry (Apan et al., 2008).
In the article Predictive Models of Archaeological Site Distributions in New
Zealand J.R. Leathwick also used the logistic regression modeling approach for
predicting archaeological sites with ArcGIS software. Leathwick used the Generalized
Additive Model to fit the logistic regression model to the data in order to estimate the
independent variables. Leathwick asserts that this approach is advantageous “in that the
relationship between the response and the continuous predictor variables are defined from
the data using scatter plot smoothers, rather than more inflexible parametric terms used
traditionally for such analysis” (Leathwick, 2000). The regression was then fitted using
the backwards stepwise procedure, where all variables were initially used and then
dropped once their significance using the T-test was determined. Leathwick stated that
the advantage of the logistic regression approach, as stated earlier, is that it has the ability
to handle non-normally distributed data, such as the presence/absence archaeological site
data used in this study (Leathwick, 2000). The idea behind this is that the data points in
the scatter plot can be smoothed “by fitting a line to the data” (NetMBA, 2010).
Leathwick assessed the individual contributions of each independent variable by “using
the residual deviance when dropping each [variable]” (Leathwick, 2000). The predictive
archaeological site map was then produced using the same environmental dataset points
on a 1-km grid (Leathwick, 2000).
2.5 Accuracy and Validation
It is recommended to validate and analyze the accuracy of the results of models. In the
model developed by Apan et al. (2008) validation and accuracy measures were conducted
to analyze these results. This was necessary to determine whether the outputs were valid
and could be used for decision-making. The authors used the Hosmer-Lemeshow chisquare
test of goodness-of-fit to examine the overall fit of the model. The desired
outcome of this test is to have a finding of non-significance, e.g., p-value above .05
significance level (SPSS). This indicates the model adequately fits the data. The authors
also suggested the omnibus test of model coefficients to test “whether the model with the
predictors is significantly different from the model with only the intercept” (Apan et al.,
2008). The desired outcome of this test is to have a finding of significance e.g., p-value
below .05 significance level. This also indicates that the data adequately fits the model.
The authors state, “This means that at least one of the predictors is significantly related to
the response variable” (Apan et al., 2008). In the same study, the authors used
classification tables to assess the accuracy of the model. The idea behind this test is to
determine what percent of the model is correctly classified. According to the authors, the
goal is to have a perfect model that scores a 100 % correct. These tests were conducted to
examine whether the variables selected for building models were valid and could
adequately explain the dependent variable, site/non-site locations (Apan et al., 2008).
2.6 Summary
In the course of examining much of the current literature on the topic of predictive
models, an overall strategy was formulated. The study conducted by the SRI staff had the
greatest amount of influence on what independent variables were created to build models
16
for the project. The work of Field (2005) also was very influential in how the logistic
regression model was built in SPSS. Because of this work, the forward step-wise and
forced-entry methods were selected to determine which independent variables were used
for building the final models for each archaeological site type. The Hosmer-Lemeshow
test was applied to examine the models’ goodness-of-fit and the omnibus tests of model
coefficients was utilized to test whether the independent variables were better able to
predict the independent variables “better than by chance alone” (Denis, 2010). In order
for a predictive surface to be created in ArcGIS 10.2, the algorithm suggested by Apan et
al. (2008) was used in the Raster Calculator geoprocessing tool, with the coefficients
created in SPSS to standardize the outputs of the predictive surface to create range from 0
to 1.
Once the model results were created, the models were assessed for accuracy. The
percent correctly classified statistic was calculated to test the accuracy of the output,
which was also utilized by Apan et al. (2008). As a result, this test provided a level of
understanding of how accurate the final output maps were and where to focus
archaeology study efforts in the Hopi Reservation.
17
Chapter 3 – Systems Analysis and Design
Because the client requested a predictive model for archaeological sites, it is necessary to
clearly discuss the overall systems analysis and design of the project. This entails
revisiting the problem statement in section 3.1 to clarify what was actually needed by the
client. After discussing the problem statement, the project functional and non-functional
requirements are then discussed in section 3.2. The project system design is then
addressed in more detail in section 3.3, followed by the project plan in section 3.4.
3.1 Problem Statement
Archaeologists are confronted with the problem of locating archaeological sites in
extremely large study areas. This requires extensive fieldwork and time in order to cover
these study areas. Since the 1970s, developing predictive archaeological site models has
become a common strategy among many archaeologists to increase the likelihood of
locating sites (Kohler & Parker, 1986). As a result, Dr. Bernardini wanted to develop a
predictive model for archaeological sites based on the following environmental variables:
water resources, topography, and vegetation. The outputs of this model were created to
help to focus his efforts and reduce the amount of fieldwork in the 1.6 million-acre Hopi
Reservation (Bernardini, personal communication, September 30, 2013).
3.2 Requirements Analysis
The client had three functional requirements for the project. In order to create the
predictive model, the data had to be appropriately housed and organized in a file
geodatabase. These feature class data were archaeological site/non-site feature classes,
hydrography feature classes, site and non-site values to points, convex hulls, and
additional feature classes for reference maps (e.g., roads, political boundaries, and
streams). The raster topographic, water resource, and vegetation independent raster
variables for the model were housed outside of the file geodatabase in separate folders
because of their large quantity. Because the predictive model was never a Python or
ArcGIS ModelBuilder tool that could be run to create predictive archaeological site raster
outputs, the predictive model requirements were only predictive raster surface maps
indicating the likelihood of locating each of the three archaeological site types. Because it
was not possible to create an automated tool, the third functional requirement was a
workflow explaining how to update the model when new archaeological sites were
needed to be added to increase its accuracy (Table 1).
The only non-functional requirement made by the client was that the model be tested
for accuracy. The client also visually inspected the final outputs of the model to see
whether the results were logical. The model was also tested utilizing the classification
tables created in SPSS for model accuracy assessment.
18
Table 1. Functional and Non-functional requirements.
Functional Requirements Description
File geodatabase File geodatabase that houses feature datasets of
archaeological site and non-site point feature
classes, hydrography feature classes, feature classes
for reference maps (e.g., roads, political
boundaries), convex hull polygons for non-site
random sampling, and points to value feature class
that holds variable values associated with site and
non-site points
Predictive model Predictive model utilizing topographic, water
resource, and vegetation variables to detect
archaeological site type locations according to
archaeological period
Predictive model workflow for
updating model
Training manual explaining how to update
predictive model when new archaeological sites are
located
Non-Functional Requirements Description
Model Assessment & Validation The client visually assessed the model outputs to
determine whether the results are acceptable. The
model was tested and validated using the Percent
Correctly Classified methodology
3.3 System Design
The file geodatabase was the first major component of the overall system design. This
was developed not only for model development and reference maps, but also for the
client’s future use if he decides to update the model and for other GIS purposes. Once the
file geodatabase was developed and organized, the overall system design was created
[Figure 3-1]. The overall system design comprised the independent and dependent
variables created in ArcGIS 10.2. These dependent variables―the archaeological
site/non-site points—were used to extract the values of the environmental variables to
those points and then were imported into SPSS in order to build models. Once the
coefficients were generated for each site type by running the logistic regression analysis,
these coefficients were then input to a formula in the Raster Calculator geoprocessing
tool to create the predictive raster surfaces [Figure 3-1].
19
Figure 3-1: System design of predictive model workflow
3.4 Project Plan
The original project plan was divided into numerous milestones [Figure 3-2]. The
geodatabase construction milestone of the project was divided into data scrubbing, data
reprojection, and data acquisition and was projected to be completed by the end of
February 2014. The model construction milestone was divided into the principal
components analysis and cost to traverse analysis and was projected to be completed by
May 15, 2014. Model assessment and testing were also divided into many tasks:
comparing current archaeological sites to predicted locations, utilizing ArcGIS Data
Reviewer, and other testing and validation measures. This milestone was projected to be
completed by June 6, 2014. The original project plan also included a web application.
This was to be completed by July 5, 2014. Project requirement revisions were projected
to be completed by June 30, 2014. The final project report was projected to be completed
by July 31, 2014 [Figure 3-2].
20
Figure 3-2: Original project plan
In the actual execution of the project plan, a large amount of time was allocated to
geodatabase design and creation, and data collection. The completion of this milestone
not only entailed the actual technical aspects of the file geodatabase scrubbing, design,
and creation, but also required extensive planning and clarification of file geodatabase
requirements with the client. This was expected to be completed by the end of February
2014. This milestone of the project was not actually reached until March 7 [Figure 3-2].
Milestones of the project such as the Principal Component Analysis and Least Cost
Surface Analysis were projected to be completed by the end of April [Figure 3-1]. In the
updated project plan, these milestones were eliminated and a new milestone was created,
called simply Model Construction. The model was projected to be completed by May 15,
2014 [Figure 3-3]. This milestone was actually completed June 11, 2014. There were
some technical difficulties during the model construction milestone of the project. The
trigger point date, the date to change to the contingency plan, for model construction
technical difficulties was May 15. The contingency plan for this was to seek outside
technical assistance from experts in the field of spatial statistics and logistic regression
analysis.
The project scope was also reduced. Originally, the client requested that the Hopi
reservation be divided into three modeling units―Great Basin Conifer Woodlands, Great
Basin Shrub Grassland, and Great Basin Desert Scrub, which would increase the number
21
of predictive model map outputs significantly. Due to time constraints, the client agreed
to the contingency plan to only require the model outputs be created for the whole Hopi
Reservation rather than individual modeling units. This then only required model outputs
for each site type, which were not based on archaeological period. The web application
was also eliminated from the original project scope because it was determined
unnecessary for the client’s needs [Figure 3-3].
Figure 3-3: Revised project plan
3.5 Summary
In summary, the project plan and timeline changed significantly throughout the course of
the project. Project scope was reduced significantly by eliminating the web application
and also by creating models for the entire Hopi Reservation at once and not by individual
model units, which would have tripled the amount of model outputs. Geodatabase design
and model construction also took significantly longer than expected. With this said, it
was important that there were the necessary trigger point dates set for when problems
arose and also contingency plans. It was also necessary that many of the project
milestones were completed in advance in order to have time to complete more difficult
milestones.
22
23
Chapter 4 – Database Design
The overall database design was one of the most crucial components of the project, not
only for model building purposes but also for the client’s future use. Section 4.1 describes
the conceptual data model and the underlying theory of how and why the model was
developed. Section 4.2 discusses the logical model, how the appropriate data were
organized within the file geodatabase, and how the independent raster variables were
stored. Section 4.3 discusses the data sources of the project and provides justification for
why these datasets were used. Section 4.4 addresses outside data collection efforts.
Section 4.5 discusses the data scrubbing and loading component of the project using data
provided by the client. Section 4.6 summarizes the chapter and emphasizes some of its
more salient points.
4.1 Conceptual Data Model
The conceptual data model of the predictive model for archaeological sites of the Hopi
Reservation of northeastern Arizona represented the relationships between the
independent environmental variables and the final model outputs [Figure 4-1]. The
independent variables were represented by three categories: topography, water resources,
and vegetation. The conceptual data model factor addressed why each variable was used
in creating the predictive model. For example, the topographic variables dictated the
amount of protection or shelter that was provided, which played a role in site location.
Vegetation was an indicator of biodiversity, which provided food resources. Water
resources were important in that they provided a water resource for survival and travel.
The final model output was a predictive surface indicating the likelihood of finding an
archaeological site within the reservation [Figure 4-1]. The topographic and water
resource independent variables were created from 10-meter USGS DEMs, while the
vegetation independent variable was created from GAP vegetation data, which is also
depicted in the model. The water resource feature classes required for the creation of the
water resource independent variables were created from the National Hydrologic Dataset
Plus (U.S. EPA, n.d.) [Figure 4-1].
24
Figure 4-1: Conceptual data model
4.2 Logical Model
The majority of the data was stored in a file geodatabase in six feature datasets for the
logical data model [Figure 4-2]. These feature datasets were Archaeological Sites,
Hydrography, Convex Hulls, Surveys, Mapmaking Resources, and Values to
Points. Within the archaeological site types feature dataset, there were multiple
feature classes. These were Habitation, Scatter, Rock Art, and the Non-Sites. The
Hydrography feature dataset contained the Water Bodies, All Streams, Major
Streams, and Seeps and Springs feature classes. The surveys feature dataset
contained only the All Surveys feature class. The Mapmaking Resources feature
dataset contained the following feature classes for mapmaking: Major Cities, Roads,
Reservation, Stream Washes, and US States. The Convex Hull feature dataset
contained the convex hulls that were used as the constraining boundaries
when the random non-site points were created for each archaeological site type
[Figure 4-2].
25
Project
Geodatabase
Sites
Non-Sites
Hydro-
graphy
Surveys
Map
Making
Resources
Habitation
Scatter
Rock Art
Major Cities
Roads
Reservation
U.S. States
Stream
Washes
10-Meter
USGS DEM
Hillshade
Topographic Raster
Variables
Hydrologic Raster
Variables
Vegetation Raster
Variable
Environmental
Variable Raster
Composites
Convex
Hulls
Site-
NonSite
Values to
Points
Non-Sites
Water
Bodies
All
Streams
Major
Seeps/
Springs
Convex
Hulls All Surveys Habitation
Rock Art
Scatter
Figure 4-2: The logical data model.
The independent model raster variables were contained within folders outside the
project geodatabase [Figure 4-2]. These variables were topographic raster variables,
water resource raster variables, and the vegetation raster variable. The topographic raster
variables folder contained eight independent model variables: slope, local elevation
change, cost surface, shelter, terrain texture, east-west aspect, north-south aspect, and
TPI. The water resource variables contained six independent model variables: cost to
traverse to streams, cost to traverse to major streams, cost to traverse to seeps and
springs, cost to traverse to water bodies, cost to traverse to stream nodes, and cost to
traverse to major stream nodes. The vegetation raster variables contained two
independent model variables: vegetation richness with 200- and 500-meter radii. All 16
independent variables were converted to a single composite variable, which was stored
separately in the environmental variable raster composites folder [Figure 4-2].
26
4.3 Data Sources
Archaeological site point data were provided directly by the client that had been collected
from his personal surveys as well as those of other associates. This data came in Excel
and Esri shapefile formats. The Hopi Reservation boundary was also provided directly by
the client in Esri shapefile format.
All hydrography data were downloaded from the National Hydrography Dataset Plus
website of the USGS. The hydrography data was from Region 15, version 01. The flow
lines, seeps and springs, and water bodies were the only datasets used from the NHD Plus
website. The NHD Plus website is the official United States geospatial hydrologic
framework dataset that was created by the US Environmental Protection Agency (U.S.
EPA, n.d.). NHD Plus data were initially developed at the 1:100,000 scale for the whole
country. The current NHD Plus data is now developed at the 1:24,000/1:12,000 scale,
which adds detail to the original dataset created at the 1:100,000 scale (U.S. EPA, n.d.).
NHD Plus data were the hydrography data source utilized by SRI staff in their predictive
modeling project.
It was also necessary to download four 10-meter resolution USGS DEMs covering
the entire Hopi Reservation. These DEMs were downloaded from the USGS National
Map Viewer and Download Platform. This data came from the National Elevation
Dataset (NED) in a floating point pixel type and a GRID file format. These data were
initially unprojected and had to be reprojected into the correct projection for analysis.
The NED serves as the elevation layer of the National Map, and provides basic elevation
information for earth science studies and mapping applications in the United States. The
data extracted from this website are the elevation data utilized by the scientific and
resource management communities for “research, hydrologic modeling, resource
monitoring, mapping and visualization applications” (USGS, 2014). For this reason, this
elevation dataset was utilized for creating topographic and water resource variables.
The two vegetation variables, vegetation richness with a 200- and 500-meter radii,
were created from the GAP Land Cover Data Portal of the USGS. GAP data are a
combination of several projects to create a seamless vegetation data set of the United
States. The GAP data utilized for this project came from the Southwest Gap Analysis
Project. These data were created from multi-season Landsat ETM+ from 1999 to 2001 in
combination with DEMs “to model natural and semi-natural vegetation” (USGS, 2011).
The USGS states:
The GAP national land cover data, based on the NatureServe Ecological Systems
Classification, are the foundation of the most detailed, consistent map of vegetative
associations ever available for the United States and will help facilitate the planning
and management of biological diversity on a regional and national scale (USGS,
2011).
This dataset was used in building models because it is the national standard when
utilizing vegetation data in scientific studies and was also utilized by SRI in a similar
predictive modeling project. These data originally came in a North American Datum
(NAD) 83 Albers projection and had to be reprojected.
27
4.4 Data Collection Methods
There were no outside data collection efforts required for this project. All data were
either provided by the client or downloaded from the web.
4.5 Data Scrubbing and Loading
There was extensive discussion and planning with the client on how the archaeological
site type data were to be organized and scrubbed. The original archaeological point files
provided by the client came in Microsoft Excel and Esri shapefile formats and in multiple
projections. Before these datasets were loaded into the file geodatabase, the data in the
attribute table had to be standardized. Initially, there were multiple archaeological site
types. For simplicity and modeling purposes, the site types were combined into three
types: Habitation, Rock Art, and Scatter. There were also multiple sites that were
duplicated. This took time and coordination with the client to determine how to scrub the
Esri shapefiles so that none of the archaeological sites were duplicated. There were also
many naming convention errors for site types, which created problems when querying the
data. For example, a Scatter site type was often called Lithic Scatter, and a Kiva site type
had to be categorized under Habitation. Once all site types had been correctly categorized
and combined into one Esri shapefile, the archaeological site types were all projected to
the NAD 1927 UTM North Zone 12 projected coordinate system and then loaded into the
file geodatabase in order to begin building models.
4.6 Summary
Chapter 4, Database Design, covered multiple topics of the data utilized for the predictive
model for archaeological sites. The conceptual model addressed the underlying theory of
how the predictive model for archaeological sites was developed. It addressed the three
driving variables that determined archaeological site location: topography, water
resources, and vegetation. This was addressed to clarify the logic behind the predictive
model [Figure 4-1]. The conceptual model was then addressed to explain how the file
geodatabase stored the necessary data and also the environmental raster variables were
stored outside the file geodatabase. This was to discuss the logic of how the data for the
model was stored and organized [Figure 4-2]. The data sources were then discussed not
only to address where the data came from but also to justify why these datasets were used
and how these were the best datasets for the predictive model. Scrubbing and loading of
the data also were addressed to discuss the importance of how the data were initially
acquired, organized, scrubbed and loaded into the file geodatabase.
28
29
Chapter 5 – Implementation
Chapter 5 discusses in detail how the project was implemented. The chapter was divided
into eight sections to individually address each phase of project implementation. Section
5.1 discusses how the data were organized and prepared. Section 5.2 discusses
independent variable development. Section 5.3 discusses how the data created in ArcGIS
10.2 was imported into SPSS. Section 5.4 discusses how the independent variables were
tested for multicollinearity. Section 5.5 discusses how the models were built in SPSS.
Section 5.6 discusses the model validation tests. Section 5.6 discusses how the predictive
raster surfaces were created in ArcGIS 10.2. Section 5.7 discusses how the models were
assessed for accuracy. Section 5.8 addresses how the predictive model surfaces were
tested for accuracy. Section 5.9 summarizes the chapter and addresses the salient points
regarding the overall implementation of the project. This was done in order for the client
or another interested individual who is competent in GIS to be able to replicate the
project.
5.1 Data Compilation and Creation Methods
Archaeological site type data in Microsoft Excel and Esri shapefiles and the reservation
polygon in Esri shapefiles were delivered directly by the client at the initiation of the
project. This phase of the project required extensive exploratory analysis, organization,
and scrubbing for the data to be ready to be input into the model. The site types included
misspelled and redundant names and were not standardized in how they were entered into
the attribute table [Figure 5-1]. The client requested that the following fields be included
in the attribute table for the archaeological sites:
FID
SiteName
SiteNo1
SiteNo2
SiteNo3
Inst.
Project
SiteType1
SiteType2
SiteType3
Area
Paleo
Archaic
BM2
BM3
PI
PII
PIII
PIV
30
Historic
Depth
Residential
Some of the data were already populated in these fields, but a large amount of data had to
be entered manually. For example, whether a site occurred during a particular
archaeological period had to be identified by binary code (i.e., 1 = Yes and 0 = No). This
was not done in the original attribute tables and had to be corrected. There was extensive
discussion with the client in order for there to be a common understanding as to how the
data were to be organized and stored. The client decided he was interested in modeling
three archaeological site types: Habitation, Scatter, and Rock Art. Once all
archaeological site type Esri shapefiles had been standardized and scrubbed, they were
loaded into a single feature class called Archaeological Site Types.
31
Figure 5-1: The attribute table of archaeological sites displaying unstandardized site
names.
Once the archaeological sites had been organized properly in accordance with the
client’s requests, the archaeological non-site types were created. Initially, the client
wanted each archaeological site type modeled by the Basket Maker 3, Period 1, Period II,
Period III, and Period IV archaeological periods. The issue arose of spatial dependency
among sites. There was extreme clustering among site types in many parts of the Hopi
Reservation. When spatial dependency occurs among features on the landscape, all the
site types are occurring in very similar environmental conditions (e.g., slope, aspect, and
cost to traverse to water). According to Goodchild, spatial dependency is "the propensity
for nearby locations to influence each other and to possess similar attributes" (Goodchild,
32
1992). Consequently, the clustering of sites that are associated with the similar
environmental conditions can lead to very strong biases in the model. Because of this, all
the same site types (i.e. Rock Art, Habitation, and Scatter) were given 200-meter buffers,
and boundaries were dissolved among the newly created buffers that overlapped. The
200-meter buffer distance was determined because with a larger buffer distance, many
more sites would be eliminated which would not allow there to be a sufficient amount of
sites to run the logistic regression analysis. According to Hosmer and Lemeshow, the
general rule for the logistic regression model is that there should be 10 sites for every
independent variable (Hosmer & Lemeshow, 2013). Because of this general rule by
Hosmer and Lemeshow, the 200-meter buffer distance was chosen because if any larger
there were not enough archaeological site points per independent variable to run the
logistic regression model. Once the site types were re-created into dissolved buffers, the
Feature to Point geoprocessing tool was run on all polygons created by buffers to
generate polygon centroids. These were then the new feature classes for each
archaeological site type to reduce spatial dependency among sites, which was previously
causing the model to indicate bias and generate odd results.
Random sampling was then conducted to create archaeological non-site points to
be combined with the site points in the logistic regression model. These point types were
created to be the dependent, dichotomous variables in the model. The Minimum
Bounding Geometry tool was run to create a convex hull polygon around the outermost
points for each original archaeological site type dataset [Figure 5-2].
33
Figure 5-2: The Minimum Bounding Geometry geoprocessing tool to create convex
Hulls.
This was done to create the smallest sampling polygon possible that contained all the
archaeological site types. The Create Random Points geoprocessing tool was then run to
create close to the same amount of non-sites as sites for each category [Figure 5-3]. All
points that were outside the reservation and within 200-meters of each other were
eliminated. This was imperative to try to eliminate the possibility of creating a random
non-site point that intersected another site point or else more errors would be introduced
into the model.
34
Figure 5-3: The Create Random Points geoprocessing tool to create non-site points
within the convex hull.
Once an appropriate amount of non-site points were created for each
archaeological site type, the independent variables were then created. The topographic
and water resource variables were created from a 10-meter USGS DEM. Initially,
multiple DEMs were mosaicked together using the Mosaic by Mask geoprocessing tool.
When this was completed there was a noticeable gap that was created when the USGS
DEMs were mosaicked together that contained null values [Figure 5-4].
35
Figure 5-4: Large gap in the mosaicked USGS DEM
Because this null value gap could introduce many errors in the data when used in raster
analysis, it was necessary to replace the null values in the USGS DEM by applying the
following formula in the Raster Calculator geoprocessing tool to this mosaicked raster:
Con(IsNull(“raster”), FocalStatistics(“raster”, NbrRectangle(27,27, ), “MEAN”),
“raster”)
where in the conditional statement, if the raster pixel is a null value the Focal Statistics
tool determined the mean elevation values of the neighboring pixels that are not null. In
order to completely fill in the gaps, 27 pixels in height were determined to be the widest
gap in the newly mosaicked USGS DEM. This number was determined by trial and error
until the gap was completely removed. By selecting 27, the Focal Statistics geoprocessing
tool was able to completely fill in all the null values in the gap because 27 was the largest
gap in height. The Focal Statistics geoprocessing tool was then able to apply the mean
elevation of the neighboring pixels to the null values in the gap.
36
5.2 Independent Variable Development
Once this problem was successfully addressed, the USGS DEM was ready to be used for
the creation of the topographic and water resource independent variables. The
topographic independent variables consisted of the following:
 Slope
 North-South Aspect
 East-West Aspect
 Local Elevation Change
 Terrain Texture
 Shelter
 Cost Surface
 Topographic Position Index
These were all created in ArcGIS 10.2 using various geoprocessing tools. Slope is
known to be one of the most common explanatory variables used in predicting site
locations (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012). Percent rise in slope
was calculated with the Slope tool. Neighboring pixels tended to have very different
slope values. This was typically due to pixels being either on, above, or below contour
lines used to create the DEM (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012). In
order to smooth this extreme variation in slope values, the Focal Statistics tool was
utilized to calculate the mean percent slope within a 100-meter radius [Figure 5-5].
37
Figure 5-5: The slope variable created, which can influence site location.
Aspect was another variable that was used in model development [Figures 5-6 and
5-7]. Aspect is important, for example, because a southern slope can have excellent sun
exposure or provide great protection from strong winds (Heilen, Leckman, Byrd,
Homburg, & Heckman, 2012). Aspect was created using the Aspect geoprocessing tool
and the USGS DEM as the input raster. The inherent problem of aspect when calculated
in ArcGIS is that it corresponds to the degrees of a compass. The problem herein lies in
that 359º and 1º may be quantitatively distinct, although, there is very little difference in
these directions in reality (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012). In
order to address this problem, Heilen et al. (2012) rescaled aspect so it ranged from 0º to
180º. By doing this, aspect “was then distributed symmetrically along either a north-south
or an east-west axis” (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012). This
38
allowed aspects to be generalized in either a north-south or east-west fashion. The
following equation in the Raster Calculator was applied to the newly created aspect raster
to transform all northerly directions to equal 0º and all southerly directions to equal 180º:
Con(("aspect" < 0),90,Con(("aspect" > 180),(360 - "aspect"),"aspect"))
where in the first conditional statement, any aspect degree less than 0º was given 90º.
This indicated that any pixel that was 90º was not trending either north or south. In the
second conditional statement, when there was an aspect degree greater than 180º, the
aspect raster was subtracted from 360º. This standardized these aspect values greater than
180º, and converted them to values less than 180º. For example, 181º was then converted
to 179º and a pixel that was already 179º, maintains that value because it is already less
than 180º. The final statement in the algorithm, “aspect,” maintained that all aspect
values that were less than 180º stay the same that were originally created in the original
slope raster. The following formula was also applied in the Raster Calculator to
generalize all east-west directions:
Con(("aspect" < 0),90,Con(("aspect" < 90),(90-"aspect"),Con("aspect" <
270,("aspect" - 90),(270 - ("aspect" - 180)))))
where in the first conditional statement, any aspect degree less than 0º was given 90º,
which is the same as the previous formula. In the second conditional statement, any value
less than 90º was subtracted from 90º. For example, 15º is now 75º when it is deducted
from 90 º. In the third conditional statement, 90º was subtracted from all values less than
270º and greater than or equal to 90º. For example, if 90º was subtracted from 260º, the
new values is now 170º. This is now less than 180º. The last part of the third conditional
statement subtracts 180º from all values greater than 270º. This formula transformed all
easterly directions to be between 0º and 89.9º all westerly directions to be between 90.1º
and 180º (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012).
39
Figure 5-6: The north-south aspect variable which can influence site location.
40
Figure 5-7: The east-west aspect variable which can influence site location.
Local elevation change was the next topographic variable utilized in building
models. Local elevation change is a measure of roughness and is important in dictating
site location (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012). Local elevation
41
change is considered to be a variable that has an impact on the ease of travel across
landscapes. According to SRI, local elevation change is “the range in elevation within a
predefined radius around a raster cell” (Heilen, Leckman, Byrd, Homburg, & Heckman,
2012). Local elevation change with a 1-kilometer radius was calculated with the Focal
Statistics geoprocessing tool and was used in building models [Figures 5-8 and 5-9].
Figure 5-8: The Focal Statistics geoprocessing tool utilized to create the local
elevation change variable.
42
Figure 5-9: The local elevation change variable, which measures surface roughness,
an indicator of how easy it is to traverse the landscape.
43
Terrain texture was another variable used in building models [Figure 5-10].
Terrain texture “is the amount of variability in elevation within a predefined radius” and
is considered another way to measure surface roughness (Heilen, Leckman, Byrd,
Homburg, & Heckman, 2012). This variable was created by calculating the standard
deviation in elevation with a radius of 1 kilometer with the Focal Statistics geoprocessing
tool. The higher the standard deviation indicated a higher level of surface roughness or
terrain texture.
44
Figure 5-10: The terrain texture variable created to measure the amount of
variability within a 1-kilometer radius.
45
Shelter was another variable that was found to be important in building models
[Figure 5-11]. This variable explains the degree to which the surrounding topographic
features offer shelter from the environmental elements (e.g., sun, wind) (Heilen,
Leckman, Byrd, Homburg, & Heckman, 2012). Low-lying areas surrounded by hills are
considered to provide a high level of shelter. To create this variable the mean elevation
within a 1-kilometer radius was created with the Focal Statistics geoprocessing tool.
Shelter was then calculated in the Raster Calculator by dividing “the mean elevation
within a specified radius by the local elevation of a given raster cell” (Heilen, Leckman,
Byrd, Homburg, & Heckman, 2012). Features on the landscape with values above 1
indicate more shelter, while features corresponding to values less than 1 represent areas
with less shelter.
46
Figure 5-11: The shelter variable created to measure the amount of protection the
landscape can provide sites.
47
Topographic Position Index (TPI) was another topographic variable utilized in
building models [Figure 5-12]. This variable addressed where a site was located or
positioned on the landscape in regard to high points such as ridges and low points such as
valley floors. According to Weiss, TPI “compares the elevation of each cell in a DEM to
the mean elevation of a specified neighborhood around that cell” (Weiss, 2014). Features
on the landscape corresponding to values closer to 1 represent high locations, while
features corresponding to values closer to 0 represent lower-lying areas. TPI was
calculated by creating a focal minimum elevation raster and a focal maximum elevation
raster with the USGS DEM and using the Focal Statistics geoprocessing tool. The
following equation was then applied to these newly created rasters in the Raster
Calculator to create the TPI variable (Cooley, 2014):
(DEM – Focal minimum elevation)/(Focal maximum elevation – Focal minimum
elevation)
As with the variables previously mentioned, TPI can help explain site location, whether
the past inhabitants wanted to settle near valley floors, more exposed areas with better
views, or areas with better protection with higher surface roughness.
48
Figure 5-12: The TPI variable created to explain a sites location on the landscape.
Cost to traverse was the final topographic variable calculated [Figure 5-13]. This
was one of the most important variables utilized in building models because it was the
basis for how all the water resource variables were calculated. It can be used to “measure
the [cost to traverse] between a given raster cell” (Heilen, Leckman, Byrd, Homburg, &
Heckman, 2012) and a resource type (e.g., water). This variable “represents the relative
49
cost of moving across the landscape” (Heilen, Leckman, Byrd, Homburg, & Heckman,
2012). The following methodology was utilized to calculate a cost to traverse variable:
1. The mean slope within a 100-meter radius of the DEM and the standard
deviation within a 100-meter radius of the same original DEM were calculated
with the Focal Statistics geoprocessing tool.
2. A value of 1 was then added to each of these newly created variables with the
Raster Calculator.
3. The natural logarithm of each variable was then calculated with the Ln
geoprocessing tool
4. Add the 2 resultant values together with the Raster Calculator to create the
final cost to traverse independent variable (Heilen, Leckman, Byrd, Homburg,
& Heckman, 2012)
This variable addressed both the slope and the ruggedness of a specified radius
surrounding a raster cell (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012). Areas
with the steepest slopes have the highest cost to traverse, whereas areas with the lowest
slopes have lower cost to traverse (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012)
[Figure 5-13].
50
Figure 5-13: The cost to traverse variable created to measure the relative difficulty
of moving across the landscape.
51
Water resources were the next variables considered when considering
independent variables for predictive modeling. There were 6 variables created in total.
These were:
 Cost to traverse to major to streamlines
 Cost to traverse to all streams
 Cost to traverse to springs or seeps
 Cost to traverse to water bodies
 Cost to traverse to minor stream network nodes
 Cost to traverse to major stream network nodes
These were created in ArcGIS 10.2 with various geoprocessing tools.
Cost to traverse to major streamlines was created by using NHD Plus stream lines
data. Canals and pipelines were removed from this original feature class in order to
analyze only naturally occurring water resources. The Strahler stream order classification
methodology was implemented to determine the highest-order streams within the
reservation (Strahler, 1957). Stream orders 3 through 5 were identified as major streams
within the reservation. The cost to traverse to major stream lines variable was then
created by using the Cost Distance geoprocessing tool with the major stream lines feature
class and the cost to traverse variable described above [Figures 5-14 and 5-15].
Figure 5-14: The Cost Distance geoprocessing tool used to create the cost to traverse
to major streams variable.
52
Figure 5-15: The cost to traverse to major streams variable created to measure the
difficulty level to traverse the landscape to this water resource.
The cost to traverse to all streams variable was also created by using NHD Plus
stream lines data. The cost to traverse to stream lines variable was calculated with the
Cost Distance geoprocessing tool by using this stream lines feature class and the cost to
traverse variable described above [Figure 5-16].
53
Figure 5-16: The cost to traverse to all streams variable created to measure the
difficulty level to traverse the landscape to all streams, excluding the major streams.
NHD Plus data was also used for analyzing the cost to traverse to springs and
seeps. This variable was created using the Cost Distance geoprocessing tool with the
springs and seeps feature class and the cost to traverse variable [5-17]. The cost to
traverse to water bodies and cost to traverse to stream network nodes were created by
following the same procedure [Figures 5-18 and 5-19]. The major stream network nodes
were considered to be “stream junctions that were situated along ‘major’ stream
segments….” (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012). The major stream
network nodes were extracted by utilizing the Feature Vertices to Points geoprocessing
tool and additional visual inspection to ensure that all necessary stream junction nodes
were extracted. The same procedure as above was followed with the major stream
54
network nodes as the input feature source data to create the cost to traverse to major
stream network nodes [Figure 5-20].
Figure 5-17: The cost to traverse to springs and seeps variable created to measure
the difficulty level to traverse the landscape to this water resource.
55
Figure 5-18: The cost to traverse to water bodies variable created to measure the
difficulty level to traverse the landscape to this water resource.
56
Figure 5-19: The cost to traverse to stream network nodes variable created to
measure the difficulty of reaching this water resource.
57
Figure 5-20: The cost to traverse to major stream network nodes variable created to
determine the cost to traverse to the major stream network nodes.
58
Vegetation richness within 100- and 500-meter radii were the final variables
developed for model building. Vegetation richness is the count of vegetation types found
in a specified area (McGinley, 2011). It is believed that sites located near multiple
vegetation types “also have greater access to a wider variety of resources than sites
located in an area of uniform vegetation” (Heilen, Leckman, Byrd, Homburg, &
Heckman, 2012). The variable was created with the Focal Statistics geoprocessing tool by
using a 100- and 500-meter radii and selecting the “Variety” statistics type [Figures 5-21,
5.22, and 5.23].
Figure 5-21: The Focal Statistics geoprocessing tool for creating the vegetation
richness variable.
59
Figure 5-22: The vegetation richness variable created to determine the count of
plant species within the specified 100-meter radius.
60
Figure 5-23: The vegetation richness variable created to determine the count of
plant species with the specified 500-meter radius.
61
5.3 Variable Import
The variables then had to be further processed to import the values to SPSS. The
Composite Bands geoprocessing tool was used to create a single composite raster of all
16 variables (Figure 5-24). This was done so that the raster variables would be properly
aligned when the pixel values were extracted from ArcGIS 10.2 to a DBF file. The
Site_Non-Site field was created for each archaeological site type feature class and nonsite
feature class. This field for the archaeological site type feature classes was populated
with 1 to indicate site presence, and the same field for the non-site feature classes was
populated with 0 to indicate site absence. Each archaeological site type feature class was
then merged with the non-site feature class that had been created for it, creating three
distinct feature classes. The Extract Multi Values to Points geoprocessing tool was
utilized for each site type to extract the values from the newly created composite raster
that was composed of all the independent variables. Once these values were extracted and
saved in individual DBF files per site type, the logistic regression model was ready to be
built in SPSS.
Figure 5-24: Composite Bands geoprocessing tool
5.4 Testing for Multicollinearity
Once the variables were created and ready to be imported into SPSS, it was necessary to
test for multicollinearity, or redundancy, among the independent variables. This was done
by running the Ordinary Least Squares (OLS) regression geoprocessing tool in ArcGIS
10.2. The three feature classes that were previously created with the Extract Multi Values
to Points that contain the independent variable values for each archaeological site type
were input into the OLS geoprocessing tool. This tool created an output PDF report that
provided the Variable Inflation Factor (VIF) metric for each independent variable in the
model. According to ArcGIS Resource Center, if the VIF for each independent variable is
above 7.5, the variable can be removed until only the independent variables in the model
62
are below 7.5 (Esri, 2013). This was done for all the independent variables for each
archaeological site type. This was taken into consideration when building models, but
many models produced the best results when running the logistic regression model with
all independent variables.
5.5 Logistic Regression Model Development in SPSS
Once the variable values had been extracted from ArcGIS 10.2 and saved in a DBF file,
logistic regression models were run for each site type. This process took some time to
determine what the best models for each site type. As discussed previously, the site /nonsite
type was always the dependent variable, which contained a value of either a 1 or 0 to
indicate site presence or absence. The independent variables were the topographic, water
resource, and vegetation variable values that had been imported in from ArcGIS. The
Forward Likelihood Ratio and Forced Entry methodologies were utilized for the creation
of these models to determine which independent variables have the highest correlation
with the dependent variable, X [Figure 5-25]. Only variables with a p-value less than .10
were selected for building models. A significance level is the probability of committing a
Type I error, or in other words, rejecting the null hypothesis that is, in fact true. It is
common to use either a .10, .05, or .01 significance level for hypothesis testing (English,
2011). In this case, only variables that were significant at a 90% level of confidence (p <
.10) were selected. The null hypothesis in this scenario is,
bi = 0, i = li ··· , k
where bi is the estimated coefficient for xi, and there are k independent variables. If the
independent variable is found to be significant (i.e., p < .10), then the null hypothesis is
rejected.
63
Figure 5-25: Forward Likelihood Ratio methodology.
5.6 Model Validation Tests
The models had to pass the Hosmer-Lemeshow test and the omnibus tests of model
coefficients to validate the models. For the omnibus tests of model coefficients, it is
desirable for the models to have p-values less than .001. This tests the models’ variables
against a null model. This is a model with no variables and only constants. In all models,
the significance levels were less than .001; therefore, the models were considered valid.
Because the models passed this test, this indicated that at least one of the predictor
variables is significantly related to the dependent variable (Apan et al., 2008). The
Hosmer-Lemeshow goodness-of-fit-test was then utilized to test the overall fitness of the
models [Figure 5-26]. A finding of non-significance indicates the model fits the data;
therefore, it is desirable for the models’ test statistics to have p-values above .05 (SPSS).
All models passed this test.
64
Figure 5-26: Hosmer-Lemeshow goodness-of-fit test
5.7 Model Assessment
Once the models were tested for validity and fitness, it was also pertinent to assess each
model’s level of accuracy. This was done by measuring the percentage of predicted
values that were correctly classified for each model, by examining the classification
tables in the SPSS outputs. The percentage correctly classified was calculated by
determining the percentage of non-sites that were correctly classified and the percentage
of sites that were correctly classified. This can be done by determining the number of hits
that were correctly classified (Table 2). A hit in this scenario is either “1-1” or a “0-0”, 1
for site and 0 for non-site (Denis, 2010). For non-sites (0), 79.2% were classified
correctly. This was calculated by dividing the number of hits, 76, by the total of hits and
non-hits, 96. This same formula was then applied to the sites (1). To determine the
overall percentage, the total number hits (76, 81) were summed together (157) and
divided by the total number of cases (188), which equaled 83.5%. It is also noteworthy to
mention when a case is considered correctly classified. SPSS determined that .50 is the
cutoff value; therefore, if a site is classified as .48, it considered to be incorrectly
classified (Denis, 2010). By understanding the SPSS classification tables, one can quickly
and efficiently determine which models are most accurate and worth considering further
for research purposes.
65
Table 2. Classification table for assessing model accuracy.
Observed
Predicted
Non-Site Site
Percentage
Correct
0 1
Step 1 Non-Site
Site
0 76 20 79.2
1 11 81 88.0
Overall Percentage 83.5
The models were further assessed to determine what percentage of the Hopi
Reservation had a probability of .5 or greater of finding the archaeological site in which
the specified model was predicting. Because the reservation was so large in size, it was
divided into four sections for the output of this geoprocessing tool to not exceed 2
gigabytes, the maximum allowed file size of a geoprocessing tool output. The reservation
feature class was divided into four sections by using the Cut Polygons tool in the Editor
toolbar and then utilizing the Extract by Mask geoprocessing tool to clip the predictive
surface of the reservation into the four sections. The Raster to Point geoprocessing tool
was then utilized to convert every pixel of the predictive surface to a point [Figure 5-27].
Each point contained the probability value of the associated pixel. The percentage of the
reservation with probability of .5 or greater probability of finding the archaeological site
in the specified model was then easily calculated.
Figure 5-27: The Raster to Point geoprocessing tool to raster pixels to points.
66
5.8 Predictive Raster Surface Creation
Once the models were built in SPSS, the predictive raster surfaces were created with the
Raster Calculator in ArcGIS 10.2. For each archaeological site type, the following
formula was implemented in the Raster Calculator to create predictive surfaces:
1 / (1 + (Exp( - ( + ( * "Independent Raster Variable") + ( * Independent
Raster Variable") + ( * “Independent Raster Variable”)))))
in which and are the coefficients of the independent variables created in SPSS (Apan
et al., 2008). This equation creates a predictive raster surface based on the coefficients,
but it also creates an output probability surface ranging from 0 to 1, with 1 being the most
likely to find an archaeological site. This equation makes the final outputs standardized
and easy to interpret.
5.9 Summary
Chapter 5, Implementation, discussed the major phases of archaeological site predictive
model development. This chapter was written to provide the audience with a detailed
understanding of how the models were built. The chapter was divided into eight sections:
Data Compilation and Creation Methods, Independent Variable Development, Variable
Import, Testing for Multicollinearity, Logistic Regression Model Development in SPSS,
Model Validation Tests, Predictive Raster Surface Creation, Model Assessment, and
Summary. These topics are crucial to address in order for the client or another interested
individual to replicate the process.
A large amount of time and effort were devoted to data preparation and
independent variable development. Once the data was properly created and organized, a
sufficient amount of time was devoted to building the models in SPSS and interpreting
the results; although, the majority of the project work was devoted to the initial phases of
the project, data preparation and independent variable development. It was beneficial to
learn the importance of the initial organization and creation of the data. Good project
planning and data organization were crucial for efficiency and project success.
67
Chapter 6 – Results and Analysis
To conclude the project, seven predictive models were produced for Rock Art,
Habitation, and Scatter archaeological site types. Sixteen independent variables were
created to be used in building the models. This chapter discusses model validity by
addressing the outcomes of the omnibus tests of model coefficients and the HosmerLemeshow
goodness-of-fit test. This chapter also assesses model accuracy by examining
the classification tables produced in the SPSS logistic regression output reports. Chapter
6 also reflects on what the final predictive raster surfaces communicate, what was
successful in the project, and where there might have been errors. By addressing these
areas, the audience will be able to see which models were most useful and which ones
could be improved.
6.1 Model Validity
When the models were generated in SPSS, it was important to conduct the omnibus tests
of model coefficients and the Hosmer-Lemeshow goodness-of-fit test to test for model
validity. The omnibus tests of model coefficients are conducted to explain whether the
predictor variables are “able to predict the dependent variable better than by chance
alone” (Denis, 2010). If the results are statistically significant (p < .001), “the model does
better than chance at predicting the dependent variable” (Denis, 2010). These tests
indicate whether the model is worth investigating further. The Hosmer-Lemeshow
goodness-of-fit test is a commonly used test that examines the overall fitness of the
model. The idea behind this test is to see how well the model fits the data. In other words,
the Hosmer-Lemeshow test examines whether “the model provides a better fit than a null
model with no predictors, [only constants]…. If chi-square goodness-of-fit is not
significant (p > .05), then the model has a good fit” (University of Strathclyde, n.d.).
Three logistic regression models were created for Rock Art. In the first model, the
Forward Likelihood Ratio method was employed to determine the independent variables
that were significant at a .10 significance level. These independent variables were eastwest
aspect, slope, and terrain texture. For this model, the omnibus test for model
coefficients validated that at least one independent variable was able to predict the
dependent variable with a p-value of < .001 (Table 3). From this, it could be surmised
that at least one independent variable in the model could predict the dependent variable,
site/non-site, better than by just chance alone. The next model, utilizing the Forced Entry
methodology for all significant variables, also passed the omnibus tests of model
coefficients (i.e. east-west aspect and all stream nodes). This is verified by the fact that
the model test was significant with a p-value of < .001 (Table 3). It could then be
surmised that at least one of the predictor variables can adequately predict the dependent
variable. The final model for Rock Art also utilized the Forced Entry methodology for
only the topographic and vegetation variables. The significant variables in this model
were east-west aspect and terrain texture. The omnibus tests for model coefficients also
validated that the predictor variables in the model adequately explained the dependent
variable. This is verified by the fact that the model test with a p-value of < .001 (Table 3).
68
Table 3. Omnibus tests for model coefficients for Rock Art Models 1, 2, and 3.
Model 1 Chi Square df Sig.
Step 1 Step
Block
Model
142.552
142.552
142.552
1
1
1
<.001
<.001
<.001
Step 2 Step
Block
Model
11.812
154.363
154.363
1
2
2
.001
<.001
<.001
Step 3 Step
Block
Model
9.826
164.189
164.189
1
3
3
.002
<.001
<.001
Model 2 Chi Square df Sig.
Step 1 Step
Block
Model
172.285
172.285
172.285
16
16
16
<.001
<.001
<.001
Model 3 Chi Square df Sig.
Step 1 Step
Block
Model
165.848
165.848
165.848
10
10
10
<.001
<.001
<.001
The Hosmer-Lemeshow indicated positive results for the first Rock Art model. In
this test for the first Rock Art model, the chi-square goodness-of-fit test results were not
69
significant, with a p-value of .373 (Table 4). This suggested that the null hypothesis―the
independent variables in the model adequately explain the dependent variable, site/nonsite―is
not rejected. Therefore, one could assert that the model adequately fits the data.
This test also indicated positive results for the second Rock Art model. In this test for
Rock Art, the chi-square goodness-of-fit test results was not significant, with a p-value of
.572 (Table 4). This indicated the same results, that the null hypothesis―the independent
variables in the model adequately explain the dependent variable, site/non-site―is not
rejected. This test, again, indicated positive results for the third Rock Art model. In this
test for Rock Art, the chi-square goodness-of-fit test results were not significant, with a pvalue
of .490 (Table 4). One can then assert that the model adequately fits the data.
Table 4. Hosmer-Lemeshow goodness-of-fit test for Rock Art Models 1, 2, and 3.
Model 1 Chi-Square df Sig.
Step 1
Step 2
Step 3
29.648
8.231
8.561
8
8
8
.490
.411
.373
Model 2 Chi-Square df Sig
Step 1 6.676 8 .572
Model 3 Chi-Square df Sig.
Step 1 7.445 8 .490
For the first Scatter site model, the omnibus tests for model coefficients utilizing
the Forced Entry methodology for all variables verified that they adequately described the
dependent variable. This is substantiated by the fact that omnibus tests of model
coefficient tests were significant with a p-value of <.001 (Table 5). From this, it was
surmised that the four significant variables (i.e. vegetation richness, shelter, cost, and cost
to traverse to water bodies) used in the model can predict the dependent variable better
than by just chance. The final Scatter site model utilizing the Forward Likelihood
methodology also passed the omnibus tests for model coefficients with a p-value of
<.001, indicating that at least one of the independent variables can predict the dependent
variable better than by chance alone (Table 5). The significant variables at a .90
confidence level were vegetation richness with a 500-meter radius, slope, terrain texture,
shelter, and major stream nodes.
70
Table 5. Omnibus tests for model coefficients for Scatter site models 1 and 2.
Model 1 Chi Square df Sig.
Step 1 Step
Block
Model
55.007
55.007
55.007
10
10
10
<.001
<.001
<.001
Model 2 Chi Square df Sig.
Step 1 Step
Block
Model
14.019
14.019
14.019
1
1
1
<.001
<.001
<.001
Step 2 Step
Block
Model
8.417
22.436
22.436
1
2
2
.004
<.001
<.001
Step 3 Step
Block
Model
6.674
29.110
29.110
1
3
3
.010
<.001
<.001
Step 4 Step
Block
Model
4.712
33.822
33.822
1
4
4
.030
<.001
<.001
Step 5 Step
Block
Model
6.413
40.236
40.236
1
5
5
.011
<.001
<.001
71
The Hosmer-Lemeshow goodness-of-fit test indicated positive results. In this test
for Scatter, the chi-square goodness-of-fit test results were not significant, with a p-value
of .121 (Table 7). The final Scatter site model also passed the Hosmer-Lemeshow test. In
this test for Scatter, the chi-square goodness-of-fit test results were not significant, with a
p-value of .544 (Table 7). These tests also confirms that the two Scatter site models
adequately fit the data.
Table 6. Hosmer-Lemeshow goodness-of-fit test for Scatter site models 1 and 2.
Model 1 Chi-Square df Sig.
Step 1 12.755 8 .121
Model 2 Chi-Square df Sig.
Step 1
Step 2
Step 3
Step 4
Step 5
4.736
4.083
4.820
9.905
6.923
5
8
8
8
8
.449
.850
.777
.272
.544
The first Habitation site model utilized the Forced Entry methodology for all
explanatory variables, but only included the independent variables that did not have a
Variance Inflation Factor (VIF) higher than 7.5. Independent variables with a VIF above
7.5 are considered to be redundant and could possibly eliminated from the model (Esri,
2013). The significant variables in this model were vegetation richness with a 500-meter
radius, slope, shelter, cost surface, cost to traverse to water bodies, and cost to traverse to
all stream nodes. For this model, omnibus tests for model coefficients had positive results
for with a p < .001, indicating that at least one of independent variables can predict the
dependent variable better than by chance alone (Table 8). The Forced Entry methodology
was also used for the final Habitation site model, but only included topographic and
vegetation. The significant variables for this model were slope, shelter, cost surface, cost
to traverse to water bodies, and cost to traverse to major stream nodes. This model also
passed the omnibus test for model coefficients with a p-value < .001, indicating a valid
model (Table 8).
72
Table 7. The omnibus tests for model coefficients for Habitation site models 1 and 2.
Model 1 Chi-Square Df Sig.
Step 1 Step
Block
Model
88.665
88.665
88.665
13
13
13
<.001
<.001
<.001
Model 2 Chi-Square Df Sig.
Step 1 Step
Block
Model
98.649
98.649
98.649
16
16
16
<.001
<.001
<.001
The first Habitation site model passed the Hosmer-Lemeshow goodness-of-fit test. In this
test for the Habitation site model, the chi-square goodness-of-fit test results were not
significant, with a p-value of .835 (Table 9). The final Habitation site model also passed
the Hosmer-Lemeshow test. In this test for the Habitation site model, the chi-square
goodness-of-fit test results were not significant, with a p-value of .110 (Table 9). By
passing this test, it is confirmed the two Habitation models are valid models.
Table 8. The Hosmer-Lemeshow goodness-of fit test for Habitation site models 1
and 2.
Model 1 Chi-Square df Sig.
Step 1 4.241 8 .835
Model 2 Chi-square df Sig.
Step 1 13.066 8 .110
6.2 Model Assessment
For the first Rock Art model, which was created utilizing the Forward Likelihood
methodology for all predictor variables, 89.4% of the sites and non-sites were correctly
73
classified (Table 9). The significant variables at a 90 % confidence level were slope, eastwest
aspect, and terrain texture. The coefficients were .147, .029, and .064, respectively.
This was one of the most accurate models based on this metric and also based on visual
inspection [Figure 6-1]. The results from this model outputs explain that topography was
the primary driver behind the model.
For the second Rock Art model which was created utilizing the Forced Entry
methodology for all predictor variables, 92% of the sites and non-sites were correctly
classified (Table 9). The significant variables in this model were only east-west aspect
and cost to traverse to all stream nodes. The coefficients were .033 and .001, respectively.
Although, this model had the highest percentage correctly classified, there were only two
significant variables in the model and current Rock Art sites did not align very well with
the predictive surface [Figure 6-2]. This should not be the primary indicator of what Rock
Art model is most accurate, but these are good indicators as to which models are most
worthy to further explore.
For the third Rock Art model which was created utilizing the Forced Entry
methodology for only topographic and vegetation explanatory variables, 91% of the sites
and non-sites were correctly classified (Table 9). The two significant variables in this
model were only east-west aspect and terrain texture. The coefficients were .031 and
.077, respectively. This model did have a very high percentage correctly classified and
the current Rock Art sites appeared to align very well with where the predictive surface
created from this model indicated a high likelihood of Rock Art sites [Figure 6-3]. This
model was driven by only a few topographic variables, east-west aspect and terrain
texture, but appeared to indicate a high likelihood of finding Rock Art sites in areas that
appeared logical. These indicators can help determine which model is the most accurate,
but further field investigation is recommended to truly determine which model is most
accurate.
74
Table 9. Classification table for Rock Art logistic regression models 1, 2, and 3.
Model 1 Predicted
Observed
Non-Sites Sites
Percentage Correct0 1
Step 1 Sites 1 9 83 90.2
Non-Sites 0 86 10 89.6
Percentage 89.9
Step 2 Site 1 87 83 90.2
Non-Site 0 9 9 90.6
Percentage 90.4
Step 3 Site 1 11 81 88.0
Non-Site 0 87 9 90.6
Final Percentage 89.4
Model 2
Step 1 Site 1 7 85 92.4
Non-Site 0 88 8 91.7
Percentage 92.0
Model 3
Site 1 8 84 91.3
Non-Site 0 87 9 90.6
Final Percentage 91.0
75
Figure 6-1: Distribution of current Rock Art sites on the predictive raster surface
utilizing slope, east-west aspect, and terrain texture predictive variables.
76
Figure 6-2: Distribution of current Rock Art sites on the predictive Rock Art
surface utilizing east-west aspect and cost to traverse to all stream nodes predictive
variables.
77
Figure 6-3: Distribution of current Rock Art sites on the predictive Rock Art
surface utilizing east-west aspect and terrain texture explanatory variables.
78
For the first Scatter site model, which was created utilizing the Forced Entry
methodology for all explanatory variables, 64.7% of the sites and non-sites were correctly
classified (Table 10). The significant variables in this model were vegetation richness
with a 500-meter radius, shelter, cost surface, and cost to traverse to water bodies. The
coefficients were .199, -37.01, .46, and -.0000157, respectively. This model had the
highest percent correctly classified, but was still relatively low compared to the Rock Art
site models. The significant variables that contributed to the creation of this predictive
surface came from all three variable categories, topography, vegetation, and water
resources [Figure 6-4]. Although, this model scored the highest for Scatter sites, it
appears that there could be further investigation into the Scatter site model to determine
where it could be improved.
For the final Scatter site model which was created utilizing the Forward
Likelihood methodology for all explanatory variables, 61.6% of the sites and non-sites
were correctly classified (Table 10). The significant variables in this model were
vegetation richness with a 500-meter radius, slope, terrain texture, shelter, and cost to
traverse to major stream nodes. The coefficients were .224, -.051, .031, -33.318, and .0000358,
respectively. This model scored the lowest percent correctly classified for the
Scatter site models. The significant variables that contributed to the creation of this
predictive surface also came from all three variable categories, topography, vegetation,
and water resources [Figure 6-5]. Like the previous Scatter site model, it appears that
there could be further investigation into the Scatter site model to determine where it could
be improved.
79
Table 10. Classification table for the Scatter site models 1 and 2.
Model 1 Predicted
Observed
Non-Sites Sites
Percentage Correct0 1
Step 1 Sites 1 62 156 71.6
Non-Sites 0 119 88 57.5
Final Percentage 64.7
Model 2
Step 1 Sites 1 60 158 72.5
Non-Sites 0 93 114 44.9
Percentage 59.1
Step 2 Site 1 84 134 61.5
Non-Site 0 114 93 55.1
Percentage 58.4
Step 3 Site 1 82 136 62.4
Non-Site 0 115 92 55.6
Percentage 59.1
Step 4 Site 1 65 159 70.2
Non-Site 0 115 92 55.6
Percentage 63.1
Step 5 Site 1 76 142 65.1
80
Non-Site 0 120 87 58.0
Final Percentage 61.6
Figure 6-4: Distribution of current Scatter sites on the predictive surface
utilizing vegetation richness at 500-meters, shelter, cost surface, and cost to
traverse to water bodies explanatory variables.
81
Figure 6-5: Distribution of current Scatter sites on the predictive surface
utilizing vegetation richness with a 500-meter radius, slope, shelter, and cost
distance to major stream nodes explanatory variables.
82
For the first Habitation site model which was created utilizing the Forced Entry
methodology, 68.6% of the sites and non-sites were correctly classified (Table 12). The
only variables not utilized in this model were the variables that had a Variance Inflation
Factor above a 7.5, indicating autocorrelation. These variables were cost to traverse to
major streams, cost to traverse to major stream nodes, and terrain texture. The significant
variables in this model were vegetation richness with a 500-meter radius, slope, shelter,
cost surface, cost to traverse to water bodies, and cost to traverse to all stream nodes. The
coefficients were .131, -.112, -58.913, 1.029, -.0000224, and -.00001486, respectively.
Although, this model indicated a relatively low percent correctly classified, it was
interesting and noteworthy to observe that the model indicated high site likelihood on
mesa tops [Figure 6-6]. This was most likely influenced by the negatively weighted
shelter explanatory variable (-58.913). This indicator appears to be logical and correct
due to the fact that many of the current habitation sites are located on top of mesas. Even
though the percent correctly classified was relatively low, it is worthy to investigate this
model further, possibly through fieldwork, to test how useful this model is and whether
the significant explanatory variables truly indicate where settlements tended to be
located.
For the final Habitation site model which was created utilizing the Forced Entry
methodology, 68.6% of the sites and non-sites were also correctly classified (Table 23).
The significant variables in this model were slope, shelter, cost surface, cost to traverse to
water bodies, and cost to traverse to major stream nodes. The coefficients were -.114, -
60.694, 1.07, -.0000206, and -.0000154, respectively. This model also scored a low
percent correctly classified, but it was interesting and noteworthy to observe that the
model indicated high likelihood near the major streams junctions. This indicator appeared
to be logical and correct due to the fact that many of the current habitation sites are also
located along major streams [Figure 6-6]. Like the previous model, even though the
percent correctly classified was relatively low, it is worthy to investigate this model to
test how useful this model is and whether the significant explanatory variables truly
indicate where settlements tended to be located.
83
Table 11. Classification table for the Habitation site models 1 and 2.
Model 1 Predicted
Observed
Non-Sites Sites
Percentage Correct0 1
Step 1 Sites 1 62 191 75.5
Non-Sites 0 128 84 60.4
Final Percentage 68.6
Model 2
Step 1 Sites 1 65 188 74.3
Non-Sites 0 131 81 61.9
Final Percentage 68.6
84
Figure 6-6: Distribution of current Habitation sites on the predictive surface
utilizing vegetation richness with a 500-meter radius, slope, shelter, and cost
distance to water bodies, and cost to traverse to all stream nodes explanatory
variables.
85
Figure 6-7: Distribution of current Habitation sites on the predictive surface
utilizing slope, shelter, and cost surface, cost to traverse to water bodies, and cost
distance to all stream nodes explanatory variables.
86
The final assessment of each model was to determine the percentage of each model’s
predictive surface that contained a probability of 0.5 or greater. This was done for all
seven models. For the first Rock Art site model, 11.8 % of the reservation had a
probability of 0.5 or greater to find a Rock Art site type. Rock Art models 1 and 2
contained 0.5 % and 0.02 % of the reservation, respectively. The average percentage for
all three models was 4.1 % (Table 12).
Table 12. Percentage of Rock Art site models predictive surface that contained a
probability of 0.5 >= of finding a Rock Art site type.
Rock Art Site Models Percentage of Reservation
with Probability Surface
.5 >= of Finding Rock Art
Sites
Model 1 11.8
Model 2 .5
Model 3 .0237
Average Percentage 4.1
For the first Habitation site model, 11.8 % of the reservation had a probability of
0.5 or greater to find a Rock Art site type. Rock Art models 1 and 2 contained 0.5 % and
0.02 % of the reservation, respectively. The average percentage for all three models was
4.1 % (Table 12).
87
Table 13. Percentage of Scatter site models predictive surface that contained a
probability of 0.5 >= of finding a Scatter site type.
Scatter Site Models Percentage of Reservation
with Probability Surface
0.5 >= of Finding Scatter
Sites
Model 1 89.6
Model 2 46.16
Average Percentage 67.88
For the habitation site models, the percentage of each model’s predictive surface
that contained a probability of 0.5 or greater was considerably higher. Habitation site
models 1 and 2 contained a percentage of 27.16 % and 3.7 %, respectively. The average
percentage for all three models was 15.47 % (Table 14).
Table 14. Percentage of Habitation site models predictive surface that contained a
probability of 0.5 >= of finding a Scatter site type.
Habitation Site Models Percentage of Reservation
with Probability Surface
.5 >= of Finding Habitation
Sites
Model 1 27.16
Model 2 3.7
Average Percentage 15.47
6.3 Summary
In conclusion, each model produced remarkably different outputs, which was
expected for each archaeological site type. The Rock Art archaeological site type models
performed the best of the three models, statistically speaking. The Rock Art model that
appeared to be the most successful had three topographic variables that were significant
at a 90 % confidence level. This is intuitive because Rock Art should undoubtedly be
found in areas of steeper slopes, which are areas where larger rocks are likely to be found
(e.g., large boulders and canyon walls). The Scatter archaeological site type model is
probably the model that needs to be refined the most. In discussion with the client, this
88
archaeological site type possibly needs to be divided into more specific site types
(Bernardini, personal communication, July 24, 2014). Many of the site type locations are
located very close to the Habitation sites and followed the same clustering pattern [Figure
6-5, Figure 6-6]. This could be one of the many reasons this model appeared to not
perform as well. Further investigation or reassessment of how this archaeological site
type could be categorized into more specific site types is necessary. The Habitation site
type models were the final models that were created. Although, the percent correctly
classified was relatively low for these models, they both provided interesting results. The
two models for Habitation received the exact same percent correctly classified scores
(68.4%), but were dictated by slightly different variables that created different results.
The first Habitation model addressed appeared to be more dictated by topographic
variables. It was very interesting to note that some of the areas of highest likelihood were
located on the mesa tops. This was probably due to the heavily negatively weighted
shelter variable, which is logical because mesa tops provide very little shelter from the
environmental elements. Conversely, the second Habitation model indicated that the
areas with the highest likelihood of site location were associated with water resources.
Many of the areas with the highest likelihood were in close proximity to the major
streams or their junctions. These were two interesting model outputs and warrant further
investigation of their validity and accuracy.
89
Chapter 7 – Conclusions and Future Work
The central goal of the project was to develop predictive models for the Rock Art,
Scatter, and Habitation archaeological site types. This was accomplished by creating 16
independent raster variables in ArcGIS 10.2. There were three categories under which
these variables were developed: topography, water resources, and vegetation. These
independent variables were created to assist in determining the possible locations of the
three archaeological site types. The dependent variables and sites/non-sites were also
developed in ArcGIS 10.2 in order to run the logistic regression models in SPSS.
Once the independent and dependent variables were created in ArcGIS, logistic
regression models were built in SPSS using various methods to select the variables that
were significant at a 90% confidence level. Three models were then selected once they
passed the Hosmer-Lemeshow goodness-of-fit test and the omnibus tests of model
coefficients. The coefficients and constants created in the selected models were then
input into the final algorithm that was used in the ArcGIS 10.2 Raster Calculator.
By doing this, the predictive surfaces for the archaeological site types were created.
All models were assessed for accuracy by examining the classification tables in SPSS,
which assessed the percentage of non-sites that were correctly classified and the
percentage of sites that were correctly classified. A point was considered to be correctly
classified if it was a site above .5 or a non-site below .5 (Denis, 2010).
All models indicated a level of accurate predictive capacity, but there are
components of the model building process that could be improved. The following section
addresses the following areas that might improve model success: soils variable
development, historical-period resources variable development, and the Random Forests
statistical approach.
7.1 Soils Variable
It is quite possible that introducing soils variables would improve the model. Multiple
soils variables were utilized by SRI in building their predictive models. SRI staff
observed that many sites were located on the edges of soil-mapping units (Heilen,
Leckman, Byrd, Homburg, & Heckman, 2012). Because of this, they used a combination
of soils data from the Soil Survey Geographic Database (SSURGO) and State Soil
Geographic Database (STATSGO). This might have to be done at the state level because
not all soils in every state have been mapped at the same level of detail. Some of the
variables SRI developed were cost to traverse to soil-texture boundary, standard deviation
to soil-texture boundary, standard deviation in soil-texture index, and range in organicmatter
content (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012). Other additional
variables SRI created included the following:
 Available water capacity (inches of water per inch of soil profile)
 Bulk density (grams per cubic centimeter)
 Calcareousness (percent calcium carbonate)
 Cation-exchange capacity (molar equivalent per 100 g)
 Electrical conductivity (decisiemens per meter)
90
 Organic matter (percent organic matter by mass)
These were the majority of the soil variables that SRI considered when building their
predictive models. In discussion with one of the principal investigators of the SRI project,
Michael Heilen communicated that soils could most definitely be worth investigating for
improved model performance (Heilen, personal communication, March 5, 2014).
7.2 Historical-Period Resources Variables
Historical-period resources are another variable that could be considered. These include
transportation routes, such as trails, wagon roads, and railroads (Heilen, Leckman, Byrd,
Homburg, & Heckman, 2012). These could have been important in predictive modeling
“because much of the US West was dependent on the use of transportation for
exploration and population migration as well as the redistribution for goods and
materials” (Heilen, Leckman, Byrd, Homburg, & Heckman, 2012). Because of this, it
could be useful to create variables for cost to traverse to historical-period resources, cost
to traverse to historical-period transportation routes, and cost-distance to [water] tanks.
These data can be found in the USGS Geographic Names Information System database.
7.3 Random Forests Statistical Approach
In this study, the logistic regression analysis was utilized in building predictive
models. In the other study, SRI used a relatively new statistical method called Random
Forests. According to Heilen et al. (Heilen, Leckman, Byrd, Homburg, & Heckman,
2012), “this approach samples both cases [dependent variable] and model [independent]
variables during the course of the model development and tests model performance with
samples not used to train the model.” This points out a limitation of the logistic
regression analysis and might be worthy of considering if someone is interested in
improving model performance. It was also mentioned that the Random Forests method is
“robust in overfitting from intercorrelations from variables” (Heilen, Leckman, Byrd,
Homburg, & Heckman, 2012). This is another very important consideration because
many of the independent variables were created from a DEM, which may have caused
considerable intercorrelation among variables. These were a few of the reasons why the
Random Forests statistical method to modeling might possibly be a better approach.
7.4 Summary
In conclusion, there are always strategies to take into consideration to improve model
performance. It has been noted that the development of soils and historical resources
variables could improve model performance. The Random Forests statistical approach
was also proposed as another overarching strategy that may improve performance.
In the model, topographic variables were deemed to be the most important in
building models. Vegetation richness with a 500-meter radius was also of primary
importance in building the Scatter site model. In conclusion, it was found that the Rock
Art and Habitation models were the most successful models and were of most use to the
client.
91
Works Cited
Apan, A., Wells, N., Reardon-Smith, K. R., McDougall, K., & Basnet, B. (2008, July).
Predictive Mapping Of Blackberry In the Condamine Catchment Using Logistic
Regression and Spatial Analysis. Retrieved from www.usq.edu.au:
http://eprints.usq.edu.au/4812/1/Apan_Wells_Reardon-
Smith_Richardson_McDougall_Basnet.pdf
Cooley, S. (2014). GIS 4 Geomorphology. Retrieved from
http://gis4geomorphology.com/: gis4geomorphology.com/roughness-topographic-
position/
Denis, D. (2010). Binary Logistic Regression Using SPSS. Retrieved from University of
Montana Department of Psychology:
http://psychweb.psy.umt.edu/denis/datadecision/binary_logistic_spss/index.html
English, R. (2011, July 29). Chapter 13 - Hypothesis Testing. Retrieved from San Diego
State University: Rohan Academic Computing: http://www-
rohan.sdsu.edu/~renglish/470/notes/chapt13/chapter13.htm
Esri. (2011). How Principal Component Analysis Works. Retrieved from ArcGIS
Resource Center:
http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#/How_Principal_Co
mponents_works/009z000000qm000000/
Esri. (2012). Principal Components. Retrieved from Principal Components (Spatial
Analyst):
http://resources.arcgis.com/en/help/main/10.1/index.html#//009z000000pq000000
92
Esri. (2013). Interpreting OLS Results. Retrieved from ArcGIS Resources:
http://resources.arcgis.com/en/help/main/10.1/index.html#//005p00000030000000
Fattah, M. A. (2010). Multicollinearity. Retrieved from Central Michigan University:
College of Humanities, Social, and Behavioral Sciences: http://www.chsbs.
cmich.edu/fattah/courses/empirical/multicollinearity.html
Field, A. (2005). Discovering Statistics Using SPSS. London, England.
Fish, P. R. (2013, September 18). Archaeological Sites. Retrieved from Learning Center
of the American Southwest: http://www.southwestlearning.org/topics/
archeological-sites
Goodchild, M. (1992). Geographic Information Systems Science. International Journal
of Geographic Information Systems, 33.
Heilen, M., Leckman, P. O., Byrd, A., Homburg, J. A., & Heckman, R. A. (2012).
Archaeological Sensitivity Modeling in Southern New Mexico: Automated Tools
and Models for Planning and Management. Albuquerque: Statistical Research,
Inc.
Hosmer, D., & Lemeshow, S. (2013). Applied Logistic Regression. Hoboken, New
Jersey, United States: John Wiley & Sons.
Kohler, T., & Parker, S. (1986). Predictive Models for Archaeological Resource
Locations. Advances in Archaeological Methodology and Theory, Vol. 9, 397-
452.
Kvamme, K. (1988, December). Quantifying the Present and Predicting the Past:
Theory, Method, and Application of Archaeological Predictive Modeling. Denver:
93
U.S. Government Printing Office. Retrieved from Quantifying the Present and
Predicting the Past: http://www.blm.gov/pgdata/etc/medialib/blm/wo/Planning_
and_Renewable_Resources/coop_agencies/cr_publications.Par.58402.File.dat/qua
ntifying_the_present.pdf
Leathwick, J. (2000, November). Predictive Models of Archaeological Site Distributions
in New Zealand. Wellington: New Zealand Department of Conservation.
Retrieved from Department of Conservation: Te Papa Atawhai:
http://www.doc.govt.nz/documents/science-and-technical/ir181.pdf
McGinley, M. (2011, September 10). Species Richness. Retrieved from The
Encyclopedia of Earth: http://www.eoearth.org/view/article/156216/
NetMBA. (2010). Scatter Plot. Retrieved from NetMBA:
http://www.netmba.com/statistics/plot/scatter/
Sonoma State University. (2008). Cultural Resource Management. Retrieved from
www.sonoma.edu: http://www.sonoma.edu/asc/aboutus/crm.htm
SPSS. (n.d.). Help Topics.
Strahler, S. M. (1957). Quantitative Analysis of Watershed Geomorphology. Burlington,
Vermont, United States.
U.S. EPA. (n.d.). National Hydrography Dataset Plus. Retrieved from NHD Plus Version
2: http://www.horizon-systems.com/nhdplus/NHDPlusV1_CO.php
University of Strathclyde. (n.d.). Goodness of Fit Measures. Retrieved from University of
Strathclyde: Humanities and Social Sciences:
http://www.strath.ac.uk/aer/materials/5furtherquantitativeresearchdesignandanaly
sis/unit6/goodnessoffitmeasures/
94
USGS. (2011, August 3). National Gap Analysis Program (GAP) | Land Cover Data
Portal. Retrieved from National Gap Analysis Program (GAP) -- Core Science
Analytics and Synthesis: http://gapanalysis.usgs.gov/gaplandcover/
USGS. (2014, April 15). National Elevation Dataset. Retrieved from USGS: Science a
Changing the World: http://ned.usgs.gov/
Weiss, A. D. (2014, June 29). Topographic Position and Landforms Analysis. Retrieved
from Jenness Enterprises: http://www.jennessent.com/downloads/tpi-poster-
tnc_18x22.pdf
Winemiller, D. T. (2014). GIS Reveals Basis for Ancient Settlement Location. ArcNews.
Redlands, California, United States: Esri.