Big Data and GIS Michael F. Goodchild University of California Santa Barbara Two sections • Data – Big Data, new geospatial sources • Software – the Cloud, GIS as platform Overlapping ideas • Data-driven science – an abundance of data – complex questions and problems – lack of faith in simple theories • all the simple discoveries have been made • the end of theory – the value of successful predictions • Big Data – using multiple sources of data – not rigorously sampled – little quality control Volume • Yes – Landsat since 1970s order exabytes – order 104 surveillance cameras in London • capturing order 106 faces for recognition and tracking • at 10 Hz • order 107 faces or 1012 bytes per second to be analyzed, identified, and linked into tracks • a petabyte computing problem • But we divide and conquer – Landsat analysis by scene – partition the surveillance task into regions – we also aggregate, abstract, generalize Environmental sensors in science: the US National Science Foundation’s NEON project: neoninc.org Underwater observatories: Monterey Bay Aquarium Research Institute: mbari.org Mapping with kites, balloons, and micro- drones http://www.instructables.com/id/ Mapping-Microbes/#step1 Velocity • Big Data is also Fast Data – real-time sensing • submeter images available in near-real time • sensor networks • Why might this be important? – science is typically slow, leisurely – a discovery that is only true now, and no longer true next week, is not of interest • and the same applies to location – fast data is not likely to advance science • at least as we currently understand science A broader view of science • GIScience – the study of the fundamental properties of geographic information • the knowledge that is implemented in GIS • The advancement of science in other domains through the application of GIScience knowledge • The development of methods – that can be used to make discoveries – that can be used to solve real-world problems • that may be time-critical Tweets about wildfire in the San Diego area by Ming Tsou http://www.engineering.sdsu.edu/sdsu_newscenter/news.aspx?s=75219 Variety • Data from disparate sources – little or no quality control – no systematic experimental design • no recognizable sampling scheme – little or no documented provenance • Perhaps a fourth V: veracity or validity? – or the lack of it “The Geography of NFL Fandom” “This map displays Facebook fans of NFL teams across the United States. Each county is color-coded based on which official team page has the most 'Likes' from people who live in that county.” (Facebook) http://www.theatlantic.com/technology/archive/2014/0 9/the-geography-of-nfl-fandom/379729/ Loving County TX Loving County, TX • Loving County is a county in the U.S. state of Texas. As of the 2010 census, the population was 82, making it the least populous county in the United States. Owing partly to its small and dispersed population, it also has the highest median per capita and household income of any county in Texas. Loving County has no incorporated communities; its county seat and only community is Mentone. The value of Big Data • H.J. Miller and M.F. Goodchild (2014) Data-driven geography. GeoJournal. DOI: 10.1007/s10708-014-9602-6. • Hard science: – rigorous experimental design, hypothesis-driven, formal analysis of data quality • Soft science: – data used for exploration, hypothesis generation, practical problem-solving – no attempt at formal sampling • no generalization from samples to populations • Does Big Data have more value for soft science? The successes of Big Data • Predictions – the winner of the Eurovision Song Contest – the stock market – outcomes of elections • Early warning from social media – flu outbreaks – wildfires Spatial prediction • Predicting where (and when) – something will happen – some condition will exist • Where will this hurricane make landfall? • Where is this criminal most likely to live? • What is my best route given traffic conditions? • Where is a new school most needed? 22 Hits Source 595673 Jesusita Fire (Ethan) 188308 SBC Jesusita Fire Santa Barbara, CA (Robert O'Connor - fire news blog) 89214 Jesusita Fire Map (Randy - Independent.com) 67525 Jesusita Fire in Santa Barbara - LA Times map (Los Angeles Times) 27777 Map of burned homes in Santa Barbara (Los Angeles Times) 26330 Jesusita Fire Evacuation Areas: Approximation (COSB) 25454 Santa Barbara 'Jesusita Fire' (ABC7 Eyewitness News) 19592 Jesusita Fire - Santa Barbara (lanewspace) 2446 Santa Barbara Damaged Homes 2008 (Los Angeles Times, note: mapped for comparison with Jesusita) 2048 Jesusita Fire (longhairedhippy) 1314 Santa Barbara Fire Evacuation (Gary); 962 Jesusita Fire in Santa Barbara (ABC30 Action News) 788 Wildfire ~ Santa Barbara (Buffalo) 505 Closure map - Jesusita Fire in Santa Barbara (Los Angeles Times) 461 Untitled (Matthew, note: discovered via google.com.mx); 396 Jesusita Fire Structure Damage (Paul Bartsch); Neogeography • Citizens as both users and producers of geospatial data – volunteered geographic information – crowdsourcing – multiple sources, little quality control, not rigorously sampled • Maps for the individual – user-centered – transitory – the view from the ground – delivered on small devices www.openstreetmap.org http://www.directrelief.org/Flash/HaitiShipments/Index.html nationalmap.gov “Small” data • Rigorously sampled to provide representation • Quality control throughout • Abundant metadata • Generalizable results • Census data Big data • The Three Vs – volume, variety, velocity • No sampling scheme – samples often self-identified • No generalizability • No metadata • No quality control – though quality can be excellent – more up-to-date • Maps, atlases, globes – highly synthesized, compiled, abstracted – often rich quantitative attributes – using sometimes millions of observations – no memory of observation provenance Traditional geographic information Hidden synthesis • By experts in traditional authoritative production of geographic information – the process by which observations are synthesized into statements about points, polylines, or polygons is typically hidden – no provenance The new synthesis • Software • Actions of volunteers • Disparate purported facts of varying quality and reliability • How to harden? The crowd solution • Linus’s Law – the more eyes to review, the more accurate – works for popular facts • Corroborating reports – but how do we know the information is the same? – distinct georeferences – distinct descriptions • Geographic facts may be obscure – little-known areas of the world • or not so obscure The social solution • Who can be trusted? • A hierarchy of moderators and gate-keepers – all volunteered facts referred up the hierarchy • A social structure – promotion based on track record – heavy, accurate contributors promoted – e.g., Wikipedia, OSM – top levels of Google MapMaker reserved for Google staff The geographic solution • How can we know if a purported geographic fact is false? – because it violates the rules by which the geographic world is constructed – the syntactic rules – compare language rules, the sentence structure of English • What are those rules? – essential, fundamental geographic knowledge www.flickr.com Angle larger than 30 degrees bergonia.org Formalizing geographic knowledge • To enable automated triage of asserted facts • To enable automated synthesis – into products that more closely resemble the traditional ones 1960’s 1980’s 1990’s 2000’s 2010’s1970’s Mainframe Minicomputer Workstation Desktop Client/Server Web Server Cloud/ Device GIS Is Transforming Into a Platform Integrating Software and Services Pervasive Enterprise This New Platform Connects and Leverages Existing GIS Investments ServerDesktop Providing Mapping and GIS As a Service Professional GIS Executive Access Public Engagement Enterprise Integration Works Anywhere Knowledge Workers GIS Servers To the Entire Organization This Platform Leverages Many Trends Pervasive Geographic Understanding Enabling Pervasive Access Integrating Traditional GIS with a Whole New World of Apps Enabling New User Experiences The Platform Integrates All Types of Geospatial Information Using Web Maps to Normalize the Information . . . Imagery DBMS Services Sensor Networks Big Data SpreadsheetsMaps Social Media Web Maps Are Fundamental Providing a New Medium for Organizing and Publishing Making Geographic Information Available Anywhere Desktops Tablets Smartphones Websites Any Device Browsers Social Media Distributed Services Web Map Supporting Visualization, Query, Editing, and Analysis This Platform Transforms Organizations Breaking Down Barriers Between Workflows, Disciplines, and Cultures Enabling Collaboration, Sharing, and Holistic Approaches GIS Is at a Major Turning Point Becoming a Platform Enabling Wide Scale Access and Use of GIS Public EngagementExecutive Access Knowledge Workers Professional GIS Enterprise Integration Works Anywhere Summary • Big Data is relevant to GIS: – in the soft stages of science – in solving time-critical problems – in spatial prediction • Big Data requires a change of scientific perspective – science driven by data rather than theory – all the data, not just the best data – prediction as a legitimate activity Summary (2) • We need to develop ways to harden Big Data – at electronic speed • Synthesis may be the most important activity in GIScience in the future • GIS is becoming a platform – an integrated set of Cloud services – ubiquitous access across all devices – making it easy to develop new applications – but with many open questions about privacy, data management