Mining in geographic data Original slides:Raymond J. Mooney University of Texas at Austin What is Learning? • Herbert Simon: “Learning is any process by which a system improves performance from experience.” • What is the task? – Classification – Problem solving / planning / control Classification • Assign object/event to one of a given finite set of categories. – Medical diagnosis – Credit card applications or transactions – Fraud detection in e-commerce – Worm detection in network packets – Spam filtering in email – Recommended articles in a newspaper – Recommended books, movies, music, or jokes – Financial investments – DNA sequences – Spoken words – Handwritten letters – Astronomical images Measuring Performance • Classification Accuracy • Solution correctness • Solution quality (length, efficiency) • Speed of performance Why Study Machine Learning? Engineering Better Computing Systems • Develop systems that are too difficult/expensive to construct manually because they require specific detailed skills or knowledge tuned to a specific task (knowledge engineering bottleneck). • Develop systems that can automatically adapt and customize themselves to individual users. – Personalized news or mail filter – Personalized tutoring • Discover new knowledge from large databases (data mining). – Market basket analysis (e.g. diapers and beer) – Medical text mining (e.g. migraines to calcium channel blockers to magnesium) Why Study Machine Learning? The Time is Ripe • Many basic effective and efficient algorithms available. • Large amounts of on-line data available. • Large amounts of computational resources available. Related Disciplines • Artificial Intelligence • Data Mining • Probability and Statistics • Information theory • Numerical optimization • Computational complexity theory • Control theory (adaptive) • Psychology (developmental, cognitive) • Neurobiology • Linguistics • Philosophy Defining the Learning Task Improve on task, T, with respect to performance metric, P, based on experience, E. Designing a Learning System • Choose the training experience • Choose exactly what is too be learned, i.e. the target function. • Choose how to represent the target function. • Choose a learning algorithm to infer the target function from the experience. Training vs. Test Distribution • Generally assume that the training and test examples are independently drawn from the same overall distribution of data. – IID: Independently and identically distributed Choosing a Target Function • What function is to be learned and how will it be used by the performance system? • For checkers, assume we are given a function for generating the legal moves for a given board position and want to decide the best move. – Could learn a function: ChooseMove(board, legal-moves) → best-move – Or could learn an evaluation function, V(board) → R, that gives each board position a score for how favorable it is. V can be used to pick a move by applying each legal move, scoring the resulting board position, and choosing the move that results in the highest scoring board position. Ideal Definition of V(b) • If b is a final winning board, then V(b) = 100 • If b is a final losing board, then V(b) = –100 • If b is a final draw board, then V(b) = 0 • Otherwise, then V(b) = V(b´), where b´ is the highest scoring final board position that is achieved starting from b and playing optimally until the end of the game (assuming the opponent plays optimally as well). – Can be computed using complete mini-max search of the finite game tree. Approximating V(b) • Computing V(b) is intractable since it involves searching the complete exponential game tree. • Therefore, this definition is said to be non-operational. • An operational definition can be computed in reasonable (polynomial) time. • Need to learn an operational approximation to the ideal evaluation function. Representing the Target Function • Target function can be represented in many ways: lookup table, symbolic rules, numerical function, neural network. • There is a trade-off between the expressiveness of a representation and the ease of learning. • The more expressive a representation, the better it will be at approximating an arbitrary function; however, the more examples will be needed to learn an accurate function. Lessons Learned about Learning • Learning can be viewed as using direct or indirect experience to approximate a chosen target function. • Function approximation can be viewed as a search through a space of hypotheses (representations of functions) for one that best fits a set of training data. • Different learning methods assume different hypothesis spaces (representation languages) and/or employ different search techniques. Various Function Representations • Numerical functions – Linear regression – Neural networks – Support vector machines • Symbolic functions – Decision trees – Rules in propositional logic – Rules in first-order predicate logic • Instance-based functions – Nearest-neighbor – Case-based • Probabilistic Graphical Models – Naïve Bayes – Bayesian networks – Hidden-Markov Models (HMMs) – Probabilistic Context Free Grammars (PCFGs) – Markov networks Various Search Algorithms • Gradient descent – Perceptron – Backpropagation • Dynamic Programming – HMM Learning – PCFG Learning • Divide and Conquer – Decision tree induction – Rule learning • Evolutionary Computation – Genetic Algorithms (GAs) – Genetic Programming (GP) – Neuro-evolution Evaluation of Learning Systems • Experimental – Conduct controlled cross-validation experiments to compare various methods on a variety of benchmark datasets. – Gather data on their performance, e.g. test accuracy, training-time, testing-time. – Analyze differences for statistical significance. • Theoretical – Analyze algorithms mathematically and prove theorems about their: • Computational complexity • Ability to fit training data • Sample complexity (number of training examples needed to learn an accurate function) History of Machine Learning • 1950s – Samuel’s checker player – Selfridge’s Pandemonium • 1960s: – Neural networks: Perceptron – Pattern recognition – Learning in the limit theory – Minsky and Papert prove limitations of Perceptron • 1970s: – Symbolic concept induction – Winston’s arch learner – Expert systems and the knowledge acquisition bottleneck – Quinlan’s ID3 – Michalski’s AQ and soybean diagnosis – Scientific discovery with BACON – Mathematical discovery with AM History of Machine Learning (cont.) • 1980s: – Advanced decision tree and rule learning – Explanation-based Learning (EBL) – Learning and planning and problem solving – Utility problem – Analogy – Cognitive architectures – Resurgence of neural networks (connectionism, backpropagation) – Valiant’s PAC Learning Theory – Focus on experimental methodology • 1990s – Data mining – Adaptive software agents and web applications – Text learning – Reinforcement learning (RL) – Inductive Logic Programming (ILP) – Ensembles: Bagging, Boosting, and Stacking – Bayes Net learning History of Machine Learning (cont.) • 2000s – Support vector machines – Kernel methods – Graphical models – Statistical relational learning – Transfer learning – Sequence labeling – Collective classification and structured outputs – Computer Systems Applications • Compilers • Debugging • Graphics • Security (intrusion, virus, and worm detection) – E mail management – Personalized assistants that learn – Learning in robotics and vision