S Sample Complexity Generalization Bounds Samuel’s Checkers Player Definition Samuel’s Checkers Player is the first machine learning system that received public recognition. It pioneered many important ideas in game playing and machine learning. The two main papers describing his research (Samuel, , ) became landmark papers in Artificial Intelligence. In one game, the resulting program was able to beat one of America’s best players of the time. Description of the Learning System Samuel’s checkers player featured a wide variety of learning techniques. First, his checkers player remembered positions that it frequently encountered during play. This simple form of rote learning allowed it to save time, and to search deeper in subsequent games whenever a stored position was encountered on the board or in some line of calculation. Next, it featured the first successful application of what is now known as Reinforcement Learning for tuning the weights of its evaluation function. The program trained itself by playing against a stable copy of itself. After each move, the weights of the evaluation function were adjusted in a way that moved the evaluation of the root position after a quiescence search closer to the evaluation of the root position after searching several moves deep. This technique is a variant of what is nowadays known as Temporal-Difference Learning and commonly used in successful game-playing programs. Samuel’s program not only tuned the weights of the evaluation but also employed on-line Feature Selection for constructing the evaluation function with the terms that seem to be the most significant for evaluating the current board situation. Feature Construction was recognized as the key problem that still needs to be solved. Later, Samuel changed his evaluation function from a linear combination of terms into a structure that closely resembled a -layer Neural Network. This structure was trained with Preference Learning from several thousand positions from master games. Cross References Machine Learning and Game Playing Recommended Reading Samuel, A. L. (). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, (), –. Samuel, A. L. (). Some studies in machine learning using the game of checkers. II – recent progress. IBM Journal of Research and Development, (), –. Saturation Bottom Clause SDP Symbolic Dynamic Programming Search Bias Learning as Search Claude Sammut & Geoffrey I. Webb (eds.), Encyclopedia of Machine Learning, DOI ./----, © Springer Science+Business Media LLC   S Search Engines: Applications of ML Search Engines: Applications of ML Eric Martin University of New South Wales, Sydney, NSW, Australia Definition Search engines provide users with Internet resources – links to web sites, documents, text snippets, images, videos, etc. – in response to queries. They use techniques that are part of the field of information retrieval, and rely on statistical and pattern matching methods. Search engines have to take into account many key aspects and requirements of this specific instance of the information retrieval problem. First, the fact is that they have to be able to process hundreds of millions of searches a day and answer queries in a matter of milliseconds. Second, the resources on the World Wide Web are constantly updated, with information being continuously added, removed or changed – the overall contents changing by up to % a week – in a pool consisting of billions of documents. Third, the users express possibly semantically complex queries in a language with limited expressive power, and often not make use or proper use of available syntactic features of that language – for instance, the boolean or operator occurs in less than % of queries. Motivation and Background Web searching is technically initiated by sending a query to a search engine but the whole search process starts earlier, in the mind of the person who conducts the search. To be successful, the process needs to provide users with words, text snippets, images, or movies that fulfill the users’ quest for information. Thus, though a search is technically the implementation of a procedure that maps a query to some digital material, it spans a larger spectrum of activities, from a psychological trigger to a psychological reward. For a given set of digital material that, if provided, would be deemed perfectly satisfactory by a number of users looking for the same information, different users will issue different queries. This might be because they have varying skills at conveying what they are after in the form of a few words. This, in turn, might be because their understanding of the technology prompts them to formulate what they are after in a form that, rightly or wrongly, they consider appropriate for a computing device to process. That might be for a number of different reasons that all point to the fact that the quality of the search is not determined by its adequacy to the query, but by its adequacy to the psychological trigger that produced the query. This especially makes Web searching challenging and exciting area in the field of information retrieval. In Broder (), it is suggested that web queries can be classified in three classes: Navigational queries expect the search to return a particular url. For instance, http://www.cityrail. info is probably the expected result to the query Cityrail for a Sydneysider. Transactional queries expect the search to return links to sites that offer further interaction, for example for online shopping or to download music. For instance, http://www.magickeys.com/books/, where books for young children are available for download, is probably a good result to the query children stories. Informational queries expect the search to reveal the information, that is, the correct answer to a question. This information can be immediately provided in the page where the results of the search are displayed, for instance, Bern for the query capital of switzerland. Or it can be provided in the pages accessible from the first links returned by the search, as for instance Italy that is easily found in the web page accessed from the first link returned in response to the query football world champion 1982. Answering an informational query with the information itself, rather than with links to documents where the information is to be found, is one of the most difficult challenges that search engines developers have started addressing. Structure of the Learning System The general structure of a search engine can be illustrated as follows: Search Engines: Applications of ML S  S User Parsing Postfilering Repository Matching Postprocessing Ranking Q uery Documents Results A string matching algorithm is applied to the parsed query issued by the user and to an indexed representation of a set of documents, resulting in a ranked subset of the latter. This ranked set of documents can be subjected to a postprocessing procedure whose aim is to improve the results by either refining the query or by analyzing the documents further, possibly over many iterations, until the results stabilize and can be returned to the user, following a postfiltering procedure to display the information appropriately. Retrieval Methods The existence of hyperlinks between documents distinguishes search engines from other information retrieval applications. All techniques developed in the field of information retrieval are potentially relevant for extracting information from the web, but benefits from a proper analysis of the cross-reference structure. That is, to measure the degree of relevance of a document to a given query, one can take advantage of a prior ranking of all documents independent of that query or any other, following a sophisticated version of the PageRank (Page, Brin, Motwani, & Winograd, ) link analysis algorithm. One of the simplest versions of the algorithm recursively defines the PageRank PR(T) of a page T which pages T, . . . ,Tn point to, amongst the c, . . . ,cn pages T, . . . ,Tn point to, respectively, as  − d N + d(T/c + ⋯ + Tn/cn), where N is the total number of pages and d, a damping factor, represents the probability that a user decides to follow a link rather than randomly visit another page; normalizing the solution so that the PageRanks of all pages add up to , PR(T) then represents the probability that a user visits T by clicking on a link. Boolean retrieval is one of the simplest methods to retrieve a set of documents that match exactly a query expressed as a boolean combination of keywords. The match is facilitated by using an inverted file indexing structure which associates every possible keyword with links to the documents in which it occurs. If extra information is kept on the occurrences of keywords in documents (number of occurrences, part of the document in which they occur, font size and font type used for their display, etc.) then the results can also be ranked. But best match models, as opposed to exact match models, are better suited to producing ranked results. The vector space model is one of the earliest and most studied models of this kind. It represents documents and queries as vectors over a space each of whose dimensions represents a possible keyword, and measures the similarity between the vectors ⃗q and ⃗d whether it occurs at least once in query and document, respectively, record for each keyword as the cosine of the angle formed by ⃗q and ⃗d, namely, ⃗q⃗d ⃗q ⃗d , that is all the most closer to  that query and document have more in common. The term-frequency-inversedocument-frequency (tf-idf) model refines the encoding given by ⃗d by replacing a value of  in the ith dimension, indicating the existence of an occurrence of the ith keyword in ⃗d, with c log( N c ), where c is the number of occurrences of the ith keyword in the document, N is the total number of documents, and c is the number of documents in the whole collection that contain at least one occurrence of the ith keyword; so more importance is given to keywords that occur more and that occur “almost exclusively” in the document under consideration. One of the most obvious issues with this approach is that the number of dimensions is huge and the vectors are sparse. Another important issue is that set of vectors determined by the set of keywords is not orthogonal, and not even linearly independent, because two given keywords can be synonyms (sick and ill), not semantically related (garlic and manifold), or more or less semantically related (wheel and tire).  S Search Engines: Applications of ML The extended vector space model (Robertson, Walker, & Beaulieu, b) addresses this issue assuming that the similarity between two keywords is captured by the symmetric difference between the set of documents that contain a keyword and the set of documents that contain the other, ranging from identical sets (similar keywords) to disjoint sets (unrelated keywords). Let D, . . . ,DN′ be an enumeration of the quotient relation over the set of all documents such that two documents are equivalent if they contain precisely the same keywords (so N′ is at most equal to N, the number of documents in the whole collection). Conceive of an N′ -dimensional vector space S of which D, . . . ,DN′ is a basis. Associate the ith keyword with the vector ⃗vi of S defined as / √ w  + ⋯ + w N′ (w, . . . ,wN′ ) where for all nonzero k ≤ N′ , wk is the number of occurrences of the ith keyword in all documents that belong to class Dk. Then associate a document with the vector ⃗d of S defined as α ⃗v + ⋯ + αN′′ ⃗vN′′ where N′′ is the number of keywords and for all nonzero k ≤ N′′ , αk is the number of occurrences of the ith keyword in that document, and associate a query with the vector ⃗q of S defined as β ⃗v + ⋯ + βN′′ ⃗vN′′ where for all nonzero k ≤ N′′ , βk is equal to  if the ith keyword occurs in the query, and to  otherwise. The similarity between ⃗q and ⃗d is then measured as described for the simple vector space method. The topic-based vector space model (Becker & Kuropka, ) also replaces the original vector space with a different vector space of a different dimension, addressing the issue of nonorthogonality between keywords thanks to fundamental topics, assumed to be pairwise independent, using ontologies; the fundamental topics then provide the vector basis which is a linear combination of a given keyword. So the topic-based vector space model conceives of the meaning of words as the semantic relationships that emerge from the common use of a language by the members of a given community, whereas the extended vector space model conceives of the meaning of words as the syntactic relationship of term co-occurrence with respect to the repository of documents being processed. Probabilistic retrieval frameworks aim at estimating the probability that a given document is relevant to a given query. Given a keyword w, denote by p+ w the probability that w occurs in a document relevant to w, and denote by p− w the probability that w occurs in a document not relevant to w. Many probabilistic retrieval frameworks then define the relevance of a document to a query as follows, where w, . . . ,wn are the keywords that occur both in the query and in the document: n ∑ i= log( p+ wi ( − p− wi ) p− wi ( − p+ wi ) ). This quantity increases all the more that the document contains more words that are more likely to occur in relevant documents, and more words less likely to occur in irrelevant documents. Different frameworks suggest different ways to evaluate the values of p+ wi and p− wi . For instance, pi is sometimes assumed to be constant and p− wi defined as ni/N where N is the total number of documents and ni the number of documents in which wi occurs, capturing the fact that a document containing a keyword appearing in few other documents is likely to be relevant to that keyword, in which case the previous formula can be rewritten c n ∑ i= log( N − ni ni ) for some constant c. More sophisticated methods have been developed to better estimate the probabilities, such as the Okapi weighting document score (Robertson, Walker, & Beaulieu, a) which defines the relevance of a document to a query as n ∑ i= log( N − ni + . ni + . ) (k + )ci (k( − b) + b(l/β)) + ci × (k + )di k + di , where the notation is as above, with the addition of ci to denote the number of occurrences of wi in the document, di to denote the number of occurrences of wi in the query, l to denote the number of bytes in the document, β to denote the average number of bytes in a document, and b, k, and k to denote constants. Query Classification The development of effective methods of information retrieval from web resources requires a good understanding of users’ needs and practice. In Markev (a), the following questions are identified as being especially relevant towards gaining such an understanding. Search Engines: Applications of ML S  S ▸ What characterizes the queries that end users submit to online IR systems? What search features do people use? What features would enable them to improve on the retrievals they have in hand? What features are hardly ever used? What do end users do in response to the system’s retrievals? This chapter indicates that many of the basic features of information retrieval systems are poorly used. For instance, less than , , and % of queries make use of the and, or, and not boolean operators, respectively, and less than % of queries of enclosing quotes; the wrong syntax is often used, resulting in incorrect use of advanced search features in one third of the cases; less than % of queries take advantage of relevance feedback. Based on those findings, the second part (Markev, b) of the article suggests two dozen new research questions for researchers in information retrieval, while noting that about % of users are satisfied with their search experience. Evaluating search satisfaction has received lots of attention. In Fox, Karnawat, Mydland, Dumais, and White (), both explicit and implicit measures of satisfaction are collected. Explicit measures are obtained by prompting the user to evaluate a search result as satisfying, partially satisfying, or not satisfying, and similarly to evaluate satisfaction gained from a whole search session. Implicit measures are obtained by recording mouse and keyboard actions, time spent on a page, scrolling actions and durations, number of visits to a page, position of page in results list, number of queries submitted, number of results visited, etc. A Bayesian model can be used to infer the relationships between explicit and implicit measures of satisfaction. This chapter reports on two Bayesian networks that were built to predict satisfaction for individual page visits and satisfaction for entire search sessions – w.r.t. the feedback obtained from both kinds of prompts – with evidence that a combination of well chosen implicit satisfaction measures can be a good predictor of explicit satisfaction. Referring to the categorization of web queries in Broder () as user goals, it is proposed in Lee, Liu, and Cho () to build click distributions by sorting results to a query following the numbers of clicks they received from all users, and suggested that highly skewed distributions should correspond to navigational queries, while flat distributions should correspond to informational queries. The same kind of considerations are also applied to anchor-link distributions, the anchor-link distribution of a query being defined as the function that maps a URL to the number of times that URL is the destination of an anchor that has the same text as the query. Finer techniques of query classification are proposed in Beitzel, Jensen, Lewis, Chowdhury, and Frieder (), where is a rule-based automatic classier is produced from selectional preferences. A query consisting of at least two keywords is split into a head x and a tail y, and then converted into a forward pair (x,u) and a backward pair (u,y), where u represents a category, that is, a generic term that refers to a list of semantically related words in a thesaurus. For instance, the query “interest rate” can (only) be split into (interest,rate) and converted to the forward pair (interest,personal finance) where “personal finance” denotes the list consisting of the terms “banks,” “rates,” “savings,” etc.; so the first keyword – “interest” – provides context for the second one. Given a large query log, the maximum likelihood estimate (MLE) of P(u/x), the probability that a query decomposed as (x,z) is such that z belongs to category u, is defined as the quotient between the number of queries in the log that have (x,u) as a forward pair and the number of queries in the log that can be decomposed as (x,z). This allows one to write a forward rule of the form “x Y classified as u with weight p,” where p is the MLE of P(u/x), provided that the selectional preference strength of x be above some given threshold. The rule can then be applied to incoming queries, such as “interest only loan” by matching a final or initial segment of the query – depending on whether forward or backward rules are under consideration – and suggest possible classifications; with the running example, “interest only loan” would then be classified as “personal finance with weight p” if a forward rule of the form “interest Y classified as personal finance with weight p” had been discovered. Such a classification can then be used to rewrite the query, or to send it to an appropriate database-backend if many domain-specific databases are available. Cross References Bayesian Methods Classification  S Self-Organizing Feature Maps Covariance Matrix Rule Learning Text Mining Recommended Reading Becker, J., & Kuropka, D. (). Topic-based vector space model. In W. Abramowicz & G. Klein (Eds.), Proceedings of the sixth international conference on business information systems (pp. –). Colorado Springs, CO. Beitzel, S. M., Jensen, E. C., Lewis, D. D., Chowdhury, A., & Frieder, O. (). Automatic classification of web queries using very large unlabeled query logs. ACM Transactions on Information Systems, (), . ISSN: - Broder, A. (). A taxonomy of web search. SIGIR Forum, (), –. Fox, S., Karnawat, K., Mydland, M., Dumais, S., & White, T. (). Evaluating implicit measures to improve web search. ACM Transactions on Information Systems, (), –. Lee, U., Liu, Z., & Cho, J. (). Automatic identification of user goals in web search. In WWW ’: In Proceedings of the th international conference on World Wide Web (pp. –). New York: ACM Press. Markev, K. (a). Twenty-five years of end-user searching, part : Research findings. Journal of the American Society for Information Science and Technology, (), –. Markev, K. (b). Twenty-five years of end-user searching, part : Future research directions. Journal of the American Society for Information Science and Technology, (), –. Page, L., Brin, S., Motwani, R., & Winograd, T. (). The pagerank citation ranking: Bringing order to the web. Technical report. Stanford, CA: Stanford University Press. Robertson, S. E., Walker, S., & Beaulieu, M. (a). Okapi at trec: Automatic ad hoc, filtering, VLC and filtering tracks. In E. Voorhees & D. Harman (Eds.), In Proceedings of the seventh text retrieval conference (pp. –). Gaithersburg, MD. Robertson, S. E., Walker, S., & Beaulieu, M. (b). On modeling of information retrieval concepts in vector spaces. ACM Transactions on Database Systems, (), –. Self-Organizing Feature Maps Self-Organizing Maps Self-Organizing Maps Samuel Kaski Helsinki University of Technology, Finland Synonyms Kohonen maps; Self-organizing feature maps; SOM Definition Self-organizing map (SOM), or Kohonen Map, is a computational data analysis method which produces nonlinear mappings of data to lower dimensions. Alternatively, the SOM can be viewed as a clustering algorithm which produces a set of clusters organized on a regular grid. The roots of SOM are in neural computation (see neural networks); it has been used as an abstract model for the formation of ordered maps of brain functions, such as sensory feature maps. Several variants have been proposed, ranging from dynamic models to Bayesian variants. The SOM has been used widely as an engineering tool for data analysis, process monitoring, and information visualization, in numerous application areas. Motivation and Background The SOM (Kohonen, , ) was originally introduced in the context of modeling of how the spatial organization of brain functions forms. Formation of feature detectors selective to certain sensory inputs, such as orientation-selective visual neurons, had earlier been modeled by competitive learning in neural networks, and some models of how the feature detectors become spatially ordered had been published (von der Malsburg, ). The SOM introduced an adaptation kernel or neighborhood function that governs the adaptation in such networks; while in plain competitive learning only the winning neuron that best matches the inputs adapts, in SOM all neurons within a local neighborhood of the winner learn. The neighborhood is determined by the neighborhood function. The SOM is an algorithm for computing such ordered mappings. While some of the motivation of the SOM comes from neural computation, its main uses have been as a practical data analysis method. The SOM can be viewed as a topographic vector quantizer, a nonlinear projection method, or a clustering method. In particular, it is a clustering-type algorithm that orders the clusters. Alternatively, it is a nonlinear projection-type algorithm that clusters, or more specifically quantizes, the data. The SOM was very popular in the s and still is; it is intuitively relatively easily understandable, yet hard to analyze thoroughly. It connects many research traditions and works well in practice. An impressive set of variants have been published over the years, of which Self-Organizing Maps S  S probabilistic variants (e.g., Bishop, Svensén, & Williams () and Heskes ()) are perhaps closest to the current mainstream machine learning. While there currently are excellent alternative choices for many of the specific tasks SOMs have been applied for over the years, even the basic SOM algorithm is still viable as a versatile engineering tool in data-analysis tasks. Structure of Learning System The SOM consists of a regular grid of nodes (Fig. ). A model of data has been attached to each node. For vector-valued data x = [x, . . . ,xd]T , the models are vectors in the same space; the model at the ith node is mi = [mi, . . . ,mid]. The models define a mapping from the grid to the data space. The coordinates on the grid are uniquely determined by the index i of a node, and the model mi gives the location in the data space. The whole grid becomes mapped into an “elastic net” in the data space. While being a mapping from the grid to the input space, the SOM defines a projection from the input space to the discrete grid locations as well; each data point is projected to the node having the closest model. The original online SOM algorithm updates the model vectors toward the current input vector at time t, mi(t + ) = mi(t) + hci(t)(x(t) − mi(t)) . Here c is the index of the unit having the closest model vector to x(t), and hci(t) is the neighborhood function or adaptation kernel. The kernel is a decreasing function of the distance between the units i and c on the grid; it forces neighboring units to adapt toward similar input samples. The height and width of h are decreasing functions of time t. In an iteration over time and over the different inputs, the model vectors become ordered and specialize to represent different regions of the input space. The online version of K-means clustering is a special case of the SOM learning rule, where only the closest model vector is adapted. That is, the neighborhood function is hci(t) = α(t) for i = c and hci =  otherwise. Here α(t) is the adaptation coefficient, a decreasing scalar. In short, K-means and SOM use the prototypes in the same way, but in SOM the prototypes have an inherent order that stems from fixing them onto a grid and updating the prototypes to represent both the data mapped to themselves and to their neighbors. A neural interpretation of the SOM adaptation process is that the nodes are feature detector neurons or processing modules that in a competitive learning process become specialized to represent different kinds of inputs. The neighborhood function is a plasticity kernel that forces neighboring neurons to adapt at the same time. The kernel transforms the discrete set of feature detectors into feature maps analogous to ordered brain maps of sensor inputs, and more generally to maps of more abstract properties of the input data. A third interpretation of the SOM is as a vector quantizer. The task of a vector quantizer is to encode inputs with indexes of prototypes, often called codebook vectors, such that a distortion measure is minimized. If there is noise that may change the indexes, the i mi SOM grid Data space Self-Organizing Maps. Figure . A schematic diagram showing how the SOM grid of units (circles on the left, neighbors connected with lines) corresponds to an “elastic net” in the data space. The mapping from the grid locations, determined by the indices i, to the data space is given by the model vectors mi attached to the units i  S Semantic Mapping distribution of the noise should be used as the neighborhood function, and then the distortion becomes minimized by a variant of SOM (Luttrell, ). In summary, the SOM can be viewed as an algorithm for producing codebooks ordered on a grid. While it has turned out to be hard to rigorously analyze the properties of the SOM algorithm (Fort, ), its fixed points may be informative. In a fixed point the models must fulfill mi = ∑x hc(x),ix ∑x hc(x),i , that is, each model vector is in the centroid of data projected to it and its neighbors. The definition of a principal curve (Hastie, Tibshirani, & Friedman, ), a nonlinear generalization of principal components (see principle components analysis), essentially is that the curve goes through the centroid of data projected to it. Hence, one interpretation of the SOM is a discretized, smoothed, nonlinear generalization of principal components. In short, SOMs aim to describe the variation in the data nonlinearly with their discrete grids. Finally, a popular prototype-based classifier, learning vector quantization (LVQ) (Kohonen, ), can be loosely interpreted as a variant of SOMs, although it does not have the neighborhood function and hence, the prototypes do not have an order. Programs and Data The SOM has been implemented in several commercial packages and as freeware. Two examples, SOM_PAK written in C and Matlab SOM Toolbox (http://www.cis. hut.fi/research/software) came from Kohonen’s group. Applications The SOM can be used as a nonlinear dimensionality reduction method, by projecting each data vector into the grid location having the closest model vector. An image of the grid can be used for information visualization. Since all grid locations are clusters, the SOM display actually visualizes an ordered set of clusters, or a quantized image of the principal manifold in data. More specifically, the SOM units can be thought of as subclusters, and data clusters may form larger areas on the SOM grid. SOM-based visualizations can be used for illustrating the proximity relationships of data vectors, such as documents in the WEBSOM document maps (Kohonen et al., ), or monitoring the change of a system such as an industrial process or the utterances of a speaker, as a trajectory on the SOM display. More applications can be found in a collected bibliography (the latest one is Pöllä, Honkela, & Kohonen (in press)). Cross References ART Competitive Learning Dimensionality Reduction Hebbian Learning K-means Clustering Learning Vector Quantization Recommended Reading Bishop, C. M., Svensén, M., & Williams, C. K. I. (). GTM: The generative topographic mapping. Neural Computation, , – . Fort, J. C. (). SOM’s mathematics. Neural Networks, , –. Hastie, T., Tibshirani, R., & Friedman, J. (). The elements of statistical learning. New York: Springer. Heskes, T. (). Self-organizing maps, vector quantization, and mixture modeling. IEEE Transactions on Neural Networks, , –. Kohonen, T. (). Self-organized formation of topologically correct feature maps. Biological Cybernetics, , –. Kohonen, T. (). Self-organizing maps (rd ed.). Berlin: Springer. Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., et al. (). Self organization of a massive document collection. IEEE Transactions on Neural Networks, , –. Luttrell, S. P. (). A Bayesian analysis of self-organizing maps. Neural Computation, , –. Pöllä, M., Honkela, T., & Kohonen, T. (). Bibliography of selforganizing map (SOM) papers: - addendum. Report TKK-ICS-R, Helsinki University of Technology, Department of Information and Computer Science, Espoo, Finland. von der Malsburg, C. (). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, , –. Semantic Mapping Text Visualization Semi-Naive Bayesian Learning S  S Semi-Naive Bayesian Learning Fei Zheng, Geoffrey I. Webb Monash University, Clayton, Melbourne, Victoria, Australia Definition Semi-naive Bayesian learning refers to a field of Supervised Classification that seeks to enhance the classification and conditional probability estimation accuracy of naive Bayes by relaxing its attribute independence assumption. Motivation and Background The assumption underlying naive Bayes is that attributes are independent of each other, given the class. This is an unrealistic assumption for many applications. Violations of this assumption can render naive Bayes’ classification suboptimal. There have been many attempts to improve the classification accuracy and probability estimation of naive Bayes by relaxing the attribute independence assumption while at the same time retaining much of its simplicity and efficiency. Taxonomy of Semi-Naive Bayesian Techniques Semi-naive Bayesian methods can be roughly subdivided into five high-level strategies for relaxing the independence assumption. The first strategy forms an attribute subset by deleting attributes to remove harmful interdependencies and applies conventional naive Bayes to this attribute subset. The second strategy modifies naive Bayes by adding explicit interdependencies between attributes. The third strategy accommodates violations of the attribute independence assumption by applying naive Bayes to a subset of training set. Note that the second and third strategies are not mutually exclusive. The fourth strategy performs adjustments to the output of naive Bayes without altering its direct operation. The fifth strategy introduces hidden variables to naive Bayes. Methods That Apply Naive Bayes to a Subset of Attributes Due to the attribute independence assumption, the accuracy of naive Bayes is often degraded by the presence of strongly correlated attributes. Irrelevant attributes may also degrade the accuracy of naive Bayes, in effect increasing variance without decreasing bias. Hence, it is useful to remove both strongly correlated and irrelevant attributes. Backward sequential elimination (Kittler, ) is an effective wrapper technique to select an attribute subset and has been profitably applied to naive Bayes. It begins with the complete attribute set and iteratively removes successive attributes. On each iteration, naive Bayes is applied to every subset of attributes that can be formed by removing one further attribute. The attribute whose deletion most improves training set accuracy is then removed, and the process repeated. It terminates the process when subsequent attribute deletion does not improve training set accuracy. Conventional naive Bayes is then applied to the resulting attribute subset. One extreme type of interdependencies between attributes results in a value of one being a generalization of a value of the other. For example, Gender=female is a generalization of Pregnant=yes. Subsumption resolution (SR) (Zheng & Webb, ) identifies at classification time pairs of attribute values such that one appears to subsume (be a generalization of) the other and delete the generalization. It uses the criterion Txi = Txi,xj ≥ u to infer that attribute value xj is a generalization of attribute value xi, where Txi is the number of training cases with value xi, Txi,xj is the number of training cases with both values, and u is a userspecified minimum frequency. When SR is applied to naive Bayes, the resulting classifier acts as naive Bayes except that it deletes generalization attributevalues at classification time if a specialization is detected.  S Semi-Naive Bayesian Learning Methods That Alter Naive Bayes by Allowing Interdependencies between Attributes Interdependencies between attributes can be addressed directly by allowing an attribute to depend on other non-class attributes. Sahami () introduces the terminology of the z-dependence Bayesian classifier, in which each attribute depends upon the class and at most z other attributes. Figure  depicts methods in this group from the Bayesian Network perspective. In Fig. a, each attribute depends on the class and at most one another attribute. Tree Augmented Naive Bayes (TAN) (Friedman, Geiger, & Goldszmidt, ) is a representative one-dependence classifier. It efficiently finds a directed spanning tree by maximizing the loglikelihood and employs this tree to perform classification. SuperParent TAN (Keogh & Pazzani, ) is an effective variant of TAN. A SuperParent one-dependence classifier (Fig. b) is a special case of one-dependence classifiers, in which an attribute called the SuperParent (X in this graph), is selected as the parent of all the other attributes. Averaged One-Dependence Estimators (AODE) (Webb, Boughton, & Wang, ) selects a restricted class of one-dependence classifiers and aggregates the predictions of all qualified classifiers within this class. Maximum a posteriori linear mixture of generative distributions (MAPLMG) (Cerquides & Mántaras, ) extends AODE by assigning a weight to each one-dependence classifier. Two z-dependence classifiers (z ≥ ) are NBTree (Kohavi, ) and lazy Bayesian rules (LBR) (Zheng & Webb, ), both of which may add any number of non-class-parents for an attribute. In Fig. c, attributes in {Xiq+ , . . . ,Xin } depend on all the attributes in {Xi , . . . ,Xiq }. The main difference between these two methods is that NBTree builds a single tree for all training instances while LBR generates a Bayesian rule for each test instance. Methods That Apply Naive Bayes to a Subset of the Training Set Another effective approach to accommodating violations of the conditional independence assumption is to apply naive Bayes to a subset of the training set, as it is possible that the assumption, although violated in the whole training set, may hold or approximately hold in a subset of the training set. NBTree and LBR use a local naive Bayes to classify an instance and can also be classified into this group. Locally weighted naive Bayes (LWNB) (Frank, Hall, & Pfahringer, ) applies naive Bayes to a neighborhood of the test instance, in which each instance is assigned a weight decreasing linearly with the Euclidean distance to the test instance. The number of instances in the subset is determined by a user-specified parameter. Only those instances whose weights are greater than zero are used for classification. Methods That Calibrate Naive Bayes’ Probability Estimates Methods in this group make adjustments to the distortion in estimated posterior probabilities resulting from violations of independence assumption. Isotonic regression (IR) (Zadrozny & Elkan, ) is a nonparametric calibration method which produces a monotonically increasing transformation of the probability outcome Y X1 X2 Xi Xi+1 Xn Y X1 X2 Xi Xi+1 Xn Xiq+1 Xiq+2 Xin XiqXi2Xi1 Y Semi-Naive Bayesian Learning. Figure . Bayesian Network. (a) one-dependence classifier, (b) SuperParent onedependence classifier and (c) z-dependence classifier (z ≥ ) Semi-Naive Bayesian Learning S  S of naive Bayes. It uses a pair-adjacent violators algorithm (Ayer, Brunk, Ewing, Reid, & Silverman, ) to perform calibration. To classify a test instance, IR first finds the interval in which the estimated posterior probability fits and predicts the isotonic regression estimate of this interval as the calibrated posterior probability. Adjusted probability naive Bayesian classification (Webb & Pazzani, ) makes adjustments to class probabilities, using a simple hill-climbing search to find adjustments that maximize the leave-one-out cross validation accuracy estimate. Starting with the conditional attribute-value frequency table generated by naive Bayes, iterative Bayes (Gama, ) iteratively updates the frequency table by cycling through all training instances. Methods That Introduce Hidden Variables to Naive Bayes Creating hidden variables or joining attributes is another effective approach to relaxing the attribute independence assumption. Backward sequential elimination and joining (BSEJ) (Pazzani, ) extends BSE by creating new Cartesian product attributes. It considers joining each pair of attributes and creates new Cartesian product attributes if the action improves leave-one-out cross validation accuracy. It deletes original attributes and also new Cartesian product attributes during a hill-climbing search. This process of joining or deleting is repeated until there is no further accuracy improvement. Hierarchical naive Bayes (Zhang, Nielsen, & Jensen, ) uses conditional mutual information as a criterion to create a hidden variable whose value set is initialized to the Cartesian product over all the value sets of its children. Values of a hidden variable are then collapsed by maximizing conditional log-likelihood via the minimum description length principle (Rissanen, ). Selection Between Semi-Naive Bayesian Methods No algorithm is universally optimal in terms of generalization accuracy. General recommendations for selection between semi-naive Bayesian methods is provided based on bias-variance tradeoff together with characteristics of the application to which they are applied. Error can be decomposed into bias and variance (see bias variance decomposition). Bias measures how closely a learner is able to approximate the decision surfaces for a domain and variance measures the sensitivity of a learner to random variations in the training data. Unfortunately, we cannot, in general, minimize bias and variance simultaneously. There is a bias-variance tradeoff such that bias typically decreases when variance increases and vice versa. Data set size usually interacts with bias and variance and in turn affects error. Since differences between samples are expected to decrease with increasing sample size, differences between models formed from those samples are expected to decrease and hence variance is expected to decrease. Therefore, the bias proportion of error may be higher on large data sets than on small data sets and the variance proportion of error may be higher on small data sets than on large data sets. Consequently, low bias algorithms may have advantage in error on large data sets and low variance algorithms may have advantage in error on small data sets (Brain & Webb, ). Zheng & Webb () compare eight semi-naive Bayesian methods with naive Bayes. These methods are BSE, FSS, TAN, SP-TAN, AODE, NBTree, LBR, and BSEJ. NBTree, SP-TAN, and BSEJ have relatively high training time complexity, while LBR has high classification time complexity. BSEJ has very high space complexity. NBTree and BSEJ have very low bias and high variance. Naive Bayes and AODE have very low variance. AODE has a significant advantage in error over other semi-naive Bayesian algorithms tested, with the exceptions of LBR and SP-TAN. It achieves a lower error for more data sets than LBR and SP-TAN without SP-TAN’s high training time complexity and LBR’s high test time complexity. Subsequent researches (Cerquides & Mántaras, ; Zheng & Webb, ) show that MAPLMG and SR can in practice significantly improve both classification accuracy and the precision of conditional probability estimates of AODE. However, MAPLMG imposes very high training time overheads on AODE, while SR imposes no extra training time overheads and only modest test time overheads on AODE. Within the prevailing computational complexity constraints, we suggest using the lowest bias semi-naive Bayesian method for large training data and lowest variance semi-naive Bayesian method for small training  S Semi-Supervised Learning data. An appropriate tradeoff between bias and variance should be sought for intermediate size training data. For extremely small data, naive Bayes may be superior and for large data NBTree and BSEJ may be more appealing options if their computational complexity satisfies the computational constraints of the application context. AODE achieves very low variance, relatively low bias and low training time and space complexity. MAPLMG and SR further enhance AODE by substantially reducing bias and error and improving probability prediction with modest time complexity. Consequently, they may prove competitive over a considerable range of classification tasks. Furthermore, MAPLMG may excel if the primary consideration is attaining the highest possible classification accuracy and SR may have an advantage if one wishes efficient classification. Cross References Bayesian Network Naive Bayes Recommended Reading Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T., & Silverman, E. (). An empirical distribution function for sampling with incomplete information. The Annals of Mathematical Statistics, (), –. Brain, D., & Webb, G. I. (). The need for low bias algorithms in classification learning from large data sets. In Proceedings of the Sixteenth European Conference on Principles of Data Mining and Knowledge Discovery (pp. –). Berlin: Springer-Verlag. Cerquides, J., & Mántaras, R. L. D. (). Robust Bayesian linear classifier ensembles. In Proceedings of the Sixteenth European Conference on Machine Learning, pp. –. Frank, E., Hall, M., & Pfahringer, B. (). Locally weighted naive Bayes. In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, Acapulco, Mexico (pp. –). San Francisco, CA: Morgan Kaufmann. Friedman, N., Geiger, D., & Goldszmidt, M. (). Bayesian network classifiers. Machine Learning, (), –. Gama, J. (). Iterative Bayes. Theoretical Computer Science, (), –. Keogh, E. J., & Pazzani, M. J. (). Learning augmented Bayesian classifiers: A comparison of distribution-based and classification-based approaches. In Proceedings of the International Workshop on Artificial Intelligence and Statistics, pp. –. Kittler, J., (). Feature selection and extraction. In T. Y. Young & K. S. Fu (Eds.), Handbook of Pattern Recognition and Image Processing. New York: Academic Press. Kohavi, R. (). Scaling up the accuracy of naive-Bayes classifiers: A decisiontree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. –. Pazzani, M. J. (). Constructive induction of Cartesian product attributes. In ISIS: Information. Statistics and Induction in Science, Melbourne, Australia, (pp. –). Singapore: World Scientific. Rissanen, J. (). Modeling by shortest data description. Automatica, , –. Sahami, M. (). Learning limited dependence Bayesian classifiers. In Proceedings of the Second International Conference on Knowledge Discovery in Databases (pp. –) Menlo Park: AAAI Press. Webb, G. I., & Pazzani, M. J. (). Adjusted probability naive Bayesian induction. In Proceedings of the Eleventh Australian Joint Conference on Artificial Intelligence, Sydney, Australia (pp. –). Berlin: Springer. Webb, G. I., Boughton, J., & Wang, Z. (). Not so naive Bayes: Aggregating onedependence estimators. Machine Learning, (), –. Zadrozny, B., & Elkan, C. (). Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada (pp. –). New York: ACM Press. Zhang, N. L., Nielsen, T. D., & Jensen, F. V. (). Latent variable discovery in classification models. Artificial Intelligence in Medicine, (), –. Zheng, Z., & Webb, G. I. (). Lazy learning of Bayesian rules. Machine Learning, (), –. Zheng, F., & Webb, G. I. (). A comparative study of seminaive Bayes methods in classification learning. In Proceedings of the Fourth Australasian Data Mining Conference, Sydney, pp. –. Zheng, F., & Webb, G. I. (). Efficient lazy elimination for averaged-one dependence estimators. In Proceedings of the Twenty-third International Conference on Machine Learning (pp. –). New York: ACM Press. Semi-Supervised Learning Xiaojin Zhu University of Wisconsin-Madison, Madison, WI, USA Synonyms Co-training; Learning from labeled and unlabeled data; Transductive learning Definition Semi-supervised learning uses both labeled and unlabeled data to perform an otherwise supervised learning or unsupervised learning task. In the former case, there is a distinction between inductive semi-supervised learning and transductive Semi-Supervised Learning S  S learning. In inductive semi-supervised learning, the learner has both labeled training data {(xi,yi)}l i =  iid ∼ p(x,y) and unlabeled training data {xi}l+u i = l+ iid ∼ p(x), and learns a predictor f : X ↦ Y, f ∈ F where F is the hypothesis space. Here x ∈ X is an input instance, y ∈ Y its target label (discrete for classification or continuous for regression), p(x,y) the unknown joint distribution and p(x) its marginal, and typically l ≪ u. The goal is to learn a predictor that predicts future test data better than the predictor learned from the labeled training data alone. In transductive learning, the setting is the same except that one is solely interested in the predictions on the unlabeled training data {xi}l+u i = l+, without any intention to generalize to future test data. In the latter case, an unsupervised learning task is enhanced by labeled data. For example, in semisupervised clustering (a.k.a. constrained clustering) one may have a few must-links (two instances must be in the same cluster) and cannot-links (two instances cannot be in the same cluster) in addition to the unlabeled instances to be clustered; in semi-supervised dimensionality reduction one might have the target low-dimensional coordinates on a few instances. This entry will focus on the former case of learning a predictor. Motivation and Background Semi-supervised learning is initially motivated by its practical value in learning faster, better, and cheaper. In many real world applications, it is relatively easy to acquire a large amount of unlabeled data {x}. For example, documents can be crawled from the Web, images can be obtained from surveillance cameras, and speech can be collected from broadcast. However, their corresponding labels {y} for the prediction task, such as sentiment orientation, intrusion detection, and phonetic transcript, often requires slow human annotation and expensive laboratory experiments. This labeling bottleneck results in a scarce of labeled data and a surplus of unlabeled data. Therefore, being able to utilize the surplus unlabeled data is desirable. Recently, semi-supervised learning also finds applications in cognitive psychology as a computational model for human learning. In human categorization and concept forming, the environment provides unsupervised data (e.g., a child watching surrounding objects by herself) in addition to labeled data from a teacher (e.g., Dad points to an object and says “bird!”). There is evidence that human beings can combine labeled and unlabeled data to facilitate learning. The history of semi-supervised learning goes back to at least the s, when self-training, transduction, and Gaussian mixtures with the expectation-maximization (EM) algorithm first emerged. It enjoyed an explosion of interest since the s, with the development of new algorithms like co-training and transductive support vector machines, new applications in natural language processing and computer vision, and new theoretical analyses. More discussions can be found in section .. in Chapelle, Zien, and Schölkopf (). Theory Unlabeled data {xi}l+u i = l+ by itself does not carry any information on the mapping X ↦ Y. How can it help us learn a better predictor f :X ↦ Y? Balcan and Blum pointed out in  that the key lies in an implicit ordering of f ∈ F induced by the unlabeled data. Informally, if the implicit ordering happens to rank the target predictor f ∗ near the top, then one needs less labeled data to learn f ∗ . This idea will be formalized later on using PAC learning bounds. In other contexts, the implicit ordering is interpreted as a prior over F or as a regularizer. A semi-supervised learning method must address two questions: what implicit ordering is induced by the unlabeled data, and how to algorithmically find a predictor near the top of this implicit ordering and fits the labeled data well. Many semi-supervised learning methods have been proposed, with different answers to these two questions (Abney, ; Chapelle et al., ; Seeger, ; Zhu & Goldberg, ). It is impossible to enumerate all methods in this entry. Instead, we present a few representative methods. Generative Models This semi-supervised learning method assumes the form of joint probability p(x,y θ) = p(y θ)p(x y, θ). For example, the class prior distribution p(y θ) can be a multinomial over Y, while the class conditional distribution p(x y, θ) can be a multivariate Gaussian in X (Castelli & Cover, ; Nigam, McCallum, Thrun, & Mitchell, ). We use θ ∈ Θ to denote the  S Semi-Supervised Learning parameters of the joint probability. Each θ corresponds to a predictor fθ via Bayes rule: fθ (x) ≡ argmaxyp(y x, θ) = argmaxy p(x,y θ) ∑y′ p(x,y′ θ) . Therefore, F = {fθ : θ ∈ Θ}. What is the implicit ordering of fθ induced by unlabeled training data {xi}l+u i = l+? It is the large to small ordering of log likelihood of θ on unlabeled data: logp({xi}l+u i = l+ θ) = l+u ∑ i = l+ log ⎛ ⎝ ∑ y∈Y p(xi,y θ) ⎞ ⎠ . The top ranked fθ is the one whose θ (or rather the generative model with parameters θ) best fits the unlabeled data. Therefore, this method assumes that the form of the joint probability is correct for the task. To identify the fθ that both fits the labeled data well and ranks high, one maximizes the log likelihood of θ on both labeled and unlabeled data: argmaxθ logp({xi,yi}l i =  θ) + λ logp({xi}l+u i = l+ θ), where λ is a balancing weight. This is a non-concave problem. A local maximum can be found with the EM algorithm, or other numerical optimization methods. (See also, generative learning.) Semi-Supervised Support Vector Machines This semi-supervised learning method assumes that the decision boundary f (x) =  is situated in a low-density region (in terms of unlabeled data) between the two classes y ∈ {−,} (Joachims, ; Vapnik, ). Consider the following hat loss function on an unlabeled instance x: max( − f (x) ,), which is positive when −