Inductive Classification

                                Original slides: Raymond J. Mooney

                                   University of Texas at Austin

                                 Sample Category Learning Problem

•            Instance language: <size, color, shape>

–       size Î {small, medium, large}

–       color  {red, blue, green}

–       shape Î {square, circle, triangle}

•            C = {positive, negative}

•           D:

                                  Classification (Categorization)

•             Given:

–        A description of an instance, xÎX, where X is the instance language or instance space.

–        A fixed set of categories: C={c[1], c[2],…c[n]}

•             Determine:

–        The category of x: c(x)C, where c(x) is a categorization function whose domain is X and
whose range is C.

–        If c(x) is a binary function C={0,1} ({true,false}, {positive, negative}) then it is
called a concept.

                                    Learning for Categorization

•           A training example is an instance xÎX, paired with its correct category c(x):
<x, c(x)> for an unknown categorization function, c.

•           Given a set of training examples, D.

•           Find a hypothesized categorization function, h(x), such that:

                                       Hypothesis Selection

•            Many hypotheses are usually consistent with the training data.

–       red & circle

–       (small & circle) or (large & red)

–       (small & red & circle) or (large & red & circle)

–       not [ ( red & triangle) or (blue & circle) ]

–       not [ ( small & red & triangle) or (large & blue & circle) ]

•            Bias

–       Any criteria other than consistency with the training data that is used to select a
hypothesis.

                                          Generalization

•           Hypotheses must generalize to correctly classify instances not in the training data.

•           Simply memorizing training examples is a consistent hypothesis that does not
generalize.

•           Occam’s razor:

–      Finding a simple hypothesis helps ensure generalization.

                                         Hypothesis Space

•             Restrict learned functions a priori to a given hypothesis space, H, of functions h(x)
that can be considered as definitions of c(x).

•             For learning concepts on instances described by n discrete-valued features, consider
the space of conjunctive hypotheses represented by a vector of n constraints

    <c[1], c[2], … c[n]> where each c[i] is either:

–        ?, a wild card indicating no constraint on the ith feature

–        A specific value from the domain of the ith feature

–        Ø indicating no value is acceptable

•             Sample conjunctive hypotheses are

–        <big, red, ?>

–        <?, ?, ?> (most general hypothesis)

–        < Ø, Ø, Ø> (most specific hypothesis)

                                   Inductive Learning Hypothesis

•             Any function that is found to approximate the target concept well on a sufficiently
large set of training examples will also approximate the target function well on unobserved
examples.

•             Assumes that the training and test examples are drawn independently from the same
underlying distribution.

•             This is a fundamentally unprovable hypothesis unless additional assumptions are made
about the target concept and the notion of “approximating the target function well on unobserved
examples” is defined appropriately (cf. computational learning theory).

                               Evaluation of Classification Learning

•           Classification accuracy (% of instances classified correctly).

–      Measured on an independent test data.

•           Training time (efficiency of training algorithm).

•           Testing time (efficiency of subsequent classification).


                                    Category Learning as Search

•             Category learning can be viewed as searching the hypothesis space for one (or more)
hypotheses that are consistent with the training data.

•             Consider an instance space consisting of n binary features which therefore has 2^n
instances.

•             For conjunctive hypotheses, there are 4 choices for each feature: Ø, T, F, ?, so
there are 4^n syntactically distinct hypotheses.

•             However, all hypotheses with 1 or more Øs are equivalent, so there are 3^n+1
semantically distinct hypotheses.

•             The target binary categorization function in principle could be any of the possible
2^2^n functions on n input bits.

•             Therefore, conjunctive hypotheses are a small subset of the space of possible
functions, but both are intractably large.

•             All reasonable hypothesis spaces are intractably large or even infinite.

                                      Learning by Enumeration

•            For any finite or countably infinite hypothesis space, one can simply enumerate and
test hypotheses one at a time until a consistent one is found.

       For each h in H do:

              If h is consistent with the training data D,

                   then terminate and return h.

•            This algorithm is guaranteed to terminate with a consistent hypothesis if one exists;
however, it is obviously computationally intractable for almost any practical problem.

                                        Efficient Learning

•           Is there a way to learn conjunctive concepts without enumerating them?

•           How do human subjects learn conjunctive concepts?

•           Is there a way to efficiently find an unconstrained boolean function consistent with a
set of discrete-valued training instances?

•           If so, is it a useful/practical algorithm?

                                 Sample Category Learning Problem

•            Instance language: <size, color, shape>

–       size Î {small, medium, large}

–       color  {red, blue, green}

–       shape Î {square, circle, triangle}

•            C = {positive, negative}

•           D:

                                   Sample Generalization Lattice