Density estimation k-nearest neighbor estimation Nearest neighbor classification rule PA196: Pattern Recognition 05. Nonparametric techniques Dr. Vlad Popovici popovici@iba.muni.cz Institute of Biostatistics and Analyses Masaryk University, Brno Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule Histograms Parzen density estimation Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule Histograms Parzen density estimation Introduction let X1, . . . , Xn be i.i.d. d−dimensional random variables let p(x) be their continuous distribution: p(x) ≥ 0, Rd p(x) dx = 1 the problem is to estimate p(x) i.e. find ˆp(x) Note: a density estimate does not need to be a density itself!; it can have negative values or infinite integral... Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule Histograms Parzen density estimation Desirable properties: asymptotical unbiasedness: E[ˆp(x)] → p(x) as n → ∞ consistency: mean squared error: MSE(ˆp) = E[(ˆp(x) − p(x))2 ] ↔ MSE(ˆp) = Var(ˆp) + [bias(ˆp)]2 if MSE → 0 for all x ∈ Rd than it is a pointwise consistent estimator of p in the quadratic mean global measure of accuracy: the mean integrated squared error (average of all possible samples): MISE = E (ˆp(x) − p(x))2 dx = E[(ˆp(x) − p(x))2 ] dx Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule Histograms Parzen density estimation Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule Histograms Parzen density estimation Histograms the simplest density estimator: divide the interval of values in N equal intervals (cells) ˆp(x) = nj N j=1 njdx where nj is the number of points falling into the j−th interval straddling the point x in d dimensions: ˆp(x) = nj N j=1 njdV dx Problems: exponential growth of number of cells (Nd ) super-exponential growth in sample size needed for a proper estimation discontinuity between cells Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule Histograms Parzen density estimation Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule Histograms Parzen density estimation Modifications: data-adaptive histograms: allow the location, size and shape of the cells to adapt to the available data assume variable independence (naive Bayes): p(x) = d i=1 p(xi). For each variable one can use a histogram with N cells, which leads to Nd Nd cells. Lancaster models: assume that interactions above a certain order vanish. Bayesian networks: p(x) = p(xd|x1, . . . , xd−1)p(xd−1|x1, . . . , xd−2)p(x2|x1)p(x1) dependence trees: pairwise conditional probabilities Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule Histograms Parzen density estimation Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule Histograms Parzen density estimation Parzen estimator (kernel methods) fix the volume of the cell and use the number of point falling within to construct a density estimate idea: smooth the histogram with a properly selected kernel function the kernels are chosen to have a compact support the density estimate is ˆp(x) = 1 nh n i=1 K x − xi h where K is the kernel function and h is a smoothing parameter (spread, bandwidth) Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule Histograms Parzen density estimation Examples of kernel functions rectangular: K(x) =    1/2, for |x| < 1 0, otherwise triangular: K(x) =    1 − |x|, for |x| < 1 0, otherwise normal: K(x) = 1√ 2π exp(−x2 /2) Bartlett-Epanechnikov: K(x) =    3 4 (1 − x2 /5)/ √ 5, for |x| < √ 5 0, otherwise Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule Histograms Parzen density estimation Different levels of smoothing: from Webb: Statistical pattern recognition Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN the probability that a point z falls into a volume V centered at x is θ = V(X) p(x) dx for a small volume, θ ≈ p(x)V on the other hand, θ ≈ k(x) n : the fraction of points falling within V ⇒ k−NN density estimator: ˆp(x) = k(x) nV k−NN: fix k(x)/n or, equivalently (for a given n) fix k and find the volume V centred at containing k points Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule example: if xk is the k−th closest point to x then V can be taken as a sphere of radius x − xk the volume of a d−dimensional sphere is 2rd π d 2 d Γ(d/2) where Γ(t) = ∞ 0 xt−1 e−x dx (for n ∈ N, Γ(n) = (n − 1)!) this is in contrast with the histogram, where the volume is fixed and k varies Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN density estimation with k = 1 from Webb: Statistical pattern recognition Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN density estimation with k = 2 from Webb: Statistical pattern recognition Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule Notes: the density estimate produced is not a density itself (the estimate varies as 1/|x| leading to an infinite integral) it is asymptotically unbiased if lim n→∞ k(n) = ∞ lim n→∞ k(n) n = 0 Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances k−NN can be used to estimate the density → apply MAP rule to get a classification rule let there be ki samples of class gi among the closest k samples to x; m i=1 ki = k (m is the total number of classes) let ni be the total number of samples from class gi: m i=1 ni = n then the estimate of the class-conditional probability is ˆp(x|gi) = ki niV the estimated prior is ˆp(gi) = ni n Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances k−NN decision rule MAP rule: assign x to gi if ˆp(gi|x) ≥ ˆp(gj|x) for all j from Bayes’ theorem: assign x to gi if ki niV ni n ≥ kj njV nj n for all j i k−NN decision rule Assign x to gi if ki ≥ kj, ∀j i Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances What about the ties? Breaking the ties random assignment among classes with the same number of neighbors assign to the class with the closest mean vector assign to the most compact class weighted distance etc. etc. Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances Error rate for k−NN (Cover, Hart, 1967) e∗ ≤ e ≤ e∗ 2 − me∗ m − 1 where e∗ is the Bayes error rate, m is the number of classes and e is the k−NN error rate As n → ∞, e∗ ≤ e ≤ 2e∗. Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances Note on implementing k−NN: as n becomes large, finding the k NN incurs more computation various approximating algorithms, e.g. LAESA: linear approximating and eliminating search algorithm idea: use the properties of the metric space and reduce the number of comparisons to a set of identify "prototypes" Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances Refinements: editing techniques Idea: remove misclassified samples to obtain homogeneous regions. Procedure: given a set R and a classification rule η, let S be the set of misclassified samples from R by η. Remove these and re-train η on R = R \ S, etc. etc Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances Possible implementation: 1 consider a partition of the full set into N subsets R1, . . . , RN 2 classify samples in Ri using k−NN trained on the union of M "next" sets: R(i+1) mod N ∪ · · · ∪ R(i+M−1) mod N for 1 ≤ M ≤ N − 1 3 remove the samples misclassified and repartition 4 repeat until a predefined number of iterations do not remove any more samples Notes: M = N − 1 is similar to cross-validation if N is equal to number of samples, the procedure becomes leave-one-out the result is a set of homogeneous "clusters" of samples Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances Refinements: condensation after editing, the clusters can be "condensed" idea: remove samples in the center of the clusters, that do not contribute to the decision Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances Distance choice of distance depends on the (knowledge of the) domain is the space isotrop? are some variables "more important"? etc etc general Euclidean distance: d(x, z) = (x − z)t A(x − z) alternative (van der Heiden, Groen - 1997 - radar applications): d(x, z) = (x(p) − z(p))t (x(p) − z(p)) where x (p) i =    (x p i − 1)/p, if 0 < p ≤ 1 log xi, if p = 0 Vlad PA196: Pattern Recognition Density estimation k-nearest neighbor estimation Nearest neighbor classification rule k−NN decision rule Refinements Distances What about k? the larger k the more robust is the procedure; however k must be less than the smallest of ni k can be optimized in a cross-validation approach Enas, Choi (1986) suggest: k ≈ n2/8 or k ≈ n3/8 where n is the sample size Vlad PA196: Pattern Recognition