Mining Co-location Patterns with Rare Events from Spatial Data Sets Yan Huang, Jian Pei, Hui Xiong Petr G\os Masaryk University Faulty of Informatics, Knowledge Discovery Group Institute of Computer Science, Department of Intelligent Building Systems Botanická 68a, 602 00 Brno glos@ics.muni.cz http://maps.muni.cz, http//:www.muni.cz, http://ics.muni.cz Seminar on Knowledge Discovery, April 2011 ****** w*^* > IMI Co-location Patterns Participation Index Participation Ratio MinMax Algorithm Algorithm maxPrune Q<&A Co-Location Patterns Co-Location Pattern - group of spatial features/events that frequently co-located in the same region. Co-Location Pattern - set of spatial features that are frequently located together in spatial proximity. Location based services, Ecology mapping, Road works, Closures, Accidents, Spatial feature is rare if its instances are substantially less than those of other features in a co-location. Questions and tasks How to identify and measure spatial co-location patterns involving rare spatial features? - Measure called maximal participation ratio How to mine the patterns involving rare spatial feature eff ieciently? - Extension of apriori-like solution to do post-procesing Very low participation index treshold to prune Maximal participation ratio treshold to do a postprocessing - Algorithm using weak monotonie property of the maximal participation ratio to push the maximal participation ratio treshold deep into the mining. Frequent pattern x Co-location pattern Mining Item Item set Frequent pattern Support [Transactional database Spatial feature Spatial feature set Co-location pattern Spatial interestigness measures Spatial database Neighbor-set • S - spatial dataset • F = {fi, ... , f J - set of boolean spatial features • i = •■• / i J " se"l" °f n instances in S, • Each instance is a vector (instance-id, location, spatial feature) • i.f - spatial feature f of instance i • R is neighborhood realation over pairwise instances in S. • Neighbor-set L is a set of instances such that all pairwise locations in L are neighbors. if > IMI Co-location pattern {A,B,C,D} Neighbor sets {3,6,17} {6,17} {3,6} {4,5,13} {4,7,10,16} Example Example Dataset 18 ^3 17 Springer Row instance, Participation ratio Co-location pattern C is a set of spatial features, C < F. A neighbor-set L is said to be a row instance of co-location pattern C if every feature in C appears as a feature of an instance in L and there exists no proper subset of L does so. rowset(C) - all row instances of co-location pattern C Participation ratio pr(Cf) = IfrlCrGS^ and (r.1 :=f1 and (r is a row instance of C)\\ |{r (reS) and (r.f=f)}| Wherever the feature f is observed, with probability pr(C,f), all other features in C are also observed in neighbor-set. if > IMI Row instances for ({A,B,C,D}) {2,11,14,15} [2,8,11,1115] rowset({A,B,C,D}) {{4,7,10,16} {2,11,14,15} {8,11,14,15}} rowset({A,B}) «7,10} {2,14} {5,13} {8,14}} Example Example Dataset 18 ^3 17 Springer if > IMI Participation ratios pr({A,B,C,D},A)=2/5 pr({A,B,C,D},B)=3/5 pr({A,B,C,D},C>l/3 pr({A,B,C,D},D)=l PI ({A,B,C,D},A)=l/3 Example Example Dataset 18 ^3 17 Springer Participation index and monotonicity of participation ratio and index PI(C) = minfeC{pr(C,f» Wherever any feature from C is observed, with probability of at least PI(C)/ a'l other features in C can be observed in neignbor-set. A high participation index value indicates that the spatial features in a co-location pattern likely occur together. Given a user-specified participation index treshold min_prev/ a co-location pattern C is called prevalent if PI(C) >= min_prev. Let C and C be two co-location patterns such that C is subset of C. Then, for each feature feC, pr(C/f) >= pr(C\f). Furthemore, PI(C) >= PI(C') Maximal participation ratio Maximal participation ratio maxPR(C) = max feC{pr(C,f)} A high maximal participation ratio value indicates that there are some spatial features strongly imply the pattern. C - {f j, ... , f J co-location pattern, Minimum maximal participation ratio treshold min_maxPR pr(C,f!) => ... => pr(C,f I) => ... => pr(C,f k), fI is the last spatial feature that has participation ration above min_maxPR If spatial feature fi (1 <= i <= I) is observed in some location, then the probability of observing all other spatial feature in C - {f j} in neighbor set is at least min_maxPR. 5 > v. tV.MS/ (Ml Example 18 17 Table 2 Rowsets, Pis and maxPRs of co-locations of dataset in Fig. 2 ID Co-loc Rowsct pr PI max PI 1 {A} {{1},{5},{6},{7},{14}} {1} 1 1 2 {B} ({2},{8},{10}, {13},{18}} (1} 1 1 3 |C} ({3},{9},{12},{15},{16},{17}} {1} 1 1 4 P} mum 11} 1 1 5 {A,B) {{5,13},{7,10},{14,2},{14,8}} {4/5,4/5} 4/5 4/5 6 {A,q ({1,12},{6,3},{6,17},{14,15},{7,16}} (4/5,5/6) 4/5 5/6 7 {A,D} {{5,4},{14,1},{7,4}} {3/5,2/2} 3/5 1 8 {B,C} {{2,9},{2,15},{8,15],{10>16}} {3/5,3/6} 1/2 3/5 9 [B,Dj {{2,H},{8,11},{10,4},{13,4}) {4/5,2/2} 4/5 1 10 (C,D} {{15,11},{16,4}} {2/6,2/2} 1/3 1 11 [A,B,C} {{7,10,16},{14,2,15},{14,8,15}} {2/5,3/5,2/6} 1/3 3/5 12 [A,B,Dj {{5,13,4},{7,10,4},{14,2,11},{14,8,11}} {3/5,4/5,2/2} 3/5 1 13 (A,C,DJ {{7,16,4),{14,15,H}} {2/5,2/6,2/2} 2/5 1 14 (B,C,D} {{2,15,11},{10,16,4],{8,15,11}} {3/5,2/6,2/2} 1/3 1 15 {A,B,CD} {{7,10,16,4},{14,2,15,n},{14,8,15,ll}} {2/5,3/5,2/6,2/2} 1/3 1 Rundimentary Algorithm A spatial database a neighborhood relation 11, a minimum prevalent threshold min_prev, and a minimum maximal participation index threshold minjnaxPR. Co-location patterns P such that PI(P) > min_prev and maxPR{P) > minjnaxPR. 1. let k = 2; generate C2, the set of candidate 2-patterns and their rowsets, by geometric methods; 2. for each C eCk calculate PI(C) and maxPR{C) from C's rowset rowset(C); 3. let ?; be the subset of Ck such that for each P e P'k, PI(P) > min_prev; 4. let Pk be the subset of P'k such that for each P € Pk, maxPR{P) > minjnaxPR] 5. generate the set Ck+\ of candidate (k + l)-patterns, a co-location pattern P with (k+ 1) spatial features is in C^+i if and only if for each feature fzP,(P-{f})& P'k; 6. if Cjt+j # 0, let A: = + 1, go to Step 2; 7. output UjPj ■ Fig. 3 Algorithm Min-Max Input: Output: Method: Rundimentary Algorithm If min_prev = 0 then algorithm can find the complete set of patterns. If min_prev > 0 then some patterns with high maximal participation ratio but low prevalence may be missed. Major disadvantage - If user wants to find the complete answer, the algorithm has to generate a huge number of candidates and test them, even though the maximal participation ration treshold min_maxPR is high. Weak monotonocity of maximal participation ratio Let P be a k-co-location pattern. Then, there exists at most one (k-1) - subpattern P' such that P' is subset of P and maxPR(P') < maxPR(P) If a k-pattern is above the maximal participation ratio treshold, then at least (k-1) out of its k subpatterns with (k-1) features are above the maximal participation ratio treshold. Algorithm maxPrune Example 8: (Candidate generation using weak monotonicity) Suppose the maximal participation ratio values of [A, B, C), {A,C,D} and {B, C, D] are all over the threshold min_maxPR, but that of {A, B, D) is not. We still should generate a candidate P — [A, B, C, D], since it is possible that maxPR(P) passes the threshold. To achieve this, we need a systematic way to generate the candidates. Please note that, in apriori, for the above example, {A, B, C, D] is generated only if {A, B, C) and {A, B, D} (differ only in their last spatial feature) are both frequent. However, in the co-location pattern mining with rare spatial features using maximal participation ratio measure, it is possible that [A, B, D} is below the given threshold minjriaxPR while {A, B, C, D] is above the threshold min_maxPR. In general, for two co-location patterns P and P' from the set Pk of ^-patterns above threshold minjnaxPR, i.e., P e Pk and P' e Pk, P and P' can be joined to generate a candidate (k + 1)-pattern in Q+1 if and only if P and P' have one different feature in the last two features. For example, even {A, B, D} is below threshold min_maxPR, candidate {A, B, C, D] can be generated by {A, B, C] and {A, C, D] since they have the common feature C in their last two features, i.e., they differ one spatial feature in their last two spatial features. ■ We will illustrate the correctness of the above candidate generation method in Lemma 3 and Example 9. Also, with the revised candidate generator, the mining algorithm is presented in Fig. 4. The algorithm does not need a minimum prevalence threshold but still finds all co-location patterns with maximal participation index above threshold min_maxPR. To make sure the candidate generation does not miss any co-location, we need to prove that the candidate (k + 1)-patterns Q+, generated by the maxPrune algorithm Algorithm maxPrune Input: A spatial database S, a neighborhood relation 71, a minimum maximal participation ratio min_maxPR. Output: Co-location patterns P such that maxPR(P) > mf'«„maxPR. Method: 1. let k = 2; generate C2, the set of candidate 2-patterns and their rowsets, by geometric methods; 2. For each C € Ck calculate maxPR(C) from C's rowset rowset(C); Let be the subset of Q such that for each P e Pk, maxPR(P) > minjnaxPR; 3. generate Ct+i, the set of candidates (k + l)-patterns, as illustrated in Example 8 ; if Ck+l ^ 0, let k — k + 1, go to Step 2; 4. output U/P,- ■ Fig. 4 Algorithm maxPrune Springer Thank you for your attention.