Contact searching for business partners PV115 Laboratory of Knowledge Discovery Documents processing Approaches Juraj Jurco Faculty of Informatics, Masaryk University October 4, 2011 Disambiguate Juraj Jurčo Contact searching for business partners 0 Motivation 0 My work O WePS • Definition • Problems • Business Partners • Example • Resources Q Documents processing • Approaches • Main content • Named entities 0 Disambiguation • Methods 0 What is done Juraj Jurčo Contact searching for business partners Why we want to search? Contact searching for business partners Juraj JurCo Outline • Business reasons: Motivation • Sponsors My work WePS • Partners • Looking for similar area of work Definition Problems • Personal reasons: Business Partners • Our contacts are not actual Example Resources o We need contact for specific person Documents processing Approaches Main content Named entities Disambiguation Methods What is done Juraj Jurčo Contact searching for business partners My master thesis Contact searching for business partners 0 In my master thesis I'm trying to find business partners for Juraj JurCo Faculty of Informatics of Masaryk University Outline o The scope of searching is for persons owning company or Motivation managers of company My work Q 1 will get list of graduated (but also students) of our WePS Definition faculty and program will try to find information about Problems Business them Partners Example Resources Q If they are successful program suggests these persons as Documents possible business partners for our faculty processing Approaches Main content Named entities 0 Output of the program will be list of found contact information about person Disambiguation Methods What is done Juraj Jurčo Contact searching for business partners Informal definition of problem Contact searching for business partners Juraj JurCo Outline Motivation My work We need to find information about concrete person on the Internet and these information are in different pages or documents. WePS • Usually we search throw search engine which returns us Definition Problems Business documents sorted by some criteria (e.g. PageRank) Partners Example Resources • Search engine usually pre-process our query Documents processing Approaches Main content Named entities Disambiguation Methods What is done Juraj Jurčo Contact searching for business partners More formal definition of We PS Let q be a query for search engine with the name of the searched person. Let S = {Si, S2, ...Sm} be a set of documents returned by search engine E. Let PD = {Di, D2,Dn} be a set of documents containing the name of searched person where PD C S and m > n. We suppose every document Df-contains only information about one real person. Set PD contains k persons. Input for algorithm is query q and output is set of k clusters of documents D; belonging to one real personal, Vu2007] [2, Yoshida2010] done Juraj Ju r čo Contact searching for business partners Problems connected with WePS Contact searching for business partners Juraj JurCo Outline Motivation My work • One name can refer to more people in the same time • One page can contain more people WePS Definition Problems Business Partners Example Resources • Information on page can be out of date • Pages are using JavaScript or Flash to display their content Documents processing Approaches Main content Named entities Disambiguation Methods What is done Juraj Jurčo Contact searching for business partners Search for business partners Contact searching for business partners Juraj Jurc'o • Problem for searching for business partners is similar to Outline Motivation My work problem for searching people on the web (WePS) • From all found persons with given name we need to WePS choose the most likely person Definition Problems • We need to take into account also scope of the work Partners Example • Persons or companies which sponsored some event are Resources Documents processing Approaches more relevant Named entities Disambiguation Methods What is done Juraj Jurčo Contact searching for business partners Problems connected with searching Contact searching for business partners Juraj JurCo Outline O Page can contains only information about company sponsored event and no information about searched which person Motivation My work • We have to find information about company too WePS Definition Problems O Search engines does not index whole content of the Internet ( robots.txt ) Business Partners Example Resources Documents processing Approaches Main content Named entities 0 Searched name can be mentioned on the other page company • cooperation of companies of the Disambiguation Methods What is done Juraj Jurčo Contact searching for business partners Human vs. Computer Example: Ing. Zdeněk ŠTĚPÁNEK mChief executive of ALIMA a.s. iBirth date: 22.9.1961 IPIace: Caslava IBirth number: 6109220040 lAddress: Zdiměřická 852, B25242 Jesenice Project manager Address: Jugoslávská 25, Ostrava-Jih, 700 30 Mobil: 777 077 709 Email: zstepanek@zeu.cz IC: 27846181 Chairman of Czech office for guns and munition testing Address: Jilmova 759/12, 130 00 Praha 3 Phone: 271773064 Web: www.cuzzs.cz Email: stepanek@cuzzs.cz Juraj Ju r čo Contact searching for business partners Useful information resources Contact searching for business partners Juraj JurCo Outline • Trade Register - Information about companies and Motivation business persons My work • Managers moving between companies WePS Definition Problems Business Partners Example Resources • Social networks » Job portals • Companies homepage Documents processing Approaches Main content Named entities Disambiguation Methods What is done Juraj Jurčo Contact searching for business partners Used approaches Contact searching for business partners Juraj JurCo Outline • Preprocessing of documents is very important Motivation My work • Has impact for accuracy • Documents have to be pre-processed "on-the-fly" WePS Definition Problems Business Partners Example Resources • At the beginning the number of persons is not known • Clustering algorithms with fixed number of classes are unusable Documents processing Approaches Main content Named entities Disambiguation Methods What is done Juraj Jurčo Contact searching for business partners Main content extraction • Web pages contain also irrelevant information • Header and footer contain other contact information • Menus and advertisements • Methods used for main content extraction • Not extract, transform to plain text [2, Yoshida2010] • then we need to use other methods for determine main content of document • Extract some few characters (200, 500, 1000) around found keyword[4, Jiang2010] • Some named entities occur near the searched name. • Extraction based on similar pages on the server • Navigation elements referring to pages on the server • Time and data transfer consuming job, but maybe more accurate done Juraj Jurčo Contact searching for business partners Named Entities I • 8 important Named Entities (NE) • Name, Company, Address, Profession, Telephone number Date of birth, E-mail, URL • Name • Name can occurs in different forms: John Smith, John Kennedy Smith, John K. Smith, and J. Smith [3, Lefever2010][4, Jiang2010] • Name, Company, Address • Can be determined from text by: [4, Jiang2010] o Character Language Model - previously learned model from examples[5, Alias2010] • Hidden Markovov Model - dynamic Bayes network[6, HMM2010] done Juraj Jurčo Contact searching for business partners Named Entities II • Profession, Telephone number, Date of birth • It was found, that these entities occurs more often near searched name than other entities [4, Jiang2010] • E-mail • Addresses often contains part of searched name [4, Jiang2010] • URL • URL referring to same page helps in person disambiguation • Compound Key-Words (CKW) • Does not occurs often, but when they occur it is strong identifier of person [2, Yoshida2010] • Example: " chief software architect" in connection with "Bill Gates" Juraj Jurčo Contact searching for business partners Disambiguation model Contact searching for business partners Juraj JurCo Outline Q The goal of algorithms is from the set PD make k Motivation clusters, where every cluster represents one real person My work 0 The same person occurs in the different contexts, usually WePS Definition with few and often with noisy items [7, Han2009] Problems Business O More same Named Entities occurred on the same page are Partners Example Resources more likely true [7, Han2009] Documents processing Approaches Main content Named entities Disambiguation Methods What is done Juraj Jurčo Contact searching for business partners Disambiguation methods • TruthFinder [7, Han2009] • Different resources can contain dispute informations • Truth analysis should find true information about every object • Distinct [7, Han2009] • Distinguish objects with same name • Uses Agglomerative Hierarchical Clustering and repeatedly groups most similar clusters • Graph ° Creating graph based on Named Entities • NE are nodes and edges between nodes represent occurrence of NE in one document • After running clustering algorithm on graph is graph split to clusters, where every cluster means one person Juraj Jurčo Contact searching for business partners Disambiguation methods • Fuzzy Ants [3, Lefever2010] • Inspired by ants which collect dead corpus of other ants to heaps • Number of heaps is not specified • Ants can pick up one item or heap of items • Ants have their own intelligence - whether pick up or release item/heap 9 Two-stage Agglomerative Hierarchical Clustering • Words are split as strong and weak • Strong (NE and CKW) can split document between clusters, weak (Algorithm for programmer) cannot • In the first stage algorithm splits clusters based on strong words and assigns weights for weak words. In the second step take into account also weights of weak words. done Juraj Ju r čo Contact searching for business partners What is done Modules • Search through web search engines: Google, Yahoo! (Bing uses Yahoo! results) • ARES - Access to Registers of Economic Subjects • Search person by name o Basic - Basic info about person • Address - address standardization Výpis identifikačních údajů (Standard) Obchodní rejstřík (Commercial Register) Registr živnostenskho podnikání (Trade Register) Statistický registr RES (Register Economic Entities Statistical Office) Registr církví a náboženských společností (Churches and Religious Societies) Registr pojišťovacích zprostředkovatelů a likvidátorů pojistných událostí (Insurance Intermediaries Loss Adjusters) Seznam devizových míst a licencí (Foreign Exchange Spots and Licences) 0 Seznam občanských sdružení a spolků (Civic Associations Guilds Clubs) 0 Přehled ekonomických subjektů (Economic Entities) 0 Registr zdravotnických zařízení (Register Of Health) 0 Zemědělský registr (Common Agricultural Register) 0 Registr politických stran a hnutí (Political Parties Movements) 0 Rejstřík škol (School Register) 0 Insolventní registr (Insolvent Register) Juraj Jurčo Contact searching for business partners What is done • Detection of the document encoding (cpDetector) • http://cpdetector.sourceforge.net/ • Processing of JavaScript (HtmlUnit project) • inline JavaScript, event handlers • http://htmlunit.sourceforge.net/javascript-howto.html • Multi thread download manager • Download documents from web servers • Every server can have connection restrictions • IP address restrictions (proxy server) • Maximum m simultaneous downloads (in the one time server can process only m requests from one IP address) • Request delay (requests cannot be so often) • Maximum n requests in t time units (seconds, minutes, hours, days) • Violate these restrictions can lead to blocking up of the server Juraj Jurčo Contact searching for business partners I'm working on Download web pages from different sites • Why? • Use searching on these web pages • Not every page is indexed by search engines (robots.txt) • Pages use JavaScript (events) • Automatic transformation of the page to RDF/FOAF format )in the case result is XML or well structured 9 When it is not easy to recognize Named Entities and document require linguistic pre-processing, document is just downloaded • Sites involved: facebook.com, foaf.sk, Juraj Jurčo Contact searching for business partners Future Contact searching for business partners Juraj JurCo Outline • Extract Named Entities from documents Motivation My work WePS • Cluster documents • Suggest the most likely person Definition Problems • Visualize (time, time, time) Business Partners Example Resources • Any suggestions? Documents processing Approaches Main content Named entities Disambiguation Methods What is done Juraj Jurčo Contact searching for business partners Contact [Vu2007] Vu, Quang Minh, Tomonari Masada, Atsuhiro Takasu, searching for business and Jun Adachi partners Disambiguation of People in Web Search Using a Knowledge Juraj JurCo Base. Outline Motivation 2007 IEEE International Conference on Research, Innovation and Vision for the Future (March 2007): 185-191 My work □ [Yoshida2010] Yoshida, Minoru WePS Person Name Disambiguation by Bootstrapping. Definition Problems Search (2010): 10-17 Business Partners Example Q [Lefever2010] Lefever, E., T. Fayruzov, V. Hoste, and M. De Resources Cock Documents processing Approaches Main content Clustering web people search results using fuzzy ants. Information Sciences 180, no. 17 (September 2010): 3192-3209 Named entities Disambiguation [Jiang2010] Jiang, Lili, Wei Shen, Jianyong Wang, and Ning An. Methods What is done Juraj Jurčo Contact searching for business partners GRAPE : A System for Disambiguating and Tagging People Names in Web Search. Analysis (2010): 1257-1260 [5 [Alias2010] Alias-i LingPipe: Character Language Model Tutorial URL: http://alias-i.com/lingpipe/demos/tutorial/lm/ read-me.html 15 [HMM2010] Hidden Markov model. In Wikipedia, The Free Encyclopedia. URL: http://en.wikipedia.org/w/index.php?title= Hidden_Markov_model&oldid=398697711 15 [Han2009] Han, Jiawei Mining Heterogeneous Information Networks by Exploring the Power of Links (2009): 13-30 Juraj Jurčo Contact searching for business partners