What Are The Grand Challenges for Data Mining? KDD-2006 Panel Report Gregory Piatetsky-Shapiro KDnuggets gps at acm.org Robert Grossman UIC & Open Data Group rlg at opendatagroup.com Chabane Djeraba U. of Lille Chabane.Djeraba at lifl.fr Ronen Feldman U. of Bar-Ilan & ClearForest ronenf at gmail.com Lise Getoor U. of Maryland getoor at cs.umd.edu Mohammed Zaki RPI zaki at cs.rpi.edu ABSTRACT We discuss what makes exciting and motivating Grand Challenge problems for Data Mining, and propose criteria for a good Grand Challenge. We then consider possible GC problems from multimedia mining, link mining, largescale modeling, text mining, and proteomics. This report is the result of a panel held at KDD-2006 conference. Categories and Subject Descriptors H.2.8 [Database Applications]: Data mining General Terms Measurement, Performance, Experimentation. Keywords Data mining, bioinformatics, multimedia mining, image mining, video mining, link mining, text mining, web mining, grand challenge, X-prize. 1. INTRODUCTION Recently we saw several major scientific and engineering advances that were stimulated by a grand challenge/prize [CSM06, WSJ06]. The DARPA Grand Challenge produced great advances in robotic car navigation in 2005; X-prize led to the first successful commercial spaceflight; and RoboCup, (www.robocup.org) whose goal is to develop a team of humanoid robots that can win against the human world soccer champion team by 2050, has greatly advanced robotic performance and created many enthusiasts. Looking further back, the first transatlantic flight by Charles Lindbergh in 1927 was also stimulated by a competition for a prize. We have seen several examples where a Grand Challenge problem can get researchers, press, funding agencies, venture capitalists, and public interested, greatly stimulate research, and produce dramatic advances in science and technology. What are the grand challenge problems for data mining ? This question is timely - X-prize foundation is looking for additional fields where the prize can be created. We propose the following criteria for a good grand challenge problem for data mining. 1) The problem is hard -- very difficult to solve given the current state of the art 2) Involves data mining: data mining plays an important role in solving the problem. 3) Based on a large, publicly available data set 4) There is a specific goal: it is clear when the problem is solved 5) Problem is interesting to researchers and understandable to the public, and preferably stated in one sentence. 6) There is significant public benefit if it is solved. Some potential ideas for a grand challenge include: * Automatic tagging and classification of 1 billion digital photos on the web. A company called Riya (www.riya.com) is already working on a smaller scale project. SIGKDD Explorations Page 70Volume 8, Issue 2 ˇ Identifying all genes and potential therapeutic targets for some specific types of cancer. * A text-mining and understanding system that can use the web to pass standard tests, e.g. SAT in World History. * Literature-based discovery of drug X side effects ([Swan86] is one of the earliest examples) * Fraud detection based on company financial statements ­ can we find another Enron before it collapses? Perhaps the KDD-06 Panel already had some effect ­ on October 2, 2006, Netflix announced $1 million prize for a program that substantially increases the accuracy of predictions about how much someone is going to love a movie based on their movie preferences (see www.netflixprize.com/). The Netflix Prize satisfies the first 5 of our proposed criteria. Another related prize is Genomics X prize, recently announced by the X-prize foundation for technology that sequences the human genome quickly and cheaply [GenomeX06]. Here data mining plays some role, but is not central to the solution. In the rest of this report we examine possible Grand Challenge problems in several hot research areas: multimedia mining, link mining, large scale data mining, text mining, and proteomics. 2. MULTIMEDIA MINING (Djeraba) The rapid progress of data acquisition and storage technology has led to a tremendous amount of multimedia data stored in databases and files, and the amount of this data continues to grow very fast. Although valuable information is contained in multimedia data, the great majority of this data is non-structured or semi-structured, which makes it difficult (if not impossible) for human beings to extract the information without powerful tools. Multimedia data is of no use unless we can actually access and mine it [Zai03]. How will the users explore the vast and growing multimedia information, including images, video and audio, at their disposal? There is a need for making sense out of the multimedia data and to use the multimedia content effectively and efficiently. 2.1 Grand Challenges Let us consider the following grand challenge: Annotate 1000 hours of digital video in one hour. 1000 hours is the approximate amount of "Rush" daily video produced by top news agencies. This currently needs thousands of man-hours to do manually. The annotation of one image in National Geographic Society takes about 20 minutes. The challenge is to automate the entire annotation process. Advances in pattern recognition, or automatically extracting text from the speech accompanying video (when available) or recognition of text written on images may be a pragmatic way to bridge the semantic gap. Another pragmatic way to bridge the semantic gap is to: * Extract automatically primitive (low level) features, including key frames, shots, and other classical primitive features (e.g., colour distribution, Fourier transforms, wavelet, texture histograms, colour histograms, shape primitives, filter primitives) of large video databases. * Annotate a subset of video database (e.g., presence of human faces, red ball, white, blue and green clothes, people in the background, Handball game, attack, defence, actions). In certain situations, medium-level features (e.g., faces, clothes) may be extracted automatically. * Then, on the basis of the frequent patterns between primitive features and annotations (semantic features) of the subset of video, we generalize the annotations to the remaining video database. The complexity of the process turns around this last point. Other grand challenges, may be considered, including: * Predicting user interest on video lectures of a particular video web site on the basis of the first 5 minutes of browsing. * Scanning an archive of video broadcasts to find similar interviews with a particular individual, e.g. a person running for a political office. * Extracting from the football (soccer) game patterns that characterize the actions during 1 minute before the goal. Extracting low level features such as colour distribution, texture and shape from pixels is easy. Extracting medium level features such as human faces, red ball, white, blue and green clothes, and people in the background is possible. However, extracting high level features such as Handball game, attack, defence, actions, is very difficult without user annotations Fig 2: Low, medium and high level features SIGKDD Explorations Page 71Volume 8, Issue 2 2.2 Great Research Areas The grand challenges belong to two great research areas that involving usage and data. 1) Mining user behaviours in interactions with multimedia data and use the knowledge extracted in this process to anticipate future behaviours or to diagnose medical or psychological conditions of the users. The difficulty is to mine not only explicit actions (interactions, navigation), but also implicit reactions such as eye/gaze fixation, emotions (70% of people is emotion), heartbeat, respiration rate, stress, etc. The difficulty is also to use non-intrusive sensors (e.g., cameras), rather than intrusive sensors. The difficulty concerns also making possible the tracking of actions on multimedia data. This means that the tools for images, videos and audio should record user operations (e.g., play, pause, visualize, eye fixation) and multimedia pieces concerned by these operations - mining user actions, considering for example, intra/inter video actions. 2) Crossing the semantic gap between multi-media data and semantics. The difficulty is to extract automatically the meaning of multimedia content so that exploitation (e.g., retrievals) using semantic information can be tailored to individual applications (security, marketing, business, etc.). Multimedia data is the most natural information-conveying vehicle but also the most complex to index and mine. It is a very difficult process considering: the high volume (rapid explosion of available multimedia information), the complexity (videos, 3D models, audio, images), and the heterogeneity of data (streams, several sources). The difficulty is to generate metadata that describes the content and that may be exploitable in applications. Document semantics has been studied for quite some time. What is now needed is to develop approaches to extract semantics [Gro05] from multimedia documents so that retrievals using concept-based queries can be tailored to individual users. The semantic gap, or, as others put it, the semantic chasm, must be crossed. Multimedia usage mining coupled with domain ontology may be a revolutionary way to deal with the lack of semantics in multimedia information, and will certainly contribute to the hot domain of multimedia semantics. 3. LINK MINING GRAND CHALLENGE PROBLEM (Getoor) There is an increasing need to both learn and extract structured data. Much of the input to today's data mining and machine learning algorithms is structured, often in the form of a graph or network. Examples include social networks, biological networks and communication networks. At the same time, in many cases there is a desire to learn structured outputs, for example extracting graphs describing entities and relationships from unstructured data. Link mining [Get05] refers to both making use of the observed network's structure during learning and inference and inferring the (unobserved) link structure from other observations. Examples include using links for ranking nodes, using links for collective classification of nodes, and discovering links by predicting missing links or inferring new kinds of links and relationships. Link mining tasks can be broken down into the following categories: * Node Centric ­ Labeling/ranking nodes (aka Collective Classification/ PageRank) ­ Consolidating nodes (aka Entity Resolution) ­ Discovering hidden nodes (aka Group Discovery) * Edge Centric ­ Labeling/ranking edges ­ Predicting the existence of edges ­ Predicting the number of edges ­ Discovering new relations/paths * Graph/Subgraph Centric ­ Discovering frequent sub-patterns ­ Generative models ­ Metadata discovery, extraction, and reformulation Current research mostly focuses on a single task such as node ranking or link prediction. In real data analysis scenarios, and particularly for a Grand Challenge, we need a mix of all of these capabilities. The requirements for a Grand Challenge problem are discussed in section 1. While there is much structured data available, and even more unstructured data, finding a problem that meets the requirements is non-trivial. There are many problems which match some of the criteria such as social relevance, but for which the data is not publicly available, or for which the required domain knowledge is quite specialized. One domain for which the data is available, the data mining tasks are difficult yet compelling and socially relevant, the required knowledge is accessible and there are not a great number of research groups working is Wikipedia. Wikipedia has generated a lot of interest in recent years, ranging from its founder and foremost evangelist, Jimmy Wales who describes Wikipedia as a project whose "goal [is] to distribute a free encyclopedia to every single person on the planet in their own language" to its detractors, such as Larry Sanger, Wikipedia co-founder who says, SIGKDD Explorations Page 72Volume 8, Issue 2 "Wikipedia has gone from a nearly perfect anarchy to an anarchy with gang rule" [Schiff06]. Other commentary includes that of Eric Raymond, Open-source movement figure, who opines "Disaster is not too strong a word for Wikipedia... the site is infested with moonbats" [Schiff06]. Regardless of one's opinion of Wikipedia, it is a great testbed for link mining algorithms. There are interesting studies involving building descriptive models of Wikipedia's growth, see for example en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia's_growth Another useful task is predicting whether a contributor is a "wiki gnome" (a benevolent contributor who makes lots of edits, fixing typos and grammar mistakes) or a "wiki troll" (a destructive user who's edits are malicious). Text classification is also important, for example checking whether an article maintains the tenet of Wikipedia that a contribution must maintain a neutral point of view (NPOV). Link prediction is also relevant, e.g. identifying where links should exist. This becomes even more compelling, because as Wikipedia grows, it becomes harder for any given author to know about other relevant information to which they should link. A link prediction method could help with this by doing link suggestion or automatic link construction. Evaluation can be done by generating a dataset of Wikipedia pages, removing some of the existing links, and then seeing if a system can identify those places and suggest appropriate links. Other link mining tasks abound, including trust/reputation analysis, social network analysis and identification of communities, evaluation of accuracy, identifying misuse including vandalism and self-promotion and evaluation of coverage (which areas are not covered, or are poorly covered/linked?). However, while each of these are interesting research topics, none of these really serve as a Grand Challenge. Instead, we propose the following Wikipedia Test: Given a collection of entries constructed via participatory journalism (such as the entries in Wikipedia) versus via automatic link mining tools, can you distinguish between the real Wikipedia entries and the automatically generated ones? Furthermore, which is better? Evaluation could be done via a panel of human experts. Or, one can even automate the evaluation by leaving the entries up on Wikipedia and checking on their eventual page rank! One compelling aspect of this challenge problem is that its solution will require a variety of integrated link mining capabilities. Another is that funding may already available: The Hutter prize, http://prize.hutter1.net/, provides 50,000 EUR for being able to being able to compress 100MB of Wikipedia to less than 18MB. 4. THE GRAND CHALLENGE OF ESTIMATING ONE BILLION PREDICTIVE MODELS (Grossman) Large data sets can present challenges for data mining for a variety of reasons. One reason is that the data may be a mixture from several different sub-populations, each of which could benefit from a separate statistical or data mining model that is estimated using data just from that sub-population. For some applications, the subpopulations themselves may be unknown, with part of the challenge is to estimate these. It has been a common practice for some time to build several different models in data mining. Manually segmenting populations and building a separate statistical model for each segment is a standard technique in statistics. For example, dividing a potential target audience into several different segments and estimating the parameters of a separate logistic model for each segment is a very common methodology for building response models in marketing. Another example is provided by ensemble-based modeling techniques. Over the past two decades, a variety of ensemble based techniques have been used that estimate different statistical models either by re-sampling a small data set or by partitioning a large data set. The challenge we address here is the challenge of automatically estimating the parameters in thousands or millions of individual statistical or data mining models, which can be required for very large or very complex data sets. Here is an example from [Grossman06]. The data from this example comes from 833 traffic sensors in the Chicago metropolitan region and the goal is identify anomalous traffic patterns. In addition to the traffic sensor data, there is also semi-structured data about the weather and text data about any special events that can affect traffic, such as sports events. The goal is to decide whether traffic is unusual or anomalous. It is important to note that the goal is not to detect whether the traffic is congested, which is quite simple. The approach taken was to segment the data into a separate segment, one for each hour of the day (24 hours), for each day of the week (7 days), and for each small segment of the highway (about 250 highway segments). This produced about 24x7x250 or 42,000 different segments. For each segment, the parameters of a separate change detection model were estimated using data belonging to that segment. In this way, over 42,000 separate statistical models were automatically created, updated, and used for detecting SIGKDD Explorations Page 73Volume 8, Issue 2 anomalous changes in the traffic. Due to the size of the data, its complexity, and its heterogeneity, this approach proved to be preferable to building fewer models. Today, there are a variety of applications emerging in which makes sense to consider estimating a billion separate statistical models. Here are some examples: * In online marketing, one could a build separate statistical model for each consumer. For large online companies in the near future, this could produce over one billion separate models. * In detecting network anomalies, one could build separate statistical models for each IPv4 or IPv6 address. * As a final example, for modern approaches to therapeutics, one could build a separate model predicting the efficacy of a new drug or treatment based upon the person's genotype. In other words, each genotype could be used to build a separate model. Again, over time, this could yield over a billion different models (cf. [Church05]). We close with two remarks. First, this challenge is not concerned with estimating a single model with the property that the model scales to a large number of different features vectors. Today, there are a variety of techniques that can be used to estimate the parameters of a model that will work for a large number of different feature vectors. The challenge addressed here is to estimate the parameters of a very large number of different models, each of which can work with a large number of different feature vectors. Second, if we think of multiple models as arising from segmenting a large data set into segments D1, ..., Dm, some care is needed when stating the challenge to exclude using so many segments that the overall accuracy is harmed rather than helped by the segmentation. Here is one way to proceed. Given a data set D and a fixed class of possible models F, we can define the optimal partition number moptimal > 0 as follows: 1. For m segments, where m = 1, 2, 3, ... 2. Consider all ways P of partitioning the data set D into m segments: D1, ..., Dm 3. Let LP denote the minimum total misclassification rate for all models f1, ..., fm F built on the data sets D1, ..., Dm. Note that Lp is a function of m. 4. Define moptimal to be the smallest m that minimizes LP(m) i.e. moptimal = min { argmin LP(m) }. As the size and complexity of the data set D grows, so does the optimal partition number moptimal. One way of stating the challenge is develop algorithms and an associated infrastructure that scales to large optimal partition numbers, in particular to optimal partition numbers > 1,000,000,000. 5. GRAND CHALLENGES FOR TEXT MINING (Feldman) Text Mining is an exciting research area that tries to solve the information overload problem by using techniques from data mining, machine learning, NLP, IR and knowledge management. Text Mining involves the preprocessing of document collections (text categorization, information extraction, term extraction), the storage of the intermediate representations, the techniques to analyze these intermediate representations (distribution analysis, clustering, trend analysis, association rules etc) and visualization of the results. Here are some of the challenges that are facing the text mining research area: Challenge 1: Entity Extraction. Most text analytics systems rely on accurate extraction of entities and relations from the documents. However, the accuracy of the entity extraction systems in some of the domains reaches only 70- 80% and creates a noise level which prohibits the adaptation of text mining systems by a wider audience. We are seeking domain independent and language independent NER (named entity recognition) systems that will be able to reach an accuracy of 99-100%. Based on such system, we are seeking domain independent and language independent relation extraction systems that will be able to reach precision of 98-100% and recall of 95-100%. Since the systems should work in any domain they must be totally autonomous and require no human intervention. Challenge 2: Autonomous Text Analysis. Text Analytics systems today are pretty much user guided, and they enable users to view various aspects of the corpus. We would like to have a text analytics system which is totally autonomous and will analyze huge corpuses and come up with truly interesting findings that are not captured by any single document in the corpus and are not known before. The system can utilize the internet to filter findings that are already known. The "interest" measure which is totally subjective will be defined by a committee of experts in each domain. Such systems can then be used for alerting purposes in the financial domain, the anti-terror domain, the biomedical domain and many other commercial domains. The system will get streams of documents from a variety of sources and send emails to relevant people if an "interesting" finding is detected. Based on systems developed in step 1 & 2, we would like to have (this is our text mining grand challenge) Text mining systems that will be able to pass standard reading comprehension tests such as SAT, GRE, GMAT etc. SIGKDD Explorations Page 74Volume 8, Issue 2 Systems that will be able to pass the average scores will win the grand challenge. The systems can utilize the web when answering the test questions. We view this grand challenge as an extension of the classic Turing test. This grand challenge satisfies most of the criteria that were set for the various challenges. First, there are no systems today that are able to get above average score in any of the standard tests. Second, the criterion for success is very well defined. Then, we believe that within 5 years researchers will be able to build such systems based on technologies that are developed for annual competitions such as ACE, TREC and TIDES. Finally, having such systems will contribute to the advance of humankind as the underlying technologies deployed by these systems can be utilized by children and adults to more rapidly acquire knowledge about various topics. 6. MINING THE PROTEOME (Zaki) Large-scale databases from sequencing projects, microarray studies, gene-function studies, protein-protein interactions, comparative genomics, structural biology, and open source journal articles, are growing at rapid rates. The challenge in systems biology is to connect all the dots from the diverse molecular, cellular, organism and environmental data sources to deduce how sub-systems and whole organisms work. We need to decipher the language of life ­ the language of the genome, protein folding, developmental pathways, and much more. There are numerous computational challenges in collecting, indexing, searching and mining these vast data sources. Mining the diverse sources of public data will be a crucial component in piecing together the bigger picture. In particular, mining the proteome, especially mining new protein interactions and functionally enriching the proteome, is emerging as a grand challenge for data mining. 6.1 Protein Functions Proteins are the fundamental molecules of life. A protein has a three-dimensional (3D) fold that determines its roles. Proteins play diverse roles in the cells, such as: * Proteins as Molecular Machines: proteins can change their shape, allowing opening/closing movements, as well as twists and turns. Thus proteins can function as molecular switches. * Proteins as Catalysts: enzymes are proteins that act as catalysts, speeding up biochemical reactions by several orders of magnitude, making life possible. * Proteins in Pathways: proteins take part in "sequences" of biochemical reactions or pathways to enable a wide variety of functions. Some of the common functions of proteins include: a) Metabolism: proteins mediate chemical reactions, b) Signaling: proteins are involved in signaling within and between cells, c) Regulation: proteins act as gateways in cellular membranes, d) Cellular structure: proteins can help define cell shape and form, e) Transportation: proteins are involved in moving other proteins, oxygen, sugar, nutrients and wastes into and around cells, f) Movement: proteins play a role in muscle contraction and cell movement, g) DNA Transcription: transcription factors are proteins that turn genes on/off, h) Immunity: proteins identify germs and other foreign substances and mark them for destruction. 6.2 Genomics to Proteomics The genes, which are contiguous stretches of DNA, encode the information to manufacture proteins. Often there is a very complex regulatory network involving several genes that controls the production of proteins. Protein formation happens via two main steps: transcription of the gene into a mRNA molecule, and translation of the mRNA molecule into a protein, as illustrated in Figure 3. Figure 3: From Genes to Proteins In the traditional view, it was thought that one gene gave rise to one mRNA transcript, which in turn produced one protein. However, the current view is that we have to consider all the genes (the genome), all the mRNA transcripts (the transcriptome) and all the proteins (the proteome) in totality. Thus a single gene can produce many transcripts (via alternate splicing), and these transcripts can produce many proteins. The proteins undergo many posttranslational modifications that further increase the protein diversity. For example, in humans, there are around 30,000 genes, yet there are over a million proteins, when one accounts for post-translational modifications. Also note that whereas the genome is the static information repository, both the transcriptome and proteome are dynamic, since they change in response to the cellular environment. 6.3 Data Mining Challenge: Functional Annotation & Mining of the Proteome The Proteome is the complete set of proteins in the cell under a set of conditions. It is dynamic and complex, and characterized in terms of: SIGKDD Explorations Page 75Volume 8, Issue 2 ˇ Structure ­ shape, electrostatic properties, etc. * Abundance ­ protein expression level, i.e., the quantity of protein present. * Localization ­ sub-cellular location. * Modifications ­ post-translational modifications. * Interactions ­ protein-protein interactions (PPI; also called the interactome). The goal of functional annotation of the proteome is to comprehensively catalog the following information: * Why is a given protein produced (biological process)? * What kind of molecule is it (molecular function)? * Where is it found (cellular localization)? Functional annotation can help characterize unknown or hypothetical proteins via "guilt by association". The data mining grand challenge is to determine which proteins are present, in what quantities, where are they localized, and whom they interact with (binary/complex interactions). If we could predict the 3D protein shape from sequence alone, we could then infer the protein-protein interactions and other interactions involving proteins directly. However, protein structure prediction is a grand challenge in its own right. Figure 4. A part of the PPI network involving 90 human proteins, with 266 interactions What we do have is a growing amount of publicly available data from protein mass spectrometry, protein arrays, PPI datasets across organisms, PubMed journal articles (requiring text/literature mining), and transcriptomics data (e.g., microarray datasets). The challenge is to integrate all these sources, to mine new protein interactions and to create a complete functional categorization of all proteins. For example Figure 4 shows part of the PPI network involving 90 human proteins, with 266 interactions. 6.4 Public Data Sources There is a wealth of data available that has to be integrated and mined to help solve the above grand challenge problem. The sources of data include: * Protein Expression and Raw PPI Databases: These data come from 2D Gel Electrophoresis, Affinity Chromatography, Mass Spectrometry, and Protein Chips/Arrays. * Literature Curated PPI Databases: There are many publicly available datasets cataloging the PPI across species. Examples include HPRD (Human Protein Reference Database), which has 18,284 proteins, and 33,710 interactions; DIP (Database of Interacting Proteins), which contains 19,075 proteins, and 55,757 interactions; MINT (Molecular Interaction DB), which catalogs 26,055 proteins, and 72,436 interactions; BIND (Biomolecular Interaction Network Database), which notes 205,846 interactions; IntAct (45,888 proteins, and 68,269 interactions); BioGrid (305,683 proteins, and 158,134 interactions); and many others (MIPS, CYGD - yeast, HPID, etc). * Potential interactions (Predictome): There are databases the list inferred interactions, such as: InterDom (30,037 putative interactions), and OPHID Online Predicted Human Interaction Database (49,008 predicted interactions). * Orthologs databases: Orthologs are proteins that are homologs in other organisms. Using orthologs it is possible to infer new interactions, say in humans, by checking if the orthologous proteins interact in other organisms like yeast, fly, worm, etc. For example the Inparanoid database has orthologs information from 26 organisms, spanning 463,242 sequences. Orthologs (called Interologs) can also be predicted from sequence searches in model organisms, via BlastP (proteinprotein BLAST) searches. * Post-translational Modifications: Databases like RESID list hundreds of known protein modifications (such as glycosylation, phosporylation, etc.). * Literature Mining: Mining PubMed journal articles, looking for protein interactions, is extremely useful in inferring new relationships among proteins. Databases like iHOP (Information Hyperlinked over Proteins) list such mined data. * Transcriptomics Data: These are databases that catalog the gene expression and knockout information. Examples include cDNA libraries, DNA microarrays, mutagenesis and gene knockout experiments, and SIGKDD Explorations Page 76Volume 8, Issue 2 RNAi interference databases. These data also provide clues as to the interacting proteins and the functional modules. * Gene Ontology (GO): The three categories of the GO hierarchy span the biological process, molecular function, and sub-cellular location for many genes. The GO data can be integrated in proteomics studies to check the validity of mined modules. 7. SUMMARY This is an opportune time to consider and propose Grand Challenge Problems for data mining. A good Grand Challenge problem should be hard, involve data mining, rely on a large, public dataset, have a specific goal, be interesting to researchers and the public, and promise significant public benefit if solved. We offer this discussion of possible grand challenge problems as a first step to creating such Data Mining Grand Challenges. 8. REFERENCES [Church05] G. M. Church, The Personal Genome Project, Molecular Systems Biology, 2005, doi:10.1038/msb4100040. [CSM06] "Grand challenges spur grand results - Private groups are offering big cash prizes to anyone who can solve a range of daunting problems". The Christian Science Monitor, January 12, 2006 www.csmonitor.com/2006/0112/p13s01-stss.html [GenomeX06] Genomics X Prize home, www.xprize.org/xprizes/genomics_x_prize.html [Get05] SIGKDD Explorations Special Issue on Link Mining, Lise Getoor and Chris Diehl, December 2005 [Gro05] William I. Grosky, Nilesh Patel, Xin Li, Farshad Fotouhi: Dynamically Emerging Semantics in an MPEG-7 Image Database. Comput. J. 48(5): 536-544, 2005 [Grossman06] Robert L. Grossman, et al, Real Time Change Detection and Alerts from Highway Traffic Data, ACM/IEEE International Conference for High Performance Computing and Communications (SC '05). [Schiff06] Know It All: Can Wikipedia Conquer Expertise? Stacy Schiff, New Yorker, July 31, 2006 [Swan86] Don R. Swanson, Fish Oil, Raynaud's syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine, 30, 7-18, 1986. [WSJ06] Prize for DNA Decoding Aims to Fuel Innovation, Wall Street Journal, Jan 27, 2006 [Zai03] Osmar Zaiane, Simeon Smirof, Chabane Djeraba, Knowledge Discovery from Multimedia and Complex data, LNAI 2797, ISBN 3-540-20305-2, Springer Verlag, 2003. SIGKDD Explorations Page 77Volume 8, Issue 2