Clementine TutorialClementine Tutorial Zdroj: Dr. Teh Ying WahZdroj: Dr. Teh Ying Wah > This tutorial will introduce you to theThis tutorial will introduce you to the Clementine toolkit for data mining andClementine toolkit for data mining and show you how to get started with your ownshow you how to get started with your own data mining project.data mining project.data mining project.data mining project. The first part provides a tour of theThe first part provides a tour of the workspace, including an update of what'sworkspace, including an update of what's new in this version of Clementine.new in this version of Clementine. The second part is a stepThe second part is a step--byby--step guide tostep guide toThe second part is a stepThe second part is a step--byby--step guide tostep guide to data mining in Clementine. All of the filesdata mining in Clementine. All of the files shown in the examples are installed withshown in the examples are installed with Clementine so that you can follow along.Clementine so that you can follow along. Clementine uses a visual approach to dataClementine uses a visual approach to data mining that provides a tangible way tomining that provides a tangible way to work with data.work with data. Each process in Clementine isEach process in Clementine isEach process in Clementine isEach process in Clementine is represented by an icon, orrepresented by an icon, or nodenode, that you, that you connect to form aconnect to form a streamstream representing therepresenting the flow of data through a variety offlow of data through a variety of processes.processes. Working in Clementine is essentially likeWorking in Clementine is essentially like using a visual metaphor to describe theusing a visual metaphor to describe the world of data, statistics, and complexworld of data, statistics, and complex algorithms.algorithms.algorithms.algorithms. Although it may take a minute to shift intoAlthough it may take a minute to shift into this paradigm, you will soon find thatthis paradigm, you will soon find that Clementine's simplicityClementine's simplicity--ofof--use isuse is exceedingly powerful. Let's take a closerexceedingly powerful. Let's take a closerexceedingly powerful. Let's take a closerexceedingly powerful. Let's take a closer look.look. To start Clementine:To start Clementine: From the Windows Start menu choose:From the Windows Start menu choose: ˇˇ ProgramsPrograms ClementineClementineClementineClementine When you first start Clementine, theWhen you first start Clementine, the workspace opens in the default view.workspace opens in the default view. The tools here are used to help you createThe tools here are used to help you create a visual representation of data mininga visual representation of data mininga visual representation of data mininga visual representation of data mining operations.operations. First, the area in the middle is called theFirst, the area in the middle is called the stream canvasstream canvas. This is the main area you. This is the main area you will use to work in Clementine.will use to work in Clementine. Most of the data and modeling tools inMost of the data and modeling tools in Clementine reside inClementine reside in palettespalettes, the area, the area below the stream canvas.below the stream canvas. Each tab contains groups of nodes that are aEach tab contains groups of nodes that are a graphical representation of data mining tasks,graphical representation of data mining tasks, such as accessing and filtering data, creatingsuch as accessing and filtering data, creating graphs, and building models.graphs, and building models. To add nodes to the canvas, doubleTo add nodes to the canvas, double--click iconsclick icons from the node palettes or drag and drop themfrom the node palettes or drag and drop them onto the canvas. You then connect them toonto the canvas. You then connect them to create acreate a streamstream, representing the flow of data., representing the flow of data. You will learn more about building streamsYou will learn more about building streams later in this tutorial. You can jump aheadlater in this tutorial. You can jump ahead now using the Contents button below.now using the Contents button below. On the top right side of the window are theOn the top right side of the window are the output and objectoutput and object managersmanagers. These tabs. These tabs are used to view and manage a variety ofare used to view and manage a variety of Clementine objects.Clementine objects.Clementine objects.Clementine objects. The Streams tab contains all streamsThe Streams tab contains all streams open in the current session. You can saveopen in the current session. You can save and close streams as well as add them toand close streams as well as add them to a project.a project.a project.a project. The Outputs tab contains a variety of filesThe Outputs tab contains a variety of files produced by stream operations inproduced by stream operations in Clementine. You can display, rename, andClementine. You can display, rename, and close the tables, graphs, and reports listedclose the tables, graphs, and reports listedclose the tables, graphs, and reports listedclose the tables, graphs, and reports listed herehere The Models tab is a powerful tool thatThe Models tab is a powerful tool that contains all generated models (modelscontains all generated models (models that have been built in Clementine) for athat have been built in Clementine) for a session. Models can be examined closely,session. Models can be examined closely,session. Models can be examined closely,session. Models can be examined closely, added to the stream, exported, oradded to the stream, exported, or annotated.annotated. NoteNote: The Models tab replaces the: The Models tab replaces the Generated Models tab from earlierGenerated Models tab from earlier versions of Clementine.versions of Clementine. On the bottom right side of the window isOn the bottom right side of the window is thethe projectsprojects tool, used to create andtool, used to create and manage data mining projects.manage data mining projects. There are two ways to view projects youThere are two ways to view projects youThere are two ways to view projects youThere are two ways to view projects you create in Clementinecreate in Clementine----Classes view andClasses view and CRISPCRISP--DM view.DM view. The CRISPThe CRISP--DM tab provides a way toDM tab provides a way to organize projects according to the Crossorganize projects according to the Cross-- Industry Standard Process for DataIndustry Standard Process for Data Mining, an industryMining, an industry--proven, nonproprietaryproven, nonproprietaryMining, an industryMining, an industry--proven, nonproprietaryproven, nonproprietary methodology. For both experienced andmethodology. For both experienced and firstfirst--time data miners, using the CRISPtime data miners, using the CRISP-- DM tool will help you to better organizeDM tool will help you to better organize and communicate your efforts.and communicate your efforts. The Classes tab provides a way toThe Classes tab provides a way to organize your work in Clementineorganize your work in Clementine categoricallycategorically----by the types of objects youby the types of objects you create. This view is useful when takingcreate. This view is useful when takingcreate. This view is useful when takingcreate. This view is useful when taking inventory of data, streams, models, etc.inventory of data, streams, models, etc. As a data mining application, ClementineAs a data mining application, Clementine offers a strategic approach to findingoffers a strategic approach to finding useful relationships in large data sets. Inuseful relationships in large data sets. In contrast to more traditional statisticalcontrast to more traditional statistical methods, you do not necessarily need tomethods, you do not necessarily need tomethods, you do not necessarily need tomethods, you do not necessarily need to know what you are looking for when youknow what you are looking for when you start. You can explore your data, fittingstart. You can explore your data, fitting different models and investigating differentdifferent models and investigating different relationships, until you find usefulrelationships, until you find useful information.information. This section provides:This section provides: An overview ofAn overview of the types of datathe types of data--miningmining problemsproblems Clementine can help solve.Clementine can help solve. AA handshands--on demonstrationon demonstration of buildingof buildingAA handshands--on demonstrationon demonstration of buildingof building streams, deriving fields, using graphs, andstreams, deriving fields, using graphs, and modeling in Clementine.modeling in Clementine. A wide variety of organisations useA wide variety of organisations use Clementine to help them mine vastClementine to help them mine vast repositories of data. Following is a samplerepositories of data. Following is a sample of the types of problems data mining canof the types of problems data mining canof the types of problems data mining canof the types of problems data mining can help solve.help solve. Public sectorPublic sector Governments around the world use dataGovernments around the world use data mining to explore massive data stores,mining to explore massive data stores, improve citizen relationships, detectimprove citizen relationships, detect occurences of fraud such as moneyoccurences of fraud such as moneyoccurences of fraud such as moneyoccurences of fraud such as money laundering and tax evasion, detect crimelaundering and tax evasion, detect crime and terrorist patterns, and enhance theand terrorist patterns, and enhance the expanding realm of eexpanding realm of e--govermentgoverment CRMCRM Customer relationship management canCustomer relationship management can be improved thanks to smart classificationbe improved thanks to smart classification of customer types and accurateof customer types and accurate predictions of churn. Clementine haspredictions of churn. Clementine haspredictions of churn. Clementine haspredictions of churn. Clementine has successfully helped businesses attract andsuccessfully helped businesses attract and retain the most valuable customers in aretain the most valuable customers in a variety of industries.variety of industries. Web miningWeb mining With powerful sequencing and predictionWith powerful sequencing and prediction algorithms, Clementine contains thealgorithms, Clementine contains the necessary tools to discover exactly whatnecessary tools to discover exactly what guests do at a Web site and deliverguests do at a Web site and deliverguests do at a Web site and deliverguests do at a Web site and deliver exactly the products or information theyexactly the products or information they desire. From data preparation to modeling,desire. From data preparation to modeling, the entire datathe entire data--mining process can bemining process can be managed inside of Clementine.managed inside of Clementine. Drug discovery andDrug discovery and bioinformaticsbioinformatics Data mining aids both pharmaceutical andData mining aids both pharmaceutical and genomics research by analyzing the vastgenomics research by analyzing the vast data stores resulting from increased labdata stores resulting from increased lab automation. Clementine's clustering andautomation. Clementine's clustering andautomation. Clementine's clustering andautomation. Clementine's clustering and classification models help generate leadsclassification models help generate leads from compound libraries while sequencefrom compound libraries while sequence detection aids the discovery of patterns.detection aids the discovery of patterns. Clementine provides templates for many ofClementine provides templates for many of these datathese data--mining applications. Clementinemining applications. Clementine Application Templates, also known as CATs, areApplication Templates, also known as CATs, are available for the following types of activities:available for the following types of activities: WebWeb--miningmining Fraud detectionFraud detection Analytical CRMAnalytical CRM Telcommunications analytical CRMTelcommunications analytical CRM Microarray analysisMicroarray analysis Crime detection and preventionCrime detection and prevention Let's get started learning how Clementine can help you conduct yourLet's get started learning how Clementine can help you conduct your own data mining project.own data mining project. This section of the guide will show you how to build and executeThis section of the guide will show you how to build and execute simple streams using sample drug demonstration files that aresimple streams using sample drug demonstration files that are included with Clementine. You will learn how to work with data in theincluded with Clementine. You will learn how to work with data in the various phases of data mining, including:various phases of data mining, including: VisualizationVisualization, which helps you gain an overall picture of your data. You, which helps you gain an overall picture of your data. You can create plots and charts to explore relationships among the fields incan create plots and charts to explore relationships among the fields in VisualizationVisualization, which helps you gain an overall picture of your data. You, which helps you gain an overall picture of your data. You can create plots and charts to explore relationships among the fields incan create plots and charts to explore relationships among the fields in your data set and generate hypotheses to explore during modeling.your data set and generate hypotheses to explore during modeling. ManipulationManipulation, which lets you clean and prepare the data for modeling., which lets you clean and prepare the data for modeling. You can sort or aggregate data, filter out fields, discard or replaceYou can sort or aggregate data, filter out fields, discard or replace missing values, and derive new fields.missing values, and derive new fields. ModelingModeling, which gives you the broadest range of insight into the, which gives you the broadest range of insight into the relationships among data fields. Models perform a variety of tasks suchrelationships among data fields. Models perform a variety of tasks such as predict outcomes, detect sequences, and group similarities. Theseas predict outcomes, detect sequences, and group similarities. These help your organization grow, streamline processes, detect fraud, andhelp your organization grow, streamline processes, detect fraud, and retain the most valuable customers.retain the most valuable customers. For this section, imagine that you are a medicalFor this section, imagine that you are a medical researcher compiling data for a study.researcher compiling data for a study. You have collected data about a set of patients,You have collected data about a set of patients, all of whom suffered from the same illness.all of whom suffered from the same illness.all of whom suffered from the same illness.all of whom suffered from the same illness. During their course of treatment, each patientDuring their course of treatment, each patient responded to one of five medications.responded to one of five medications. Part of your job is to use data mining to find outPart of your job is to use data mining to find out which drug might be appropriate for a futurewhich drug might be appropriate for a future patient with the same illness.patient with the same illness. The data fields used in this demo are:The data fields used in this demo are: Age (Number)Age (Number) Sex M or FSex M or F BP Blood pressure: HIGH, NORMAL, or LOWBP Blood pressure: HIGH, NORMAL, or LOW Cholesterol Blood cholesterol: NORMAL or HIGHCholesterol Blood cholesterol: NORMAL or HIGH Na Blood sodium concentrationNa Blood sodium concentration K Blood potassium concentrationK Blood potassium concentration Drug Prescription drug to which a patientDrug Prescription drug to which a patient respondedresponded The first step is to load the data file usingThe first step is to load the data file using aa Variable File nodeVariable File node. You can add a. You can add a Variable File node from the palettesVariable File node from the palettes----eithereither click theclick the SourcesSources tab to find the node ortab to find the node orclick theclick the SourcesSources tab to find the node ortab to find the node or use theuse the FavoritesFavorites tab, which includes thistab, which includes this node by default. Next, doublenode by default. Next, double--click theclick the newly placed node to open its dialog box.newly placed node to open its dialog box. Click the button just to the right of the FileClick the button just to the right of the File box marked with ellipses (...). This opens abox marked with ellipses (...). This opens a dialog box for browsing to the directory indialog box for browsing to the directory in which Clementine is installed on yourwhich Clementine is installed on yourwhich Clementine is installed on yourwhich Clementine is installed on your computer (or server). Open thecomputer (or server). Open the demosdemos directory and select the file calleddirectory and select the file called DRUG1nDRUG1n.. SelectSelect Read field names from fileRead field names from file andand notice the fields and values that have justnotice the fields and values that have just been loaded into the dialog box. Beforebeen loaded into the dialog box. Before clickingclicking OKOK to close the dialog box, take ato close the dialog box, take aclickingclicking OKOK to close the dialog box, take ato close the dialog box, take a moment to look at the data using the othermoment to look at the data using the other tabs on the Source node.tabs on the Source node. Click theClick the DataData tab to override and changetab to override and change storagestorage for a field. Note that storage isfor a field. Note that storage is different thandifferent than typetype, or usage of the data, or usage of the data field.field.field.field. TheThe FilterFilter tab can be used to remove anytab can be used to remove any fields from the data that is brought intofields from the data that is brought into Clementine. Clicking on a field's arrow willClementine. Clicking on a field's arrow will mark it with a red X and filter it out. Formark it with a red X and filter it out. Formark it with a red X and filter it out. Formark it with a red X and filter it out. For this tutorial, though, we want to keep allthis tutorial, though, we want to keep all fields.fields. TheThe TypesTypes tab helps you learn more abouttab helps you learn more about the type of fields in your data. You canthe type of fields in your data. You can also choosealso choose Read ValuesRead Values to view theto view the actual values for each field based on theactual values for each field based on theactual values for each field based on theactual values for each field based on the selections that you make from theselections that you make from the ValuesValues column. This process is known ascolumn. This process is known as instantiationinstantiation.. Now that you have loaded the data file,Now that you have loaded the data file, you may want to glance at the values foryou may want to glance at the values for some of the records.some of the records. One way to do this is by building a streamOne way to do this is by building a streamOne way to do this is by building a streamOne way to do this is by building a stream that includes a Table node. To place athat includes a Table node. To place a Table node in the stream, either doubleTable node in the stream, either double-- click the icon in the palette or drag andclick the icon in the palette or drag and drop it on to the canvas.drop it on to the canvas. NoteNote: Double: Double--clicking a node from theclicking a node from the palette will automatically connect it to thepalette will automatically connect it to the selected node in the stream canvas.selected node in the stream canvas. However, you can not connect to terminalHowever, you can not connect to terminalHowever, you can not connect to terminalHowever, you can not connect to terminal nodes like tables and graphs.nodes like tables and graphs. Next, if the nodes are not alreadyNext, if the nodes are not already connected, you can use your middleconnected, you can use your middle mouse button to connect the Source nodemouse button to connect the Source node to the Table node. To simulate a middleto the Table node. To simulate a middleto the Table node. To simulate a middleto the Table node. To simulate a middle mouse button, click the Alt key while usingmouse button, click the Alt key while using the mouse.the mouse. Now that you have built a stream, youNow that you have built a stream, you must execute it in order to view its output.must execute it in order to view its output. Click the green arrow button on the toolbarClick the green arrow button on the toolbar to execute the stream and view an outputto execute the stream and view an outputto execute the stream and view an outputto execute the stream and view an output table showing all of the records in the datatable showing all of the records in the data file.file.