Clementine TutorialClementine Tutorial
Zdroj: Dr. Teh Ying WahZdroj: Dr. Teh Ying Wah
<http://<http://fsktm.um.edu.my/~tehyw/6317_lab1.pptfsktm.um.edu.my/~tehyw/6317_lab1.ppt>>
This tutorial will introduce you to theThis tutorial will introduce you to the
Clementine toolkit for data mining andClementine toolkit for data mining and
show you how to get started with your ownshow you how to get started with your own
data mining project.data mining project.data mining project.data mining project.
The first part provides a tour of theThe first part provides a tour of the
workspace, including an update of what'sworkspace, including an update of what's
new in this version of Clementine.new in this version of Clementine.
The second part is a stepThe second part is a step--byby--step guide tostep guide toThe second part is a stepThe second part is a step--byby--step guide tostep guide to
data mining in Clementine. All of the filesdata mining in Clementine. All of the files
shown in the examples are installed withshown in the examples are installed with
Clementine so that you can follow along.Clementine so that you can follow along.
Clementine uses a visual approach to dataClementine uses a visual approach to data
mining that provides a tangible way tomining that provides a tangible way to
work with data.work with data.
Each process in Clementine isEach process in Clementine isEach process in Clementine isEach process in Clementine is
represented by an icon, orrepresented by an icon, or nodenode, that you, that you
connect to form aconnect to form a streamstream representing therepresenting the
flow of data through a variety offlow of data through a variety of
processes.processes.
Working in Clementine is essentially likeWorking in Clementine is essentially like
using a visual metaphor to describe theusing a visual metaphor to describe the
world of data, statistics, and complexworld of data, statistics, and complex
algorithms.algorithms.algorithms.algorithms.
Although it may take a minute to shift intoAlthough it may take a minute to shift into
this paradigm, you will soon find thatthis paradigm, you will soon find that
Clementine's simplicityClementine's simplicity--ofof--use isuse is
exceedingly powerful. Let's take a closerexceedingly powerful. Let's take a closerexceedingly powerful. Let's take a closerexceedingly powerful. Let's take a closer
look.look.
To start Clementine:To start Clementine:
From the Windows Start menu choose:From the Windows Start menu choose:
ˇˇ ProgramsPrograms
ClementineClementineClementineClementine
When you first start Clementine, theWhen you first start Clementine, the
workspace opens in the default view.workspace opens in the default view.
The tools here are used to help you createThe tools here are used to help you create
a visual representation of data mininga visual representation of data mininga visual representation of data mininga visual representation of data mining
operations.operations.
First, the area in the middle is called theFirst, the area in the middle is called the
stream canvasstream canvas. This is the main area you. This is the main area you
will use to work in Clementine.will use to work in Clementine.
Most of the data and modeling tools inMost of the data and modeling tools in
Clementine reside inClementine reside in palettespalettes, the area, the area
below the stream canvas.below the stream canvas.
Each tab contains groups of nodes that are aEach tab contains groups of nodes that are a
graphical representation of data mining tasks,graphical representation of data mining tasks,
such as accessing and filtering data, creatingsuch as accessing and filtering data, creating
graphs, and building models.graphs, and building models.
To add nodes to the canvas, doubleTo add nodes to the canvas, double--click iconsclick icons
from the node palettes or drag and drop themfrom the node palettes or drag and drop them
onto the canvas. You then connect them toonto the canvas. You then connect them to
create acreate a streamstream, representing the flow of data., representing the flow of data.
You will learn more about building streamsYou will learn more about building streams
later in this tutorial. You can jump aheadlater in this tutorial. You can jump ahead
now using the Contents button below.now using the Contents button below.
On the top right side of the window are theOn the top right side of the window are the
output and objectoutput and object managersmanagers. These tabs. These tabs
are used to view and manage a variety ofare used to view and manage a variety of
Clementine objects.Clementine objects.Clementine objects.Clementine objects.
The Streams tab contains all streamsThe Streams tab contains all streams
open in the current session. You can saveopen in the current session. You can save
and close streams as well as add them toand close streams as well as add them to
a project.a project.a project.a project.
The Outputs tab contains a variety of filesThe Outputs tab contains a variety of files
produced by stream operations inproduced by stream operations in
Clementine. You can display, rename, andClementine. You can display, rename, and
close the tables, graphs, and reports listedclose the tables, graphs, and reports listedclose the tables, graphs, and reports listedclose the tables, graphs, and reports listed
herehere
The Models tab is a powerful tool thatThe Models tab is a powerful tool that
contains all generated models (modelscontains all generated models (models
that have been built in Clementine) for athat have been built in Clementine) for a
session. Models can be examined closely,session. Models can be examined closely,session. Models can be examined closely,session. Models can be examined closely,
added to the stream, exported, oradded to the stream, exported, or
annotated.annotated.
NoteNote: The Models tab replaces the: The Models tab replaces the
Generated Models tab from earlierGenerated Models tab from earlier
versions of Clementine.versions of Clementine.
On the bottom right side of the window isOn the bottom right side of the window is
thethe projectsprojects tool, used to create andtool, used to create and
manage data mining projects.manage data mining projects.
There are two ways to view projects youThere are two ways to view projects youThere are two ways to view projects youThere are two ways to view projects you
create in Clementinecreate in Clementine----Classes view andClasses view and
CRISPCRISP--DM view.DM view.
The CRISPThe CRISP--DM tab provides a way toDM tab provides a way to
organize projects according to the Crossorganize projects according to the Cross--
Industry Standard Process for DataIndustry Standard Process for Data
Mining, an industryMining, an industry--proven, nonproprietaryproven, nonproprietaryMining, an industryMining, an industry--proven, nonproprietaryproven, nonproprietary
methodology. For both experienced andmethodology. For both experienced and
firstfirst--time data miners, using the CRISPtime data miners, using the CRISP--
DM tool will help you to better organizeDM tool will help you to better organize
and communicate your efforts.and communicate your efforts.
The Classes tab provides a way toThe Classes tab provides a way to
organize your work in Clementineorganize your work in Clementine
categoricallycategorically----by the types of objects youby the types of objects you
create. This view is useful when takingcreate. This view is useful when takingcreate. This view is useful when takingcreate. This view is useful when taking
inventory of data, streams, models, etc.inventory of data, streams, models, etc.
As a data mining application, ClementineAs a data mining application, Clementine
offers a strategic approach to findingoffers a strategic approach to finding
useful relationships in large data sets. Inuseful relationships in large data sets. In
contrast to more traditional statisticalcontrast to more traditional statistical
methods, you do not necessarily need tomethods, you do not necessarily need tomethods, you do not necessarily need tomethods, you do not necessarily need to
know what you are looking for when youknow what you are looking for when you
start. You can explore your data, fittingstart. You can explore your data, fitting
different models and investigating differentdifferent models and investigating different
relationships, until you find usefulrelationships, until you find useful
information.information.
This section provides:This section provides:
An overview ofAn overview of the types of datathe types of data--miningmining
problemsproblems Clementine can help solve.Clementine can help solve.
AA handshands--on demonstrationon demonstration of buildingof buildingAA handshands--on demonstrationon demonstration of buildingof building
streams, deriving fields, using graphs, andstreams, deriving fields, using graphs, and
modeling in Clementine.modeling in Clementine.
A wide variety of organisations useA wide variety of organisations use
Clementine to help them mine vastClementine to help them mine vast
repositories of data. Following is a samplerepositories of data. Following is a sample
of the types of problems data mining canof the types of problems data mining canof the types of problems data mining canof the types of problems data mining can
help solve.help solve.
Public sectorPublic sector
Governments around the world use dataGovernments around the world use data
mining to explore massive data stores,mining to explore massive data stores,
improve citizen relationships, detectimprove citizen relationships, detect
occurences of fraud such as moneyoccurences of fraud such as moneyoccurences of fraud such as moneyoccurences of fraud such as money
laundering and tax evasion, detect crimelaundering and tax evasion, detect crime
and terrorist patterns, and enhance theand terrorist patterns, and enhance the
expanding realm of eexpanding realm of e--govermentgoverment
CRMCRM
Customer relationship management canCustomer relationship management can
be improved thanks to smart classificationbe improved thanks to smart classification
of customer types and accurateof customer types and accurate
predictions of churn. Clementine haspredictions of churn. Clementine haspredictions of churn. Clementine haspredictions of churn. Clementine has
successfully helped businesses attract andsuccessfully helped businesses attract and
retain the most valuable customers in aretain the most valuable customers in a
variety of industries.variety of industries.
Web miningWeb mining
With powerful sequencing and predictionWith powerful sequencing and prediction
algorithms, Clementine contains thealgorithms, Clementine contains the
necessary tools to discover exactly whatnecessary tools to discover exactly what
guests do at a Web site and deliverguests do at a Web site and deliverguests do at a Web site and deliverguests do at a Web site and deliver
exactly the products or information theyexactly the products or information they
desire. From data preparation to modeling,desire. From data preparation to modeling,
the entire datathe entire data--mining process can bemining process can be
managed inside of Clementine.managed inside of Clementine.
Drug discovery andDrug discovery and
bioinformaticsbioinformatics
Data mining aids both pharmaceutical andData mining aids both pharmaceutical and
genomics research by analyzing the vastgenomics research by analyzing the vast
data stores resulting from increased labdata stores resulting from increased lab
automation. Clementine's clustering andautomation. Clementine's clustering andautomation. Clementine's clustering andautomation. Clementine's clustering and
classification models help generate leadsclassification models help generate leads
from compound libraries while sequencefrom compound libraries while sequence
detection aids the discovery of patterns.detection aids the discovery of patterns.
Clementine provides templates for many ofClementine provides templates for many of
these datathese data--mining applications. Clementinemining applications. Clementine
Application Templates, also known as CATs, areApplication Templates, also known as CATs, are
available for the following types of activities:available for the following types of activities:
WebWeb--miningmining
Fraud detectionFraud detection
Analytical CRMAnalytical CRM
Telcommunications analytical CRMTelcommunications analytical CRM
Microarray analysisMicroarray analysis
Crime detection and preventionCrime detection and prevention
Let's get started learning how Clementine can help you conduct yourLet's get started learning how Clementine can help you conduct your
own data mining project.own data mining project.
This section of the guide will show you how to build and executeThis section of the guide will show you how to build and execute
simple streams using sample drug demonstration files that aresimple streams using sample drug demonstration files that are
included with Clementine. You will learn how to work with data in theincluded with Clementine. You will learn how to work with data in the
various phases of data mining, including:various phases of data mining, including:
VisualizationVisualization, which helps you gain an overall picture of your data. You, which helps you gain an overall picture of your data. You
can create plots and charts to explore relationships among the fields incan create plots and charts to explore relationships among the fields in
VisualizationVisualization, which helps you gain an overall picture of your data. You, which helps you gain an overall picture of your data. You
can create plots and charts to explore relationships among the fields incan create plots and charts to explore relationships among the fields in
your data set and generate hypotheses to explore during modeling.your data set and generate hypotheses to explore during modeling.
ManipulationManipulation, which lets you clean and prepare the data for modeling., which lets you clean and prepare the data for modeling.
You can sort or aggregate data, filter out fields, discard or replaceYou can sort or aggregate data, filter out fields, discard or replace
missing values, and derive new fields.missing values, and derive new fields.
ModelingModeling, which gives you the broadest range of insight into the, which gives you the broadest range of insight into the
relationships among data fields. Models perform a variety of tasks suchrelationships among data fields. Models perform a variety of tasks such
as predict outcomes, detect sequences, and group similarities. Theseas predict outcomes, detect sequences, and group similarities. These
help your organization grow, streamline processes, detect fraud, andhelp your organization grow, streamline processes, detect fraud, and
retain the most valuable customers.retain the most valuable customers.
For this section, imagine that you are a medicalFor this section, imagine that you are a medical
researcher compiling data for a study.researcher compiling data for a study.
You have collected data about a set of patients,You have collected data about a set of patients,
all of whom suffered from the same illness.all of whom suffered from the same illness.all of whom suffered from the same illness.all of whom suffered from the same illness.
During their course of treatment, each patientDuring their course of treatment, each patient
responded to one of five medications.responded to one of five medications.
Part of your job is to use data mining to find outPart of your job is to use data mining to find out
which drug might be appropriate for a futurewhich drug might be appropriate for a future
patient with the same illness.patient with the same illness.
The data fields used in this demo are:The data fields used in this demo are:
Age (Number)Age (Number)
Sex M or FSex M or F
BP Blood pressure: HIGH, NORMAL, or LOWBP Blood pressure: HIGH, NORMAL, or LOW
Cholesterol Blood cholesterol: NORMAL or HIGHCholesterol Blood cholesterol: NORMAL or HIGH
Na Blood sodium concentrationNa Blood sodium concentration
K Blood potassium concentrationK Blood potassium concentration
Drug Prescription drug to which a patientDrug Prescription drug to which a patient
respondedresponded
The first step is to load the data file usingThe first step is to load the data file using
aa Variable File nodeVariable File node. You can add a. You can add a
Variable File node from the palettesVariable File node from the palettes----eithereither
click theclick the SourcesSources tab to find the node ortab to find the node orclick theclick the SourcesSources tab to find the node ortab to find the node or
use theuse the FavoritesFavorites tab, which includes thistab, which includes this
node by default. Next, doublenode by default. Next, double--click theclick the
newly placed node to open its dialog box.newly placed node to open its dialog box.
Click the button just to the right of the FileClick the button just to the right of the File
box marked with ellipses (...). This opens abox marked with ellipses (...). This opens a
dialog box for browsing to the directory indialog box for browsing to the directory in
which Clementine is installed on yourwhich Clementine is installed on yourwhich Clementine is installed on yourwhich Clementine is installed on your
computer (or server). Open thecomputer (or server). Open the demosdemos
directory and select the file calleddirectory and select the file called
DRUG1nDRUG1n..
SelectSelect Read field names from fileRead field names from file andand
notice the fields and values that have justnotice the fields and values that have just
been loaded into the dialog box. Beforebeen loaded into the dialog box. Before
clickingclicking OKOK to close the dialog box, take ato close the dialog box, take aclickingclicking OKOK to close the dialog box, take ato close the dialog box, take a
moment to look at the data using the othermoment to look at the data using the other
tabs on the Source node.tabs on the Source node.
Click theClick the DataData tab to override and changetab to override and change
storagestorage for a field. Note that storage isfor a field. Note that storage is
different thandifferent than typetype, or usage of the data, or usage of the data
field.field.field.field.
TheThe FilterFilter tab can be used to remove anytab can be used to remove any
fields from the data that is brought intofields from the data that is brought into
Clementine. Clicking on a field's arrow willClementine. Clicking on a field's arrow will
mark it with a red X and filter it out. Formark it with a red X and filter it out. Formark it with a red X and filter it out. Formark it with a red X and filter it out. For
this tutorial, though, we want to keep allthis tutorial, though, we want to keep all
fields.fields.
TheThe TypesTypes tab helps you learn more abouttab helps you learn more about
the type of fields in your data. You canthe type of fields in your data. You can
also choosealso choose Read ValuesRead Values to view theto view the
actual values for each field based on theactual values for each field based on theactual values for each field based on theactual values for each field based on the
selections that you make from theselections that you make from the ValuesValues
column. This process is known ascolumn. This process is known as
instantiationinstantiation..
Now that you have loaded the data file,Now that you have loaded the data file,
you may want to glance at the values foryou may want to glance at the values for
some of the records.some of the records.
One way to do this is by building a streamOne way to do this is by building a streamOne way to do this is by building a streamOne way to do this is by building a stream
that includes a Table node. To place athat includes a Table node. To place a
Table node in the stream, either doubleTable node in the stream, either double--
click the icon in the palette or drag andclick the icon in the palette or drag and
drop it on to the canvas.drop it on to the canvas.
NoteNote: Double: Double--clicking a node from theclicking a node from the
palette will automatically connect it to thepalette will automatically connect it to the
selected node in the stream canvas.selected node in the stream canvas.
However, you can not connect to terminalHowever, you can not connect to terminalHowever, you can not connect to terminalHowever, you can not connect to terminal
nodes like tables and graphs.nodes like tables and graphs.
Next, if the nodes are not alreadyNext, if the nodes are not already
connected, you can use your middleconnected, you can use your middle
mouse button to connect the Source nodemouse button to connect the Source node
to the Table node. To simulate a middleto the Table node. To simulate a middleto the Table node. To simulate a middleto the Table node. To simulate a middle
mouse button, click the Alt key while usingmouse button, click the Alt key while using
the mouse.the mouse.
Now that you have built a stream, youNow that you have built a stream, you
must execute it in order to view its output.must execute it in order to view its output.
Click the green arrow button on the toolbarClick the green arrow button on the toolbar
to execute the stream and view an outputto execute the stream and view an outputto execute the stream and view an outputto execute the stream and view an output
table showing all of the records in the datatable showing all of the records in the data
file.file.