Marketing Information Systems: part 3
Course code: PV250
Dalia Kriksciuniene, PhD
Faculty of Informatics, Lasaris lab., Autumn, 2014

Computational methods for marketing
•Business intelligence: analytical reporting (pivoting)
•Statistical methods: probabilistic
•Artificial intelligence: directed learning:
•Neural networks NN
•Memory-Based Reasoning MBR
•Survival analysis
•Artificial intelligence: undirected learning:
•Segmentation
•Clustering
•Association rules
•Fuzzy inference (possibilities, natural language reasoning)
•Web data mining
•
2
Dalia Krikščiūnienė, MKIS 2014, Brno

Data Mining Techniques Applications
•Marketing – Predictive DM techniques, like artificial neural networks (ANN), have been used for
target marketing including market segmentation.
•Direct marketing – customers are likely to respond to new products based on their previous
consumer behavior.
•Retail – DM methods have likewise been used for sales forecasting.
•Market basket analysis – uncover which products are likely to be purchased together.
•
3
Dalia Krikščiūnienė, MKIS 2014, Brno

Artificial intelligence (AI):
The subfield of computer science concerned with symbolic reasoning and problem solving
4
Dalia Krikščiūnienė, MKIS 2014, Brno

Characteristics of artificial intelligence
•Symbolic processing (versus Numeric)
•Heuristic (versus algorithmic)
•Inferencing
•Machine learning
•Heuristics
• Informal, judgmental knowledge of an application area that constitutes the “rules of good
judgment” in the field. Heuristics also encompasses the knowledge of how to solve problems
efficiently and effectively, how to plan steps in solving a complex problem, how to improve
performance, and so forth.
•It can be transferred as tacit knowledge
•Marketing activities are heuristic to high extent
5
Dalia Krikščiūnienė, MKIS 2014, Brno

•Inferencing
•Reasoning capabilities that can build higher-level knowledge from existing heuristics
•Expert knowledge and experience capturing
•Machine learning
•Learning capabilities that allow systems to adjust their behavior and react to changes in the
outside environment
•
6
Characteristics of artificial intelligence
Dalia Krikščiūnienė, MKIS 2014, Brno

Designing the Knowledge Discovery System
1.Business Understanding – To obtain the highest benefit from data mining, there must be a clear
statement of the business objectives.
2.Data Understanding – Knowing the data well can permit the designer to tailor the algorithm or
tools used for data mining to his/her specific problem.
3.Data Preparation – Data selection, variable construction and transformation, integration, and
formatting
4.Model building and validation – Building an accurate model is a trial and error process.  The
process often requires data mining specialist to iteratively try several options, until the best
model emerges.
5.Evaluation and interpretation – Once the model is determined, the validation dataset is fed
through the model.
6.Deployment – Involves implementing the ‘live’ model within an organization to aid the decision
making process.
•
7
Dalia Krikščiūnienė, MKIS 2014, Brno

 CRISP-DM Data Mining Process Methodology
8
Dalia Krikščiūnienė, MKIS 2014, Brno

The Iterative Nature of the Knowledge Discovery process
9
Dalia Krikščiūnienė, MKIS 2014, Brno

Data Mining Technique categories
1.Predictive Techniques
•Classification: serve to classify the discrete outcome variable.
•Prediction or Estimation: predict a continuous outcome (as opposed to classification techniques
that predict discrete outcomes).
2.Descriptive Techniques
•Affinity or association:  serve to find items closely associated in the data set.
•Clustering: create clusters according to similarity defined by complex of variables of input
objects, rather than an outcome variable.
10
Dalia Krikščiūnienė, MKIS 2014, Brno

Web Data Mining - Types
1.Web structure mining – Examines how the Web documents are structured, and attempts to discover
the model underlying the link structures of the Web.
•Intra-page structure mining evaluates the arrangement of the various HTML or XML tags within a
page
•Inter-page structure refers to hyper-links connecting one page to another.
2.Web usage mining (Clickstream Analysis) – Involves the identification of patterns in user
navigation through Web pages in a domain.
•Processing, Pattern analysis, and Pattern discovery
3.Web content mining – Used to discover what a Web page is about and how to uncover new knowledge
from it.
11
Dalia Krikščiūnienė, MKIS 2014, Brno

Barriers to the use of DM
•Two of the most significant barriers that prevented the earlier deployment of knowledge discovery
in the business relate to:
•Lack of data to support the analysis
•Limited computing power to perform the mathematical calculations required by the data mining
algorithms.
•
12
Dalia Krikščiūnienė, MKIS 2014, Brno

Variables for consideration in airline planning
13
Dalia Krikščiūnienė, MKIS 2014, Brno

Classification of data mining methods for CRM
14
Dalia Krikščiūnienė, MKIS 2014, Brno

Neural networks
•They are used for classification, regression, time series forecasting tasks
•Supervised and unsupervised learning
•Supervised means, that you have data samples with the known outcome (e.g. credit success and
failure cases). Theses samples are used for creating NN model by learning. The outcome for new
unknown samples is computed according to NN model
•Unsupervised means, that we do not know the outcome for samples, but we can cluster them according
to their similarity by taking into account all known information, put into data records consinsting
of many variables.
•
15
Dalia Krikščiūnienė, MKIS 2014, Brno

Good NN problem has following characteristics
•Inputs are well understood. You know which features (indicators) are important, but not
necessarily know how to combine them
•Outputs are well understood. You know wht you try to model
•Experience is available- you have enough examples where both input and output are known. These
cases will be used to train network
•A black box model is acceptable. Explaining and interpreting model is not necessary
16
Dalia Krikščiūnienė, MKIS 2014, Brno

Neural network analysis
•Neural network performance is based on node’s activation function
•Inputs are combined into single value, then passed to transfer function to produce output
•Each input has its own weight
•Usually combination function is a weighted sum
•Other possibilities-max function (e.g. radial basis network has other combination)
•Transfer function is made by 0-1 or sigmoid (continuous)
•If linear- neural network is the same as linear regression
•Sigmoid is sensitive in middle range: small change makes big difference
17
Dalia Krikščiūnienė, MKIS 2014, Brno

Neural network analysis
•NN has linear behavior similarity in large ranges and non-linear in small
•Power of NN is in non-linear behavior due to activation of constituent unite
•It leads to requirement to have similar ranges of inputs (standardized or near to 0)
•In this case weight adjustment will have bigger impact
•
•
18
Dalia Krikščiūnienė, MKIS 2014, Brno

19
Neural network models
The generally applied network types for designing neural network models are Multilayer Perceptron,
Radial Basis Function and Probabilistic Neural Network.
The main difference is in their algorithms, used for analysis and grouping of the input cases for
further classification.
Dalia Krikščiūnienė, MKIS 2014, Brno

The Multilayer Perceptron NN model
•The following diagram illustrates a perceptron network with three layers:
•This network has an input layer (on the left) with three neurons, one hidden layer (in the middle)
with three neurons and an output layer (on the right) with three neurons.
•There is one neuron in the input layer for each predictor variable. In the case of categorical
variables, N-1 neurons are used to represent the N categories of the variable.
Dalia Krikščiūnienė, MKIS 2014, Brno
20

Multilayer perceptron
•Hidden layer gets inputs from all nodes in input layer
•Standardization is important
•In hidden layer – hyperbolic tangent is preferred, as it gives positive and negative values
•Transfer function depends on target
•For continuous- linear is preferred
•For binary- logistic, which behaves as probability
•One hidden layer is usually sufficient
•The wider it is, the bigger capacity NN gains
•The drawback of increasing hidden layer is memorizing instead of generalizing (overfit)
21
Dalia Krikščiūnienė, MKIS 2014, Brno

Multilayer perceptron
•A small number of hidden layer nodes with non-linear transfer functions are sufficient for very
flexible models
•Output is weighted linear combination
•Usually output is one value and is calculated from all nodes of hidden layer
•One additional input- constant which is weighted as well
•Topologies can vary- NN can have more outputs (e.g. calculating probability that customer will by
in each of the departments NN has output for each department)
•The results can be used in different ways, usually selected by experimenting: take max, take top
3, take those above threshold, take meeting percentage from maxs
22
Dalia Krikščiūnienė, MKIS 2014, Brno

Multilayer perceptron
•Training is performed for one set in order to test performance with the other
•It is similar to finding one best fit line for regresssion
•In NN there is no single case of best fit, it uses optimization
•Goal is to find set of weights which minimize the overall error function, e.g. average square
error
23
Dalia Krikščiūnienė, MKIS 2014, Brno

Multilayer perceptron
•First successful training method- back propogation, 3 steps:
•Get data, compute outputs with existing weights of the system (e.g. random)
•Calculate overall error by taking difference of actual values
•Error is sent back to network, weights are adjusted
•Then blame is adjusted to nodes, and weights adjusted for these nodes
•(complex math procedure of partial derivatives is used)
•After sufficient generations and showing sufficient training samples the error no longer
decreases- stop
24
Dalia Krikščiūnienė, MKIS 2014, Brno

Multilayer perceptron
•The weights are adjusted: if their change decrease overall error (not eliminate)
•After sufficient generations and showing sufficient training samples the error no longer
decreases- stop
•Training set has to be balances to have enough various cases as goal is to generalize
•This technique is called generalization delta rule-2 param:
•Momentum- weight remembers which direction is was changing, it tries to go same direction. If
momentum is high the NN responds slowly to samples which  try to change direction. Low momentum
allows flexibility
•Learning rate controls how quickly weights change. Best approach is to start big and decrease
slowly as NN is being trained.
•
•
•
25
Dalia Krikščiūnienė, MKIS 2014, Brno

Multilayer perceptron
•Initially weights are random
•Large oscillations are useful
•Getting closer to optimal, learning rate should decrease
•There are more methods, the goal for all of them – to arrive quickly to optimal
•
26
Dalia Krikščiūnienė, MKIS 2014, Brno

Radial basis function network
•Fitting a curve exactly through a set of points
•Weighted distances are computed between the input x and a set of prototypes
•These scale distances are then transformed through a set of nonlinear basis functions h, and these
outputs are summed up in a linear combination with the original inputs and a constant.
•
Picture 3.png Picture 4.png
Radial basis function network
Dalia Krikščiūnienė, MKIS 2014, Brno
27

Radial basis function network RBF
•They differ from MLPin 2 ways:
•Interpretation relies on geometry rather than biology
•Training method is different as in addition to optimizing weights used to combine outputs of RBF
nodes , the nodes themselves have parameters that can be optimized
•As with other types of NN the data processed is always numeric, so it is possibles to interpret
any input record as point in space
28
Dalia Krikščiūnienė, MKIS 2014, Brno

Radial basis function network
•In RBF network hidden layer nodes are also points in same space, Each has address specified by
vector of elements which number equals to no. of variables
•Instead of combination and transfer functions the RBF have distance and transfer functions
•Distant function os standard Euclidean – suqare root of quadratic distances of each dimension
•The nodes output is non-linear function of how dimension is close to the input is: the closer the
input, the stronger the output.
•
29
Dalia Krikščiūnienė, MKIS 2014, Brno

Radial basis function network
•„Radial“ refers to the fact that all inputs of same distance from node‘s position produce same
output
•In two dimensions they produc circle, in 3D- sphere
•RBF nodes are in hidden layer and also have transfer functions
•Instead of S-shape (as in MLP) these are bell-shape Gaussians (multidimensional normal curve)
•Unlike MLP the RBF does not have weights associated with connections between input and hidden
layers
30
Dalia Krikščiūnienė, MKIS 2014, Brno

Probabilistic NN
PNNarchitecture2
Dalia Krikščiūnienė, MKIS 2014, Brno
31

32
Probabilistic Neural Network model
This type of network copies every training case to the hidden layer of the network, where the
Gaussian kernel-based estimation is further applied. The output layer is then reduced, by making
estimations from each hidden unit.
The training is extremely fast, as it just copies the training cases after their normalization to
the network. But this procedure tends to make the neural network very large, therefore this makes
them slow to execute.
Dalia Krikščiūnienė, MKIS 2014, Brno

33
During the testing stage the Probabilistic Neural Network model requires a number of operations
approximately proportional to the square of the number of training cases, therefore for the large
number of cases the total duration of creating model becomes similar to the other network types
that are usually described as being far slower to train (e.g. multilayer perceptrons).
If the prior probabilities (of class distribution) are known and different from the frequency
distribution of the training set, they can be incorporated in training of the network model,
otherwise the distribution is described by frequency (StatSoft Inc.).
Dalia Krikščiūnienė, MKIS 2014, Brno

Memory-Based Reasoning MBR
•MBR belong to the class of tasks- Nearest neighbour techniques
•MBR results are based on analogous situations in past
•Application:
•Collaborative filtering (not only similarity among neighbours but also their preferences),
customer response to offer
•Text mining approach
•Acoustic engineering: mobile app Shazam which identifies songs from snippets captured in mobile
phone
•Fraud detection (similarity to known cases)
•
34
Dalia Krikščiūnienė, MKIS 2014, Brno

Memory-Based Reasoning MBR
•MBR uses data as it is. Unlike other DM techniques it does not care of data formats
•Main components: distance function between two records and combination function (combine results
from several neighbors and give result)
•Ability to adapt- add new categories
•Does not need long training, e.g. for Shazam app new songs are added on daily basis and app just
works
•Disadvantage- method requires larga sample data base. Classifying new record needs processing all
historizal records
•
35
Dalia Krikščiūnienė, MKIS 2014, Brno

Survival analysis
•It means time-to-event analysis. It tells when to start worrying about customers doing something
important
•It identifies which factors are most correlated with the event
•Survival curves provide snapshots of customers and their life cycles, it takes care of very
important facet of customer behaviour- tenure.
•When customer is likely to leave
•.. Or migrate to other customer segment
•Compound effect of other factors to tenure
•
36
Dalia Krikščiūnienė, MKIS 2014, Brno

Survival analysis
•Survival curve plotting: proportion of customers that are expected to survive up to particular
point in tenute, based of historical info, how long customers survived in past : starts at 100%,
decreases
•Graph procedures: Cox proportional hazards regression  model. It shows how many customers are here
after some time (e.g. 2000 days). Likelihood that they will stay longer.and the differences between
two groups
•
Dalia Krikščiūnienė, MKIS 2014, Brno
37

Association rules
•They allow analysts and researchers to uncover hidden patterns in large data sets, such as
"customers who order product A often also order product B or C" or "employees who said positive
things about initiative X also frequently complain about issue Y but are happy with issue Z.“
•Supports all common types of variables or formats in which categories, items, or transactions are
recorded:Categorical Variables, Multiple Response Variables, Multiple Dichotomies. STATISTICA
Association Rules (e.g., information regarding purchases of consumer items)
Dalia Krikščiūnienė, MKIS 2014, Brno
38

Association rules
Dalia Krikščiūnienė, MKIS 2014, Brno
39

SOM – self organizing maps
•A self-organizing map (SOM) or self-organizing feature map (SOFM) is a type of artificial neural
network that is trained using unsupervised learning to produce a low-dimensional (typically
two-dimensional), discretized representation of the input space of the training samples, called a
map. Self-organizing maps are different from other artificial neural networks in the sense that
they use a neighborhood function to preserve the topological properties of the input space.
•
Dalia Krikščiūnienė, MKIS 2014, Brno
40

SOM – self organizing maps
•For data mining purposes, it has become a standard to approximate the SOM by a two-dimensional
hexagonal grid. The “nodes” on the grid are associated so-called “reference vectors” which point to
distinct regions in the original data space. Starting with sets of numerical, multivariate data,
these reference vectors on the grid gradually adapt to the intrinsic shape of the data
distribution, whereby the reference vectors of neighbored nodes point to adjacent regions in the
data space. Thus the order on the grid reflects the neighborhood within the data, such that data
distribution features can be read directly from the emerging landscape on the grid.
•
Dalia Krikščiūnienė, MKIS 2014, Brno
41

SOM – self organizing maps
Dalia Krikščiūnienė, MKIS 2014, Brno
42
Pav1.tif

SOM – self organizing maps: cluster differences, influence of single variable to cluster separation
Dalia Krikščiūnienė, MKIS 2014, Brno
43

44
Fuzzy inference
•Basic approach of ANFIS
Adaptive networks
Neural networks
Fuzzy inference
systems
Generalization
Specialization
ANFIS
Dalia Krikščiūnienė, MKIS 2014, Brno

Our approach of using ANFIS as a neuro-fuzzy modeling tool is like this. First we generalize neural
networks?architectures to obtain adaptive networks, and then we do a specialization to derive fuzzy
inference systems represented by adaptive networks, and that ANFIS. During the processes of
generalization and specialization, the backpropagation techniques used for training neural networks
can be carried over directly, so we can train ANFIS using the same techniques. (In fact
backpropagation is too slow, so we have some other technique to speed up training in ANFIS.)

45
Fuzzy Sets
•Sets with fuzzy boundaries
A = Set of tall people
Heights
(cm)
170
1.0
Crisp set A
Membership
function
Heights
(cm)
170
180
.5
.9
Fuzzy set A
1.0
Dalia Krikščiūnienė, MKIS 2014, Brno

A fuzzy set is a set with fuzzy boundary. Suppose that A is the set of tall people. In a
conventional set, or crisp set, an element is either belong to not belong to a set; there nothing
in between. Therefore to define a crisp set A, we need to find a number, say, 5??, such that for a
person taller than this number, he or she is in the set of tall people. For a fuzzy version of set
A, we allow the degree of belonging to vary between 0 and 1. Therefore for a person with height
5??, we can say that he or she is tall to the degree of 0.5. And for a 6-foot-high person, he or
she is tall to the degree of .9. So everything is a matter of degree in fuzzy sets. If we plot the
degree of belonging w.r.t. heights, the curve is called a membership function. Because of its
smooth transition, a fuzzy set is a better representation of our mental model of all? Moreover, if
a fuzzy set has a step-function-like membership function, it reduces to the common crisp set.

46
Membership Functions (MFs)
•Subjective measures
•Not probability functions
MFs
Heights
(cm)
180
.5
.8
.1
“tall” in Taiwan
“tall” in the US
“tall” in NBA
Dalia Krikščiūnienė, MKIS 2014, Brno

Here I like to emphasize some important properties of membership functions. First of all, it
subjective measure; my membership function of all?is likely to be different from yours. Also it
context sensitive. For example, I 5?1? and I considered pretty tall in Taiwan. But in the States,
I only considered medium build, so may be only tall to the degree of .5. But if I an NBA player,
Il be considered pretty short, cannot even do a slam dunk! So as you can see here, we have three
different MFs for all?in different contexts. Although they are different, they do share some
common characteristics --- for one thing, they are all monotonically increasing from 0 to 1.
Because the membership function represents a subjective measure, it not probability function at
all.

47
Fuzzy Inference System (FIS)
If speed is low then resistance = 2
If speed is medium then resistance = 4*speed
If speed is high then resistance = 8*speed
Rule 1: w1 = .3; r1 = 2
Rule 2: w2 = .8; r2 = 4*2
Rule 3: w3 = .1; r3 = 8*2
Speed
2
.3
.8
.1
low
medium
high
Resistance = S(wi*ri) / Swi
                   = 7.12
MFs
Dalia Krikščiūnienė, MKIS 2014, Brno

A single fuzzy rule is not very interesting. But if we have a collection of fuzzy rules, we can use
them to describe a system behavior. This leads to a fuzzy inference system. For instance, we can
describe the resistance experienced by a moving object by the following three rules: .... Then
given a crisp speed value, how do we find the resistance value from these three rules? It quite
simple and can be done in three steps. In the first step, we find the membership grades for ow?
edium? and igh? For instance, if speed is 2, the membership grades for ow? edium?and igh?are
.3, .8, and .1, respectively. These numbers also represent how the given input condition peed =
2?satisfies the IF part of the rules. Sometimes these numbers are called the firing strengths of
the rules. In the second step, we find the output of each rule, given speed is 2. In the third
step, we apply a weighted average method to find the overall resistance, where the weighting
factors are equal to the firing strengths of the rules. The whole process to derive the output from
a given input condition is called fuzzy reasoning. For a two-input FIS, the process of fuzzy
reasoning is better represented by the following diagram.

Fuzzy inference: surface diagrams for relationship among variables
Dalia Krikščiūnienė, MKIS 2014, Brno
48

Fuzzy methods for marketing
49
Dalia Krikščiūnienė, MKIS 2014, Brno

Combining methods for exploring customer performance
50
Dalia Krikščiūnienė, MKIS 2014, Brno

Web data mining
•Indicators for evaluation
•Opinion mining
•Text mining approaches and process
•Static analytic
•Dynamic analytic
•Sentiment analysis
•Classification
•Social network generation for analysis
•Social network analysis approach
•
•
•
•
•
•
51
Dalia Krikščiūnienė, MKIS 2014, Brno

Social media analytics
52
Dalia Krikščiūnienė, MKIS 2014, Brno

Analytic types in social media: Opinion mining
53
Dalia Krikščiūnienė, MKIS 2014, Brno

Analytic types in social media: text mining
54
Dalia Krikščiūnienė, MKIS 2014, Brno

Mining process
•Example  “I like this shoe”
55
Dalia Krikščiūnienė, MKIS 2014, Brno

Static analytics (reporting, pivoting)
56
Dalia Krikščiūnienė, MKIS 2014, Brno

Static analytics (reporting, pivoting)
57
Dalia Krikščiūnienė, MKIS 2014, Brno

Dynamic analytics
58
Dalia Krikščiūnienė, MKIS 2014, Brno

Sentiment classification (text)
59
Dalia Krikščiūnienė, MKIS 2014, Brno

Classification (support vector machine SVM)
60
Dalia Krikščiūnienė, MKIS 2014, Brno

Social network generation for analysis
61
Dalia Krikščiūnienė, MKIS 2014, Brno

Social network analysis approach
62
Dalia Krikščiūnienė, MKIS 2014, Brno

Optimization of quantitative and qualitative values
•Constraints-meet budget, exceed target sales
•Qualitative goal- to obtain advertisement which could maximize the effectiveness of reach to
public
•The experts are invited (or surveys done) is order to define effectiveness of each media, also its
dependency on number of shows
•The quantification of expert opinion: the effectiveness change by the 10th show. It is expresses
by number of people (out of 100) who watch the advertisement
•The goal of the optimization experiment is to maximize the effectiveness
Dalia Krikščiūnienė, MKIS 2014, Brno
63

Literature
•Berry, M.,J.A., Linoff, G.S. (2011), "Data Mining Techniques: For Marketing, Sales, and Customer
Relationship Management", (3rd ed.), Indianapolis: Wiley Publishing, Inc.
•(Electronic Version): StatSoft, Inc. (2012). Electronic Statistics Textbook. Tulsa, OK: StatSoft.
WEB: http://www.statsoft.com/textbook/
•(Printed Version): Hill, T. & Lewicki, P. (2007). STATISTICS: Methods and Applications. StatSoft,
Tulsa, OK.
•Sugar CRM Implementation http://www.optimuscrm.com/index.php?lang=en
•Statsoft: the creators of Statistica  http://www.statsoft.com
•Viscovery Somine http://www.viscovery.net/
64
Dalia Krikščiūnienė, MKIS 2014, Brno