Credit scoring - evaluace
modelu
10           20           30           40           50           60           70
Score (relative) - the higher the better
Introduction
□ It is impossible to use scoring model effectively without knowing how good it is.
□ Usually one has several scoring models and needs to select just one. The best one.
□  Before measuring the quality of models one should know (among other things):
> good/bad definition
> expected reject rate
Good/bad client definition
□ Good   definition   is   the   basic   condition   of effective scoring model.
□ The definition usually depends on:
> days past due (DPD)
> amount past due time horizon
□ Generally we consider following types of client:
> Good
>  Bad
> Indeterminate
> Insufficient
>  Excluded Rejected.
Good/bad client definition
Customer
Accepted
GOOD
Insufficient
INDETERMINATE
Measuring the quality
□ Once the definition of good / bad client and client's score is available, it is possible to evaluate the quality of this score. If the score is an output of a predictive model (scoring function), then we evaluate the quality of this model. We can consider two basic types of quality indexes. First, indexes based on cumulative distribution function like
>  Kolmogorov-Smirnov statistics (KS)
>  Gini index
> C-statistics
>  Lift.
The second, indexes based on likelihood density function like
>  Mean difference (Mahalanobis distance)
>  Informational statistics/value (IVa,).
Indexes based on distribution function
DK =
1,    client is good
0,
otherwise.
Number of good clients: n Number of bad clients: m Proportions of good/bad clients: pG
n
n + m
Pb =
m
n + m
Empirical distribution functions:
l n
K.Goooia) = -Z7^ - aA Dk = l)
Kolmogorov-Smirnov statistics (KS)
KS = max
ae[L,H]
^m^BAD\a)      ^n^GOOD\a)
1      m Fm.BAD (ft) = — E J(Si   ^OADk=0)
FNALL{a) = ±-fjl{sl<a)       ae[L,H]
N   i=\
I(A) =
1     A is true 0   otherwise
			
-	----■Bad clients Good clients -------K-S = 0.410	s— y^	
-		7 * * / t :   / / / # : *   \ ■'■■.........*......■..................:■ / * / *.................:..................;■ *                          /	*                  / .......*.......................■■-■■/■............................................-*                     i
0            0.5            1             1.5            2            2.5            3            3.5            4            4.5            5
Indexes based on distribution function
> Lorenz curve (LC)
X ~ ^m.BAD\a)
y^FnGOOD{a\a^[L,H]
> Gini index
0            0.1           0.2           0.3          0.4          0.5          0.6          0.7          0.8          0.9
Gini =
A
A + B
= 2A
n+m
Gini = 1 - ^ (Ft
-F.
m.BADk          m.BADk-
J-fc
+ F
GOOD k ^ 1 n.GOODk-\
k=2
where FmBADk {FnGOOD) is k-th vector value of empirical distribution function of bad (good) clients
Indexes based on distribution function
C-statistics:
c - stat = A + C =
1 + Gini
0.2           0.3           0.4           0.5           0.6           0.7           0.8           0.9             1
It represents the likelihood that randomly selected good client has higher score than randomly selected bad client, i.e.
c - stat = P[s1 > s2   I DK  = 1 a DK   = 0 j
Indexes based on distribution function
□ Another possible indicator of the quality of scoring model can be cumulative Lift, which says, how many times, at a given level of rejection, is the scoring model better than random selection (random model). More precisely, the ratio indicates the proportion of bad clients with less than a score a, üg[l,h] , to the proportion of bad clients in the general population. Formally, it can be expressed by:
n+m
Lift(a) =
CumBadRate(a) BadRate
Y,l(sl<aAY = 0)
n+m
^I(s,<a)
n+m
Y,l(sl<aAY = 0)
n+m
i=\
i=\
n+m
1^ = 0)
i=\
n
n+m
Y.IÍY = 0v7 = l)
i=\
							
					^cumul. Lift .........abs. Lilt      ■		
							
							
							
		''-.. :					
	'''■•						
			':'""'.,	........:;j			
						........	........
0        0.1       0.2       0.3       0.
4       0.5       0. F„,,
B       0.7       0.0       0.9        1
absLift(a)
BadRate(a) BadRate
Indexes based on distribution function
								! 5	3,50 3,00 2,50 2,00 1,50 1,00 0,50					
										a				-♦-abs. Lift -O— cum. Lift
□ Usually it is computed using table with										vs_				
										\^v				
numbers of all and bad clients in some										\"N^				
bands (deciles).										\   ^^^				
										N,       ^o				
decile	#cleints	absolutely			cumulatively					N^				
		# bad clients	Bad rate	abs. Lift	# bad clients	Bad rate	cum. Lift							
1	100	16	16,0%	3,20	16	16,0%	3,20			♦     ♦    ♦    ♦				
2	100	12	12,0%	2,40	28	14,0%	2,80		-	.........				
3	100	8	8,0%	1,60	36	12,0%	2,40		123456789       10 decile					
4	100	5	5,0%	1,00	41	10,3%	2,05							
5	100	3	3,0%	0,60	44	8,8%	1,76							
														
6	100	2	2,0%	0,40	46	7,7%	1,53	1 -, 0,8 0,6 0,4 0,2 0						
7	100	1	1,0%	0,20	47	6,7%	1,34		Gini=0,55                  /     J					
8	100	1	1,0%	0,20	48	6,0%	1,20							
9	100	1	1,0%	0,20	49	5,4%	1,09							
10	100	1	1,0%	0,20	50	5,0%	1,00							
All	1000	50	5,0%											
														
											Lornz curve -------Base line			
									)                0,2              0,4              0,6              0,8			1		
Indexes based on distribution function
□ When bad rates are not monotone:
LC looks fine > Gini is slightly lowered Lift looks strange
0,8
0,6
0,4
0,2
Gini=0,48	
	Lornz curve -------Base line
-^—-----------------------1------------------------------------1------------------------------------1—	---------------------1---------------------------,
0,2
0,4
0,6
0,8
3,50 3,00
2,50
o
=    2,00
£    1,50
1,00
0,50
decile	# cleints	absolutely			cumulatively		
		# bad clients	Bad rate	abs. Lift	# bad clients	Bad rate	cum. Lift
1	100	v^A^v	8,0%	1,60	8	8,0%	1,60
2	100	K'i2^|	12,0%	2,40	20	10,0%	2,00
3	100	^16<^	16,0%	3,20	36	12,0%	2,40
4	100	5	5,0%	1,00	41	10,3%	2,05
5	100	3	3,0%	0,60	44	8,8%	1,76
6	100	2	2,0%	0,40	46	7,7%	1,53
7	100	1	1,0%	0,20	47	6,7%	1,34
8	100	1	1,0%	0,20	48	6,0%	1,20
9	100	1	1,0%	0,20	49	5,4%	1,09
10	100	1	1,0%	0,20	50	5,0%	1,00
All	1000	50	5,0%				
Indexes based on distribution function
□ When score is reversed, we obtain reversed figures.
decile	# cleints	absolutely			cumulatively		
		# bad clients	Bad rate	abs. Lift	# bad clients	Bad rate	cum. Lift
1	100	/l'é\	16,0%	3,20	16	16,0%	3,20
2	100	/12\	12,0%	2,40	28	14,0%	2,80
3	100	/     8     \	8,0%	1,60	36	12,0%	2,40
4	100	5	5,0%	1,00	41	10,3%	2,05
5	100	3	3,0%	0,60	44	8,8%	1,76
6	100	2	2,0%	0,40	46	7,7%	1,53
7	100	I      1       J	1,0%	0,20	47	6,7%	1,34
8	100	\     1      /	1,0%	0,20	48	6,0%	1,20
9	100	V1 /	1,0%	0,20	49	5,4%	1,09
10	100	Vl/	1,0%	0,20	50	5,0%	1,00
All	1000	50	5,0%				
*							
decile	# cleints	absolutely			cumulatively		
		# bad clients	Bad rate	abs. Lift	# bad clients	Bad rate	cum. Lift
1	100	/l\	1,0%	0,20	1	1,0%	0,20
2	100	/   1  \	1,0%	0,20	2	1,0%	0,20
3	100	/     1    \	1,0%	0,20	3	1,0%	0,20
4	100	/     1     1	1,0%	0,20	4	1,0%	0,20
5	100	2	2,0%	0,40	6	1,2%	0,24
6	100	1       3	3,0%	0,60	9	1,5%	0,30
7	100	1       5      /	5,0%	1,00	14	2,0%	0,40
8	100	y    8    /	8,0%	1,60	22	2,8%	0,55
9	100	\   12  /	12,0%	2,40	34	3,8%	0,76
10	100	\is/	16,0%	3,20	50	5,0%	1,00
All	1000	50	5,0%				
3,50 3,00
2,50
v
=    2,00
m
>
£ 1,50 1,00 0,50
10
decile
0,2             0,4
Indexes based on distribution function
□ The Gini is not enough!!!
SCI:			
decile	#cleints		
		# bad clients	Bad rate
1	100	35	35,0%
2	100	16	16,0%
3	100	8	8,0%
4	100	8	8,0%
5	100	7	7,0%
6	100	6	6,0%
7	100	6	6,0%
8	100	5	5,0%
9	100	5	5,0%
10	100	4	4,0%
All	1000	100	10,0%
SC 2:			
decile	# cleints		
		# bad clients	Bad rate
1	100	20	20,0%
2	100	18	18,0%
3	100	17	17,0%
4	100	15	15,0%
5	100	12	12,0%
6	100	6	6,0%
7	100	4	4,0%
8	100	3	3,0%
9	100	3	3,0%
10	100	2	2,0%
All	1000	100	10,0%
Gini = 0.42
0       0,1     0,2     0,3     0,4     0,5     0,6     0,7     0,8     0,9       1
0,2           0,4           0,6           0,8
Indexes based on distribution function
SCI:
4,00
3,50
3,00
<u    2,50
g    2,00
-1    1,50
1,00
0,50
>=)
abs. Lift cum. Lift
Lift20o/o = 2.55       > Lift50o/o = 1.48       <
SC 2:
2,50 2,00
^    1,50
-
>
Ü    1,00 0,50
Lift20o/o = 1.90 Lift50o/o = 1.64
SC 2 is better if reject rate is expected around 50%.
SC 1 is much more better if reject rate is expected by 20%.
Indexes based on distribution function
□ Lift can be expressed and computed by formulae
Lift{a) = E^L   ae[L,H]
f'N.ALL \a)
Lifla =
= K.bäd(FŇ1äll(^))      1 q     F^,AF-\„(q))    q
N.ALL \* N.ALL
(Fh
= -F.
n.BAD
fe^fo))
Fnall (<?) = min{a e [L, H], Fnall (a) > q]
Liftw% = \0-FnMD{F-\LL{0.\))
Indexes based on density
function
> Likelihood density functions:   f good (*)    f bad (*)
~                         n i                             ~                       m i
Kernel esti mates: JG00d (*, ä) = £ - ^ (x - sI)     /^ (*, h)=^—Kh(x- sI)
1=1,«                                                                  i=i w
^=i
• Optimal bandwidth (maximal smoothing):
h
OSI
(2k + \)\k(2k + 5) (2£ + 3)!
k+-
2k+l
g -n
2k+l
where: k is the order of kernel function
(e.g. 2 for Epanechnikov kernel) n is number of actual cases d is an estimate of standard deviation
A.=0
Indexes based on density
function
> Mean difference (Mahalanobis distance):
D =
Mg-Mb S
where S is pooled standard deviation:
S =
rnSg2 +mSb2Y2
V
n + m
J
Mgl Mb are means of good (bad) clients
S  , Sb are standard deviations of good (bad) clients
Indexes based on density
function
Information value (Iva,) - continuous case (Divergence):
00
hal   =   í(fGOOD(X)-fBAD(X))fo
J GOOD \X)
-GO
V  JBAD\X)
dx
J
Jdiff \X) — J GOOD \X)      J BAD \X) J GOOD\X)
/za(*) = ln
V Jbad\x) J
0.2 ■
D.1
-0.1
-0 2
		1                                1                                 1                                1	'
	^^~ W		
	--------ÍR		-
			
			-
-1              0              1
2              3             4
score
5              B              7
Indexes based on density
function
> Information value (Iva,) - discretized continuous case:
• Replace  density functions  by their kernel  estimates and  compute  integral numerically (e.g. by composite trapezoidal rule).
• Using Epanechnikov kernel, given by K(x) = -\i-x2)-i(xg [-1, l]) and optimal bandwidth h0Sk we have
JIV \X) ~ [/GOOD \X' "OS,2 )      J BAD \X' "OS ,2 ) )"^
Forgiven M+l pointsjc05...,jcm we obtain
J GOOD \X^OS,2)
J BAD \X^OS,2)
Indexes based on density
function
> Information statistics/value (Iva,) - discrete case:
• Create intervals of score - typically deciles. Number of goods (bads) in i-th interval is marked by gt
It must holds  gt > 0, bt > 0 Vz Then we have
g,   bA  f
n     m)
^
gjn V bp j
score int.	# bad clients	#good clients	% bad [1]	% good [2]	[3] = [2] - [1]	[4] = [2] / [1]	[5] = ln[4]	[6] = [3] * [5]
1	1	10	2,0%	1,1%	-0,01	0,53	-0,64	0,01
2	2	15	4,0%	1,6%	-0,02	0,39	-0,93	0,02
3	8	52	16,0%	5,5%	-0,11	0,34	-1,07	0,11
4	14	93	28,0%	9,8%	-0,18	0,35	-1,05	0,19
5	10	146	20,0%	15,4%	-0,05	0,77	-0,26	0,01
6	6	247	12,0%	26,0%	0,14	2,17	0,77	0,11
7	4	137	8,0%	14,4%	0,06	1,80	0,59	0,04
8	3	105	6,0%	11,1%	0,05	1,84	0,61	0,03
9	1	97	2,0%	10,2%	0,08	5,11	1,63	0,13
10	1	48	2,0%	5,1%	0,03	2,53	0,93	0,03
All	50	950					Info. Value	0,68
Indexes based on density
function
□ Information value for our example of two scorecards: SCI:
decile	# cleints	# bad clients	#good	% bad [1]	% good [2]	[3] = [2]-[1]	[4] = [2]/[1]	[5]= ln[4]	[6] = [3]* [5]	cum. [6]
1	100	35	65	35,0%	7,2%	-0,28	0,21	-1,58	0,44	0,44
2	100	16	84	16,0%	9,3%	-0,07	0,58	-0,54	0,04	0,47
3	100	8	92	8,0%	10,2%	0,02	1,28	0,25	0,01	0,48
4	100	8	92	8,0%	10,2%	0,02	1,28	0,25	0,01	0,49
5	100	7	93	7,0%	10,3%	0,03	1,48	0,39	0,01	0,50
6	100	6	94	6,0%	10,4%	0,04	1,74	0,55	0,02	0,52
7	100	6	94	6,0%	10,4%	0,04	1,74	0,55	0,02	0,55
8	100	5	95	5,0%	10,6%	0,06	2,11	0,75	0,04	0,59
9	100	5	95	5,0%	10,6%	0,06	2,11	0,75	0,04	0,63
10	100	4	96	4,0%	10,7%	0,07	2,67	0,98	0,07	0,70
All	1000	100	900					Info. Value	0,70	
>SC2:
decile	# cleints	# bad clients	#good	% bad [1]	% good [2]	[3] = [2]-[1]	[4] = [2]/[1]	[5]= ln[4]	[6] = [3]* [5]	cum. [6]
1	100	20	80	20,0%	8,9%	-0,11	0,44	-0,81	0,09	0,09
2	100	18	82	18,0%	9,1%	-0,09	0,51	-0,68	0,06	0,15
3	100	17	83	17,0%	9,2%	-0,08	0,54	-0,61	0,05	0,20
4	100	15	85	15,0%	9,4%	-0,06	0,63	-0,46	0,03	0,22
5	100	12	88	12,0%	9,8%	-0,02	0,81	-0,20	0,00	0,23
6	100	6	94	6,0%	10,4%	0,04	1,74	0,55	0,02	0,25
7	100	4	96	4,0%	10,7%	0,07	2,67	0,98	0,07	0,32
8	100	3	97	3,0%	10,8%	0,08	3,59	1,28	0,10	0,42
9	100	3	97	3,0%	10,8%	0,08	3,59	1,28	0,10	0,52
10	100	2	98	2,0%	10,9%	0,09	5,44	1,69	0,15	0,67
All	1000	100	900					Info. Value	0,67	
Indexes based on d<
function
□ Using markings idi
r
diff,
y n    m j
L» = In
LR
SCI:
0,10  -i			
			
	-------I_diff		
0,05	-------1  LR		- 1,00
			
0,00	/                                        ^ľ          r		- 0,50
	1          2/3          4         ax^"6         7          8          9        10 "		
-0,05	íl---------		
	/    /		- 0,00
-0,10	/   /		
	//		- -0,50
-0,15	//		
-0,20	Z		- -1,00
-0,25	j		- -1,50
-0,30			
			
>SC2:
\btn j
we have:
				
0,45 -0,40 -0,35  -				
		-------l_diif*l_LR -------cum. l_diff*l_LR		0,70
				0,60
0,30  -				0,50
0,25  -				0,40
0,20  -				0,30
0,15 -0,10  -				0,20
0,05  -				0,10
0,00  -				
				
123456789       10				
				
				
		-------1  diif *l  LR		
0,14  -0,12  -		-------cum. l_diff*l_LR		0,70
				0,60
0,10  -				0,50
0,08  -				0,40
0,06  -				0,30
0,04  -				0,20
0,02  -				0,10
0,00  -				
				
123456789       10				
K-S	= 0.34
Gini	= 0.42
'-'^20%	= 2.55
'-lft50o/o	= 1.48
'val	= 0.70
'val20%	= 0.47
'val50%	= 0.50
K-S	= 0.36
Gini	= 0.42
'-'^20%	= 1.90
'-lft50o/o	= 1.64
'val	= 0.67
'val20%	= 0.15
'val50%	= 0.23
Some results for normally distributed scores
> Assume that the scores of good and bad clients are normally distributed, i.e. we can write their densities as
J GOOD\X) ~
1
2ol
<T
M
J BAD \X) ~
1
2(7?
71
<7, v2tT
Estimates of parameters A^> A^> Gg and &b:
Mgf Mb are means of good (bad) clients
S  , Sb are standard deviations of good (bad) clients
Pooled standard deviation:
s =
i
nS    + mSb   '2
V
n + m
Estimates of mean and standard dev. of scores for all clients A1
ALL •> G ALL
1
M = MALL =
nM„ + mM,
n + m
S ALL   -
f   "2-     "2 ■    (Mg-M)2+m(Mb-M)2^2
nS„   + mSu   + n\
V
(n + m)
J
Some results for normally distributed scores
> Assume that standard deviations are equal to a common value g :
D =
G
KS = ®
fD\      f-D\
-O
v^y
= 20
v ^ J
(ti\
\^J
-1
Gini = 2 • O
vV2y
-1
Liftq=-0
r
G
ALL
V   CT
o-^^+^-D
y
^ = ^
Where ®() is the standardized normal distribution function, O  ct2 (•) the normal distribution function with parameters  \x, <j2 and ®_1() is the standard quantile function.
Some results for normally distributed scores
> Generally (i.e. without assumption of equality of standard deviations):
D =
2           2
í
\
V
b
b
í.
KS = ® \-oh'ď--oJa2ď2 +2b'C   -O í-a  -ď--oJa2D*2 +2b-c
J
-\
V
b
b
J
2       _2
where a = Joh +<ye, b = oh -gs, c = \n
r
KS = 0
fií+š;
s?-s2   b
Sh-D*
V
^^^Jfe+^K+2-(^^^)ln[
sA
s
b J
\
J
f
-O
VSb + Sl  S    mjý___
r*2       n2        g                 o2
v
l^Sj{s2 + S2g)ď2 + 2-{s2-S2)\^
s A
s
b J
\
J
Some results for normally distributed scores
> Generally (i.e. without assumption of equality of standard deviations):
GM = 2-o(d*)-\
Lißq=-^UhA^ALL  +°ALL  -®l(q))=-®
q
q
°'all
V
(J,
J
Lifiq=-<$>
q
í
v
S,
A
J
#2
Ival=(A + \)D   +A-1,   A
2
V
G'
.2^
b ,    g
y.
G
b J
Some results for normally distributed scores
KS: ^=0,^=1
□ KS and the Gini react much more to change of ju„        and   are almost
g
unchanged    Jn      the direction of cr
g
• Gini > KS
Gini   nb =o,(7A2=i
0.8
0.6
0.4
0.2
		
	^—KS ^^~ Gini	-
		-
Some results for normally distributed scores
Lift10„/o:^ =0,^=1
I
vaľ
Vb = °, v] =1
□ In case of Lift100/o it is evident strong dependence on /jg and significantly higher dependence on <y2g than in case of KS and Gini.
□ Again strong dependence on AV Furthermore value of Iva, rises very quickly to infinity when <?2g tends to zero.
ROC (Receiver operating characteristic )
IN [true negative)   - počet správně klasifikovaných negativních připadli
TP [ftue positive)    -počet správné klasifikovaných positivních případů
FP (false positive)   - počet nesprávne klasifikovaných negativních pfipadů
FN (false negative)  - počet nesprávně klasifikovaných positivních pfipadů
	Predikce		
5k nee	GO	G1	CelK
GO	TO	FP	W
G1	RV	TP	=
Celkem	Pnog	PPos	n
ROC -TPR,
:■
FPR
--TP/ P= TP I {TP + FN) = FP/ N = FP/ {FP+ TN)
tpr(c)=P(X>c\Gl)=l-F}(c) tnr(c)=P(X<c\Gi))=Fii(c)
Pr{c) = PiX>c\G,) = l-F(l{c)
fiir(c) = P(X<c\G1)=F1(c)
ROC
i—	--------►	\
/     TN	Ä	
J J	'V	---
		
100%
P(TP)
0%
P(FP)
TP	FP
FN	TN
1	1
100%
B S 0.6
'j
>
—   NetChop C-term 3.0
—   TAP + ProteaSMM-i
—   ProteaSMM-i
0.2              0.4              0.6
False positive rate
0.8
ROC - ACC
												ROC space
												
											0.£	perfect          C                                                          '
Accuracy:											0.Ě 0.7	v   ■ '■
ACC= (TP+ TN) 1 (P+ N)											'in	
											ft «  0.E	better                         J?'      \
											o DC Q-  0.Í 0 '	■    \      /
											Oi	
A			B                                            C							C	0.1	*                             worse *
TP=63	FP=28	91	TP=77	FP=77	154	TP=24	FP=88	112	TP=88	FP=24	112    C	V---------------,----------------,----------------,----------------,----------------1 0                 0.2               0.4               0.6               0.8                 1
FN=37	TN=72	109	FN=23	TN=23	46	FN=76	TN=12	88	FN=12	TN=76	88 200	FPR or (1 - specificity)
100            100		200	100            100		200	100            100		200	100	100		
TPR = 0.63			TPR = 0.77			TPR = 0.24			TPR =0.88			
FPR = 0.28			FPR = 0.77			FPR = 0.88			FPR = 0.24			
ACC = 0.68			ACC = 0.50			ACC = 0.18			ACC = 0.82			
ROC - AUC, Gini
The AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. It can be shown that the area under the ROC curve is equivalent to the Mann-Whitney U, which tests for the median difference between scores obtained in the two groups considered if the groups are of continuous data. It is also equivalent to the Wilcoxon test of ranks. The AUC has been found to be related to the Gini coefficient(G) by the following formula G + 1 = IxAUC
Další evaluační grafy
Box p lot
Histogram
BAD
GOOD
Další evaluační grafy
PD - absolutně
PD - kumulativně
0.5
0.45
0.4
■g    °3 Q
"b
>, 0.25 h .ň
0.15
0.1
0.05
0
\			
			------   SCI ------   SC2
			-
-■			
0.06
0.01 -
20            30            40            50            60            70            80
Score (relative) - the higher the better
90          100
10            20            30            40            50            60            70
Score (relative) - the higher the better
100
Další evaluační grafy
Lift chart:
ALL
In this case we have the proportion of all clients (FALL) on the horizontal axis and the proportion of bad clients (FBAD) on the vertical axis. The ideal model is now represented by polyline from [0, 0] through [pB, 1] to [1, 1]. Advantage of this figure is that one can easily read the proportion of rejected bads vs. proportion of all rejected. For example we can see that if we want to reject 70% of bads, we have to reject about 40% of all applicants.
Postupy evaluace
evaluace na učících datech
Evaluace na učících datech použitých k učícímu procesu není ke zjištění kvality modelu vhodná a má nízkou vypovídací schopnost, protože často může dojít k přeučení modelu. Odhad predikční kvality modelu na učících datech se nazývá resubstituční nebo interní odhad. Odhady ukazatelů kvality modelů provedených na učících datech jsou nadhodnocené, proto se místo nich používají testovací data, která se v rámci přípravy dat pro tyto účely vyčlení. > evaluace na testovacích datech
Evaluace na testovacích datech již má patřičnou vypovídací schopnost, jelikož tato data nebyla použita k sestavení modelu. Na testovací data jsou kladeny určité požadavky. Soubor testovacích dat by měl obsahovat dostatečné množství dat a měl by reprezentovat či vystihovat charakteristiky učících dat. Empiricky doporučený poměr učících a testovacích dat je 75%, resp. 25% případů. Zajištění patřičné reprezentativnosti je realizováno pomocí náhodného stratifikovaného výběru.
Postupy evaluace
• křížové ověřování (cross-validation) V případě nedostatečného počtu pozorování, kdy rozdělení datového souboru na učící a testovací data za účelem vyhodnocení modelu není možné, je vhodné použít metodu křížového ověřování. Výhodou této metody na rozdíl od dělení datového souboru je, že každý případ z dat je použit k sestavení modelu a každý případ je alespoň jednou použit k testování. Postup je následující:
• Soubor dat je náhodně rozdělen do n disjunktních podmnožin tak, že každá podmnožina obsahuje přibližně stejný počet záznamů. Výběry jsou strati f i kovány podle tříd (příslušnosti k určité třídě), aby bylo zajištěno, že podíly jednotlivých tříd podmnožin jsou zhruba stejné jako v celém souboru.
• Z těchto n disjunktních podmnožin se vyčlení n-1 podmnožin pro sestavení modelu (konstrukční podmnožina) a zbývající podmnožina (validační podmnožina) je použita k jeho vyhodnocení. Model je tedy evaluován na podmnožině dat, ze kterých nebyl sestaven a na této množině dat je odhadována jeho predikční kvalita.
• Celý postup se zopakuje n-krát a dílčí odhady ukazatelů kvality se zprůměrňují. Velikost validační podmnožiny lze přibližně stanovit jako poměr počtu případů ku počtu validačních podmnožin.
Postupy evaluace
bootstrap metoda
Metoda bootstrap zkoumá charakteristiky jednotlivých resamplovaných vzorků, které byly pořízeny z empirického výběru. Pokud původní výběr osahuje m prvků, tak každý má naději objevit se v resamplovaném výběru. Při úplném resamplování o velikosti vzorku n jsou uvažovány všechny možné výběry a existuje tedy m n možných výběrů. Úplné resamplování je teoreticky proveditelné, ale vyžádalo by si mnoho času. Alternativou je simulace Monte Carlo, pomocí níž se aproximuje úplné resamplování tak, že se provede B náhodných výběrů (obvykle se volí 500 - 10000 výběrů) s tím, že každý prvek je vždy nahrazen (vrácen zpět do osudí). Jsou-li dána data X={X1, ..., Xn) a je-li požadován odhad parametru 6, provede se z původních dat B výběrů a pro každý výběr je spočítán odhad parametru 6 . Bootstrap odhad parametru je určen jako průměr dílčích odhadů. V případě evaluace modelů bude parametrem 6 zvolený ukazatel predikční kvality. > jackknife
Tato metoda je založena na sekvenční strategii odebírání a vracení prvků do výběru o velikosti n. Pro datový soubor, který obsahuje n prvků, procedura generuje n vzorků s počtem prvků n-1. Pro každý zmenšený výběr o velikosti n-1 je odhadnuta hodnota parametru. Dílčí odhady se následně zprůměrují podobně jako u metody bootstrap.