Obsah:
1. Credit scoring (CS) - historie, základní pojmy.
2. Úvod do SAS EG
3. Metodologie vývoje scoringových funkcí.
4. Příprava dat II.
5. Úvod do shlukové analýzy. Hiearchické shlukování.
6. Vývoj CS modelu.
7. Úvod do analýzy přežití.
8. Coxova regrese.
9. Evaluace modelu II.
10. Stanovení cut-off. RAROA, CRE. Monitoring.
11. Reference.
3
45
92
164
212
275
331
382
398
443
479
1. Credit scoring- historie,
základní pojmy
3
4
Úvod
Credit Scoring je soubor prediktivních modelů a
jejich základních technik, které slouží jako
podpora finančním institucím při poskytování
úvěrů.
Tyto techniky rozhodují, kdo dostane úvěr, jaká
má být výše úvěru a jaké další strategie zvýší
ziskovost dlužníků vůči věřitelům.
Credit Scoringové techniky kvantifikují a posuzují
rizika při poskytování úvěrů konkrétnímu
spotřebiteli.
5
Úvod
Nerozeznají a nestanovují "dobré" nebo "špatné"
(očekává se negativní chování, tj. např. default)
žádosti o úvěr na individuální bázi, nýbrž poskytují
statistické šance, nebo pravděpodobnosti, že žadatel
s daným skóre se stane "dobrým" nebo "špatným".
Tyto pravděpodobnosti nebo skóre, spolu s dalšími
obchodními úvahami jako jsou předpokládaná míra
schvalování, zisk nebo ztráty, jsou pak použity jako
základ pro rozhodování o poskytnutí/neposkytnutí
úvěru.
Why do we need score?
“HISTORICAL EVOLUTION”: Money lender
• lend only to people which he knows
Operators
• they make decision based on client's
information and their experience
Automatic scoring
• make decision on statistical base
PAST EXPERIENCE -> ESTIMATION FOR FUTURE
6
Why score?
• Automatization of approval proces
• Cost – effective
• Less fraud possibilities
ADVANTAGES:
• Statistical based, not take in account client like individual
DISADVANTAGES
7
Úvod
Zatímco historie úvěru sahá 4000 let nazpět (první
zaznamenaná zmínka o úvěru pochází ze starověkého
Babylonu - 2000 let před n.l.), historie credit scoringu
je pouze 50-70 let stará.
První přístup k řešení problému identifikace skupin v
populaci představil ve statistice Fisher (1936). V roce
1941, Durand jako první rozpoznal, že tyto techniky
mohou být použity k rozlišování mezi dobrými a
špatnými úvěry.
8
Úvod
Významným milníkem při posuzování úvěrů byla
druhá světová válka.
Do té doby bylo standardem individuální posuzování
žadatele o úvěr. Dále bylo standardem, že ve finanční
sféře byli zaměstnáni (téměř) výhradně muži.
Odchod značné části mužské populace do služeb
armády měl za následek potřebu předat zkušenosti
dosavadních posuzovatelů žádostí o úvěr novým
pracovníkům.
Díky tomu vznikla jakási rozhodovací pravidla a došlo
k „automatizaci“ posuzování žádostí o úvěr.
9
Úvod
Příchod kreditních karet ke konci šedesátých let
minulého století a růst výpočetního výkonu způsobil
obrovský rozvoj a využití credit scoringových technik.
Událost, která zajistila plnou akceptaci credit scoringu,
bylo přijětí zákonů „Equal Credit Opportunity Acts”
(o rovné příležitosti přístupu k úvěrům) a jeho
pozdějších znění přijatých v USA v roce 1975 a 1976.
Tyto stanovily za nezákonné diskriminace v
poskytování úvěru, vyjma situace, pokud tato
diskriminace „byla empiricky odvozená a statisticky
validní”.
10
Úvod
V osmdesátých letech minulého století začala být
využívána logistická regrese, dodnes v mnoha
oblastech považovaná za průmyslový standard, a
lineární programování. O něco později se objevily na
scéně metody umělé inteligence, např. neuronové sítě.
Mezi další používané techniky lze zařadit metody
nejbližšího souseda, splajny, waveletové vyhlazování,
jádrové vyhlazování, Bayesovské metody, regresní a
klasifikační stromy, support vector machines, asociační
pravidla, klastrová analýza a genetické algoritmy.
11
Historie -detail
Zdroj: Anderson
pawnshop = zastavárna
deemed acceptable = považován za přijatelný
Advertisement for credit = reklama na úvěr
Mercantile agency = obchodní agentura
12
Historie -detail
Zdroj: Anderson
affordable = dostupný
iris species = druhy kosatců
Charge card = kreditní karta
Propensity scorecard = scoringová karta pro
modelování náchylnosti (k nákupu)
FI = splolečnost Fair, Isaac…dnes FICO
Mortgage = hypotéka
13
Historie -detail
Zdroj: Anderson
14
Historie -detail
Zdroj: Anderson
15
Historie -detail
Zdroj: Anderson
16
Historie –další zajímavé čtení
http://www.fundinguniverse.com/company-histories/Fair-Isaac-and-
Company-Company-History.html
http://www.fico.com/en/Company/News/Pages/03-10-2009.aspx
http://www.directlendingsolutions.com/history_credit_scoring.htm
http://www.pbs.org/wgbh/pages/frontline/shows/credit/more/scores.html
http://en.wikipedia.org/wiki/Credit_score
17
Risk Management – Acquisition
Data Acquisition
Internal Info
Strategy
• Policy Rules
• Scorecards
• Fraud
• Delinquency
• Bankruptcy
• Claims
• Credit Bureau
• Other External
Data
Pass
Fail
18
Risk Management – Customer
Credit Line Management
Usage Monitoring
Transaction Fraud
Transaction Approval
Renewal/Reissue
Collections
Claims
Scorecards
Policy Rules
Strategies
.. Lots of analysis
19
Risk Management
Commercial/Consumer
Delinquency, Fraud,
Claim, Collections
Market, Interest, VaR
(Risk Dimensions)
Enterprise
Financial Operational
Risk Management
20
Klienti nesplácí
poskytnuté půjčky
Změny úrokových
sazeb, cen akcií,
kurzů
21
22
Risk Management
Delinquency, Fraud,
Claim, Collections
Commercial/Consumer
Applicant
Transaction
Claims
Internet
(app+trans)
Fraud
P&C, Life,
Health
Mortgage
insurance
Export financing
insurance
Claim
Payment Projection
(recovery)
Outsourcing to
agency
Collections
Late payments
Bankruptcy
Write-off
Delinquency
P&C: Property & Casualty Insurance (majetkové a úrazové pojištění)
Why Manage Risk?
Reduce exposure to high-risk accounts.
Decrease bad debt and claims payouts.
Ensure better pricing to reflect risk.
Detect fraud early-on.
Increase approval rates (the “right kind” – potentially increasing
revenue).
Handle most approvals/declines quickly (customer service).
Analysts/investigators only focus on difficult accounts.
Ensure consistent, equal and objective treatment of each
applicant across the organization.
Offer more efficient marketing initiatives.
$
$
$
£
£
£
¥
¥
¥
€
€
€
23
Users of Risk Management
Banks
Citibank, Royal Bank, CIBC, BankOne
Finance Companies
GE Capital, HFC, GMAC
Insurance
Life, Property and Casualty, Health
Government
Ministries/Departments of Health (Medicare), Ministries of
Finance (IRS), Workers Compensation.
24
Users of Risk Management
Utilities
Hydro/Power/Energy, Water
Communications
Bell, Sprint, AT&T (land lines and cellular)
Retail
JC Penneys, Sears, Hudsons Bay Company, Target
Manufacturers/Industrials
Those who give credit to small businesses.
25
Risk Management “Toolbox”
Risk Data Mart/Data Warehouse
Risk prediction models (scorecards)
Reporting
Analysis tools
Operational/strategy implementation software
(for example, FICO™ Blaze Advisor®, FICO® TRIAD®
Customer Manager, Experian Probe SM, Experian
NBSM, Cardpac, VisionPlus, Pro-Logic Ovation).
26
FICO™ Blaze Advisor®
Zdroj: http://www.fico.com/account/resourcelookup.aspx?theID=430
27
Scorecards
Predict the probability of a negative event.
Custom – based on clients own data
Generic – based on pooled industry or bureau data (Beacon,
Empirica)
Application – new applicants
Behavioral – current customers
28
Scorecard Types
Mktg/CRM
Response
Churn
Revenue
Cross sell
Risk
30/60/90 Delinquency
Bankruptcy
Write-off
Claim
Fraud
Collections
Combination
Resp/approve/delq
Response/profit
Risk/churn/profit
Profit
29
Scoring in approval process
Client (new)
Hard checks
Scoring on fraud
and default
cutoffs on RAROA
Verifications
(dependant on riskgroup)
+ +
- - rejection
rejection rejection
Policy declines – low
age, unsufficient
length of employment,
“terorrist” etc.
What is the probability
that client will pay?
Will the contract be
profitable?
Is the number of
client„s phone valid?
Etc.
30
Fraud Risk
Fraud risk is one of the fastest growing areas in risk
management.
Examples include bank/retail card fraud, insurance fraud,
health care fraud, welfare fraud, franchise fraud, internet
fraud, mortgage fraud, investment fraud, tax fraud, merchant
fraud.
E-commerce presents opportunities.
The F.B.I. estimates that between 10–15% of loan applications
contain material misrepresentations.
31
Reporting and Analysis
Scorecard and portfolio performance
Approval rates, applicant profile, loss rates,
high risk segments
Behavior tracking to develop better strategies
Capturing fraud, approval/decline, pricing,
credit line management, collections, cross sells qualification,
claims.
32
Risk Applications
Retail/banking (consumer and commercial)
Application and behavior scorecards for all credit products.
Strategy design for credit limit setting, authorizations and
collections/reissue/suspension.
Fraud application and transaction detection
Pricing/down payment
ATM limits, check holds
Pre-qualifying direct marketing lists.
Automotive/finance
Loans and leasing
Application, behavioral, fraud, collection scorecards
Pricing/down payment.
33
Risk Applications
Government
Fraud detection (for example, Welfare, health insurance)
Entitlement/claims assessment (for example, Workers
compensation)
Communications
Security deposit
International call access
Contract/”pay as you go”
Telephone fraud
“Shadow limit” setting
Suspension of service
Collections.
34
Risk Applications
Insurance
Rate setting
Fraud detection
Claims management
Risk control for CRM initiatives.
Utilities
Security deposit
Collections.
35
Risk Applications
Manufacturers/pharmaceuticals/industrials
Assessing credit risk of business clients
Credit risk assessment of franchisees
(for example, gas stations)
Payment terms
Collections
Merchant fraud.
36
Risk Applications
Optimizing work flow in adjudication departments
Evaluating/pricing portfolios
Securitization
Setting economic/regulatory capital allocation
Reducing turnaround time (automated scoring)
Comparing quality of business from different
channels/regions/suppliers.
37
Resources
www.ftc.gov/bcp/conline/pubs/credit/scoring.htm
www.creditscoring.com
www.my-credit-score.com
www.fairisaac.com, www.myfico.com
www.experian.com
www.creditinfocenter.com
www.consumersunion.org/finance/scorewc200.htm
www.phil.frb.org/files/br/brso97lm.pdf
www.nacm.org
www.rmahq.org
www.riskmail.org
www.occ.treas.gov
38
Credit Scoring & Its Applications
by Lyn Thomas, Jonathan Crook, David Edelman
Credit Risk Modeling: Design and Application
by Elizabeth Mays (Editor)
Internal Credit Risk Models: Capital Allocation and
Performance Measurement
by Michael K Ong
Handbook of Credit Scoring
by Elizabeth Mays
Applications of Performance Scoring to Accounts
Receivables Management in Consumer Credit
by John Y. Coffman
Introduction to Credit Scoring,
by E.M. Lewis
Resources
39
Scorecard Development roles-
objectives
Understand the critical resources needed to successfully
complete a scorecard development and implementation
project.
Understand some of the operational considerations that go
into scorecard design.
40
Major Roles
Scorecard Developer
Data miner, data issues
Credit Scoring Manager/Risk Manager
Strategic view, corporate policies, implementation
Product Manager
Client base, target market, marketing direction.
41
Major Roles
Operational Managers
Customer Service, Adjudication, Collections
Strategy execution, impact on customers
IT/IS Managers
external/internal data, implementation platforms.
42
Minor Roles
Project Manager
Coordination, time lines
Corporate Risk staff
Corporate policies, capital allocation
Legal.
43
Why All of These Roles?
Can I use this variable?
Legal, technical (derived variables, implementation platform),
future application form design
Segmentation
Marketing, application form design, systems
What is the impact on this segment?
Operational, marketing, risk manager, corporate risk.
44
2. Úvod do SAS EG
45
Introduction to SAS Enterprise Guide
SAS Enterprise Guide provides a point-and-click interface
for managing data and generating reports.
46
SAS Enterprise Guide Interface
SAS Enterprise Guide also includes a full programming
interface that can be used to write, edit, and submit
SAS code.
47
SAS Enterprise Guide Interface: The Project
A project serves as
a collection of
data sources
SAS programs
and logs
tasks and queries
results
informational notes
for documentation.
You can control the contents, sequencing, and
updating of a project. 48
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output work.nonclub;
else output work.clubmembers;
run;
proc print data=work.nonclub;
title "Non Club Members";
var Country Gender Customer_Name;
run;
DATA
Step
PROC
Step
SAS Programs
ep02d01.sas 49
PROC PRINT Output
50
Saving SAS Programs
The SAS program in the project is a shortcut to the physical
storage location of the .sas file. Select the program icon and
then select File Save program name to save the program as
the same name, or Save program name As… to choose a
different name or storage location.
51
Embedding Programs in a Project
A SAS program can
also be embedded in
a project so that the
code is stored as part
of the project .epg
file.
Right-click on the
Code icon in a
project and select
Properties
Embed.
52
How Do You Include Data in a Project?
Selecting File
Open Data
adds a shortcut
to a SAS data
source in the
project.
53
Assigning a Libref
You can use the Assign Project Library task to define a
SAS library for an individual project.
54
Browsing a SAS Library
During an interactive SAS Enterprise Guide session, the Server List
window enables you to manage your files in the windowing environment.
In the Server List window,
you can do the following:
view a list of all the servers
and libraries available during
your current SAS Enterprise
Guide session
drill down to see all tables
in a specific library
display the properties of a table
delete tables
move tables between libraries
55
Applying Formats
Display formats can be applied in a SAS Enterprise Guide
task or query by modifying the properties of a variable.
56
Query Builder Join
When you use the Query Builder to join tables
in SAS Enterprise Guide, SQL code is generated.
SQL does not require sorted
data.
SQL can easily join multiple
tables on different key variables.
SQL provides straightforward
code
to join tables
based on a non-equal
comparison of common columns
(greater than, less than,
between).
57
Sort Data Task
The Sort Data task enables you to create a new data set
sorted by one or more variables from the original data.
58
Business Scenario
Orion Star wants to send information about a specific
promotion to female customers in Germany. The report can be
created by querying the orion.customer data set to
include only the desired customers, and then by producing a
report with the List Data task.
59
Business Scenario
The same report can be generated more efficiently by subsetting the
data directly within the List Data task. This requires modification of
the code generated by SAS Enterprise Guide.
60
Understanding Generated Task Code
There are many situations where task results created
by SAS Enterprise Guide can be further enhanced or
customized by modifying the code.
However, before you can effectively modify the code, you
must first understand the code that SAS Enterprise Guide
generates.
61
List Data Task
The Preview code button enables
you to view and modify the code
generated by the task.
62
List Data Task – Code Preview
63
Using the List Data Task to Generate Code
This demonstration illustrates building a List Data
task and examining the code generated
by SAS Enterprise Guide.
64
List Data Task – Generated Code
The initial comment block shows information about the task.
65
List Data Task – Generated Code
The first line uses a macro to delete temporary tables or views if they already exist.
If the Group by role is used in the task, the data must be ordered by the grouping
variable. PROC SORT is used by default. Only variables assigned to roles are kept in
the new data set.
66
List Data Task – Generated Code
If the Group by role is not used, SQL creates a temporary view of the required
data. Again, only variables assigned to roles in the task are included in the view.
This comment incorrectly states that sorting occurs.
67
List Data Task – Generated Code
The main part of the code includes the titles, footnotes, and procedure code to
generate the report. PROC PRINT is the procedure used with the List Data task.
TITLE and FOOTNOTE are examples of global statements and can be
included anywhere in a SAS program.
68
List Data Task – Generated Code
At the end, the final lines of code delete any temporary
tables created to build the task, and delete any assigned
titles and footnotes.
69
Techniques to Modify Code
Three methods can be used to modify code generated by
SAS Enterprise Guide:
1. Edit the last submitted task code in a separate Code window.
2.Automatically submit custom code before or after every task
and query.
3.Insert custom code in a task.
70
Edit Last Submitted Code
After a task runs, the code can be viewed from either the
Project Tree or Process Flow.
71
Edit Last Submitted Code
The task code is read-only and cannot be edited directly. To create a
copy of the code from the Last Submitted Code window, select any
key while in the SAS program window. SAS Enterprise Guide offers
to make a copy.
After the code is copied, there is no link between the task and the
new code. Any changes in the task are not reflected in the copied
code, and modifications to the code do not affect the task.
72
Summary of Editing Last Submitted
Code
Custom code linked to task? No
Can be used to modify query
code?
Yes
Extent of modification allowed? Anything in the program can be
changed.
Custom code included when
exported?
Yes. You must export the edited
program and select the option in
the Export wizard.
73
Automatically Submit Custom Code Before
or After Every Task and Query
There are times when you might need to run a SAS
statement or program before or after any task or query is
executed. The Custom Code option enables you to insert
custom code before or after all tasks and queries.
74
Automatically Submit Custom Code Before
or After Every Task and Query
To run code before tasks
and queries, select the
first check box and select
Edit… to type the code.
75
Automatically Submit Custom Code
Before or After Every Task and Query
Global statements or complete program steps can be entered.
Example: Set the LOCALE= option to Great Britain.
76
Insert Code Before or After SAS Programs
Similar options exist to automatically submit code before or after
SAS programs written and submitted in Code windows in SAS
Enterprise Guide.
77
Summary of Submitting Custom Code
Before or After Every Task and Query
Custom code linked to task? Yes
Can be used to modify query
code?
Yes
Extent of modification allowed? Statements can only be submitted
before or after the task code.
Custom code included when
exported?
Yes, select the option in the Export
wizard.
78
Insert Custom Code in a Task
In most task dialog boxes, you have the ability to insert custom code within the
generated SAS program. This technique has the significant benefit that the task
interface can still be used to modify the report.
79
Insert Custom Code in a Task
In the Code Preview window, select Insert Code… to add
custom code in predefined locations in the SAS program.
80
Insert Custom Code in a Task
In any of these
predefined
locations, you
can double-click
on a line to insert
custom code.
81
Insert Custom Code in a Task
Some insert points enable custom options to be added to existing statements.
Insert options in
the PRINT statement.
Insert options in
the VAR statement.
82
Insert Custom Code in a Task
Other insert points enable entire statements to be added
inside a step in the program.
Statements
inside the
PRINT step
83
Insert Custom Code in a Task
Additional locations enable global statements or additional steps to be
inserted before or after the main code.
Locations for
global statements
or additional steps
84
Default SAS Enterprise Guide Footnote
The default footnote includes
macro references to the SAS server
name, operating system, and date
and time that the task runs.
Generated by the SAS System version
&SYSVER(&_SASSERVERNAME, &SYSSCPL)
on %TRIM(%QSYSFUNC(DATE(), NLDATE20.))
at %TRIM(%SYSFUNC(TIME(), NLTIMAP20.))
85
ODS and SAS Enterprise Guide
Default result formats can be set under Tools Options.
86
ODS and SAS Enterprise Guide
Additional settings can be made for each result format.
87
ODS and SAS Enterprise Guide
Task properties can
be used to override
the default for an
individual task.
Generated output
can be switched
off completely
and handled by
inserting code.
Right-click on
a task icon and
select Properties.
88
SAS Enterprise Guide Help (Review)
If Help files were installed along with SAS Enterprise Guide,
you can select Help to access the Help facility regarding both
the point-and-click functionality of SAS Enterprise Guide as
well as SAS syntax.
89
Task and Procedure Help
To find information
regarding the syntax
of the code behind
the scenes of a
particular task, type
the name of the task
in the Index tab.
The task help
indicates the
procedure name
to search in the
SAS syntax
help.
90
Procedure Syntax Help
91
3. Metodologie vývoje
scoringových funkcí
92
Objectives
Understand how scorecards to predict credit risk are
developed.
Understand the analyses and issues for implementation of
scorecards.
93
Main Stages – Development
Stage 1: Preliminaries and Planning
Create Business Plan
Identify organizational objectives
Internal versus External development,
and scorecard type
Create Project Plan
Identify project risks
Identify project team.
94
Main Stages – Development
Stage 2: Data Review and Project Parameters
Data availability and quality
Data gathering for definition of project parameters
Definition of project parameters
Performance window and sample window
Performance categories definition (target)
Exclusions
Segmentation
Methodology
Review of implementation plan.
95
Main Stages – Development
Stage 3: Development Database Creation
Development sample specification
Sampling
Development data collection and construction
Adjusting for prior probabilities.
96
Main Stages – Development
Stage 4: Scorecard Development
Missing values and outliers
Initial characteristic analysis
Preliminary scorecard
Reject inference
Final scorecard production
Scaling
Points allocation
Misclassification
Scorecard strength
Validation.
97
Main Stages – Development
Stage 5: Scorecard Management Reports
Gains tables and charts
Characteristic reports.
98
Main Stages – Implementation
Stage 1: Pre-Implementation Validation
Stage 2: Strategy Development
Scoring strategy
Setting cutoffs
Strategy considerations
Policy rules
Overrides.
99
Main Stages – Post Implementation
Post-Implementation
Scorecard and Portfolio Monitoring Reports
Review.
100
Development
Stage 1: Preliminaries and Planning
101
Objectives
Create a business plan to ensure a viable and smooth project.
“All Models are wrong. Some are useful”
George Box
102
Create Business Plan
Identify organizational objectives.
Reasons for model development
Profit, revenue, loss, automation, operational efficiency
Role of scorecards in decision making
sole arbiter or decision support tool?
103
Create Business Plan
Internal/External Development and Scorecard Type
Capability and resources
Staff, tools, expertise, data
Market segment
Custom, generic, judgmental
segment, data, time.
104
Create Project Plan
Scope and timelines
Deliverables (scorecard format and documentation,…)
Implementation strategy
Testing, coding
Strategy development
FYI list.
Seamless process from planning to development and
implementation.
105
Create Project Plan
Identify Project Risks
Data risks
Availability, quality, quantity
Weak data
Operational risks
Organizational priority
Implementation delays
System interpretation of data.
106
Create Project Plan
Identify Project Team
Roles clearly defined
Signoff, executor, advisor, FYI
Critical path.
107
Development
Stage 2: Data Review and Project Parameters
108
Objectives
Identify data requirements.
Perform pre-modeling analysis.
Understand the business
Exclusions
What is a “bad”? – target definition
Sample Window/ Performance Window.
109
Data Availability and Quality
Number of “goods”, “bads” and “rejects”
Initial idea at this stage, estimated from performance reports
Internal data
Reliable, accessible
External data
Accessible, format
Retro pull.
110
Data Gathering
To determine “bad” definition and exclusions:
All applications over the last 2–5 years
(or a large sample)
account/ID number
Date opened/applied
Accept/reject indicator
Arrears/payment history
Product/channel and other identifiers
Account status
Other items to understand the business.
111
Exclusions
“Include those whom you would score during normal day to
day operations”
VIP
Staff
Fraud
Pre-approved
Underage
Cancelled (sometimes).
112
Performance
New Account Good/Bad?
?
?
“Sample Window”
“Performance Window”
113
Parameters
Performance Window
How far back do I go to get my sample?
Sample Window
Time frame from which sample will be taken.
Definition of “bad”
Bad and approval rates (when oversampling).
114
Parameters
Seasonality
Plot approval rate/applications across time
Establish any ‘abnormal’ zones (for example, talk to marketing).
Sample used in development must be from a normal
business period, to get as accurate a picture as possible of
the target population.
115
Parameters – “Bad”
Plot “bad” rate by “month opened” (cohort)
For different definitions of bad
30/60/90 days past due
Charge off/write-off
Bankrupt
Claim
Profit based
Less than x% owed collected
“Ever” versus “Current” bad
Ever bad should be used where possible
Considered “bad” if you reach status
anytime during performance window.
116
Cohort Analysis – Example
Bad = 90 days
Open Date 1 Qtr 2 Qtr 3 Qtr 4 Qtr 5 Qtr
Jan-99 0.00% 0.44% 0.87% 1.40% 2.40%
Feb-99 0.00% 0.37% 0.88% 1.70% 2.30%
Mar-99 0.00% 0.42% 0.92% 1.86% 2.80%
Apr-99 0.00% 0.65% 1.20% 1.90%
May-99 0.00% 0.10% 0.80% 1.20%
Jun-99 0.00% 0.14% 0.79% 1.50%
Jul-99 0.00% 0.23% 0.88%
Aug-99 0.00% 0.16% 0.73%
Sep-99 0.00% 0.13% 0.64%
Oct-99 0.20% 0.54%
Nov-99 0.00% 0.46%
Dec-99 0.00% 0.38%
Jan-00 0.30%
Feb-00 0.00%
Mar-00 0.00%
117
Current versus Ever – Example
Current bad definition: No Delinquency
Ever bad definition: 3 months delinquent.
Month 1 2 3 4 5 6 7 8 9 10 11 12
Delq 0 0 1 1 0 0 0 1 2 3 0 0
Month 13 14 15 16 17 18 19 20 21 22 23 24
Delq 0 0 1 2 0 0 0 1 0 1 0 0
118
Determining Parameters
Bad Rate Development
0%
1%
2%
3%
4%
5%
6%
7%
Mar
Jan-02
Nov
Sep
Jul
May
Mar
Jan-01
Nov
Sep
Jul
May
Mar
Jan-00
Month Opened
119
- mth opened from earliest to latest, and “bad rate” as of this month. For simplicity, this is straight delinquency .. No profit.
- notice at one point the bad rate levels off - this means everyone who was going to go bad has gone bad I.e. they have been
given enough time. This is telling us that for this bad defn, accts from jan-march are mature enough.
-lesson 1: need sample that is mature enough, so that you wont be defining a “bad” as a good just because you haven‟t given
them enough time.
-if you take accts from the middle (enter), some of the accts haven‟t matured yet so your bad rate is understated.
-Example: response scoring .. How long do you wait for the responses to come in. the period of measurement is „perf window‟.
Determining Parameters
Bad Rate Development
0%
1%
2%
3%
4%
5%
6%
7%
Mar
Jan-02
Nov
Sep
Jul
May
Mar
Jan-01
Nov
Sep
Jul
May
Mar
Jan-00
Month Opened
Sample
Window
Performance
Window
120
So for each definition of “bad” you‟ll get a sample window of mature accounts, and a performance window indicating the time taken for the
bad rate to mature. Also the approval rate for this sample window.
Couple of notes on this “maturing” process.
- 30 day definition will mature quicker than 90 day. Cause it takes ppl less time to go 30 day than 90 day. Chargeoff even more.
- for the same bad defn, credit card quicker than mortgage (18-24 mths vs. 3-5 yrs) .
- Why are we doing all this for the different definition?
- because each one will produce different counts and based on reasons on the next slide, we‟ll determine the best set of parameters.
Determining Parameters – Bad
Organizational objectives/purpose
Tighter definition – more precise, low counts
Looser definition – differentiation sub-optimal
Interpretable and trackable
Consistency
Reality – the best definition under the circumstances (lack of
data, history).
121
Lets look at the considerations.
- objectives: this may seem obvious, but it is not to a lot of ppl. If you‟re building a scorecard to predict profit, then use profit. Some orgs want a delinquency based defn,
but also include profit. E.g. if acct is chronically 2 mths late, but still profitable.. You can‟t set 2 mths as a “bad” - whereas in a pure delq scorecard this may be possible.
- tighter/looser: tighter means 90 day, 120 day, writeoff .. Better differentiation, but low count. Remember 2000 bads.
- looser means more count, but sub-opt diff.
- interpretable e.g. bad is 2 times 60 days, 3 times 30 days or 1 times 90 days. Sounds good, but hell to track and interpret. Keep it simple.
- consistency across other cards, products. Also if accounting writes off acct at 7 mths, then keep it consistent with that.
- typically most delq cards are 90 days.
- Reality: you take what you got. Lack of history allows only a 30 day definition .. Take it. Can‟t measure real bad rate .. Use proxy. (example LOC like an account)
Sample Definitions – Bad
Ever 90 days delinquent
Bankrupt
Claim over $1000
3 x 30 days, or 2 x 60 days, or 1 x 90 days
Negative NPV
Not profitable
50% recovered within 3 months
Fraud over $500
Closed within 6 months.
122
Confirming “Bad” Definition
Analytical
“Roll rate” analysis
Current versus worst delinquency comparison
Profitability analysis
Consensus.
123
Roll Rate Analysis
Compare Worst delinquency
for example, Previous 12 months versus
Next 12 Months
Month 1 2 3 4 5 6 7 8 9 10 11 12
Arrears 0 0 1 2 0 0 0 1 2 3 0 0
Month 13 14 15 16 17 18 19 20 21 22 23 24
Arrears 1 2 3 3 3 4 3 0 0 0 0 0
124
Roll Rate Analysis
0% 20% 40% 60% 80% 100%
Worst - Next 12 Mths
Curr/x day
30 day
60 day
90+
Worst-Prev12Mths Roll Rate
Curr/x day 30 day 60 day 90+
125
You find out which „bad defn‟ is truly bad‟ - also known as POINT OF NO RETURN.
Lets look at 30 day: out of everyone who had worst 30 day, majority became current, only a few became worse - this is not a good bad defn.
- out of those 60 days, some went over .. Most went back I.e. became better
-but those who were 90 day .. Majority did not become better. This confirms our definition.
-In general .. Once you hit 90 days, you‟re not coming back. That‟s a true bad. Rem: this is based on „bad‟ objective. If other, perhaps there is a
different point in time..
Roll Rate Analysis
Look for ‘point of no return’.
Consider objectives.
Consider sample counts.
Typically for delinquency, after 90 days most accounts do not
cure.
126
Current versus Worst Comparison
Worst Delinquency
Current 30 days 60 days 90 days 120 days writeoff
Current Current 100% 68% 34% 15% 4%
Delinquency 30 days 16% 22% 8% 5%
60 days 8% 19% 17% 8%
90 days 4% 14% 32% 11%
120 days 2% 8% 18% 54%
writeoff 2% 3% 10% 18% 100%
127
32%
44% 60% 72%
56% 40% 18%
Parameters – Goods/Indeterminates
Good
Never delinquent
Ever x- days delinquent
No claims
Profitable, positive NPV
No fraud
No bankruptcy
Recovery > 75%, $ value
Must be good throughout
performance window
Indeterminate
Mild delinquency, roll rate not
conclusive either way
Inactive
Offer declined
Voluntary cancellations*
High balance < $50
128
Default – definice cílové prom. (good/bad)
Obvykle je tato definice založena na klientově počtu dnů po
splatnosti (Days Past Due, DPD) a částce po splatnosti. S
částkou po splatnosti je spojena potřeba stanovení jisté míry
tolerance, tedy stanovení co je považováno za významný dluh a
co nikoli. Např. nemusí dávat smysl považovat za dluh částky
menší než 100 Kč.
Dále je třeba stanovit časový horizont (performance window),
na kterém jsou dva zmíněné parametry sledovány.
Za dobrého klienta lze např. označit klienta, který:
je po splatnosti méně než 60 dnů(s tolerancí 100 Kč) v prvních 6-ti
měsících od první splátky,
je po splatnosti méně než 90 dnů (s tolerancí 30 Kč) v průběhu celé své
platební historie (ever).
129
Default – definice cílové prom.
Volba těchto parametrů závisí do značné míry na typu
finančního produktu (jistě se bude lišit volba parametrů
pro spotřebitelské úvěry pro malé částky se splatností
kolem jednoho roku a pro hypotéky, které jsou obvykle
spojeny s velmi vysokou finanční částkou a se splatností
až několik desítek let) a na další využití této definice
(řízení rizik, marketing, ...).
130
Default – definice cílové prom.
Další praktickým problémem definice dobrého klienta
je souběh několika smluv jednoho klienta. Například je
možné, že zákazník je po lhůtě splatnosti na více
smlouvách, ale s rozdílnými dny po splatnosti a s
různými částkami. V tomto případě jsou většinou částky
klienta dlužné v jednom konkrétním časovém okamžiku
sečteny, a ze dnů po splatnosti na jednotlivých
smlouvách je brána maximální hodnota. Tento přístup
lze uplatnit pouze v některých případech, a to zejména v
situaci, kdy jsou k dispozici kompletní účetní data.
Situace je podstatně složitější v případě agregovaných
údajů, např. na měsíční bázi.
131
Obecně uvažujeme následující typy klientů:
Default – definice cílové prom.
dobrý (good),
špatný (bad),
nedefinovaný (indeterminate),
s nedostatečnou úvěrovou historií (insufficient),
vyřazený (excluded),
zamítnutý (rejected).
132
První dva typy byly diskutovány. Třetí typ, tj. indeterminate, je na
hranici mezi dobrým a špatným klientem a při jeho použití přímo
ovlivňuje definici dobrých/špatných klientů. Uvažujeme-li pouze
DPD, klienti s vysokými DPD (např. 90 +) jsou typicky označeni za
špatné, nedelikventní klienti (jejich DPD je rovno nule) jsou označeni
za dobré. Za indeterminate jsou pak označeni delikventní klienti,
kteří nepřekročí danou hranici DPD.
Čtvrtý typ klientů jsou typicky klienti s velmi krátkou platební
historií, u kterých je nemožná korektní definice cílové proměnné.
Vyřazení klienti jsou klienti, jejichž data jsou natolik špatná, že by
vedla ke zkreslení modelu(např. fraudy). Další skupinu tvoří klienti,
kteří nejsou standardně hodnoceni daným modelem (VIP klienti)
Poslední typ klientů jsou ti klienti, jejichž žádost o úvěr byla
zamítnuta.
Default – definice cílové prom.
133
Customer
Default
(60 or 90 DPD)
Not default
Fraud
(first delayed payment, 90 DPD)
Early default
(2-4 delayed payment, 60 DPD)
Late default
(5+ delayed payment, 60 DPD)
Definice dobrého/špatného klienta
Rejected
Accepted
Insufficient
GOOD
BAD
INDETERMINATE
134
Performance Definitions
“Goods” and “bads” (and rejects) are used for model
development.
Indeterminates included for Gains chart and forecasting.
135
Segmentation
Can one scorecard work efficiently for all the different
populations within your portfolio?
Or would more than one scorecard be better?
Segmentation maximizes predictiveness for unique
segments within your population.
136
Segmentation
Experience (Heuristic)
Knowledge/experience, operational/industry based, common sense.
Statistical
Let the data speak.
“Distinct applicant/account sub-populations”
“Better predictive power than single model”.
137
Experience Based Segmentation
Product
Card type, loan type (auto, home, unsecured), lease, used
versus new, brand
Demographics
Geographical (region, urban/rural, state/province, internal
definition, neighborhood), age, time at bureau
Source of business
Channel (net, branch, store-front, ‘take one’, brokers)
Applicant type
new/existing, first time home buyer, groups (retired,
students, engineers), thin/thick file, clean/dirty file
Product Owned
Credit Card for existing mortgage/loan
holders.
138
Experience Based Segmentation
Consider future plans, not just historic operations
How do we detect new segments?
Marketing/risk analysis:
Bad rates
Approval rate
Profit, and so on.
Look for significant performance difference.
139
Experience Based Segmentation
Need to confirm experience using analytics.
Definition of segments
What is a thin file?
What is ‘young’ versus ‘old’?
What is the best demographic split?
What break is best for ‘tenure at bank’?
140
Confirming Experience
Rule of thumb:
“When the same information predicts differently across
unique segments”
Bad Rate
Age > 30 Age < 30 Unseg
Res Status
Rent 2.1% 4.8% 2.9%
Own 1.3% 1.8% 1.4%
Parents 3.8% 2.0% 3.2%
Trades
0 5.0% 2.0% 4.0%
1-3 2.0% 3.4% 2.5%
4+ 1.4% 5.8% 2.3% 141
Confirming Experience
Attributes Bad Rates
Age
Over 40 yrs 1.80%
30-40 yrs 2.50%
Under 30 6.90%
Source of business
Internet 20%
Branch 3%
Broker 8%
Phone 14%
Applicant Type
First Time buyer 5%
Renewal Mortgage 1%
142
That Is the Easy Way
You can also build full segmented models, and compare
“lift”, sensitivity, and so on, with a base model.
It is best to perform this analysis for both experience and
statistically based segmentation.
143
Comparing Improvement
Use different methods to measure improvement
(lift, KS, c-stat, precision, and so on.)
Segment Total c-stat Seg c-stat Improvement
Age < 30 0.65 0.69 6.15%
Age > 30 0.68 0.71 4.41%
Tenure < 2 0.67 0.72 7.46%
Tenure > 2 0.66 0.75 13.64%
Gold card 0.68 0.69 1.47%
Platinum card 0.67 0.68 1.49%
144
Comparing Improvement
Portfolio stats will put improvements into measurable
portfolio terms.
After Segmentation Before Segmentation
Segment Size Approve Bad Approve Bad
Total 100% % % % %
Age < 30 65% % % % %
Age > 30 35% % % % %
Tenure < 2 12% % % % %
Tenure > 2 88% % % % %
Gold card 23% % % % %
Platinum card 77% % % % %
145
Choosing Segmentation
Cost of scorecards (internal/external)
Implementation
Processing
Data storage
Monitoring/strategy development
Segment size
Do I have to?
146
Statistically Based Segmentation
Less preconceived notions
Clustering
Decision Trees.
147
Clustering
Clustering
0
2
4
6
8
10
12
0 2 4 6 8 10
148
Showing 3 distinct groups and one outlier.
Clustering
0
0.2
0.4
0.6
0.8
1
1.2
1.4
A
ge
C
laim
s
R
egion
A
R
egion
B
M
arried:1
A
uto:S
ports
Overall Mean Mean for Cluster
149
Here is an insurance example of one cluster.
- What do we see here?
- lower than avg age
- more claims
- live in region A only
- likely to be single
and drive a sports car.
- this is obviously a high risk segment.
(confirm this group with claims analysis)
- Similar groups according to characteristics, not performance – so
confirm performance for the clusters and combine those with similar
risk behavior. We‟re not building a marketing profile, but a RISK
PROFILE.
Clustering
Defining characteristics for each group
From previous example,
Young males region A
Young females region A, and so on.
Performance analysis to confirm segmentation.
150
Decision Trees
Isolates segments based on performance (target)
Easily interpretable and differentiates between goods and
bads.
Customer > 2 yrs
Bad rate = 0.3%
Customer < 2 yrs
Bad rate = 2.2%
Existing Customer
Bad rate = 1.2%
Age < 30
Bad rate = 11.7%
Age > 30
Bad rate = 4.2%
New Applicant
Bad rate = 6.3%
All Good/Bads
Bad rate = 3.8%
151
So Now We Know ...
the business
sample and performance windows
“bad”, “good”, “indeterminate”
exclusions
bad rate, approval rate
number of scorecards needed, and their segments.
152
Methodology/Format
Implementation platform and format
Interpretability, implementation
Legal compliance
Data quality, sample size, target type
Tracking and diagnosis
Specify parameters for scorecard (range of scores, “points to
double the odds”).
153
Why ‘Scorecard’ Format?
Easiest to interpret, justify, implement
Reasons for decline/low scores can be explained to auditors,
Mgmt, regulators, adjudicators
No black box
Diagnosis, tracking, monitoring
Development process fairly simple to understand.
154
Review Implementation Plan
Number of scorecards
Data requirements
Manage expectations
Continuity.
155
Everyone is aware of what‟s going on.
This is a business process, not a mystery novel. You‟d be surprised how many
people in companies like to spring surprises on other departments.
156
Jsou k dispozici následující data:
Accepts.sas7bdat (64589 řádků)
Rejects.sas7bdat (35411 ř.)
Applicants.sas7bdat (100.000 ř.)
…24 sloupců
ID of applicant, Date of application/opening,
Accept / Reject, 30-days deliquency, 30-days
deliquency date, 60-days deliquency, 60-days
deliquency date, 90-days deliquency, 90-days
deliquency date, Worst previous deliquency,
Current deliquency, Age, Age groups, Sex,
Existing client?, Phone member?, Region,
Income, Income groups, Debt, Income/Debt
ratio, Income/Debt ratio groups, Probability of
60-days deliquency (old), Score (old).
title 'Accepts';
proc means data=indata.accepts n nmiss min median mean
max;
var age income debt idratio;
run;
title 'Accepts';
proc freq data=indata.accepts;
table sex client phone region;
table (sex client phone region)*bad60;
table bad30*(bad60 bad90) bad60*bad90;
run;
title 'All applicants';
goptions ftext='arial';
proc catalog c=gseg kill;
quit;
proc gchart data=indata.applicants;
vbar age / midpoints=18 to 75 name='_1data_a';
vbar income / name='_1data_b';
vbar debt / name='_1data_c';
vbar idratio / name='_1data_d';
vbar type / name='_1data_e';
vbar scoreold / levels=10 name='_1data_f';
vbar pbad60old / levels=30 name='_1data_f';
run;
quit;
proc univariate data=indata.applicants normal;
var age income debt idratio;
histogram age income debt idratio;
run;
Cvičení Základní popis dat:
157
Cvičení
Vybrané výstupy uvedeného kódu:
158
/* 2a. Bad rate development, roll rate analysis */
%let performancewindow='31dec2002'd>=datappl;
%let deliq=worstdeliq;
proc freq data=indata.accepts /*noprint*/;
table datappl*&deliq / out=&deliq (keep=datappl &deliq pct_row
where=(&deliq ne '0')) outpct missing;
format datappl yyqs7.;
where &performancewindow;
run;
ods html path="&appl_root" file="2.&deliq..html";
goptions reset=all ftext='arial';
symbol1 i=j v=dot;
axis1 label=('Bad rate');
proc catalog c=gseg kill;
quit;
title 'Bad rate development - current deliquency';
proc gplot data=&deliq;
plot pct_row*datappl=&deliq / name='_2curdel' grid hreverse
vaxis=axis1 hminor=0;
run;
quit;
ods html close;
Cvičení
159
/* analyza kohort */
%let target=bad30;
%let date=dat30;
data cohorts;
set indata.accepts (keep=datappl bad: dat:);
if &target then qtr=int(yrdif(datappl,&date,'act/act')*4)+1;
datappl=intnx('month',datappl,0);
format datappl mmyys7.;
run;
proc freq data=cohorts noprint;
table datappl / out=cohorts1 (drop=percent
rename=(count=counttotal));
table datappl*qtr / out=cohorts (drop=percent);
run;
data cohorts;
merge cohorts cohorts1;
by datappl;
if first.datappl then cumpct=.;
if qtr ne . then do;
cumpct+(count/counttotal);
output;
end;
run;
ods html path="&appl_root" file='2.cohorts.html';
title "Cohort analysis for &target";
proc tabulate data=cohorts missing format=percent8.4;
class datappl qtr;
var cumpct;
table datappl,qtr*cumpct=''*sum='';
run;
ods html close;
Cvičení
160
/* performance window */
%let performancewindow='31dec2002'd>=datappl;
proc tabulate data=indata.accepts out=brdev
(drop=_type_ _table_ _page_);
class datappl;
var bad90 bad60 bad30;
table datappl,(bad90 bad60 bad30)*mean*format=percent8.2;
format datappl yyqs7.;
where &performancewindow;
label datappl='Month opened';
run;
ods html path="&appl_root" file='2.perf.html';
goptions reset=all ftext='arial';
symbol1 i=j v=dot;
axis1 label=('Bad rate');
proc catalog c=gseg kill;
quit;
title 'Bad rate development';
proc gplot data=brdev;
plot (bad:)*datappl / name='_2perf' grid overlay legend hreverse
vaxis=axis1 hminor=0;
run;
quit;
ods html close;
Cvičení
161
/* bad rate development */
%let samplewindow='30jun2001'd>=datappl>='01apr2001'd;
%let samplewindow='31dec2001'd>=datappl;
proc freq data=indata.accepts noprint;
table dat60 / out=development missing;
format dat60 mmyys7.;
where &samplewindow;
run;
data development;
set development;
if _n_>1 then do;
dat60=intnx('month',dat60,0);
cum_pct+percent;
output;
end;
label datappl='Month of opening';
run;
ods html path="&appl_root" file='2.badratedev.html';
goptions reset=all ftext='arial';
symbol1 i=j v=dot;
axis1 label=('Bad rate');
proc catalog c=gseg kill;
quit;
title 'Bad rate development';
proc gplot data=development;
plot cum_pct*dat60 / name='_2brd' grid;
run;
quit;
ods html close;
Cvičení
162
/* BRDEV macro */
%macro brdev(data,out,datevar,targetvar,samplewindow);
proc freq data=&data noprint;
table &datevar / out=&out missing;
format &datevar mmyys7.;
where &samplewindow;
run;
data &out (keep=date cum_pct);
set &out;
if _n_>1 then do;
date=intnx('month',&datevar,0);
cum_pct+percent;
output;
end;
format date mmyys7.;
run;
%mend brdev;
%let samplewindow='30jun2001'd>=datappl>='01apr2001'd;
%brdev(indata.accepts,development,dat60,bad60,&samplewindow)
/* several bad rate development */
%let samplewindow='30jun2001'd>=datappl>='01apr2001'd;
%brdev(indata.accepts,development30,dat30,bad30,&samplewindow)
%brdev(indata.accepts,development60,dat60,bad60,&samplewindow)
%brdev(indata.accepts,development90,dat90,bad90,&samplewindow)
data developmentsev;
set development30 (in=__30) development60 (in=__60) development90;
if __30 then type='30';
else if __60 then type='60';
else type='90';
Run;
Cvičení
data anno;
function='label';x=20;y=2;text='Sample window';output;
size=2;function='move';x=10;y=2.5;output;
function='draw';x=30;y=2.5;output;
function='move';x=20;y=3.5;output;
function='draw';x=140;y=3.5;output;
run;
ods html path="&appl_root" file='2.badratedev_several.html';
goptions reset=all ftext='arial';
symbol1 i=j v=dot;
axis1 label=('Bad rate');
proc catalog c=gseg kill;
quit;
title 'Several bad rates development';
proc gplot data=developmentsev annotate=anno;
plot cum_pct*date=type / grid vminor=0 name='_2brds' vaxis=axis1;
format date mmyys5.;
label date='Performance window';
run;
quit;
ods html close;
163
Cvičení
/* Roll rate analysis */
ods html path="&appl_root" file='2.roll_rate.html';
proc format;
value $deliq (notsorted)
'0'=' no deliquency'
'3'='30 days'
'6'='60 days'
'9'='90+ days';
run;
proc tabulate data=indata.accepts out=rollrate missing;
class curdeliq worstdeliq;
tables worstdeliq,curdeliq*rowpctn;
format curdeliq $deliq. worstdeliq $deliq.;
title 'Roll rate analysis';
run;
proc gchart data=rollrate;
hbar3d worstdeliq / sumvar=pctn_01 subgroup=curdeliq nostats
clipref autoref raxis=axis1;
axis1 label=none minor=none;
run;
quit;
ods html close;
4. Příprava dat II
164
Development
Stage 3: Development Database Creation
165
Development Sample Specification
Development sample spec. means specifying what we need in
the database we will use for development. We are not going to
take a dump of everything from the CDW or datamart.
Make the development process manageable and efficient:
list of characteristics (or “variables” to be considered for
devp. You don’t want to have the entire DW.)
sample sizes (for each segment and category. No point
regressing on 100k when 3k will suffice.)
parameters from previous section.
Do all this bearing in mind the number of scorecards you want
developed and for which segments.
166
Characteristic Selection
Expected predictive power
Reliability: (is this manipulated? or prone to be manipulated?, e.g. salary. Check
historical data - cannot be confirmed or too expensive to confirm. Can it be interpreted e.g.
occupation/industry type is the worst cases. Do poeple usually leave this blank.)
manipulation (non-confirmable)
interpretation (present and future)
missing
Legal issues (Cant ask/get some info?.. Might get into trouble with some?)
167
How do you select characteristics? Reinforce: there is a need for some thought to be put into
process in selecting characteristics ..
You get together with risk, mktg, product. And get operations areas such as collections
aboard (WHO knows your bad guys better than anyone else?)
Characteristic Selection
Ease in collection
Do you want to spend time chasing missing info for a credit card?… may be OK
for a mortgage. How easy it is to get this piece of info?
Policy rules
Don’t include anything that is unchangeable PR, e.g. bankruptcy. If you are
going to decline all bankrupcy, no need to use it in scorecard.
Derived variables – ratios
Can do a lot of ratios .. But put some business thought into it.
Future direction.
Will this info be collected in the future (e.g. app form redesign)?
Industry direction - not relevant today but will change. can include in card or
collect for future e.g. higher credit lines. Talk to credit bureaus industry trend
and how they affect the scorecard.
168
What are you doing: you’re looking at objectives, company
operations, business knowledge, ground realities etc.
This is not just a stats exercise!!!
Sampling
Development, validation
70:30, 80:20
If sample is small, do 100%, but validate with several 50–80%.
Good, bad, reject
2000 of each (or higher)
Oversampling (oversampling is common when modeling rare
events … it leads to better predictions)
Proportional sample – not recommended for low bad rates.
Take what you got for bads and sample the goods.
Ensure that each group has sufficient numbers for
meaningful analysis.
169
Data Collection and Database
Construction
Random and representative
for each segment applicants (and accounts)
One for unsegmented (to measure lift from segmentation)
Data quirks, changes (preferably documented)
e.g. code for renters changed from R to E .. Stopped collecting some
data item, new data fields, started collecting data recently etc. etc.
Objective: Data collected, as specified.
170
Adjusting for Prior Probabilities
When oversampling
Adjust to actual:
Approval rate
Bad rate
Analysis and reports reflect
reality
Do not need if you only
want to know relationships
or rank ordering.
Rejects
2,950
Bads
874
Goods
6,176
Accepts
7,050
Through-the-door
10,000
171
Adjusting for Oversampling
Separate sampling is standard practice (helps when you just
did ‘bad’ definition)
Prior probabilities must be known
Can adjust before fitting the model or after.
Two ways:
Offset
Sampling weights (frequency variable).
172
Offset Method
Logit (pi)=β0+ β1x1+ ….+ βkxk
When oversampling, logits shifted by the offset:
Logit (p*i)= ln (ρ1π0 / ρ0π1) + β0+ β1x1+ ….+ βkxk
Where
ρ1 and ρ0= proportion of target classes in the sample
π1 and π0= proportion of target classes in the population.
173
Offset Method
Adjustment post-model (after model development):
p^i = (p^*iρ0π1) / [(1 - p^*i) ρ1π0 + p^*iρ0π1)]
Where p^*i is the unadjusted estimate of posterior
probability.
174
SAS Programs – Pre-model Adjustment
data develop;
set develop;
off=(offset calc);
run;
proc logistic data=develop ...;
model ins=……./ offset=off;
run;
proc score ….;
p=1 / (1+exp(-ins));
proc print;
var p ….;
run;
175
ln (ρ1π0 / ρ0π1)
SAS Program – Post-model Adjustment
proc logistic data=develop...;
run;
proc score ... out=scored...;
run;
data scored;
set scored;
off = (offset calc);
p=1 / (1+exp(-(ins-off)));
run;
proc print data=scored ..;
var p ...;
run; 176
Sampling Weights
Adjusts data to reflect true population
Weights: π1/ρ1 and π0/ρ0
Or set weight of bad=1 and weight of good =
p(good)/p(bad) for population.
For example, p(bad)=4%, 2000 goods,
2000 bads. Sample will show 2000 bads
and 48,000 goods.
Normalization causes less distortion in
p values and standard errors.
Use FREQ variable in EM or calculate
sample weight and use weight=sampwt
in the LOGISTIC procedure.
177
SAS Program
When using the WEIGHT statement, some output is not
correct.
data develop;
set develop;
sampwt=( π0/ ρ0)* (ins=0) +
( π1/ ρ1)* (ins=1);
run;
proc logistic data=develop …;
weight=sampwt;
model ins=…….;
run;
178
What Is the Difference?
The parameter estimates will be different.
When linear-logistic model is correctly specified, offset is
better.
When logistic model is an approximation of some non-linear
model, weights are better.
For scorecards, weighting is better since it corrects the
parameter estimates used to derive scores (prior probabilities
only affect the predicted probabilities).
179
Development
Stage 4: Scorecard Development
180
Objective
Understand a methodology for developing and assessing risk
scorecards.
Grouped attributes
Logistic regression
Reject inference
Scaled points.
181
Process Flow – Application Scorecard
Explore Data
Data Cleansing
Initial Characteristic
Analysis (Known
Good Bad)
Preliminary
Scorecard (KGB)
Reject Inference
Initial Characteristic
Analysis (All Good
Bad)
Final
Scorecard (AGB)
• Scaling
• Assessment
Validate
182
Process Flow – Behavior Scorecard
Explore Data
Data Cleansing
Initial Characteristic
Analysis (Known
Good Bad)
Final
Scorecard
• Scaling
• Assessment
Validate
183
Before you start …
Explore the data, visualize (Insight in SAS EM)
Distributions
mean, max/min, range, missing
Compare with overall portfolio distributions
Data integrity (any garbage, outliers)
Ensure data meets the data specifications done earlier.
Check that ‘0’s mean zero, not missing values.
Population stability check:
Month by month table of distribution for each predictor
(e.g. 200701 men 55%, women 45%, 200702 men 57%,
women 43%)
184
Missing Values and Outliers
Missing (ALL financial data has missing and garbage values)
Complete Case Analysis - Exclude everything with missing data .. In CS,
you’ll end up with nothing .
Exclude characteristics or records with significant missing values
Group ‘missing’ as a distinct attribute -the weight of missing will tell you
what missing contains. If it is close to neutral, good since it is random.
Recommended – recognize that missing data has information value and may
not be randomly missing. Find the value and use it. Plus, including missing
‘points’ in scorecard will take care of ppl who leave it blank.
Impute missing values – don’t use mean/most likely, model based on
decision tree may be better.
Outliers (and mis-keys)
Exclude/replace records.
185
Missing Values
Missing data is not usually random
Missing data can be related to the target
New at job may leave yrs at empl blank
Low income or commercial customers leave income blank
Do bad customers leave certain fields blank?
Including and grouping missing data can answer this
question.
186
Initial Characteristic Analysis
Analyze individual characteristics
Identify strong characteristics
Best differentiators between ‘good’ and ‘bad’
Screening
Select characteristics for regression (variable selection).
187
Initial Characteristic Analysis
Start by performing initial grouping for each characteristic
and rank order Information Value (PROC DMSPLIT or SPLIT,
or EM node)
Alternate: rank order characteristics by
Chi Square or other method
Fine tune grouping for stronger characteristics
May want to perform other analysis prior to this (for example,
use PC to identify collinear characteristics)
Some people use principal components (PROC VARCLUS) to
identify which characteristics they need from each cluster.
And then concentrate on the best out of each.
188
Criteria for Variable Selection
Predictive power of attribute:
Weight of Evidence
Range and trend of WOE across attributes
Predictive power of characteristic:
Information Value, Gini index(coefficient)
Operational/business considerations.
189
Weight of Evidence
Distr Distr Distr
Age Count Count Goods Good Bads Bad Bad rate Weight
Missing 50 3.00% 43 2.40% 8 4.10% 16% -55.497
18-22 200 10.00% 152 8.40% 48 24.90% 24% -108.405
23-26 300 15.00% 246 13.60% 54 28.00% 18% -72.039
27-29 450 23.00% 405 22.40% 45 23.30% 10% -3.951
30-35 500 25.00% 475 26.30% 25 13.00% 5% 70.771
35-44 350 18.00% 349 19.30% 11 5.70% 3% 122.044
44 + 150 8.00% 147 8.10% 3 1.60% 2% 165.509
Total 2,000 1,807 193 9.65%
Information Value = 0.066
Distr Good Distr Bad/Ln x 100
190
Weight of Evidence
Measures strength of each (grouped) attribute in separating
goods and bads
(Distr Good / Distr Bad) = odds of being good
Negative weight: more bads than goods
Logical trend
For age 23-26:
WOE = ln (0.136 / 0.28) = -0.722 (x 100 = -72.2)
191
Information Value (Strength)
Distr Distr Distr
Age Count Count Goods Good Bads Bad Bad rate Weight
Missing 50 3.00% 43 2.40% 8 4.10% 16% -55.497
18-22 200 10.00% 152 8.40% 48 24.90% 24% -108.405
23-26 300 15.00% 246 13.60% 54 28.00% 18% -72.039
27-29 450 23.00% 405 22.40% 45 23.30% 10% -3.951
30-35 500 25.00% 475 26.30% 25 13.00% 5% 70.771
35-44 350 18.00% 349 19.30% 11 5.70% 3% 122.044
44 + 150 8.00% 147 8.10% 3 1.60% 2% 165.509
Total 2,000 1,807 193 9.65%
Information Value = 0.066
Distr Good - Distr Bad
x Weight
Kullback, S., Information Theory and Statistics (1959)
192
Information Value
[(Distr Good - Distr Bad) x {ln (Distr Good / Distr Bad)}]
When figures used in decimals format
(for example, 0.136).
Rule of thumb:
< 0.02: unpredictive
0.02 – 0.1: weak
0.1 – 0.3: medium
0.3 +: strong
Too strong? (IV>0.5) – use it in a controlled way (add them
in the end of regression to see if they add any incremental
value)
193
Grouping
Groups with similar WOE are put together
For continuous variables, groups are created so as to
maximize difference from one group to next – and
maintain logical trend for WOE
Why Group?
Easier way to deal with outliers with interval variables, and
for rare classes
Format of the scorecard
Easy to understand relationships
Model non-linear dependencies with linear models
Control the process
194
Grouping of the demographic scorecard variable “age”. On the left pictures, the dependence of bad
rate (smoothed using normal probability density function) on the variables is presented. On the
right, the cumulative distribution function is presented. Vertical lines represent the borders between
categories, horizontal red lines in the left picture represent the mean bad rate in categories,
horizontal blue lines in the right picture represent the relative distribution of observations in the
categories. 195
Grouping
Logical Trend
Predictive Strength
-150
-100
-50
0
50
100
150
200
Missing 18-22 23-26 27-29 30-35 35-44 44 +
Age
Weight
196
Logical Trend
Final weightings make sense.
Enables buy-in from risk managers.
Confirms business experience
young people are higher risk
higher debt service means higher risk
Reduces overfitting if done right – model overall trend, not
quirks. Remember how long the scorecard has to last. This is
not going to be used for the next campaign and then
discarded.
Linear relationship not always true, but need trend to
confirm, and back up with business experience. E.g. revolving
open burden shows a ‘banana curve’ everywhere and is now
accepted as that. People don’t try to make it straight. 197
Logical Trend
Predictive Strength
-80
-60
-40
-20
0
20
40
60
80
100
Missing 18-22 23-26 27-29 30-35 35-44 44 +
Age
Weight
198
Obviously not a logical trend!!!
Logical Trend
Predictive Strength
-150
-100
-50
0
50
100
150
200
Missing 18-22 23-26 27-29 30-35 35-44 44 +
Age
Weight
199
Which line shows logical
trend?
Both are logical. What’s the
difference?
Blue line shows good
differentiation.
Red line is flat, and this
characteristic is likely very
week and will be reflected in
the IV.
200
Stability check
Check the stability of grouping throughout the whole developmnet
time window:
Business Factors
Nominal values
group based on similar weight (for example, postal code,
occupation)
investigate splits on urban/rural, regional
Breaks concurrent with policy rules
Sanity check.
201
List of information values of variables (predictors)
No Character
IV
Rank
Information
Value
1 Max delinq L9M 1 0.176
2 Months since delinquent 2 0.176
3 Active contract (Y/N) 3 0.045
4 Average Delinquency L9M 4 0.087
5 Months since >10 dpd 5 0.144
6 Max delinq L3M 6 0.117
7 Average Delinquency L3M 7 0.108
8 Age of oldest contract 8 0.013
9 Number of months on collections as % total time on book 9 0.132
10 Months since >20 dpd 10 0.091
11 Months since >30 dpd 11 0.054
12 Num rejected applications L9M 12 0.033
13 Times 30+ dpd L9M 13 0.042
14 Total Payment L3M 14 0.018
15 Months since >40 dpd 15 0.030
16 Current balance as % of highest ever balance 16 0.048
17 Times 30+ dpd L3M 17 0.024
18 Payment Method 18 0.001 202
Variable Selection
203
Cvičení –profile
/* 2b. Profiles */
%let input=income;
%let groups=yes;
%let n_groups=4;
/* grouping 1 - kvantily */
proc rank data=indata.accepts (keep=&input) groups=&n_groups
out=bins;
var &input;
ranks bin;
run;
proc summary data=bins nway missing;
class bin;
output out=bins (drop=_type_) min(&input)=start max(&input)=end;
run;
data bins;
set bins;
label=compress(put(start,best.))||' - '||compress(put(end,best.));
fmtname='__bin';
type='N';
run;
proc format cntlin=bins;
run;
%macro profile(input,groups);
/* Profile of &input according to BAD60 */
proc summary data=indata.accepts;
class &input;
output out=__bins (drop=_type_ rename=(_freq_=__n))
sum(bad60)=__n1;
%if %upcase(&groups)=YES %then %do;
format &input __bin.;
%end;
run;
data __bins;
set __bins end=__finish;
if _n_=1 then do;
__all_n=__n;
__all_n1=__n1;
__all_n0=__n-__n1;
retain __all_n:;
end;
else do;
__p=__n/__all_n;
__n0=__n-__n1;
__p1=__n1/__all_n1;
__p0=__n0/__all_n0;
__r1=__n1/__n;
__r0=__n0/__n;
__woe=log((__p0)/(__p1))*100;
__all_iv+(__p0-__p1)*__woe/100;
output;
end;
if __finish then do;
call symput('groups',compress(put(_n_-1,best.)));
call symput('iv',compress(put(__all_iv,8.4)));
call symput('br',compress(put(__all_n1/__all_n,best.)));
end;
attrib
__n label='N'
__p label='%' format=percent8.1
__n1 label="N of Bad"
__n0 label="N of Good"
__p1 label="% of Bad" format=percent8.1
__p0 label="% of Good" format=percent8.1
__r1 label="Bad rate" format=percent8.1
__r0 label="Good rate" format=percent8.1
__woe label='WOE' format=8.2
&input label="Group of &input"
;
drop __all:;
Run;
.
.
.
204
data __chart (keep=&input __sub __n __p __r);
set __bins (keep=&input __n0 __p0 __r0 __n1 __p1 __r1);
length __sub $4;
__sub="Good";
__n=__n0;
__p=__p0;
__r=__r0;
output;
__sub="Bad";
__n=__n1;
__p=__p1;
__r=__r1;
output;
attrib
__n label='N' format=8.0
__p label='%' format=percent8.1
__r label='Rate' format=percent8.1
__sub label='Target'
;
run;
proc datasets nolist;
delete gseg / memtype=catalog;
quit;
ods listing close;
goptions reset=all ftext='arial' htext=1.5 ftitle='arial' htitle=2;
proc gchart data=__chart;
axis1 style=0;
axis2 minor=none order=(0 to 1 by .25) label=none;
axis3 minor=none label=none;
axis4 minor=(n=4) label=none;
where __sub="Bad";
hbar &input / discrete sumvar=__r noframe nostats
maxis=axis1 raxis=axis3 autoref cref=graya0 clipref
name="__1";
title "Bad rates";
run;
where;
hbar &input / discrete subgroup=__sub sumvar=__n noframe nostats
maxis=axis1 raxis=axis3 autoref cref=graya0 clipref
name="__2";
title "Bad / Good frequencies";
run;
Quit;
proc gchart data=__bins;
hbar &input / discrete sumvar=__woe noframe nostats
maxis=axis1 raxis=axis4 autoref cref=graya0 clipref
name="__3";
title "Weight of evidence";
run;
hbar &input / discrete sumvar=__p1 noframe nostats
maxis=axis1 raxis=axis4 autoref cref=graya0 clipref
name="__4";
title "Bad distribution";
run;
quit;
ods html path="&appl_root" file="5.profile.html" style=statdoc;
proc report data=__bins nofs style(summary)=[htmlclass="Header"];
columns ("Attributes of &input" &input) ('Total' __n __p)
("Good" __n0 __p0) ("Bad" __n1 __p1) ('Measures' __r1 __woe);
define &input / group;
compute after;
__r1.sum=&br;
__woe.sum=.;
endcomp;
rbreak after / summarize;
title "Bad / Good by &input";
footnote "IV=&iv (<0.02 unpredictive, <0.1 week, <0.3 medium, <0.5 strong, >0.5 over)";
run;
goptions device=gif;
proc greplay nofs;
footnote;
igout gseg;
tc sashelp.templt;
template l2r2;
treplay 1:__1 2:__2 3:__3 4:__4 name="5_profil";
run;
quit;
title;
footnote;
ods html close;
ods listing;
%mend profile;
%profile(&input,&groups)
205
Cvičení
/*profile multiple characteristics at once*/
%model_profilevar
(
data=data.accepts,
interval=age income idratio ,
binary=sex phone client,
ordinal=age_grp income_grp region,
groups=5,
target=bad30,
rep_out=&appl_root
)
206
Cvičení
207
Cvičení
208
Cvičení
209
Cvičení
210
Cvičení
211
Cvičení
5. Úvod do shlukové analýzy (SA).
Hierarchická SA
212
213
Úvod
Shluková (klastrová, z angl. Cluster) analýza je metoda, která
na základě informací obsažených ve vícerozměrných
pozorováních roztřídí základní množinu objektů do několika
relativně stejnorodých shluků. Uvažujeme datovou matici
typu n x p, kde n je počet objektů a p je počet proměnných.
Uvažujeme různé rozklady množiny n objektů do g shluků a
hledáme takový rozklad, který je z určitého hlediska
nejvýhodnější. Cílem je dosáhnout stavu, kdy objekty uvnitř
shluku jsou si podobné co nejvíce a objekty z různých shluků
si jsou podobné co nejméně.
Unsupervised Learning
Metody shlukové analýzy patří mezi tzv. „unsupervised
learning“ metody.
“Learning without a priori knowledge about the
classification of samples; learning without a teacher.”
Kohonen (1995), “Self-Organizing Maps”
214
Cluster Profiling
Cluster profiling can be defined as the derivation of a class
label from a proposed cluster solution.
The objective is to identify the features, or combination of
features, that uniquely describe each cluster.
215
Rozlišujeme tři základní shlukovací metody:
Hiearchické shlukování (hierarchical clustering),
Shlukování s předem neznámým počtem shluků - s možným
překryvem shluků (overlapping clusters),
Shlukování do předem daného počtu shluků (partitive/partitioning
methods).
Fuzzy shlukování – fuzzy shluky jsou definovány stupněm
příslušnosti objektů do daných shluků.
Types of clustering
216
Hiearchické shlukovací algoritmy:
Aglomerativní
Divisive
Partitioning algoritms:
K-means
K-medoids
Probabilistic
Density based
Grid-based algoritms
Constraint-Based Clustering
Evolutionary Algoritms
Scalable Clustering Algoritms
Klasifikace shlukovacích algoritmů
217
Hierarchical Clustering
Agglomerative Divisive
218
Problems with Hierarchical Clustering
error
error
error
219
Partitive Clustering
reference vectors (seeds)
X
X
X
X
Initial State
observations
Final State
X
X
X X
X
X
X
X
• The goal of partitive clustering is to minimize or maximize some
specified criterion.
220
Problems with Partitive Clustering
Many partitive clustering methods
a. make you guess the number of clusters present,
b. make assumptions about the shape of the clusters, usually that
they are (hyper)spherical, and
c. are influenced by seed location, by outliers, even by the order
the observations are read in.
It is impossible to determine the optimal grouping, due to
the combinatorial explosion of potential solutions.
221
Problems with Partitive Clustering
222
The number of possible partitions of n objects into g groups
is given by:
For example, the number of partitions of 50 observations
into 4 clusters, N(50,4), is equal to 5.3 x 1028. N(100, 4)
generates 6.7 x 1058 partitions. Complete enumeration of
every possible partition, therefore, is generally impossible.
Heuristic Search
1. Generate an initial partitioning (based on the seeds)
of the observations into clusters.
2. Calculate the change in error produced by moving each
observation from its own cluster to another.
3. Make the move that produces the greatest reduction.
4. Repeat steps 2 and 3 until no move reduces error.
223
224
Hierarchická shluková analýza
Je třeba zvolit:
jak měřit vzdálenost/(ne)podobnost mezi objekty
(euklidovská,…)
do úvahy je třeba vzít typ dat (intervalová, nominální,…)
značnou roli také hraje souměřitelnost dat -> standardizace
(z-skóre, …)
jak měřit vzdálenost/(ne)podobnost mezi shluky
(wardova,…)
jak určit finální rozklad objektů do shluků
Příklad aglomerativního hiearchického
shlukování
225
obj X1 X2 X3 X4
A 100 80 70 60
B 80 60 50 40
C 80 70 40 50
D 40 20 20 10
E 50 10 20 10
Uvažujeme 5 objektů A,B,C,D a E popsaných čtyřmi proměnnými X1-X4.
Neprovádíme žádnou standardizaci.
Vzdálenost mezi objekty měříme pomocí euklidovské vzdálenosti.
Vzdálenosti mezi shluky měříme pomocí metody průměrné vzdálenosti
(average linkage).
Data: Matice vzdáleností:
A B C D E
A 0 0 0 0 0
B 40,00 0 0 0 0
C 38,73 17,32 0 0 0
D 110,45 70,71 78,10 0 0
E 111,36 72,11 80,62 14,14 0
Příklad aglomerativního hiearchického
shlukování
226
1. krok:
V matici vzdáleností hledáme nejmenší hodnotu. V našem případě je to 14,1 (vzd.
mezi D a E).
Sloučíme objekty D a E do shluku D’ , zmenšíme a přepočteme matici vzdáleností.
Používáme metodu průměrné vzdálenosti, takže:
9,1104,1114,110
21
1
,
1
A D''
'
i j
ji
DA
AD xxd
nn
D
4,711,727,70
21
1
'
BDD
35,796,801,78
21
1
'
CDD
A B C D E
A 0 0 0 0 0
B 40,00 0 0 0 0
C 38,73 17,32 0 0 0
D 110,45 70,71 78,10 0 0
E 111,36 72,11 80,62 14,14 0
A B C D'
A 0
B 40,00 0
C 38,73 17,3 0
D' 110,90 71,41 79,36 0
Příklad aglomerativního hiearchického
shlukování
227
2. krok:
V redukované matici vzdáleností hledáme nejmenší hodnotu. V našem případě je to
17,3 (vzd. mezi B a C).
Sloučíme objekty B a C do shluku B’ , zmenšíme a přepočteme matici vzdáleností.
35,397,3840
21
1
,
1
A B''
'
i j
ji
BA
AB xxd
nn
D
375,753,794,71
2
1
6,801,721,787,70
22
1
,
1
' B'''
''
Di j
ji
BD
BD xxd
nn
D
A B C D'
A 0
B 40,00 0
C 38,73 17,3 0
D' 110,90 71,41 79,36 0
A B' D'
A 0
B' 39,36 0
D' 110,90 75,39 0
A B C D E
A 0 0 0 0 0
B 40,00 0 0 0 0
C 38,73 17,32 0 0 0
D 110,45 70,71 78,10 0 0
E 111,36 72,11 80,62 14,14 0
Příklad aglomerativního hiearchického
shlukování
228
3. krok:
V redukované matici vzdáleností hledáme opět nejmenší hodnotu. V našem případě
je to 39,3 (vzd. mezi A a B’).
Sloučíme objekty A a B’ do shluku A’ , zmenšíme a přepočteme matici vzdáleností.
23,8739,75290,110
3
1
62,8010,7811,7271,7036,11145,110
23
1
,
1
'A D'''
''
i j
ji
DA
DA xxd
nn
D
A B' D'
A 0
B' 39,36 0
D' 110,90 75,39 0
A' D'
A' 0
D' 87,23 0
A B C D E
A 0 0 0 0 0
B 40,00 0 0 0 0
C 38,73 17,32 0 0 0
D 110,45 70,71 78,10 0 0
E 111,36 72,11 80,62 14,14 0
Pozor!!! Slučují se dva nestejně velké objekty a nelze tedy počítat obyčejný průměr průměrů!
Příklad aglomerativního hiearchického
shlukování
229
proc distance data=aaa method=euclid out=dist;
var interval(X1 X2 X3 X4);
id obj;
run;
proc cluster data=dist method=ave outtree=tree nonorm;
id obj;
run;
proc tree data=tree horizontal;
id obj;
run;
Příklad aglomerativního hiearchického
shlukování
230
14,14DED
32,17BCD
'B
'D
'A
35,39' ABD
23,87'' DAD
Příklad aglomerativního hiearchického
shlukování
231
8,474,392,87
1,223,174,39
A B C D E
A 0
B 40 0
C 38,73 17,321 0
D 110,45 70,71 78,10 0
E 111,36 72,11 80,62 14,14 0
obj X1 X2 X3 X4
A 100 80 70 60
B 80 60 50 40
C 80 70 40 50
D 40 20 20 10
E 50 10 20 10
Určili jsme tedy dva shluky A’ ={A, B, C} a D’ = {D, E}.
What Is Similarity?
To illustrate the difficulties involved in judging similarity,
consider your answer to the following question:
Which is more similar to a duck,
a crow or a penguin?
The answer to this question largely depends on how you
choose to define similarity.
Volba míry (ne)podobnosti závisí na typu proměnných
(nominální, ordinální, poměrové, intervalové, binární).
232
Principles of a Good Similarity Metric
The following principles have been suggested as the
foundation of a good similarity metric:
1. symmetry: d(x,y)=d(y,x).
2. non-identical distinguishability: if d(x,y)0 then xy.
3. identical non-distinguishability: if d(x,y)=0 then x=y.
Most good metrics are also consistent with the triangle
inequality: d(x,y) d(x,z) + d(y,z).
233
The DISTANCE Procedure
General form of the DISTANCE procedure:
A distance method must be specified (no default), and all
input variables are identified by level.
PROC DISTANCE DATA=SAS-data-set
METHOD=similarity-metric
;
VAR level (variables < / option-list >);
RUN;
234
Více na:
http://support.sas.com/documentation/cdl/en/statugdistance/61780/PDF/default/statugdistance.pdf
The DISTANCE Procedure
235
Metody měření vzdálenosti v SASu:
Method Range Type Accepting variables Method Range Type Accepting variables
GOWER 0 to 1 sim all HAMMING 0 to n dis Nominal
DGOWER 0 to 1 dis all MATCH 0 to 1 sim Nominal
EUCLID 0 dis Ratio, interval, ordinal DMATCH 0 to 1 dis Nominal
SQEUCLID dis Ratio, interval, ordinal DSQMATCH 0 to 1 dis Nominal
SIZE dis Ratio, interval, ordinal HAMANN –1 to 1 sim Nominal
SHAPE dis Ratio, interval, ordinal RT 0 to 1 sim Nominal
COV sim Ratio, interval, ordinal SS1 0 to 1 sim Nominal
CORR –1 to 1 sim Ratio, interval, ordinal SS3 0 to 1 sim Nominal
DCORR 0 to 2 dis Ratio, interval, ordinal DICE 0 to 1 sim Asymmetric nominal
SQCORR 0 to 1 sim Ratio, interval, ordinal RR 0 to 1 sim Asymmetric nominal
DSQCORR 0 to 1 dis Ratio, interval, ordinal BLWNM 0 to 1 dis Asymmetric nominal
L(p) dis Ratio, interval, ordinal K1 sim Asymmetric nominal
CITYBLOCK dis Ratio, interval, ordinal JACCARD 0 to 1 sim Asymmetric nominal, ratio
CHEBYCHEV dis Ratio, interval, ordinal DJACCARD 0 to 1 dis Asymmetric nominal, ratio
POWER(p,r) dis Ratio, interval, ordinal
SIMRATIO 0 to 1 sim Ratio
DISRATIO 0 to 1 dis Ratio
NONMETRIC 0 to 1 dis Ratio
CANBERRA 0 to 1 dis Ratio
COSINE 0 to 1 sim Ratio
DOT sim Ratio
OVERLAP sim Ratio
DOVERLAP dis Ratio
CHISQ dis Ratio
CHI dis Ratio
PHISQ dis Ratio
PHI dis Ratio
Euclidean Distance
Euclidean distance gives the linear distance between any
two points in n-dimensional space.
It is a generalization of the Pythagorean theorem.
k
i
iiE yxD
1
2
wx
x1
x2
(x1, x2)
(0, 0)
2
1
2
i
ixh
236
City Block (Manhattan) Distance
The distance between two points is measured along the
sides of a right-angled the triangle.
It is the distance that you would travel if you had to walk
along the streets of a right-angled city.
d
i
iiM yxD
1
1
(x1,x2)
(y1,y2)
237
Hamming Distance
1 2 3 4 5 … 17
Gene A 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1
Gene B 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1
DH = 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 = 5
Gene expression levels under 17 conditions
(low=0, high=1)
d
i
iiH yxD
1
238
Power Distance
r
d
i
q
iiP yxD
1
239
Minkowského metrika (r=q)
Hemmingova vzdálenost (r=q=1)
Euklidovská vzdálenost (r=q=2)
Čebyšovova vzdálenost (r=q->)
Correlation
Similar (+1)
.
.
.
.
.
.
. .
.
.
.
. .
Dissimilar (-1)
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
..
. ..
.
.
.
.
.
No Similarity (0)
240
Density-Based Similarity
Density-based methods define similarity as the distance
between derived density “bubbles” (hyper-spheres).
similarity
density estimate 1
(cluster 1)
density estimate 2
(cluster 2)
241
Gower’s Metric
Gower’s is the only similarity metric that accepts any measurement level.
v
j
jj
v
j
jjj
Gower
w
dw
D
1
1
),(
),(),(
yx
yxyx
242
Gower, J. C. (1971) A general coefficient of similarity and some of its properties, Biometrics 27, 623–637.
For nominal, ordinal, interval, or ratio variable:
For asymmetric nominal variable:
For nominal or asymmetric nominal variable:
For ordinal, interval, or ratio variable:
1),( yxj
absentareandbothif,0),(
presentisoreitherif,1),(
jjj
jjj
yx
yx
yx
yx
jw …váha pro j-tou proměnnou
jjj
jjj
yxd
yxd
if,0),(
if,1),(
yx
yx
jjj yxd 1),( yx
Další míry podobnosti
243
• Jaccardův koeficient
• Diceův koeficient
• Czekanowského koeficient
v
j
jj
v
j
j
v
j
j
v
j
jj
J
yxyx
yx
D
11
2
1
2
1
v
j
j
v
j
j
v
j
jj
D
yx
yx
D
1
2
1
2
1
2
v
j
jj
v
j
jj
C
yx
yx
D
1
1
)(
),min(2
Míry podobnosti pro binární data
244
• Koeficient souhlasu
• Jaccardův koeficient
• Diceův (Czekanowského) koef.
• Yuleův koeficient
Kat. objektu x 1 0
1 a b
0 c d
Kategorie objektu w
dcba
da
cba
a
cba
a
2
2
bcad
bcad
Míry (ne)podobnosti pro binární data
245
• Goodman-Kruskalovo lambda
• Binární Lanceova-Williamsova míra nepodobnosti
• Euklidovská vzdálenost
• Bin. čtvercová euklid. vzdálenost (=Hammingova vzd.)
),max(),max()(2
),max(),max(),max(),max(),max(),max(
dcbadbcadcba
dcbadbcadbcadcba
cba
cb
2
cb
cb
Standardizace/normalizace
Před vlastním výpočtem vzdáleností je nanejvýš vhodné
standardizovat (normalizovat) proměnné.
Důvodem je snaha o unifikaci měřítka a tím vyvážení vlivu
jednotlivých proměnných.
Typicky:
Obecně:
246
scale
location
standard.
x
x
scale
locationoriginal
multiplyaddresult
result = final output value
add = constant to add (ADD= option)
multiply = constant to multiply by (MULT= option)
original = original input value
location = location measure
scale = scale measure
The STDIZE Procedure
General form of the STDIZE procedure:
PROC STDIZE DATA=SAS-data-set
METHOD=method ;
VAR variables;
RUN;
247
Standardization Methods
METHOD LOCATION SCALE
MEAN mean 1
MEDIAN median 1
SUM 0 sum
EUCLEN 0 Euclidean Length
USTD 0 standard deviation about origin
STD mean standard deviation
RANGE minimum range
MIDRANGE midrange range/2
MAXABS 0 maximum absolute value
IQR median interquartile range
MAD median median absolute deviation from median
ABW(c) biweight 1-step M-estimate biweight A-estimate
AHUBER(c) Huber 1-step M-estimate Huber A-estimate
AWAVE(c) Wave 1-step M-estimate Wave A-estimate
AGK(p) mean AGK estimate (ACECLUS)
SPACING(p) mid minimum-spacing minimum spacing
L(p) L(p) L(p) (Minkowski distances)
IN(ds) read from data set read settings from data set "ds"
248„Z-skóre“
The Problem with Z-Score Standardization
Standardization using the reciprocal of the variance can
actually dilute the differences between groups!
Source: Everitt et al. (2001)
Before After
249
Cluster Preprocessing
Before ACECLUS After ACECLUS
250
Řešením tohoto problému může být procedura ACECLUS
(approximate covariance estimation for clustering)
Více na: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_aceclus_sect002.htm
The ACECLUS Procedure
General form of the ACECLUS procedure:
PROC ACECLUS DATA=SAS-data-set
;
VAR variables;
RUN;
251
Vzdálenost mezi shluky
Mimo určení jak měřit vzdálenosti mezi objekty uvnitř
shluků je třeba definovat jak měřit vzdálenosti shluků mezi
sebou.
Mezi základní metody patří metoda:
metoda nejbližšího souseda (single linkage),
metoda nejvzdálenějšího souseda (complete linkage),
metoda průměrné vazby (average linkage),
centroidní metoda,
Wardova metoda.
252
The CLUSTER Procedure
The general form of the CLUSTER procedure:
The required METHOD= option specifies the hierarchical
technique to be used to cluster the observations.
PROC CLUSTER DATA=SAS-data-set
METHOD=method ;
VAR variables;
FREQ variable;
RMSSTD variable;
RUN;
253
The CLUSTER Procedure, method=…
The METHOD= specification determines the clustering method used by the procedure. Any one of the
following 11 methods can be specified for name:
AVERAGE | AVE requests average linkage (group average, unweighted pair-group method using arithmetic averages, UPGMA).
Distance data are squared unless you specify the NOSQUARE option.
CENTROID | CEN requests the centroid method (unweighted pair-group method using centroids, UPGMC, centroid sorting,
weighted-group method). Distance data are squared unless you specify the NOSQUARE option.
COMPLETE | COM requests complete linkage (furthest neighbor, maximum method, diameter method, rank order typal analysis).
To reduce distortion of clusters by outliers, the TRIM= option is recommended.
DENSITY | DEN requests density linkage, which is a class of clustering methods using nonparametric probability density
estimation. You must also specify either the K=, R=, or HYBRID option to indicate the type of density estimation to be used. See
also the MODE= and DIM= options in this section.
EML requests maximum-likelihood hierarchical clustering for mixtures of spherical multivariate normal distributions with equal
variances but possibly unequal mixing proportions. Use METHOD=EML only with coordinate data. See the PENALTY= option for
details. The NONORM option does not affect the reported likelihood values but does affect other unrelated criteria. The EML
method is much slower than the other methods in the CLUSTER procedure.
FLEXIBLE | FLE requests the Lance-Williams flexible-beta method. See the BETA= option in this section.
MCQUITTY | MCQ requests McQuitty’s similarity analysis (weighted average linkage, weighted pair-group method using
arithmetic averages, WPGMA).
MEDIAN | MED requests Gower’s median method (weighted pair-group method using centroids, WPGMC). Distance data are
squared unless you specify the NOSQUARE option.
SINGLE | SIN requests single linkage (nearest neighbor, minimum method, connectedness method, elementary linkage analysis, or
dendritic method). To reduce chaining, you can use the TRIM= option with METHOD=SINGLE.
TWOSTAGE | TWO requests two-stage density linkage. You must also specify the K=, R=, or HYBRID option to indicate the type of
density estimation to be used. See also the MODE= and DIM= options in this section.
WARD | WAR requests Ward’s minimum-variance method (error sum of squares, trace W). Distance data are squared unless you
specify the NOSQUARE option. To reduce distortion by outliers, the TRIM= option is recommended. See the NONORM option.
254
Supported Data Types
Hierarchical Method Coordinate Data Distance Data
Average Linkage Yes Yes
Centroid Linkage Yes Yes
Complete Linkage No Yes
Density Linkage No Some Options
EML Yes No
Flexible-Beta Method No Yes
McQuitty’s Similarity No Yes
Median Linkage No Yes
Single Linkage No Yes
Two-Stage Linkage No Some Options
Ward’s Method Yes Yes
255
Average Linkage
The distance between clusters is the average distance
between pairs of observations.
K LCi Cj
ji
LK
KL xxd
nn
D ,
1
CK
d(xi, xj)
CL
256
Centroid Linkage
The distance between clusters is the squared Euclidean
distance between cluster centroids and .Kx Lx
2
LKKLD xx
X
X
DKL
CK
CL
257
Complete Linkage
The distance between clusters is the maximum distance
between two observations, one in each cluster.
),(maxK jiLKL xxdCjCiD
DKL
CK
CL
258
Density Linkage
1. Calculate a new distance metric, d*, using k-nearest
neighbor, uniform kernel, or Wong’s hybrid method.
2. Perform single linkage clustering with d*.
)(
1
)(
1
2
1
,*
ji
ji
xfxf
xxdd*(xi,xj)
CK
CL
259
Equal Variance Maximum Likelihood
The distance between clusters CK and CL is given by a
penalized maximum-likelihood variant.
LLKKMM
G
LKM
KL nnnnnnp
P
www
nD lnlnln1ln
DKL
CK
CL
CM
260
Flexible-Beta
The distance between clusters CK and CL is a BETA scaled
measure of the component distances.
bD
b
DDD KLJLJKJM
2
1
DJK
DJLDKL
CJ
CL
CK
CM
261
McQuitty’s
The average distance between an external cluster, J,
and each of the component clusters (CK and CL)
DJK
DJL
CK
CL
CM
CJ
2
JLJK
JM
DD
D
262
Median Linkage
The average distance between an external cluster and each
of the component clusters, minus the distance between the
component clusters.
DJK
DJL
CK
CL
CM
CJ
DKL
42
KLJLJK
JM
DDD
D
263
Single Linkage
The distance between clusters is the distance between the
two nearest observations, one in each cluster.
),(minK jiLKL xxdCjCiD DKL
CK
CL
264
Two-Stage Density Linkage
The same as density linkage except that a cluster must have
at least “n” members before it can be fused.
2. Apply single linkage
modal cluster K
modal cluster L
modal cluster K
modal cluster L
DKL
1. Form „modal‟ clusters
265
Ward’s
Ward’s method uses ANOVA at each fusion point to
determine if the proposed fusion is warranted.
LK
LK
KL
nn
xx
D
11
2
ANOVA
ANOVA
266
The TREE Procedure
General form of the TREE procedure:
The TREE procedure either
displays the dendrogram (LEVEL= option) or
assigns the observations to a specified number
of clusters (NCLUSTERS= option).
PROC TREE DATA=
;
RUN;
267
Interpreting Dendrograms
change in
fusion level
268
Určení finálních shluků
269
Finální shluky získáme vhodným „řezem“ dendrogramu.
Neexistuje univerzální postup, vždy záleží na konkrétních
datech a interpretovatelnosti výsledku.
Lze ale použít např. tento postup:
Označíme μi vzdálenosti shluků, které vznikli v průběhu
shlukovacího algoritmu v okamžicích spojování objektů/shluků.
Spočteme ri= μi+1 – μi .
Spočteme max(ri ) a určíme tím místo, kde „říznout“.
26,16,3
Zdroj obrázků : L.Žák, Shluková analýza (II), http://www.volny.cz/elzet/Libor/Aut_cl_2.pdf
270
Cvičení
Generování dat: c_data.sas
• COMPACT : Three well-separated, compact clusters.
Source : SAS/STAT User's Guide (Introduction to Clustering Procedures).
• DERMATOLOGY : Differential diagnosis of erythemato-squamous disease.
Source : Nilselliter, N. and Altay Guvenir, H. (1998)
• ELONGATED : Two parallel elongated clusters in which the variation in one dimension is 6 times the
variation of the other dimension. There are 150 members in each of the clusters, for a total of 300
observations.
Source : SAS/STAT User's Guide (Introduction to Clustering Procedures).
• FISH : Seven species of fish caught off the coast of Finland.
Source : Data Archive of the Journal of Statistics Education
• INVESTORS : ...(training data)
• OUTLIERS: Create two clusters with severe outliers.
• PIZZA : Nutrient levels of various brands of frozen pizza.
Source: D.E. Johnson (1998), Applied Multivariate Methods for Data Analysis, Duxbury Press, Cole
Publishing Company, Pacific Grove, CA. (Example 9.2)
• RING : A normal cluster surrounded by a ring cluster.
Source : SAS/STAT User's Guide (The MODECLUS Procedure - Examples).
• STOCK : Dividend yields for 15 utility stocks in the U.S. for 1986-1990.
Source : SAS/STAT User's Guide (The DISTANCE Procedure - Examples).
• TINVESTORS : Investors data set (test data)
• UNEQUAL : Generate three unequal variance and unqual size clusters.
Source : SAS/STAT User's Guide.
271
Cvičení
/*
clus01d01: Generating distances.
The sasuser.stock data set contains the dividend yields for 15 utility stocks in the
U.S.
The observations are names of the companies, and the variables correspond to
the annual
dividend yields over the period 1986-1990.
*/
options nodate nonumber;
goptions reset=all;
%let inputs = div_1986 div_1987 div_1988 div_1989 div_1990;
/* display the input data set */
title 'Stock Dividends';
title2 'The STOCK Data Set';
proc print data=sasuser.stock;
var company &inputs;
run;
/* calculate the range standardized Euclidean distance */
proc distance data=sasuser.stock method=euclid out=dist;
var interval(&inputs/std=range);
id company;
run;
/* display the distance matrix generated */
title2 'Euclidean Distance Matrix';
proc print data=dist;
id company;
run;
272
Cvičení/* generate hierarchical clustering solution (Ward's method)*/
proc cluster data=dist method=ward outtree=tree noprint;
id company;
run;
/* display the EUCLID dendrogram horizontally */
title2 "Cluster Solution";
proc tree data=tree horizontal;
id company;
run;
/* calculate the range standardized city block distance */
proc distance data=sasuser.stock method=cityblock out=dist;
var interval(&inputs/std=range);
id company;
run;
/* display the distance matrix generated */
title2 'City Block Distance Matrix';
proc print data=dist;
id company;
run;
/* generate hierarchical clustering solution (Ward's method)*/
proc cluster data=dist method=ward outtree=tree noprint;
id company;
run;
/* display the CITYBLOCK dendrogram horizontally */
title2 "Cluster Solution";
proc tree data=tree horizontal;
id company;
run;
273
Cvičení/* clus02d4: Impact of input standardization on clustering.
This demonstration evaluates the impact on cluster performance of
changing the method of input standardization. Several methods are
ranked according to their Cramer's V value and their misclassification
rate. PROC FASTCLUS is used to cluster the observations. The input
data set is the pizza data set. The input variables are the three inputs
recommended using by the PROC VARCLUS 1-R**2 criterion.
*/
options nodate nonumber;
%let group = brand;
%let inputs = carb mois sodium;
data results;
length method$ 12;
length misclassified 8;
length chisq 8;
length pchisq 8;
length cramersv 8;
stop;
run;
%macro standardize(dsn=, nc=, method=);
…
%mend standardize;
%standardize(dsn=sasuser.pizza,nc=10,method=ABW(11));
%standardize(dsn=sasuser.pizza,nc=10,method=AGK(1));
274
Cvičení%standardize(dsn=sasuser.pizza,nc=10,method=AHUBER(.1));
%standardize(dsn=sasuser.pizza,nc=10,method=AWAVE(.2));
%standardize(dsn=sasuser.pizza,nc=10,method=EUCLEN);
%standardize(dsn=sasuser.pizza,nc=10,method=IQR);
%standardize(dsn=sasuser.pizza,nc=10,method=L(1));
%standardize(dsn=sasuser.pizza,nc=10,method=L(1.5));
%standardize(dsn=sasuser.pizza,nc=10,method=L(2));
%standardize(dsn=sasuser.pizza,nc=10,method=MAD);
%standardize(dsn=sasuser.pizza,nc=10,method=MAXABS);
%standardize(dsn=sasuser.pizza,nc=10,method=MEAN);
%standardize(dsn=sasuser.pizza,nc=10,method=MEDIAN);
%standardize(dsn=sasuser.pizza,nc=10,method=MIDRANGE);
%standardize(dsn=sasuser.pizza,nc=10,method=NONE);
%standardize(dsn=sasuser.pizza,nc=10,method=RANGE);
%standardize(dsn=sasuser.pizza,nc=10,method=SPACING(.9));
%standardize(dsn=sasuser.pizza,nc=10,method=STD);
%standardize(dsn=sasuser.pizza,nc=10,method=SUM);
%standardize(dsn=sasuser.pizza,nc=10,method=USTD);
/* sort by number of misclassifications within Cramer's V */
proc sort data=results;
by descending cramersv misclassified;
run;
/* display Cramer's V and misclassifications for each method */
title1 'Results';
proc print data=results;
var method cramersv misclassified ;
run;
quit;
6. Vývoj CS modelu
275
Process Flow
Explore Data
Data Cleansing
Initial Characteristic
Analysis (KGB)
Preliminary
Scorecard (KGB)
Reject Inference
Initial Characteristic
Analysis (AGB)
Final
Scorecard (AGB)
• Scaling
• Assessment
Validate
276
Preliminary Scorecard (Known
Good/Bad)
Group of characteristics, that together, offer the most
predictive power
Logistic Regression (forward, backward, stepwise)
8–20 characteristics
stability.
277
Logistic Regression
Logit (pi)=β0+ β1x1+ ….+ βkxk
p – posterior probability of ‘event’ given inputs
x – input variables
β – parameters
Logit transformation is log of the odds, and is used to linearize
posterior probability and limit outcome to between 0 and 1.
Maximum Likelihood used to estimate parameters.
Parameters estimates measure rate of change of logit for one
unit change in input variable (adjusted for other inputs)
Depends on the unit of the input, therefore need to standardise (e.g.
WOE)
278
Logistic Regression
Binary target (good/bad)
Variables
Raw data
Grouped data (for example, mid value of each group)
Weight of evidence
279
Logistic Regression
Forward Stepwise
Select best variable, add it to the model, and then add/subtract
variables until no improvement in indicator.
Efficient, but weak when too many variables or high correlation
Backward Elimination
Start with all variables in the model, then eliminate least important
variables.
Correlation is better taken care of
Better than stepwise, but can be computationally intensive.
280
Preliminary Scorecard
Choose the best – and build the most comprehensive risk
profile
With as many independent data items as possible
independent data items representing different data types e.g. demog, financials, inquiries, trades info
10 characteristics with ‘100’ each preferred to 4 with ‘250’
each.
Correlation, co linearity etc. considered
Scorecard coherent with decision support structure
Sole arbiter or decision support tool: model needs to be coherent with overall decision
support structure
Interpretability, implementability, and other business
considerations.
281
Example of a Good Scorecard
Age
Residential status
Time at address
Inquiries 12 months 1)
Inquiries 3 months
Trades 90 days+ as % of total
Revolving balance/Total
Utilization
Number of products
at bank
Delinquency at bank
Total Debt Service
Ratio
282
1) Počet žádostí o úvěr za posledních 12 měsíců
contains some demographics, some inquiries, some trade, some utilization,
internal bank perforamnce and capacity to pay.
How Do We Get There?
Try statistically optimal approach (let the data speak)
“Design” a scorecard using stepwise/backward
Force characteristics in, or fix at each loop and adjust the hurdle rate
Consider:
“must have”
Weaker/stronger
Similar
283
Weaker, Similar
Weaker – consider first
Can 2 characteristics worth 40 points
each model behavior better than one worth 70?
Same strength, broader base
Similar – put together
Time related, inquiries, trades, debt capacity, demographic
Takes care of correlation
284
Putting It Together
Try different combinations of characteristics in
regression
Instead of putting all characts in, separate into categories, and try
combinations.
Leave very strong characteristics out, or use at the end
(for example, bureau scores)
Example “levels”
Weaker application info
Stronger application info
Weaker bureau
Stronger bureau
Mix and adjust with experience.
285
Putting It Together
Age, time at address, time at employment, time at bank
Region, postal code, province
TDSR, GDSR, capacity, Loan To Value
Time at bureau, current customer (Y/N)
Inq 3 months, inq 6 months, inq 12 months, inq 3/12 months
Trades delq, trades 3 mth/total, current trades
Utilization, public records
Bureau score, bankruptcy.
286
GDSR( Gross Debt Service Ratio) = (Annual Mortgage Payments + Property Taxes + Other Shelter Costs)/(Gross Family Income)
TDSR (Total Debt Service Ratio) = (Annual Mortgage Payments + Property Taxes + Other Shelter Costs + Other Debt Payments)/(Gross Family Income)
Logistic Regression
Use stepwise or backward
stepwise means dominating variable will stay in.
Backward: set of weak variables may end up staying.. That together
add value (sometimes better than stepwise).. Also backward takes
care of correlation better than others.
Modify to consider only selected characteristics at each “level”
series of regression runs, each as one “level”, force selected
characteristics from previous “levels” in.
EM nodes in series.
287
• It is strongly recommended that all coefficients are logical.
If some of them are not, include comments with explanation
why it is good to keep them in scorecard. Include column,
where it is easy to see the contribution of each category
(either scaled scorepoints for linearization, either simply
bi*xi*1000). Order categories in each predictor according to
badrate (WOE) so that the worst are the first.
288
Logistic Regression
289
LR – scorecard example
Own / Rent
<5
12
Prof
50
<.5
2
None
0
Check
5
<15
22
0
3
<.5
0
0
5
0-15%
15
Years at
address
Occupation
Dept St /
Major CC
Bank
reference
Debt ratio
No. of recent
inquiries
Years in file
# Rev trades
outstanding
% Credit line
utilization
Worst
reference
Rent
15
.5-2.49
10
SemiPrf
44
.5-1.49
8
Dept-St
11
Sav
10
15-25
15
1
11
1-2
5
1-2
12
16-30%
5
Other
10
2.5-6.49
15
Mgr
31
1.5-2.49
19
Maj-CC
16
Ck&Sav
20
26-35
12
2
3
3-4
15
3-5
8
31-40%
-3
NI
17
6.5-10.49
19
>10.49
23
NI
14
Offc.
28
2.5-5.49
25
Both
27
Other
11
36-49
5
3
-7
5-7
30
6+
-4
41-50%
-10
Bl.Col
25
5.5-12.49
30
No answr
10
NI
9
50+
0
4
-7
8+
40
>50%
-18
5-9
-20
No Rcrd
0
NI
13
NI
12
Retired
31
Other
22
NI
27
12.5
39
Retired
43
NI
20
Years on job
Obs colnamew colvalue Bi_x_Xi_x_1000 bad_rate freq_rate Bi Xi
1 Intercept 3129 . . 3.129451 .
3 age_fr_w 20 -359 10.3 3.5 0.377612 -0.950967
4 age_fr_w 29 -154 6.3 28 0.377612 -0.408556
5 age_fr_w 32 -47 4.8 8.4 0.377612 -0.123244
6 age_fr_w 36 20 4.1 10 0.377612 0.05253
7 age_fr_w 41 48 3.8 11 0.377612 0.12723
8 age_fr_w 51 173 2.7 23 0.377612 0.458154
9 age_fr_w 60 327 1.8 16 0.377612 0.865979
10 car_owner_fr_w 0 -60 4.5 76 1.179055 -0.051044
11 car_owner_fr_w 1 211 3.6 24 1.179055 0.179078
12 child_num_fr_w 99 -59 4.9 1.9 0.424104 -0.138759
13 child_num_fr_w 0 -33 4.6 60 0.424104 -0.078474
14 child_num_fr_w 1 34 4 27 0.424104 0.080184
15 child_num_fr_w 2 134 3.2 11 0.424104 0.315117
16 education_fr_w 5 -174 6.1 4.3 0.453663 -0.384073
17 education_fr_w 4 -79 5 3.3 0.453663 -0.174838
18 education_fr_w 2 -10 4.4 73 0.453663 -0.021929
19 education_fr_w 36 112 3.4 19 0.453663 0.247396
290
LR – scorecard example
For all predictors in scorecard estimated coefficient, wald chi-square, pvalue
(e.g. in SAS output Analysis of maximum likelihood estimate),
summary of predictor selections (order of predictors entering the model, e.g.
in SAS output summary of stepwise selection)
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Wald Pr > ChiSq
Error Chi-Square
Intercept 1 3.1295 0.0106 86436.032 <.0001
age_fr_w 1 0.3776 0.0254 221.2018 <.0001
car_owner_fr_w 1 1.1791 0.1093 116.2664 <.0001
education_fr_w 1 0.4537 0.066 47.2419 <.0001
fam_state_fr_w 1 0.4167 0.0297 196.864 <.0001
goods_group_fr_w 1 0.2716 0.0311 76.3139 <.0001
child_num_fr_w 1 0.4241 0.0867 23.9081 <.0001
ident_card_age_fr_w 1 0.5677 0.0269 446.5471 <.0001
291
Logistic Regression
Summary of Stepwise Selection
Step
Effect
DF
Number Score Wald
Pr > ChiSq
Entered Removed In
Chi-
Square
Chi-
Square
1 price1_fr_w 1 1 3858.056 <.0001
2 age_fr_w 1 2 3007.7932 <.0001
3 init_pay_by_price1_f 1 3 1434.1863 <.0001
4 ident_card_age_fr_w 1 4 868.4661 <.0001
5 type_suite_fr_w 1 5 800.2294 <.0001
6 time_on_job_fr_w 1 6 554.571 <.0001
7 mobile_phone_fr_w 1 7 342.7936 <.0001
8 sex_fr_w 1 8 357.5026 <.0001
9 fam_state_fr_w 1 9 331.6358 <.0001
10 ident_type2_fr_w 1 10 358.4905 <.0001
11 weekend_fr_w 1 11 323.0631 <.0001
12 region_fr_w 1 12 299.3854 <.0001
292
Summary of predictor selection
Good Scorecard?
Eye Ball Test
Point allocation logical, no flips (after scaling)
“Flips” occur for several reasons: low count, correlation.
Scorecard characteristics make sense
what went in, what did’t., does it cover all the major categories of
information?
Misclassification
Strength
Validation.
293
Scorecard Development
Build more complex models and compare predictiveness – if
difference not significant, then scorecard is OK
Examine findings – is there a valid business reason?
Build several ‘different’ scorecards
294
What Have I Just Done?
“Designed” a scorecard
Used regression, with business considerations
Stable, represents strong major/independent information categories
Measurable strength and impact
Something a risk manager can buy and use.
Used only known goods and bads (that is, approves)
But need to apply scorecard on all applicants.
295
Declined
Reject Inference
Bad
Good
Total Applicants
296
Everything to this point has been for known performance - e.g.
approval rate is 60%, building a model for 100% of the population
based on 60% sample is not accurate.
Reject Inference
Inferring the behavior of declined applicants
Bad Good
297
This is where you need to get to: so need to create a sample
representative of the “through the door” or entire applicant pop
performance - 100% approval rate.
The Known Good Bad Picture
Rejects
2,950
Bads
874
Goods
6,176
Accepts
7,050
Through-the-door
10,000
?
298
Reject Inference
Make the scorecard relevant
ignoring rejects distorts model
Influence of past decision making
For decision making
Get population odds
Expected performance
Swap set.
Old scorecard
Approve Decline
New Approve A B
Scorecard Decline C D
299
A – is approved goods
B – is rejected goods
C – is approved bads
D – is rejected bads
Where?
Medium/low approval rates
a 95% approval rate is close to “through the door”
Manual adjudication environment
Incorporates experience/intuition based overriding
“cherry picking” distorts performance.
300
Reject Inference Techniques
“True” Performance
“Nearly True” Performance
Statistical Inference
Or ignore the problem
Assume accepts = total population
not recommended unless previous credit granting was random or
scorecard was perfect ( assume all rejects = bad).
301
“True” Performance
Approve every applicant
Or random sample
Expensive
… but the only true way to
determine performance of
below cutoff applicants.
cutoff
ApproveAll
13
Sample
2
ApproveAll
302
“Nearly True” Performance 1
Bureau data
performance of declined apps on similar products with other companies
legal issues
difficult to implement in practice – timings, definitions, programming
• Need consent to get bureau at any time
• data - if u rejected them, they probably were rejected elsewhere
• timings - performance window, sample window must be consistent
• bad definition must be closely replicated
• product must be similar - credit cards, unsecured line of credit with similar limit and conditions as you
would have given
• Experience - Programming effort is tremendous, depending on how detailed credit bureau reports are
Declined - got
credit elsewhere
Jan 99
Analyze performance
Dec 00
303
“Nearly True” Performance 2
In-house data
performance of declined apps on similar products, for example, credit
cards/line of credit
timings, definitions my cause problems.
• data - if u rejected them for a lower level product, they probably were rejected for higher one
.. HOWEVER, in multiple product environments, scorecards are not always aligned and
there is “ARBITRAGE”.
• timings - performance window, sample window must be consistent
• bad definition must be closely replicated
• product must be similar - credit cards, unsecured line of credit with similar limit and
conditions as you would have given
Analyze performance
Dec 00
Declined - got
similar products
Jan 99
304
Bureau Score Migration
Analyze bureau score migration of existing accounts with
below cutoff scores
Identify accounts whose scores migrate to ‘above cutoff’
within specified time frame
305
Reclassification
Build an accept/reject model
Score all rejects and designate worst as accepted ‘bad’
Can use score or “serious derogatory” information to select
accounts
Analyze Accept/Reject vs. Good/Bad cross tabs
Add to accepts and Re-model
306
Simple Augmentation
Simple Augmentation
Build good/bad model
Score rejects – establish a p(bad) to assign class
Add to Accepts and re-model
Simple
Arbitrary cutoff to assign goods and bads
Good/Bad model needs to be very good
No adjustment for p(approve).
307
Augmentation 2
Augmentation 2 (Coffman, Chandler 1977)
Build accept/reject model, obtain p(accept)
Build good/bad model
Adjust case weights of good/bad model to reflect probability of
acceptance
Recognizes need to adjust for p(approve).
308
Parceling
Parceling (also called re-weighting)
score rejects with G/B model
split (randomly) rejects into proportional G and B groups.
Score # Bad # Good % Bad % Good
0-99 24 10 70.3% 29.7%
100-199 54 196 21.6% 78.4%
200-299 43 331 11.5% 88.5%
300-399 32 510 5.9% 94.1%
400+ 29 1,232 2.3% 97.7%
Reject
342
654
345
471
778
Rej - Bad
240
141
40
28
18
Rej - Good
102
513
305
443
760
continued...
309
Parceling
But ..
Reject bad proportion cannot be the same as approved?
Allocate higher proportion of bads from reject
Rule of thumb: bad rate for rejects should be
2–4 times that of approves.
Quick and simple
Good/Bad model better be good
May understate rejected bad rate.
310
Iterative Reclassification
Iterative Reclassification (McLachlan, 1975)
Build good/bad model using accepts
Score rejects and assign class based on chosen p(bad) cutoff
Rebuild model with combined dataset
Score rejects and re-assign class
Repeat until parameter estimates (and p(bad)) converge.
Can be modified for p(good) and p(bad) target assignment.
311
Iterative Reclassification
can be done as a plot of ln (odds) versus score.
lnOdds
Score
KGB
Iteration 1
Iteration 2
312
Fuzzy Augmentation
Step 1: Classification
Build good/bad model
Score rejects with G/B model
Do not assign a reject to a class
Create 2 weighted cases for each reject, using p(good) and p(bad).
313
Fuzzy Augmentation
Step 2: Augmentation
Combine rejects with accepts, adjusting for approval rate
For this, weigh rejects again: weight determines how much
more frequent an actual case is compared to an inferred
case in the augmented dataset
Freq of a ‘Good’ from rejects = p(good)
x weight
Step 3: Remodel.
314
EM users: This is in the EM RI node.
Freq= p(good) x (reject rate/approval rate) x (#accepts/#rejects)
# rejects/accepts are proportional to actual population I.e. weighted, not raw counts
Fuzzy Augmentation
No need for arbitrary classification cut-off
Augmentation step: better approach for choosing the
importance of rejects.
315
Nearest Neighbor (Clustering)
Clustering
Create 2 sets of clusters: goods and bads
Run rejects through both clusters
Compare Euclidean distances to assign most likely performance
Combine accepts and rejects and re-model
Measures are relative
Adjustment for p(approve) can be added at augmentation
step.
Can also use Memory-based Reasoning.
316
Other Techniques
Heckman’s Correction
http://ewe3.sas.com/techsup/download/stat/heckman.html
Heckman, James. "Sample Selection Bias as a Specification Error", Econometrica,
Vol 47, No 1., January 1979, pp. 153-161.
Greene, William. "Sample Selection Bias as a Specification
Error: Comment", {\sl Econometrica}, Vol. 49, No. 3, May 1981,
pp. 795-798.
Mixture Decomposition
B.S Everitt and D.J. Hand, Finite Mixture Distributions (London:Chapman & Hall,
1981)
317
Verification
Compare bad rates/odds for known versus inferred, and use
rule of thumb.
Review bad rates/weight of evidence of
pre- versus post inference groupings.
Create “fake rejects” and test.
assign some accepted accounts as rejects with an artificial cutoff and
test methods.
318
Factoring – Post Inference
Bads
914
Goods
2,036
Rejects
2,950
Bads
874
Goods
6,176
Accepts
7,050
Through-the-door
10,000
Bad rate = 30.98% Bad rate = 12.4%
319
After rejects have been inferred, we build the post-inference data sets for the final scorecard
production.
So the sample bias is solved and you can apply the scorecard on the entire population.
Process Flow
Explore Data
Data Cleansing
Initial Characteristic
Analysis (KGB)
Preliminary
Scorecard (KGB)
Reject Inference
Initial Characteristic
Analysis (AGB)
Final
Scorecard (AGB)Validate
320
Final Scorecard
Repeat Exploration, Initial Characteristics Analysis and
Regression for “All Good Bad” data set
Scaling
Assessment
Misclassification
Strength.
321
Scorecard Scaling (conversion into points)
Why scale?
Implementation software – batch versus on-line
Marketing uses (off line selection, build retention model, score and
isolate account numbers) vs. online decision support and app
processing software
Ease of understanding and interpretation
End user can deal with points easier than weights
Continuity
previous scorecards were grouped/scaled . .and you want to have the
same format and scaling.
Legal requirements
legal requirements to identify characteristics and reasons for decline
Components
Odds at a score
Points to double the odds
Example: Odds of 20:1 at 200, and odds double every 20 points.
322
Scorecard Scaling
This is the transformation
from parameter estimates
to scores.
Result: get a score card with
discrete points, related to
each other and the final
score related to odds.
odds doubling every 20
points
Score Odds
200 20
201 23
202 25
203 26
.
.
220 40
.
240 80
323
Age
18-24 10
25-29 15
30-37 25
38-45 28
46+ 35
Time at Res
0-6 12
7-18 25
19-36 28
37+ 40
Region
Major Urban 20
Minor Urban 25
Rural 15
Inq 6 mth
0 40
1-3 30
4-5 15
6+ 10
Scorecard Scaling
In general:
Score = A + B log (odds)
Score +PDO = A + B log (2*odds)
Offset A and Factor B are to be calculated
• Odds = odds at which to fix a score
• Score = score at point x
• PDO = points to double the odds
324
Scorecard Scaling
Solving for PDO:
PDO = B Log (2), therefore
B = PDO/log(2) ; A = Score – {B log (Odds)}
Example, Odds of 50:1 at 600 and 20 pdo
B = 20/log(2) = 28.8539
A = 600 – {28.8539 log (50)} = 487.123
Score = 487.123 + 28.8539 log (odds)
Or log (odds) = (-16.88239) + 0.03465*Score
325
Scorecard Scaling
n
offset
factor
n
a
woe ii *)*(
The points for each attribute are calculated by multiplying
the Weight of Evidence of the attribute with the regression
coefficient of the characteristic, then adding a fraction of
the regression intercept, then multiplying this by -1 and by
the factor and finally adding a fraction of the offset.
326
The negative sign is there because we switch from bad/good in modeling
(regression) to good/bad in scaling (high scores being better than low
scores).
B A
Scorecard Scaling
)*)*((
*))*((
*))*((
*)log(
1
1
1
n
offset
factor
n
a
woe
offsetfactor
n
a
woe
offsetfactorawoe
offsetfactoroddsscore
n
i
ii
n
i
ii
n
i
ii
β = is the regression
coefficient
WOE = weight of
evidence for the
attribute
n = number of
characteristics
a = intercept
327
Check Points Allocation
Age Weight Scorecard 1
Missing -55.50 16
18-22 -108.41 12
23-26 -72.04 18
27-29 -3.95 26
30-35 70.77 35
35-44 122.04 43
44 + 165.51 51
Scorecard 2
16
12
18
14
38
44
52
328
Scorecard 1 looks OK - logical distribution - as age increases, points increase according to weight
But Scorecard 2 doesn’t.
Why?
Correlation? Quirk in the data? Grouping? Maybe weights were too close together and not enough
differentiation - repeat grouping with more distinct groups, and repeat regression.
FICO is a unified score, which you can get from your score by linear transformation. The aim is to compute
these transformation coefficients for every scorecard, because then you can compare quality of portfolios. If
your development data are old enough, so that you can observe 90DPD @12MOB, take random sample (30 000
observations) from them, if not, take older data and score them by your new scorecard. Make a table according
to example below, compute FICO score for each category as linear transformation ln(G/B)->Fico, defined
FICO=(x+7.58)/0.0157. Apply linear regression on median score and FICO.
Median of the
category
Lower bound of
the score
Upper bound of
the score
Numb @12
Mob
Ever 90@12
MOB Good/Bad ln(G/B) Fico
0.716 0 0.752399981 1497 943 0.5874867 -0.5319 449
0.778 0.7524 0.7968 1504 733 1.0518418 0.050543 486
0.8132 0.7968 0.8268 1496 630 1.3746032 0.318165 503
0.8371 0.8268 0.8457 1508 564 1.6737589 0.515072 516
0.8532 0.8457 0.8596 1495 510 1.9313726 0.658231 525
0.8654 0.8596 0.8703 1500 474 2.164557 0.772216 532
0.875 0.8703 0.8792 1513 447 2.3847875 0.86911 538
0.8833 0.8792 0.8869 1489 414 2.5966184 0.95421 544
0.8901 0.8869 0.8934 1496 393 2.8066158 1.031979 549
0.8968 0.8934 0.8996 1521 378 3.0238095 1.106517 553
0.9024 0.8996 0.9051 1491 351 3.2478633 1.177997 558
329
FICO score
(1497-943)/943 = 0.5874
(-0.5319 + 7.58)/0.0157 = 449
FICO transformation graph
y = 678.99x - 49.314
R2
= 0.9613
0
100
200
300
400
500
600
700
0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
sc1
Fico_score
Řada1 Lineární (Řada1)
330
FICO score
7. Introduction to
Survival Analysis
331
What Is Survival Analysis?
Survival analysis is a class of statistical methods for
which the outcome variable of interest is time until an
event occurs.
Time is measured from the beginning of follow-up
until the event occurs or a reason occurs for the
observation of time to end.
332
Examples of Survival Analysis
Follow-up of patients undergoing surgery to measure
how long they survived after the surgery
Follow-up of leukemia patients in remission to measure
how long they remain in remission
Follow-up of clients to measure how long they stay
non-defaulted
333
What Is Survival Analysis?
Time
Subjects
A
B
C
D
E
F
G
Event
End of Study
Withdrew
Event
Lost to follow-up
Event
Event
1 2 3 4 5 6
334
Data Structure
Subject Survival Time Status
A 4.0 1 (event)
B 6.0 0 (censored)
C 3.0 0
D 5.0 1
E 3.0 0
F 3.0 1
G 2.0 1
335
Problems with Conventional Methods
Logistic regression
ignores information on the timing of events
cannot handle time-dependent covariates.
Linear regression
cannot handle censored observations
cannot handle time-dependent covariates
is not appropriate because time to event can have unusual
distribution.
336
Right-Censoring
An observation is right-censored if the observation is
terminated before the event occurs.
Time
Subjects
End of Study
Withdrew
Lost to follow-up
337
Left-Censoring
Start of
Study
End of
Study
A
B
Time before
Study
Event
Event
An observation is left-censored when the observation
experiences the event before the start of the follow-up period.
338
Interval-Censoring
A
B
Event
Event
Time
a b?
An observation is interval-censored if the only information you
know about the survival time is that it is between the values a and b.
339
Types of Right-Censoring
Type I subjects survived until end of the study. Censoring
time is fixed.
Type II subjects survived until end of the study. Censoring
time occurs when a pre-specified number of events have
occurred.
Random observations are terminated for reasons that are
not under the control of the investigator.
340
Uninformative Censoring
Censoring is uninformative if it
occurs when the reasons for termination are unrelated to the
risk of the event
assumes that subjects who are censored at time X should be
representative of all those subjects with the same values of the
predictor variables who survive to time X
does not bias the parameter estimates and statistical inference.
341
Informative Censoring
Censoring is informative if it
occurs when the reasons for termination of the observation are
related to the risk of the event
results in biased parameter estimates and inaccurate statistical
inference about the survival experience.
342
Recommendations Regarding
Informative Censoring
When designing and conducting studies, reduce the amount
of random censoring.
Always analyze the pattern of censoring to see whether it is
related to a subset of subjects.
Include in your study any explanatory variables that may
affect the rate of censoring.
343
Time Origin Recommendations
Choose a time origin that marks the onset of continuous
exposure to the risk of the event.
Choose the time of randomization to treatment
as the time origin in experimental studies.
If there are several time origins available, consider
controlling for the other time origins by including
them as covariates.
344
Survival Analysis
The goals of survival analysis might be to
estimate and interpret survival and hazard functions from
survival data
compare survival and hazard functions among different groups
assess the relationship of time-independent and
time-dependent explanatory variables to survival time
predict the remaining time until the event.
345
• T … náhodná veličina označující délku přežití (čas do sledované
události nebo cenzorování).
• δ… indikátor události (δ=1 pokud událost nastala, δ=0 pokud je
pozorování cenzorované)
• S(t)… funkce přežití (survival function) vyjadřuje
pravděpodobnost, že jedinec v čase t ještě žije.
• Platí:
Survival Function (funkce přežití)
)()( tTPtS
0)(lim
1)0(
tS
S
t
346
Kaplan-Meier estimation
Time
Number
Events
Number
Censored
0 0 0
1 0 0
2 1 0
3 1 2
4 1 0
5 1 1
6 0 0
Number
At Risk
Cumulative
Survival
7 1.00
7 1.00
7 (7-1)/7=.86
6 .86*5/6=.71
3 .71*2/3=.48
2 .48*1/2=.24
0 -------
j
i i
ii
jjjj
n
dn
ndtStS
1
1 )1)((ˆ)(ˆ nj… počet objektů pozorovaných do doby tj
dj…počet objektů s událostí v době tj
+ -3
347
Kaplan-Meier Curve
348
Další metody odhadu
349
Life Table Method
The life table method
is useful when there are a large number of observations
groups the event times into intervals
can produce estimates and plots of the hazard function.
350
Life Table Method
351
Differences between KM and Life
Table Methods
In the Kaplan-Meier method,
time interval boundaries are determined by the event
times themselves
censored observations are assumed to be at risk for the
whole event time period.
In the life table method,
time interval boundaries are determined by the user
censored observations are censored at the midpoint
of the time interval.
352
Standard error of KM estimate
• The corresponding estimate of the standard error is
computed using Greenwood’s formula (Kalbfleisch and
Prentice; 1980) as
j
i iii
i
jj
dnn
d
tStS
1 )(
)(ˆ))(ˆ(ˆ
353
Pointwise Confidence Limits
354
Pointwise Confidence Limits
355
Pointwise Confidence Limits
356
Simultaneous Confidence Intervals
Confidence bands show with a given confidence level that the
survival function falls within the interval for all time points.
There are two approaches in SAS for constructing
simultaneous confidence intervals.
Equal precision (CONFBAND=EP) confidence intervals are
proportional to the pointwise confidence intervals.
Hall-Wellner (CONFBAND=HW) confidence intervals are not
proportional to the pointwise confidence intervals.
Transformations that are used to improve the pointwise
confidence bands can be used to improve the simultaneous
confidence bands.
357
Simultaneous Confidence Intervals
Let
358
Simultaneous Confidence Intervals
359
Simultaneous Confidence Intervals
360
Comparing Survival Functions
361
Likelihood-Ratio Test
The likelihood-ratio test
is a parametric test that assumes that the distribution of event
times follows an exponential distribution
can be verified if the plot of the negative log of the survival
function by time follows a linear trend with
an origin of 0.
362
Nonparametric Tests
363
Log-Rank Test
The log-rank test
tests whether the survival functions are statistically
equivalent
is a large-sample chi-square test that uses the observed and
expected cell counts across the event times
has maximum power when the ratio of hazards is constant
over time
loses power in the presence of interactions.
364
Log-Rank Test for Two Groups
where d1j is the number of events that occur in group 1 at
time j, and e1j is the expected number of events in group 1
at time j.
2
1 1
1
1 1
1
( )
var ( )
r
j j
j
r
j j
j
d e
d e
365
Wilcoxon Test
The Wilcoxon test
is also known as the Gehan test or the Breslow test
can be biased if the pattern of censoring is different between
the groups
loses power in the presence of interactions.
366
Wilcoxon Test for Two Groups
2
1 1
1
1 1
1
( )
var ( )
r
j j j
j
r
j j j
j
n d e
n d e
where nj is the total number at risk at each time point.
367
Log-Rank versus Wilcoxon Test
Log-rank test
is more sensitive than the Wilcoxon test to differences
between groups in later points in time.
Wilcoxon test
is more sensitive than the log-rank test to differences between
groups that occur in early points in time.
368
New Tests in SAS®9
Tarone-Ware test uses a weight equal to the square root of the
number at risk. This gives more weight to differences between
the observed and expected number of events at time points
where there is the most data.
Peto-Peto and Modified Peto-Peto tests use weights that
depend of the observed survival experience of the combined
sample. The principle advantage of these tests is that they do
not depend on the censoring experience of the groups.
Harrington-Fleming test incorporates features of both the
log-rank and Peto-Peto tests.
369
Stratified Tests
Stratified tests are used when you want to compare survival
functions across k populations while controlling for other
covariates.
They are different than the k-sample tests which only
compare survival functions across k populations.
Stratified tests are available in SAS®9 with the use of the
GROUP= option in the STRATA statement.
370
Syntax for Stratified Tests
STRATA variable1 / GROUP variable2 TEST=(list);
Distinct values
represent the m
strata
Distinct values
represent the k
populations
371
Multiple Comparison Methods
Bonferroni correction to the raw p-values
Dunnett’s two-tailed comparisons of the control
group with all other groups
Scheffe’s multiple-comparison adjustment
Sidák correction to the raw p-values
Paired comparisons based on the studentized maximum
modulus test
Tukey’s studentized range test
Adjusted p-values from the simulated distribution
372
Specification of Comparisons
DIFF=ALL requests all paired comparisons.
DIFF=CONTROL <(’string’ <...’string’>)> requests
comparisons of the control curve with all other curves.
To specify the control curve, you specify the quoted strings of
formatted values that represent the curve
in parentheses.
373
374
375
376
377
LIFETEST Procedure
General form of the LIFETEST procedure:
PROC LIFETEST DATA=SAS-data-set ;
TIME variable <*censor(list)>;
STRATA variable <(list)> <...variable <(list)>>
;
TEST variables;
RUN;
• The simplest use of PROC LIFETEST is to request the nonparametric estimates of the survivor
function for a sample of survival times. In such a case, only the PROC LIFETEST statement and the
TIME statement are required. You can use the STRATA statement to divide the data into various strata.
A separate survivor function is then estimated for each stratum, and tests of the homogeneity of strata
are performed. 378
Hazard Function (riziková funkce)
The hazard function
is the instantaneous risk or potential that an event will occur
at time t, given that the individual has survived up to time t
takes the form number of events per interval of time
is a rate, not a probability, that ranges from zero to infinity.
379
Hazard Function
0
( | )
( ) lim
t
P t T t t T t
h t
t
Instantaneous risk or potential
(okamžité riziko/potenciál)
Interval of time
Conditional
Probability
))(ln(
)(
)(
)( tS
ttS
tf
th
))(exp()( tHtS
t
dxxhtH
0
)()(
Platí:
kde je tzv. kumulativní riziková funkce.
380
kde je hustota náhodné veličiny T.)(tf
Hazard Function
381
8. Cox model
382
Survival Models
Models in survival analysis
are written in terms of the hazard function
assess the relationship of predictor variables to survival time
can be parametric or nonparametric models.
383
Parametric versus Nonparametric
Models
Parametric models require that
the distribution of survival time is known
the hazard function is completely specified except
for the values of the unknown parameters.
Examples include the Weibull model, the exponential
model, and the log-normal model.
384
Parametric versus Nonparametric
Models
Properties of nonparametric models are
the distribution of survival time is unknown
the hazard function is unspecified.
An example is the Cox proportional hazards model.
385
Cox Proportional Hazards Model
1 1{ ... }
0( ) ( ) i k ikX X
ih t h t e
Baseline Hazard function –
involves time but not
predictor variables
Linear function of a
set of predictor
variables – does not
involve time
386
Popularity of the Cox Model
The Cox proportional hazards model
provides the primary information desired from a survival
analysis, hazard ratios and adjusted survival curves, with a
minimum number of assumptions
is a robust model where the regression coefficients closely
approximate the results from the correct parametric model.
387
Measure of Effect
Hazard ratio = hazard in group A
hazard in group B
ˆ ( )i iA iBX X
e
388
Properties of the Hazard Ratio
Group B
Higher
Hazard
Group A
Higher
Hazard
0 1
No Association
389
Proportional Hazards Assumption
Log h(t)
Time
Females
Males
390
Nonproportional Hazards
391
Cox model in credit scoring
Credit-scoring systems were built to answer the question, "How likely is a credit
applicant to default by a given time in the future?" The methodology is to take a sample
of previous customers and classify them into good or bad depending on their
repayment performance over a given fixed period. Poor performance just before the
end of this fixed period means that customer is classified as bad; poor performance
just after the end of the period does not matter and the customer is classified as good.
This arbitrary division can lead to less-thanrobust scoring systems. Also, if one wants
to move from credit scoring to profit scoring, then it matters when a customer defaults.
One asks not if an applicant will default but when will they default. This is a more
difficult question to answer because there are lots of answers, not just the yes or no of
the "if" question, but it is the question that survival analysis tools address when
modeling the lifetime of equipment, constructions, and humans.
Zdroj: Thomas, Edelman, Crook – Credit scoring and its application.
392
Cox model in credit scoring
Using survival analysis to answer the "when" question has several advantages over
standard credit scoring. For example,
• it deals easily with censored data, where customers cease to be borrowers (either
by paying back the loan, death, changing lender) before they default;
• it avoids the instability caused by having to choose a fixed period to measure
satisfactory performance;
• estimating when there is a default is a major step toward calculating the
profitability of an applicant;
• these estimates will give a forecast of the default levels as a function of time,
which is useful in debt provisioning;
• this approach may make it easier to incorporate estimates of changes in the
economic climate into the scoring system.
Zdroj: Thomas, Edelman, Crook – Credit scoring and its application.
393
Cox model in credit scoring
Let T be the time until a loan defaults. Then there are three standard ways to
describe the randomness of T in survival analysis (Collett 1994): S(t), f(t) and h(t).
Zdroj: Thomas, Edelman, Crook – Credit scoring and its application.
394
Cox model in credit scoring
In standard credit scoring, one assumes that the application characteristics affect
the probability of default. Similarly, in this survival analysis approach, we want
models that allow these characteristics to affect the probability of when a customer
defaults. Two models have found favor in connecting explanatory variables to failure
times in survival analysis:
• proportional hazard models
• accelerated life models.
If x = (x1,..., xp) are the application (explanatory) characteristics, then an
accelerated life model assumes that
where ho and So are baseline functions, so the x can speed up or slow down the
aging of the account. The proportional hazard assumes that
so the application variables x have a multiplier effect on the baseline hazard.
Zdroj: Thomas, Edelman, Crook – Credit scoring and its application. 395
Cox model in credit scoring
Cox (1972) pointed out that in proportional hazards one can estimate the weights w
without knowing h0(t) using the ordering of the failure times and the censored
times. If ti and xi are the failure (or censored) times and the application variables for
each of the items under test, then the conditional probability that customer i
defaults at time ti given that R(i) are the customers still operating just before ti- is
given by
which is independent of h0.
Zdroj: Thomas, Edelman, Crook – Credit scoring and its application.
396
PHREG Procedure
PROC PHREG DATA=SAS-data-set ;
CLASS variable <(options)><...variable <(options)>>;
MODEL response<*censor(list)>=variables ;
STRATA variable<(list)><…variable<(list)>> ;
CONTRAST <‟label‟>effect values<,.., effect values> ;
ASSESS keyword ;
HAZARDRATIO <‟label‟> variable ;
TEST equation1 <,..., equationk> < /options>;
WEIGHT variable;
OUTPUT ;
programming statements;
RUN;
397
9. Měření kvality (síly) modelu,
validace modelu.
398
How Good Is the Scorecard?
And which one is the best?
Combination of statistical measures and business
objectives
Misclassification (Confusion) matrix
Scorecard strength measures
399
Misclassification
Confusion matrix
Accuracy
(TP+TN)/total
Error rate
(FP+FN)/total
Sensitivity ; Specificity
(TP)/Actual Positives ; (TN)/Actual Negatives
Positive ; Negative predicted value
TP/predicted positives ; TN/predicted negatives
Predicted
Good Bad
Good True Positive False Negative
Actual
Bad False Positive True Negative
“Good”/”Bad” is above/below chosen cutoff.
Want to max accuracy and min error rate.
400
Misclassification
Confusion matrix
Acceptance of bads (FP)
Acceptance of goods (TP)
Decline Goods (FN)
Decline Bads (TN)
Predicted
Good Bad
Good True Positive False Negative
Actual
Bad False Positive True Negative
Want to min rejection of goods and max rejection of bads.
401
Misclassification
Approval rate: bad rate relationship
Objective:
Minimize the rejection of goods or acceptance of bads
Best option for desired bad rate, approval rate
Compare scorecards and cutoff choices.
“I’d rather approve some bads than reject good customers” vs “the cost of approving bads is
too high, we can deal with PR”.
Generate these stats for different cutoff rate choices and compare with base I.e. current
approval and bad rates.
If several models are being compared, generate these for same bad rate or approval rate. I.e.
choose different cutoffs to get same bad rate.
402
Misclassification: Oversampling
Need to adjust for oversampling if have not done so before
this step
Sensitivity/specificity unaffected by oversampling
Multiply cell counts by sample weights (π0 and π1)
Predicted
Good Bad
Good
n*(True Ps/Actual
Ps)* π1
n*(1-Sens)* π1
Actual
Bad n*(1 - Spec)* π0 n*(Spec)* π0
403
Scorecard Strength
Akaike’s Information Criterion (AIC)
Schwartz Bayesian Criterion (SBC)
-(score test statistic) + penalty term
• Penalty term = (k + 1). Ln(n)
• k = number of variables
• n = sample size
Penalise for adding parameters to the model ...
Smaller values are better.
404
KS Statistic
Max difference between cumulative distributions of goods
and bads across score ranges
Kolmogorov-Smirnov
0%
20%
40%
60%
80%
100%
0
110
130
145
155
165
175
185
195
205
215
225
235
245
255
265
275
285
300
Score
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
KS
Good - A Bad - A Good - B Bad - B KS - B KS- A
405
Scorecard Strength
C - Statistic
Area under the ROC curve, Wilcoxon-Mann-Whitney test.
ROC Curve
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
(1 - Specificity)
Sensitivity
Scorecard - A Random Scorecard - B
You may be wondering where the name "Receiver Operating Characteristic" came from. ROC analysis is part of a field called "Signal Detection Theory" developed
during World War II for the analysis of radar images. Radar operators had to decide whether a blip on the screen represented an enemy target, a friendly ship, or
just noise. Signal detection theory measures the ability of radar receiver operators to make these important distinctions. Their ability to do so was called the
Receiver Operating Characteristics. It was not until the 1970's that signal detection theory was recognized as useful for interpreting medical test results.
406
Scorecard Strength
• Gains Chart: Cumulative Positive Predicted Value
versus Distribution of Predicted Positives (depth)
• Lift/concentration curve: Sensitivity versus Depth
• Lift = Positive predicted value / % positives in the sample
• Misclassification costs (losses assigned to false
+positive and –positive)
• Bayes Rule (minimizes expected cost)
• Cost ratio (at what cutoff do I break even based on prior bad rate I.e. if bad odds
are 9:1, you need a cutoff where 1 bad and 9 goods)
• Somers’ D, Gamma, Tau-a
407
Information value
The special case of Kullback-Leibler divergence given by:
where
• are densities of scores of bad and good clients.
408
Informační hodnota (Ival) – spojitý případ (Divergence):
dx
xf
xf
xfxfI
BAD
GOOD
BADGOODval
)(
)(
ln)()(
)()()( xfxfxf BADGOODdiff
)(
)(
ln)(
xf
xf
xf
BAD
GOOD
LR
409
Information value
• Nahradíme hustotu jejím jádrovým odhadem a spočteme integrál numericky (např.
pomocí složeného lichoběžníkového pravidla).
• S použitím Epanečnikova jádra, ,
a optimální šířky vyhlazovacího okna dostaneme
• Pro daných M+1 bodů dostáváme
Informační hodnota (Ival) – diskretizovaný spojitý případ:
1,11
4
3
)( 2
xIxxK
1
1
0
0
)(
~
)(
~
2)(
~
2
M
i
MIViIVIV
M
val xfxfxf
M
xx
I
),(
~
),(
~
ln),(
~
),(
~
)(
~
2,
2,
2,2,
OSBAD
OSGOOD
OSBADOSGOODIV
hxf
hxf
hxfhxfxf
kOSh ,
Mxx ,,0
0x Mx
410
Information value
• Vytvoříme intervaly skóre – typicky decily. Počet dobrých (špatných) klientů v i-tém
intervalu označíme .
• Musí platit
• Potom dostáváme
i i
iii
val
nb
mg
m
b
n
g
I ln
Informační statistika/hodnota (Ival) – diskrétní případ:
ii bg
ibg ii 0,0
411
Information value
Informační hodnota pro 2 příklady scoringových modelů:
SC 1:
SC 2:
decile # cleints # bad clients #good % bad [1] % good [2] [3] = [2] - [1] [4] = [2] / [1] [5] = ln[4] [6] = [3] * [5] cum. [6]
1 100 35 65 35,0% 7,2% -0,28 0,21 -1,58 0,44 0,44
2 100 16 84 16,0% 9,3% -0,07 0,58 -0,54 0,04 0,47
3 100 8 92 8,0% 10,2% 0,02 1,28 0,25 0,01 0,48
4 100 8 92 8,0% 10,2% 0,02 1,28 0,25 0,01 0,49
5 100 7 93 7,0% 10,3% 0,03 1,48 0,39 0,01 0,50
6 100 6 94 6,0% 10,4% 0,04 1,74 0,55 0,02 0,52
7 100 6 94 6,0% 10,4% 0,04 1,74 0,55 0,02 0,55
8 100 5 95 5,0% 10,6% 0,06 2,11 0,75 0,04 0,59
9 100 5 95 5,0% 10,6% 0,06 2,11 0,75 0,04 0,63
10 100 4 96 4,0% 10,7% 0,07 2,67 0,98 0,07 0,70
All 1000 100 900 Info. Value 0,70
decile # cleints # bad clients #good % bad [1] % good [2] [3] = [2] - [1] [4] = [2] / [1] [5] = ln[4] [6] = [3] * [5] cum. [6]
1 100 20 80 20,0% 8,9% -0,11 0,44 -0,81 0,09 0,09
2 100 18 82 18,0% 9,1% -0,09 0,51 -0,68 0,06 0,15
3 100 17 83 17,0% 9,2% -0,08 0,54 -0,61 0,05 0,20
4 100 15 85 15,0% 9,4% -0,06 0,63 -0,46 0,03 0,22
5 100 12 88 12,0% 9,8% -0,02 0,81 -0,20 0,00 0,23
6 100 6 94 6,0% 10,4% 0,04 1,74 0,55 0,02 0,25
7 100 4 96 4,0% 10,7% 0,07 2,67 0,98 0,07 0,32
8 100 3 97 3,0% 10,8% 0,08 3,59 1,28 0,10 0,42
9 100 3 97 3,0% 10,8% 0,08 3,59 1,28 0,10 0,52
10 100 2 98 2,0% 10,9% 0,09 5,44 1,69 0,15 0,67
All 1000 100 900 Info. Value 0,67
412
Information value
Označíme-li , dostáváme:
m
b
n
g
I ii
diffi
nb
mg
I
i
i
LRi
ln
SC 1:
SC 2:
-0,30
-0,25
-0,20
-0,15
-0,10
-0,05
0,00
0,05
0,10
1 2 3 4 5 6 7 8 9 10
-2,00
-1,50
-1,00
-0,50
0,00
0,50
1,00
1,50
I_diff
I_LR
0,00
0,05
0,10
0,15
0,20
0,25
0,30
0,35
0,40
0,45
0,50
1 2 3 4 5 6 7 8 9 10
0,00
0,10
0,20
0,30
0,40
0,50
0,60
0,70
0,80
I_diif * I_LR
cum. I_diff * I_LR
-0,15
-0,10
-0,05
0,00
0,05
0,10
1 2 3 4 5 6 7 8 9 10
-1,00
-0,50
0,00
0,50
1,00
1,50
2,00
I_diff
I_LR
0,00
0,02
0,04
0,06
0,08
0,10
0,12
0,14
0,16
1 2 3 4 5 6 7 8 9 10
0,00
0,10
0,20
0,30
0,40
0,50
0,60
0,70
0,80
I_diif * I_LR
cum. I_diff * I_LR
K-S = 0.34
Gini = 0.42
Lift20% = 2.55
Lift50% = 1.48
Ival = 0.70
Ival20% = 0.47
Ival50% = 0.50
K-S = 0.36
Gini = 0.42
Lift20% = 1.90
Lift50% = 1.64
Ival = 0.67
Ival20% = 0.15
Ival50% = 0.23
413
Information value
Ival for normally distributed scores
bg
D
2
2
2
)(
1
2
1
)( g
gx
g
exf
2
2
2
)(
0
2
1
)( b
bx
b
exf
22
*
bg
bg
D
2
DIval
2
2
2
2
2*
2
1
,1)1(
b
g
g
b
val AADAI
Assume that standard deviations are equal to a common value :
Generally (i.e. without assumption of equality of standard deviations):
Assume that the scores of good and bad clients are normally
distributed, i.e. we can write their densities as
where
where
414
We can see a quadratic dependence on difference of means.
Ival takes quite high values when both variances are approximately equal and
smaller or equal to 1, and it grows to infinity if ratio of the variances tends to
infinity or is nearby zero. 415
Ival for normally distributed scores
Ival: ,0b 12
b
Velmi silná závislost na . Navíc hodnota Ival míří velmi
rychle k nekonečnu pokud se blíží nule.
g
2
g
416
Ival for normally distributed scores
Empirical estimate of Ival
417
However in practice, there could occur computational
problems. The Information value index becomes infinite in
cases when some of n0j
or n1j
are equal to 0. When this arises
there are numerous practical procedures for preserving finite
results. For example one can replace the zero entry of
numbers of goods or bads by a minimum constant of say
0.0001. Choosing of the number of bins is also very
important. In the literature and also in many applications in
credit scoring, the value r=10 is preferred.
418
Empirical estimate of Ival
where is the empirical quantile function
appropriate to the empirical cumulative distribution
function of scores of bad clients.
Empirical estimate with
supervised interval selection
We want to avoid zero values of n0j
or n1j
.
We propose to require to have at least k, where k is a
positive integer, observations of scores of both good and
bad client in each interval.
Set
,
419
Usage of quantile function of scores of bad clients is motivated by the
assumption, that number of bad clients is less than number of good clients.
If n0 is not divisible by k, it is necessary to adjust our intervals, because we
obtain number of scores of bad clients in the last interval, which is less than k. In
this case, we have to merge the last two intervals.
Furthermore we need to ensure, that the number of scores of good clients is as
required in each interval
To do so, we compute n1j
for all actual intervals. If we obtain n1j
< k for jth
interval, we merge this interval with its neighbor on the right side.
This can be done for all intervals except the last one. If we have n1j
< k for the
last interval, than we have to merge it with its neighbor on the left side, i.e. we
merge the last two intervals.
420
Empirical estimate with
supervised interval selection
Very important is the choice of k. If we choose too small value, we get
overestimted value of the Information value, and vice versa. As a
reasonable compromise seems to be adjusted square root of number of bad
clients given by
The estimate of the Information value is given by
where n0j
and n1j
correspond to observed counts of good and bad clients in
intervals created according to the described procedure.
421
Empirical estimate with
supervised interval selection
Simulation results
Consider n clients, 100pB% of bad clients with and
100(1-pB)% of good clients with .
Because of normality we know .
Consider following values of parameters:
n = 100 000 , n = 1000
μ0 = 0
σ0 = σ1 = 1
μ1 = 0.5, 1, 1.5
pB = 0.02, 0.05, 0.1, 0.2
),(: 000 Nf
),(: 111 Nf
2
01
valI
422
1) Scores of bad and good clients were generated according to given
parameters.
2) Estimates were computed.
3) Square errors were computed.
4) Steps 1)-3) were repeated one thousand times.
5) MSE was computed.
ESISvalKERNvalDECval III ,,,
ˆ,ˆ,ˆ
423
Simulation results
n=100000, = 0.5
MSE
0.02 0.05 0.1 0.2
IV_decil 0,000546 0,000310 0,000224 0,000168
IV_kern 0,000487 0,000232 0,000131 0,000076
IV_esis 0,000910 0,000384 0,000218 0,000127
n=100000, = 1.0
MSE
0.02 0.05 0.1 0.2
IV_decil 0,006286 0,004909 0,004096 0,002832
IV_kern 0,003396 0,001697 0,001064 0,000646
IV_esis 0,002146 0,000973 0,000477 0,000568
n=100000, = 1.5
MSE
0.02 0.05 0.1 0.2
IV_decil 0,056577 0,048415 0,034814 0,020166
IV_kern 0,019561 0,010789 0,006796 0,004862
IV_esis 0,013045 0,008134 0,007565 0,027943
n=1000, = 0.5
MSE
0.02 0.05 0.1 0.2
IV_decil 0,025574 0,040061 0,026536 0,009074
IV_kern 0,038634 0,017547 0,009281 0,004737
IV_esis 0,038331 0,021980 0,016280 0,008028
n=1000, = 1.0
MSE
0.02 0.05 0.1 0.2
IV_decil 0,186663 0,084572 0,043097 0,029788
IV_kern 0,117382 0,072381 0,045344 0,032131
IV_esis 0,150881 0,071088 0,036503 0,023609
n=1000, = 1.5
MSE
0.02 0.05 0.1 0.2
IV_decil 1,663859 1,037778 0,535180 0,200792
IV_kern 0,529367 0,349783 0,266912 0,196856
IV_esis 0,609193 0,352151 0,172931 0,194676
01
01
01
01
01
01
• worst
• average
• best performance
424
Simulation results
Adjusted empirical estimate with
supervised interval selection (AESIS)
Je zřejmé, že volba parametru k je zcela zásadní. Otázkou
tedy je:
Je volba optimální (vzhledem k MSE)?
Jaký vliv na optimální k má n0 ?
A jaký vliv, pokud vůbec, má rozdíl středních hodnot ?01
425
Consider 10000 clients, 100pB% of bad clients with and 100(1-pB)%
of good clients with . Set and consider ,
Bp
Bp
01 01
MSEk
))ˆ(( 2
valval IIEMSE
)1,(: 00 Nf
)1,(: 11 Nf 00 5.1and1,5.01
426
Simulation results
Dependence of MSE on k, .
The highlighted circles correspond to
values of k, where minimal value of the MSE
is obtained. The diamonds correspond to
values of k given by .
101
AESISvalI ,
ˆ
0.02 0.05 0.1 0.2
0.5 29 42 62 84
1 12 18 23 32
1.5 6 9 8 9
MSEk
01
Bp
0.02 0.05 0.1 0.2
0.5 31 45 61 84
1 12 17 24 32
1.5 7 10 14 19
Bp
01
427
Simulation results
ESIS.1
Algorithm for the modified ESIS:
1)
2)
3)
4) Add to the sequence, i.e.
5) Erase all scores
6) While n0 and n1 are greater than 2*k, repeat step 2) – 5)
7)
1
1
11
n
k
Fqj
0
1
00
n
k
Fqj
),max( 01max jj qqs
],[ maxsqq
[]q
maxs
maxs
)]max(,,1)[min( scorescore qq
1,
ˆ
ESISvalI
where
428
ESIS.2
U původního ESIS často dochází ke slučování
vypočtených intervalů ve druhé fázi algoritmu.
Pro výpočet se používá jen .
Aby byla splněna podmínka n11
>k, je zřejmě
nutné, aby hranice prvního intervalu byla větší než
To vede k myšlence použít ke konstrukci
intervalů nejprve a následně, od nějaké
hodnoty skóre .
Jako vhodná hodnota skóre pro tento účel se jeví
hodnota s0, ve které se protínají hustoty skóre,
rozdíl distribučních funkcí skóre nabývá své
maximální hodnoty a také platí, že funkce fIV
nabývá nulové hodnoty.
,
0s
Point of
intersection
of densities
Point of
maximal
difference of
CDFs
Point of zero
value of fIV
==
.
1
0
F
.
1
1
1
n
k
F
.
1
1
F
.
1
0
F
429
Algorithm for the modified ESIS:
1)
2)
3)
4)
5) Merge intervals given by q1 where number of bads is less than k.
6) Merge intervals given by q0 where number of goods is less than k.
)(,,1, 01
1
1
1
11 sF
k
n
j
n
kj
Fq j
]1)max(,,,1)[min( 01 scorescore qqq
where
1,,)(, 0
00
0
0
1
00
k
n
sF
k
n
j
n
kj
Fq j
2,
ˆ
ESISvalI
010 maxarg FFs
s
430
ESIS.2
AESIS.2 – Simulation results
Consider 1000, 10000 and 100000 clients, 100pB% of bad clients with and
100(1-pB)% of good clients with . Set , and consider
.
))ˆ(( 2
valval IIEMSE
)1,(: 00 Nf
)1,(: 11 Nf 00 5.1and1,5.01
Bp
MSEk
01
01
MSEk Bp
0.02 0.05 0.1 0.2
0.5 29 51 77 112
1 15 24 28 45
1.5 6 11 11 14
15 23 32 45
0.02 0.05 0.1 0.2
0.5 15 19 22 45
1 3 8 11 16
1.5 2 3 6 7
5 8 10 15
1000n
10000n
Bp
MSEk
01
0.02 0.05 0.1 0.2
0.5 118 198 298 371
1 50 61 106 141
1.5 17 28 32 48
5 8 10 15
100000n
431
0.02 0.05 0.1 0.2
0.5 38 60 85 120
1 15 23 32 45
1.5 8 13 18 26
Simulation results
Dependence of MSE on k.
0.02 0.05 0.1 0.2
0.5 29 51 77 112
1 15 24 28 45
1.5 6 11 11 14
MSEk
01
BpBp
01
10000n
2,
ˆ
AESISvalI
2
01
ˆˆ
np
k B
2
01
ˆˆ
np
k B
2.0,1000 Bpn
5.001 0.101 5.101
5.001 0.101
5.101
5.001 0.101 5.101
05.0,100000 Bpn
2.0,10000 Bpn
432
Scorecard Strength
433
Scorecard Strength
434
Process Flow
Explore Data
Data Cleansing
Initial Characteristic
Analysis (KGB)
Preliminary
Scorecard (KGB)
Reject Inference
Initial Characteristic
Analysis (AGB)
Final
Scorecard (AGB)Validate
435
Validation
Why?
Confirm that the model is robust and applicable on the subject
population
Holdout sample
70/30, 80/20 or random samples of 50–80%
2 Methods
Compare statistics for development versus validation
Compare distributions of goods and bads for development
versus validation.
436
Validation – Comparing Statistics
Fit Statistic Label Training Validation Test
_AIC_ Akaike's Information Criterion 6214.0279153 . .
_ASE_ Average Squared Error 0.0301553132 0.0309774947 .
_AVERR_ Average Error Function 0.1312675287 0.1355474611 .
_DFE_ Degrees of Freedom for Error 23609 . .
_DFM_ Model Degrees of Freedom 7 . .
_DFT_ Total Degrees of Freedom 23616 . .
_DIV_ Divisor for ASE 47232 45768 .
_ERR_ Error Function 6200.0279153 6203.7361993 .
_FPE_ Final Prediction Error 0.0301731951 . .
_MAX_ Maximum Absolute Error 0.9962871546 0.9959395534 .
_MSE_ Mean Square Error 0.0301642541 0.0309774947 .
_NOBS_ Sum of Frequencies 23616 22884 .
_NW_ Number of Estimate Weights 7 . .
_RASE_ Root Average Sum of Squares 0.1736528525 0.1760042464 .
_RFPE_ Root Final Prediction Error 0.1737043324 . .
_RMSE_ Root Mean Squared Error 0.1736785944 0.1760042464 .
_SBC_ Schwarz's Bayesian Criterion 6270.5156734 . .
_SSE_ Sum of Squared Errors 1424.295752 1417.777979 .
_SUMW_ Sum of Case Weights Times Freq 47232 45768 .
_MISC_ Misclassification Rate 0.0320121951 0.0325117986 .
_PROF_ Total Profit for GB 3430000 2730000 .
_APROF_ Average Profit for GB 145.24051491 119.29732564 .
If stats are similar, then scorecard is validated. 437
Validation – Compare Distributions
Validation Chart
0%
20%
40%
60%
80%
100%
0
110
130
145
155
165
175
185
195
205
215
225
235
245
255
265
275
285
300
Score
Good-Dev Good-Val Bad-Dev Bad-Val
Valid if no significant difference. 438
Validation
Common reasons for not validating
Characteristics with large score ranges,
Concentration of a certain type of attribute in one sample (for
example, not random sampling),
small sample sizes
439
Comparison with the old scorecard
Month by month comparison of performance of the old and the new scorecard, both
for development and hold-out sample – on given segment
Comparison of performance month by month
Mobiles: Model performance
0.2
0.3
0.4
0.5
0.6
0.7
11 12 01 02 03 04 05 06 07 08 09 10 11 12 01 02 03 04 05 06 07
2005 2006 2007
Month of first date due
GiniCoefficient
(higherisbetter)
Actual SC, fraud part Actual SC, defaulter part
New SC, fraud part New SC, defaulter part
Validation
440
Power on fresh data
Use fresh data and compute "softer" good bad
definitions (e.g. 1_30, 1_60 instead of 1_90).
Measure power of the scorecard on
development sample according these
definitions and compare it with performance on
the fresh data.
Comparison with real default
Month by month comparison of average
predicted pd by the new scorecard and real
default, for both development and hold-out
samples. Diagonal test - score on x-axis and
real default on y-axis – graph of average
default should be ideally monotonous (the
higher the score, the lower the default)
Graph of diagonal test
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.6
0.63
0.66
0.69
0.72
0.75
0.78
0.81
0.84
0.87
0.9
0.93
0.96
0.99
1
10
100
1000
10000
100000
1000000
Nb contracts AVG_default AVG_score
Validation
441
Comparison with real default
Graph of predicted pd versus the real default
1 - audio-video
0
10000
20000
30000
40000
50000
60000
200606
200607
200608
200609
200610
200611
200612
200701
200702
200703
200704
200705
200706
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Počet smluv Default Score
Validation
442
cutoff
10. Cutoff, RAROA, Monitoring
443
Možné zamítací škály – cutoff
cutoff hodnota určuje mez, při které je žádost o úvěr
schválena/zamítnuta
Je možné použít tyto zamítací škály:
PD – Pravděpodobnost Defaultu (Probability of
Default)
KRN - Kreditní Rizikové Náklady (CRE – Credit Risk
Expenses)
Marže (Margin)
RAROA
…
444
Cutoff na škále PD
cutoff = 0.1 (tj. zamítám všechny s
pravděpodobností defaultu větší než 10 %)
SC1
SC2
Cut off
• Pro SC1 je
reject rate 22 %.
• Pro SC2 je
reject rate 33 %.
445
Strategická křivka (Strategy curve)
BsFpB 1rateacceptanceBad
)(1
1
sF
BsFpB
ratebadActual
Při zavádění nové scoringové funkce typicky dochází k tomu, že stávající nastavení
schvalovacího procesu (nastavení cutoff) je reprezentováno bodem O , který leží nad novou
strategickou křivkou. Otázkou pak je směr, kterým se chceme vydat při stanovení nového cutoff.
Pokud se posuneme do bodu A, potom zachováme poměr schválených špatných klientů, ale
současně zvýšíme celkový poměr schválených klientů. Při posunu do bodu B schválíme stejný
poměr klientů, ale snížíme poměr schválených špatných klientů a tedy i poměr špatných klientů
(bad rate). Posunem do bodu C zachováme bad rate při současném zvýšení poměru schválených
klientů.
)(1 sFrateAcceptance
pB
446
Nastavení cutoff maximalizující zisk (profit)
Profit - náhodná veličina definovaná jako:
špatnýmsestaneaschválenúvěrlije
dobrýmsestaneaschválenúvěrlije
zamítnutúvěrlije
,
,
,0
D
LR
Označme pG a pB proporci dobrých a špatných klientů v populaci. q(G|s) (q(B|s)) označuje podmíněnou
pravděpodobnost, že klient mající skóre s bude dobrý (špatný), přičemž q(G|s) + q(B|s) = 1. Nechť p(s) je
proporce populace se skóre s.
Střední hodnota profitu při schválení klientů se skóre s:
Tedy k maximalizaci profitu je třeba schválit ty klienty, jejichž skóre
splňuje podmínku:
447
Nastavení cutoff maximalizující profit
Nechť A označuje množinu skóre, kde je splněna předchozí
podmínka. Pak je střední hodnota zisku (profitu) na jednoho
klienta dána vztahem:
Pokud L a D navíc závisí na skóre s, je situace ještě o něco
složitější. Více viz Thomas et al. (2002).
448
Nastavení cutoff maximalizující profit
Body na spodní části křivky odpovídají vyšším cutoff hodnotám, a tedy i menšímu počtu přijatých špatných
klientů, zatímco body na horní části křivky odpovídají menším hodnotám cutoff, tj. vyššímu počtu přijatých
špatných klientů. Efektivní hranicí je tedy spodní část křivky od bodu C do bodu D.
Jestliže aktuální nastavení schvalovacího procesu odpovídá bodu O, opět máme možnost posunu na křivku
odpovídající nové scoringové funkci. První možností je zachování poměru schválených špatných klientů, tj.
posun do bodu A. Druhou možností je zachování celkového poměru schválených klientů, tj. posun do bodu
B. Je zřejmé, že posun do bodu A není vhodná volba, protože tento bod neleží na efektivní hranici a lze
snadno dosáhnout stejného očekávaného zisku při nižší očekávané ztrátě.
449
CRE = ((1-Recovery) * SUM(PD * Loss))/(Expected Average Volume)
Profit = (Interest rate – CRE)*Expected Average Volume
Definice KRN (CRE)
Půjčenýobjem
Ztráta(Loss)
Ztráta
Ztráta
Ztráta
Ztráta
Ztráta
Ztráta
Ztráta
Ztráta
Ztráta
Číslo defaultní splátky (pravděpodobnost (PD))
1 (.06) 2 (.02) 3 (.02) 4 (.02) 5 (.02) 6 (.02) 7 (.02) 9 (.02)8 (.02) 10 (.03)
Pravděpodobnost defaultu silně závisí na
scoringové funkci
Úroková míra Očekávaný průměrný objem úvěru 450
Recovery (=Late collection(LC))
Číslo
defaultní
splátky
score
band1 band2 band3 band4
1. 20% 25% 30% 35%
2.-4. 50% 55% 60% 65%
5. + 75% 80% 85% 90%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
3 4 5 6 7 8 9 10 11 12 13 14 15
LC month 60
LC
1 sp1
1 sp2
1 sp3
1 sp4
4 sp1
4 sp2
4 sp3
4 sp4
5 sp1
5 sp2
5 sp3
5 sp4
odhad
451
Cutoff na škále KRN
0%
20%
40%
60%
80%
100%
120%
0,5%
0,7%
0,9%
1,0%
1,2%
1,4%
1,7%
1,8%
2,0%
2,2%
2,4%
2,7%
2,8%
3,3%
3,8%
4,0%
4,3%
4,7%
5,1%
5,5%
6,0%
6,5%
7,1%
7,5%
8,1%
8,7%
9,3%
9,9%
10,6%
11,5%
12,6%
13,8%
15,5%
18,0%
24,3%
KRN
%zamítnutýchsmluv
0%
1%
2%
3%
4%
5%
6%
7%
KRNprodukce
úroveň zamítání
počet smluv
počet zamitnutých na TK
původní úroveň zamítání
kumulativní KRN
KRN původní
452
10%
15%
20%
25%
30%
35%
40%
45%
50%
55%
60%
13,1%
13,1%
13,2%
13,2%
13,2%
13,2%
13,2%
13,3%
13,4%
13,5%
13,6%
13,7%
13,7%
13,7%
13,8%
13,9%
13,9%
13,9%
13,9%
14,0%
14,1%
14,2%
14,3%
14,4%
14,4%
14,4%
14,5%
14,5%
14,7%
KRN
%zamítnutýchsmluv
4,1%
4,2%
4,3%
4,4%
4,5%
4,6%
4,7%
4,8%
4,9%
5,0%
5,1%
KRNprodukce
úroveň zamítání
původní úroveň zamítání
KRN původní
kumulativní KRN
nastavení cut off při
zachování úrovně
zamítání - 18%
zamítnutých smluv
nastavení cut off při
zachování úrovně KRN
- 15,1% zamítnutých
smluv
Cutoff na škále KRN
453
(Očekávaná) Marže = Úroková míra (vč. poplatků) – KRN –
OPEX
Úroková míra
Efektivní míra ideálního finančního toku (-výše úvěru-poplatky; anuita; anuita; ... ;
anuita).
KRN
Viz výše.
OPEX
Cena peněz.
Režijní náklady, variabilní náklady, podpora prodejní sítě.
Náklady na administrátory – vlastní zaměstnance zajišťující zpracování úvěru.
(Očekávaná) Marže
454
Marže (Margin)
Optimální cutoff: marže=0
455
RAROA
(Risk Adjusted Return On Assets)
456
RAROA
457
RAROA
458
RAROA
459
Výhody RAROA
Ideal flow Expected flow Ideal flow Expected flow
-1000 -1000 -1000 -1000
1 400 200 150 110
2 400 180 150 100
3 400 170 150 90
4 400 160 150 80
5 150 70
6 150 60
7 150 50
8 150 40
9 150 30
10 150 16
11 150 10
12 150 0
Case A Case B
• A – krátkodobý úvěr s vysokým rizikem fraudu
• B – dlouhodobý úvěr s vysokým rizikem defaultu
Úroková míra (A) = 22%
Úroková míra (B) = 10%
Úvěr A je lepší, protože z něj plyne vyšší
zisk (710>656), navíc je ho dosaženo
mnohem dříve.
KRN(A) = 44%
KRN(B) = 20%
cutoff na škále KRN preferuje B
Marže (A) = -22%
Marže (B) = -10%
cutoff na škále marže preferuje B
RAROA (A) = -0.29
RAROA (B) = -0.36
cutoff na škále RAROA preferuje A
460
Cutoff segmentace
Možná segmentace podle:
Prodejní síť (skupina obchodních míst)
Profitabilita produktu
Kvalita prodejního místa
Typ zboží (pro spotřebitelské úvěry)
Výše úvěru
…
461
Cutoff scénáře
462
Evaluation of Reject rate, Profitability, Default and Loss rates before and
after cutoff change according to Distribution channel or Segment of
scorecard.
Cutoff impact evaluation table
Before Christmas (approved credits) After Christmas (approved credits)
Reject
rate
RAROA Loss rate Profit (per year) Reject rate RAROA Loss rate Profit (per year)
Segment 1 24.7% 3.65% 11.33% 414 363 110 24.3% 3.75% 11.19% 428 757 430
Segment 2 12.1% 4.01% 8.22% 160 364 072 12.9% 3.95% 8.29% 159 917 943
Segment 3 45.1% 9.64% 9.69% 747 636 468 45.1% 9.8% 9.5% 758 966 512
Segment 4 22.2% 5.80% 4.89% 52 213 720 20.1% 5.62% 5.05% 51 715 263
Segment 5 20.9% 6.77% 5.41% 54 312 614 19.7% 6.61% 5.48% 53 975 903
Segment 6 33.4% 7.04% 7.22% 212 090 365 32.6% 7.04% 7.16% 211 684 371
Segment 7 49.3% 9.30% 8.93% 36 840 287 49.2% 9.4% 8.8% 37 140 165
Segment 8 19.3% 4.68% 2.96% 15 668 962 14.9% 4.54% 3.16% 15 636 910
Segment 9 32.0% 8.41% 5.06% 3 679 430 27.2% 7.97% 5.26% 3 535 809
Segment 10 33.4% 7.14% 6.69% 1 823 050 341 33.4% 7.2% 6.6% 1 832 986 599
Segment 11 28.5% 6.34% 7.36% 2 633 609 071 28.6% 6.47% 7.24% 2 651 352 740
ALL 32.6% 6.64% 8.37% 6 153 828 440 32.6% 6.96% 8.17% 6 205 669 645
Cutoff impact evaluation
463
Profitability, Default and Loss rates according to reject rate into one graph
Characteristics of approved credits according to reject rate
-5%
0%
5%
10%
15%
20%
25%
5.5%11.0%17.3%22.3%27.4%31.8%36.1%40.4%44.7%49.1%53.4%57.7%62.0%66.3%70.7%75.0%79.3%83.6%88.0%92.3%96.6%
Reject rate
0
1 000 000 000
2 000 000 000
3 000 000 000
4 000 000 000
5 000 000 000
6 000 000 000
7 000 000 000
8 000 000 000
Profit (per year) RAROI Loss rate
Decision
Reasoning, why the final cutoffs were chosen
Cutoff sensitivity analysis
464
Monitoring
Stabilita SF -týdny
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
Vzorek 2006-
13
2006-
14
2006-
15
2006-
16
2006-
17
2006-
18
2006-
19
2006-
20
2006-
21
2006-
22
Gini
K-S
výv. vzorek [1] týden1 [2] [3]=[2] -[1] [4]=[2]/[1] [5]=ln[4] [6]=[3]*[5]
skóre_1 10,00% 5,63% -0,044 0,563 -0,574 0,025
skóre_2 10,00% 11,21% 0,012 1,121 0,114 0,001
skóre_3 10,00% 11,00% 0,010 1,100 0,095 0,001
skóre_4 10,00% 10,97% 0,010 1,097 0,092 0,001
skóre_5 10,00% 10,31% 0,003 1,031 0,031 0,000
skóre_6 10,00% 10,12% 0,001 1,012 0,012 0,000
skóre_7 10,01% 9,62% -0,004 0,961 -0,039 0,000
skóre_8 10,00% 9,89% -0,001 0,989 -0,011 0,000
skóre_9 10,00% 10,31% 0,003 1,031 0,030 0,000
skóre_10 10,00% 10,94% 0,009 1,095 0,091 0,001
PSI 0,030
465
Monitoring scoringových modelů
Není překvapivé, že prediktivní modely se ve statistickém slova smyslu chovají
nejlépe na vývojovém vzorku dat. Výstupy těchto modelů, např. skóre nebo rating
klienta, jsou počítány pomocí jistých vzorců, jejichž koeficienty příslušející
nezávislým proměnným (prediktorům) jsou odvozeny na datech vývojového
vzorku. Posun distribuce výstupu daného modelu je pak zapříčiněn právě změnou
vstupních hodnot modelu, tj. prediktorů, v průběhu času. V podstatě ihned
(alespoň většinou) po nasazení prediktivního modelu do praxe dochází k jistému
poklesu jeho prediktivní síly, který je způsoben určitou změnou vstupních hodnot
modelu. Zásadní je v praxi nastavení takových procesů, které odhalí, že se tak
děje, proč se tak děje a jak vážný problém to ve svých důsledcích znamená.
466
Monitoring scoringových modelů
Faktorů způsobujících posun v distribuci prediktorů, a
následně posun v distribuci výstupu prediktivního modelu,
je několik:
Přirozený posun v datech/změna demografické struktury dat
Databázové chyby
Změna datového zdroje
Změna definice/formátu vstupních dat
Změna datového univerza
Ostatní
467
Monitoring scoringových modelů
Typickým příkladem prvního uvedeného důvodu je příjem
klienta (všeobecným trendem je růst příjmu populace). Změnou
definice/formátu vstupních dat je myšlena například situace, kdy je
rozšířen číselník hodnot, kterých může vstupní proměnná nabývat.
Změnou datového univerza je myšlen případ kdy je vyvinutý
prediktivní model použit např. pro odlišný/nový segment portfolia
nebo odlišný/nový produkt.
468
Monitoring scoringových modelů
K-S, Gini:
Stabilita SF -týdny
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
Vzorek 2006-
13
2006-
14
2006-
15
2006-
16
2006-
17
2006-
18
2006-
19
2006-
20
2006-
21
2006-
22
Gini
K-S
469
Monitoring scoringových modelů
Čím strmější křivka tím lépe.
V průběhu času se zplošťuje – jde o to, jak moc.
Závislost defaultu na Skóre
0%
5%
10%
15%
20%
25%
30%
35%
40%decil1
decil2
decil3
decil4
decil5
decil6
decil7
decil8
decil9
decil10
vzorek
2006-13
2006-14
2006-15
2006-16
2006-17
2006-18
2006-19
2006-20
2006-21
2006-22
470
Monitoring scoringových modelů
c-statistika:
471
Monitoring scoringových modelů
r
i i
ii
E
EO
1
2
2 )(
r
i i
i
ii
E
O
EOPSI
1
ln)(
Chceme posoudit zda se distribuce skóre na
vývojovém vzorku liší od distribuce skóre v daném
časovém intervalu:
472
Monitoring scoringových modelů
výv. vzorek [1] týden1 [2] [3]=[2] -[1] [4]=[2]/[1] [5]=ln[4] [6]=[3]*[5]
skóre_1 10,00% 5,63% -0,044 0,563 -0,574 0,025
skóre_2 10,00% 11,21% 0,012 1,121 0,114 0,001
skóre_3 10,00% 11,00% 0,010 1,100 0,095 0,001
skóre_4 10,00% 10,97% 0,010 1,097 0,092 0,001
skóre_5 10,00% 10,31% 0,003 1,031 0,031 0,000
skóre_6 10,00% 10,12% 0,001 1,012 0,012 0,000
skóre_7 10,01% 9,62% -0,004 0,961 -0,039 0,000
skóre_8 10,00% 9,89% -0,001 0,989 -0,011 0,000
skóre_9 10,00% 10,31% 0,003 1,031 0,030 0,000
skóre_10 10,00% 10,94% 0,009 1,095 0,091 0,001
PSI 0,030
473
Monitoring scoringových modelů
1,0PSI
25,01,0 PSI
25,0PSI
značí žádný nebo jen velmi malý rozdíl daných distribucí skóre.
znamená, že došlo k nějakému posunu distribuce, nicméně
nikterak významnému.
signalizuje významný posun v distribuci skóre, tj. zamítáme
hypotézu o shodě daných distribucí.
474
0,00
0,01
0,01
0,02
0,02
0,03
0,03
0,04
0,04
0,05
0,05
2006-13 2006-14 2006-15 2006-16 2006-17 2006-18 2006-19 2006-20 2006-21 2006-22
PSI chi-kvadrat
Monitoring scoringových modelů
475
r
i i
i
iiDR
DR
DR
DRDRPSI
1 1
2
ln)12(
def_rate Gini PSI_DR PSI chi-kvardat
vzorek 7,69% 0,643
200613 9,38% 0,564 0,120 0,030 0,024
200614 9,35% 0,542 0,131 0,034 0,027
200615 8,70% 0,537 0,093 0,032 0,025
200616 8,57% 0,523 0,089 0,033 0,026
200617 8,59% 0,540 0,071 0,030 0,025
200618 9,19% 0,544 0,111 0,030 0,024
200619 8,03% 0,558 0,063 0,034 0,026
200620 8,52% 0,552 0,055 0,023 0,019
200621 8,05% 0,555 0,043 0,027 0,022
200622 7,76% 0,539 0,039 0,045 0,034
Monitoring scoringových modelů
476
0,00
0,02
0,04
0,06
0,08
0,10
0,12
0,14
Vzorek 2006-
13
2006-
14
2006-
15
2006-
16
2006-
17
2006-
18
2006-
19
2006-
20
2006-
21
2006-
22
0,40
0,45
0,50
0,55
0,60
0,65
0,70
Gini
Def. rate PSI_DR PSI chi-kvadrat Gini
Monitoring scoringových modelů
477
Champion-challenger
(mistr – vyzyvatel)
K rozšíření využití strategie champion-challenger došlo v devadesátých
letech minulého století. Princip je velmi jednoduchý. Předpokládejme, že
existuje nějaký způsob dělání něčeho (např. aktuálně používaný
scoringový model pro schvalování/zamítání žádostí o úvěr). Tento způsob
nazveme mistrem (champion). Nicméně existují další, jeden nebo více,
alternativní způsoby jak dosáhnout téhož (nebo velmi podobného) cíle.
Tyto nazveme vyzyvateli (challengers). Na náhodném vzorku otestujeme
vyzyvatele a porovnáme s mistrem. To nám umožní nejen porovnat
efektivnost vyzyvatelů a mistra, ale získáme možnost identifikovat
existenci a rozsah vedlejších efektů. Výsledkem pak může být zjištění, že
některý z vyzyvatelů je lepší než mistr a tento vyzyvatel se stane novým
mistrem.
478
479
11. Reference
480
Literatura - knihy
Anderson, R. (2007). The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk
Management and Decision Automation, Oxford: Oxford University Press.
Giudici, P. (2003). Applied Data Mining: statistical methods for business and industry,
Chichester : Wiley.
Han, J., Kamber, M. (2006). Data mining: Concepts and Techniques, 2nd ed. San Francisco:
Morgan Kaufmann.
Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, New York: Springer-Verlag.
Hosmer, D. W., Lemeshow S. (2000). Applied Logistic Regression, Textbook and Solutions
Manual , 2nd ed., New York: John Wiley and Sons.
481
Literatura - knihy
Siddiqi, N. (2006). Credit Risk Scorecards: developing and implementing intelligent
credit scoring, New Jersey: Wiley.
Thomas, L.C. (2009). Consumer Credit Models: Pricing, Profit, and Portfolio, Oxford:
Oxford University Press.
Thomas, L.C., Edelman, D.B., Crook, J.N. (2002). Credit Scoring and Its
Applications, Philadelphia: SIAM Monographs on Mathematical Modeling and
Computation.
Wilkie, A.D. (2004). Measures for comparing scoring systems, In: Thomas, L.C.,
Edelman, D.B., Crook, J.N. (Eds.), Readings in Credit Scoring. Oxford: Oxford
University Press, pp. 51-62.
Witten, I.H., Frank, E. (2005). Data Mining: Practical Machine Learning Tools and
Techniques, San Francisco: Morgen Kaufmann.
482
Literatura - časopisy
Crook, J.N., Edelman, D.B., Thomas, L.C. (2007). Recent developments in
consumer credit risk assessment. European Journal of Operational Research,
183 (3), 1447-1465
Hand, D.J. and Henley, W.E. (1997). Statistical Classification Methods in
Consumer Credit Scoring: a review. Journal. of the Royal Statistical Society,
Series A., 160,No.3, 523-541.
Harrell, F.E., Lee, K.L. and Mark, D.B. (1996). Multivariate prognostic models:
issues in developing models, evaluating assumptions and adequacy, and
measuring and reducing errors. Statistics in Medicine, 15, 361-387.
Lilliefors, H.W. (1967). On the Komogorov-Smirnov test for normality with
mean and variance unknown. Journal of the American Statistical Association, 62,
399-402.
Nelsen, R. B. (1998). Concordance and Gini‟s measure of association. Journal
of Nonparametric Statistics, 9, Isssue 3, 227–238.
Newson R. (2006). Confidence intervals for rank statistics: Somers' D and
extensions. The Stata Journal, 6(3), 309-334.
Somers R. H. (1962). A new asymmetric measure of association for ordinal
variables. American Sociological Review, 27, 799-811.
Thomas, L.C. (2000). A survey of credit and behavioural scoring: forecasting
financial risk of lending to consumers. International Journal of Forecasting,
16(2), 149-172 .
Literatura - časopisy
483
Literatura - web
Coppock, D.S. (2002). Why Lift?, DM Review Online,
www.dmreview.com/news/53291.html
Xu, K. (2003). How has the literature on Gini‟s index evolved in past 80 years?,
www.economics.dal.ca/RePEc/dal/wparch/howgini.pdf
Xin Ming Tu, Wan Tang (2006). Categorical Data Analysis.
http://www.urmc.rochester.edu/smd/biostat/people/faculty/TuSite/bst466/handouts.htm
Jiawei Han and Micheline Kamber (2006). Data Mining: Concepts and Techniques.
http://www.cs.illinois.edu/~hanj/bk2/
Jens Peter Dittrich (2007). Data warehousing.
http://www.dbis.ethz.ch/education/ss2007/07_dbs_datawh/Data_Mining.pdf
Joe Carthy (2006). Data Warehousing.
http://www.csi.ucd.ie/staff/jcarthy/home/DataMining/DM-Lecture02-01.ppt
Jan Spousta (?). Přednášky k data miningu. [cit. 19.03.2009] http://samba.fsv.cuni.cz/~soukup
484
Další zajímavé zdroje informací
http://www.cs.uiuc.edu/homes/hanj/
http://www-users.cs.umn.edu/~kumar/
http://www.kdnuggets.com/
http://www.kdnuggets.com/datasets/competitions.html
http://www.crc.man.ed.ac.uk/conference/
http://www.crc.man.ed.ac.uk/conference/archive/
http://www.kmining.com/info_conferences.html
http://en.wikipedia.org/wiki/Data_mining
http://cs.wikipedia.org/wiki/Data_mining
http://en.wikipedia.org/wiki/Credit_scorecards
485
Užitečné zdroje dat
http://archive.ics.uci.edu/ml/
http://kdd.ics.uci.edu/
http://sede.neurotech.com.br:443/PAKDD2009/
http://www.dataminingbook.com/
http://www.stat.uni-muenchen.de/service/datenarchiv/welcome_e.html
www.kaggle.com
486