Matemat»-1 *
pl\7
V
Martin R
Obsah:
i.   Úvod do data miningu: základní pojmy, CRISP-DM, SEMMA. 3
2. Organizace dat, úvod do SQL. 44
3. Příprava dat - čištění, kategorizace, agregace, transformace 101 (WOE), úvod do SAS data step. 168
4. Explorační analýza, vizualizace dat, kontingenční tabulky.
5. Regrese, Logistická regrese I. 274
6. Credit scoring (CS) - historie, základní pojmy. 338
7. Metodologie vývoje scoringových funkcí. 426
8. Příprava dat II. 498
9. Evaluace prediktivního modelu - LC (ROC), Gini, KS, Lift. 545
10. Stanovení cut-off. RAROA, CRE. Monitoring. 587
11. Reference. 623
1. Úvod do data miningu
Co je to Data Mining?
• Data mining (DM), nebo také dolování z dat či vytěžování dat, je analytická metodologie získávání netriviálních skrytých a potenciálně užitečných informací.
Aplikace
• Bankovnictví: schvalování úvěrů/kreditních karet
• Predikce dobrých zákazníků.
• Pojišťovnictví: schvalování pojistných smluv
• Odhad pravděpodobnosti pojistné události/výše škody.
• CRM (marketing):
• Identifikace zákazníků, kteří mají v úmyslu přejít ke konkurenci.
• Cross-selling.
• Up-selling.
• Cílený marketing:
• Identifikace pravděpodobných respondentů na nabídku.
• Detekce fraudu: telekomunikace, finanční transakce, pojistné podvody
• Online/offline identifikace podvodného chování.
Aplikace
Medicína: efektivita léčebné péče
• Analýza pacientovy historie (předchozí nemoci a jejich průběh): nalezení vztahu mezi nemocemi.
• Farmacie: identifikace nových léků
• Vědecká analýza dat:
• Identifikace nových galaxií.
• Design webových stránek:
• Nalezení vztahu návštěvníka stránek a příslušná změna podoby stránek.
Aplikace
• Rozpoznávání psaného textu, řeči, obrázků.
• Supermarkety
• Identifikace současně nakupovaného zboží
• Průmysl:
• automatické přenastavení ovládacích prvků při změně parametrů procesu.
Sport:
• NBA-optimalizace herní strategie
• další...
Aplikace - Rozmístění zboží v supermarketech
• Cíl: identifikovat zboží, které je nakupováno souběžně dostatečným množstvím zákazníků.
• Výsledek: Jestliže zákazník nakupuje dětské pleny a mléko, pak si velmi pravděpodobně koupí i pivo.
Jedna z možných interpretací
Správné interpretace výsledků analýz je schopen jen zkušený analytik.
Data mining a princip indukce
•   Dedukce zachovává platné vztahy:
1. Koně jsou savci.
2. Všichni savci mají plíce.
3. Proto platí, že všichni koně mají plíce.
•   Indukce přidává informace:
1. Všichni doposud pozorovaní koně mají plíce.
2. Proto platí, že všichni koně mají plíce.
Problém s indukcí
• Z platných faktů můžeme vyvodit nepravdivé tvrzení (model).
• Příklad:
• Evropské labutě jsou bílé
• Indukce: „Labutě jsou bílé" jakožto obecné pravidlo.
• Objevením Austrálie se objevili i černé labutě...
• Problém: množina pozorování nebyla náhodná a tudíž reprezentativní.
http://cs.wikipedia.org/wiki/Labu%C5%A5_%C4%8Dern%C3%A1
Data mining-podpora business rozhodování
Increasing potential to support business decisions
Making Decisions
Data Presentation Visualization Techniques
Data Mining Information Discovery
Data Exploration Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
_OLAP, MDA_
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
End User
Business Analyst
Data Analyst
DBA
12
Historie názvu
i960 Data Fishing, Data Dredging (bagrování):
• užíváno statistiky
1989 Knowledge Discovery (KD, KDD):
• užíváno komunitou zabývající se umělou inteligencí a strojovým učením
1990 Data Mining (DM):
• užíváno v komerční sféře a databázové komunitě
Další názvy: Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, ...
Data mining - nutnost?
mt 2000 Jdipj 2004
Největší světové databáze v r. 2005:
• Max Planck Inst, for Meteorology
• Yahoo •AT&T
V roce 2008:
• Max Planck Inst, for Meteorology
• Yahoo
Data mining - nutnost?
• Terabytes — ioAi2 bytes: data obchodních řetězců, bank,... Petabytes — ioAi5 bytes: geografická data
• Exabytes — ioAi8 bytes: národní databáze zdravotních záznamů
• Zettabytes — ioA2i bytes: databáze meteo-snímků
• Zottabytes — ioA24 bytes: video-databáze
Data mining - nutnost?
■
Proč data mining? Proč dnes?
• Data jsou produkována.
• Data jsou skladována.
• Výpočetní síla je dostupná.
• Výpočetní síla je cenově dostupná.
• Konkurenční tlak je velice silný.
• Komerční produkty (DM software) jsou k dispozici.
Data mining vs. Statistická analýza
• Data Mining •
• Původně vyvinuto pro expertní systémy automaticky řešící zadané problémy.
• Neklade takový důraz na přesné porozumění použité metody
• Pokud něco dává smysl, pak to použijme!
• Žádné předpoklady o datech.
• Funguje i pro velmi rozsáhlá data.
• Vyžaduje porozumění problému z datovému a business pohledu.
Statistická analýza
• Testuje se statistická korektnost modelu.
■ Jsou statistické předpoklady modelu splněny?
• Testování hypotéz.
• Intervalové odhady.
• Pracuje se s výběrem hodnot.
• Standardní metody nejsou optimalizovány pro rozsáhlá data.
• Vyžaduje pokročilé statistické znalosti.
18
Data mining
• Proces (polo-) automatické analýzy (rozsáhlých) databází k identifikaci vztahů, které jsou:
• validní:  platí na nových datech s určitou jistotou obecné platnosti
• nové: doposud neznámé
• užitečné: dají se v praxi nějak použít
• srozumitelné: (vždy) se nalezený vztah dá nějak vysvětlit
19
Data mining není
• Brutální hromadné zpracování dat.
• Slepé použití algoritmů.
• Hledání vztahů tam, kde žádné neexistují.
Známé ^ Zajímavé
Zajímavé jsou ty vztahy, které se liší od obecných očekávání.
Data mining se vyplácí právě díky objevování dosud neznámých a překvapivých vztahů.
Mléko a cereálie prodávej dohromady.^1
Mléko a cereálie prodávej dohromady!^
21
Vztah s ostatními disciplínami
Databázové technologie
Strojové učení		Data Mining
		
Vizualizace
Informační technologie
Ostatní vědní disciplíny
22
Data mining -proces
Ověření vztahů
i
r
Data Mining
Relevantní Data
Výběr Dat Transformace Dat
Data Warehouse
Čištění dat Integrace dat
Databáze
23
Data Mining Methodology (2007)
Kterou metodologii používáte pro data mining?
CRISP-DM (63) ^^^^^^^HHI 42%
Vlastní (29) ^^^^^^H 19%
SEMMA(i9) ^^^^^M 13%
KDD Process (11) ^^^B 7% Firemní (8) I 5%
Ostatní (20) I 14%
24
CRISP-DM
{CRoss Industry Standard Process for Data Mining)
1. pochopení obchodních souvislostí
2. pochopení dat
3. příprava dat
4. modelování
5. vyhodnocení modelu
6. nasazení modelu do obchodního procesu
http://community.udayton.edu/provost/it/training/documents/SPSS_CRISPWPlr.pdf
25
SEMMA
(Sample, Explore, Modify, Model, Assess)
• Sample - identifikovat vhodná učící data, určit odpovídající rozsah dat, a to jak z pohledu časového okna tak i z pohledu počtu případů. Dále se doporučuje rozdělit data na 3 skupiny:
Trénovací - využívá se pro vývoj modelu.
Validační - využívá se pro vyhodnocení modelu a pro prevenci proti přeučení (over fitting) modelu. Testovací - využívá se pro finální vyhodnocení modelu. Zajímá nás především jak dobře se model chová na datech disjunktních s daty, na kterých byl model vyvinut.
• Explore - připravit popisné statistiky, které poskytnou základní představu o obsahu a kvalitě podkladových dat. Pomocí vizualizačních technik odhalit skryté trendy a závislosti v datech.
• Modify - na základě předchozího kroku konsolidovat data a odvodit nové proměnné. Následně transformovat data do tvaru vhodného pro modelování.
• Model - vytvořit příslušný model. Mezi často používané techniky patří např. neuronové sítě, rozhodovací stromy, logistické modely.
• Assess - vyhodnotit úspěšnost modelu a případně implementovat model do praxe.
26
Fáze DM procesu (1 & 2)
Porozumění obchodu (Business Understanding):
• Stanovení business cílů.
• Stanovení data miningových cílů.
• Statnovení kriterií úspěchu.
Porozumění datům (Data Understanding):
• Průzkum dat a ověření jejich kvality.
• Nalezení odlehlých hodnot.
27
Fáze DM procesu (3)
Příprava dat (Data preparation):
• Obvykle zabírá přes 90% celkové času.
• Sběr dat
• Konsolidace a čištění
Vazební tabulky, agregace, chybějící hodnoty...
• Selekce
Ignorování neužitečných dat?
• Odlehlá pozorování?
• Výběr dat?
• Vizualizační nástroje.
• Transformace - vytváření nových odvozených proměnných
Fáze DM Procesu (4)
Modelování (Model building)
• Výběr vhodných modelovacích technik závisí na stanovených data miningových cílech.
• Modelování je většinou iterační proces propojený s přípravou dat
• Rozdílný přístup pro „superviseď a „unsupervised learning"
Základní přístupy k modelování
• Prediktivní: jde o matematický model předpovídající (s určitou přesností) budoucí hodnotu/chování nějaké veličiny (entity).
• Regrese/ Klasifikace
• Analýza časových řad
• Deskriptívni: jde o matematický model popisující historické události a předpokládané nebo reálné vazby mezi nimi.
• Klastrová (shluková) analýza
• Asociační pravidla
• Detekce deviací/zlomů
• Faktorová analýza / analýza hlavních komponent
30
Klasifikace
Na základě známých údajů o „starých" zákaznících a jejich platební morálce máme predikovat platební způsobilost nového žadatele o úvěr.
Předchozí zákazníci
Klasifikátor
Věk
Příjem
Zaměstnání
Bydliště
Typ zákazníka
Rozhodovací pravidlo
Příjem > x
Dofcrý/ špatný
Data nového žadatele
31
Klasifikační metody
• Cíl: Predikovat třídu Ci = f(xi, X2, .. Xn)
• Regrese: (lineární nebo polynomiální)
• a*xi + b*x2 + c = Ci
• Metody nejbližšího souseda (KNN)
• Rozhodovací stromy
• Pravděpodobnostní modely (GLM) - např. logistická regrese.
• Diskriminační analýza (LDA,...)
• Neuronové sítě
• Support vector machines (SVM)
• Bayesovské modely
Deskriptívni modelování
• Základním cílem je získání ucelených a snadno srozumitelných informací z dostupných dat.
•Někdy součástí průzkumové (explorační) analýzy předcházející prediktivnímu modelování, někdy je vytvoření deskriptivního modelu hlavním cílem DM projektu.
33
Klastrová analýza
• Máme nalézt skupiny/ klastry stávajících zákazníků na základě platební historie tak, aby poaobní klienti byli ve stejné skupině/ klást ru.
• Základní požadavek: Kvalitní míra podobnosti (http://cs.wikipedia.org/wiki/ShluKOva_analyza).
s......
0°-* H-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-
1       2       3      4       5      6      7       8       9      10     11     12     13     14     15     16     17     16     19    20     21     22    23 2A
MěonjM fio aktivaci karty
Zdroj: NEPIL, M. Data mining v praxi. Brno : MU v Brně, 2007. s 25-38.
34
Supervised vs. unsupervised learning
Supervised learning:
• Supervize: Data (pozorování, měření, atp.) jsou označena předem definovanými/známými třídami.
• Nová/testovací data jsou následně rozřazena do těchto tříd.
• Z pohledu kauzality daný model definuje vztah mezi vstupními daty a daty výstupními.
Unsupervised learning:
• Předem nejsou definované žádné třídy.
• Pro daná data je cílem prokázat existenci nějakých tříd.
• Z pohledu kauzality jsou všechna data chápána jako výstupní. Modelujeme závislost daných dat na jakýchsi neznámých skrytých proměnných.
35
Fáze DM Procesu (5)
Vyhodnocení modelu (Model Evaluation):
Evaluace modelu: jak se chová na testovacích datech.
• Metody a kritéria závisí na typu modelu:
• Např. koincidenční matice pro klasifikační modely, průměrná chyba pro regresní modely,...
• Interpretace modelu: důležitost a obtížnost interpretace značně závisí na zvolené modelovacím algoritmu.
	
business l^--	
Understanding!^—	[Understanding!
36
Fáze DM Procesu (6)
Nasazení do praxe (Deployment)
• Je třeba určit, jak mají být výsledky využity.
• Kdo je bude využívat?
• Jak často budou využívány?
• Nasazení data miningových výsledků pomocí:
• Skórování databáze.
• Využití výsledků pomocí obchodních pravidel.
• Interaktivní on-line scoring.
SAS - stručné seznámení
2 základní SAS rozhraní:
• SAS windowing environment
SAS Enterprise Guide (GUI)
File     Edit     View     Tasks     Program     Tools     Help    | H - GaT ^5 I fl V1       BE) X I 1*3 £*H        I     Process Flow
Process Flow "
Project Tree
3 E^o Process Flow customer
"|g Club Members Query |t] Non Club Members Query □■■■^1 nonclub
=■■-111 List Data
► Run t | | Export ▼ Schedule ▼   Zoom ▼   *^ Project Log    [7| Properties
clubmembe...
												
												
												
												
												
												
												
			\dh\									
List Data
SAS Report -List Da...
38
SAS - stručné seznámení
v Sas		□SSI
File   Edit   View   Tools   Run   Solutions   Window Help		
	-■ 'il M I * x o #	
Contents of 'SAS Environment'
Libraries      File Shortcuts
Explorer window
Results Viewer - 5as Output
Okresy Středních Čech
Produkt A Produkt B Ostatní produkty
/1 jednoduchy priklad */
/* dopočteni hustoty osidleni */
= data c sobec;
3Et czdata.c sobec; hustota=obývate 1/ (ploch.a+1);
graf ickych p scanne tru a vo lani procedury GIIAP */
ľjoptions reset=all colocs= [gĽayŕO grayeO graydO graycO graybO grayaO gray90 gcayBO gtayVO gray60 gray50 gray40 gray3 0 gray2 0 gray10)   ft itle=1ar ial1   ctext=black;
- proc ijmap data=csobec itiap = czdata. esobec map; id idobec; z horo obývate1;
SAS Output
Program
editor window
39
Pomocí
View    Code   Data Describe
Graph Analyze
OLAP   Add-In   Tools Window
E3 13 s^g >a a éi L® s
l^i EGDefault
WORK, IMPWĚ175
Process Flow
SAS Output
1 WORK.IMPV/6175 [read-only]     | jjs] HTML - Histograi
HTML - Pie Chart |
Zastoupeni kraju
iL
Karlovarský kraj
if X   Task List
I Tasks by Category   Ta;k-: by Narv-Create New Items in Project
^ Zireate I: :e
Create Data using Data Grid Create Note LY create I.ubiji usmg Active Data Ly Zireate Empty Queiji Bqg Create Empty Proces? Flow
Add Hems |p Project
. 1 Open From My Computer W Open From SAS Server/Binder 0 Open IDLAP Cube Jj^l Open Exchange
DpenOLEDB ^ Open ODBC ^ Import Data
M Area Plot
111 Bar Chart |h Box Plot [SP Bubble Plot Q Contour Plot ffij Create Map Feature Table f' j-jiv.: 2-1-. ■■■■■■ Line F :■ V Map Graph-Ci Pie Chart ^t; Radar Chart |3l Scatter Plot
■I
f X
^SMla,
i Plot
j£ ARIMA Modeling and Forera^iriq ^ Regression Anal'.'si: with Auto re a n: Ai Basic Forecasting
Regression Analysis ot Panel Data |fg Prepare Time Series Data
Multivariate
|if Canonical Correlation Ik* duster Analysis
j ::- t -'lai-z;k: IjW Factor Analysis
Principal Comoonents
_l
40
SAS Enterprise Guide (EG) Interface
• EG automaticky generuje kód, který možné dále editovat
BonusReport * ■^y Program
y Save " l> Run * ■ Stop  Select Server    Export ▼ Send To ? Create » | (2) Properties
El data work.comp;
set orion.sales; Bonus=500;
Compensation=sum (Salary, Bonus) ; BonusMonth=month(Hire_Date) ;
drop Gender Salary Job_Title Country Birth_Date; format Bonus Compensation doliar8.  Hire_Date date9.; label Employee_ID="Employee ID"
First_Name="First Name"
Last_Name ="Last Name"
BonusHonth="Month of Bonus"
Hire_Date="Hire Date";
run;
- proc print data=work.comp label; title  'Bonus report for 2009'; run;
41
SAS Help
•Use the SAS Enterprise Guide Help facility or SAS OnlineDoc for additional direction on SAS Enterprise Guide or the SAS programming language. Go to support.sas.com and select
Product Documentation O Base SAS.
É? SAS Enterprise Guide Help
Hide Back Print
Contents I |nden    Search Favorites
OS
Welcome to SAS Enterprise GuidE
m cj * cj
s CJ * □
m Cj ♦ □
Aboul SAS Enterprise Guide
B„_.......
^ Where do I start?
"I Tutorial and training for SAS Enteri
"I Additional resources
"| Accessibility and compatibility feati F| Keyboard shortcuts
"I What are 'tasks' and 'projects'?
"I Can I still write SAS programs? What's New
Using SAS Enterprise Guide Working with Projects Working with Data Building Queries Filtering and Sorting Data Working with Programs Working with Stored Processes Working with Prompts Working with Results Exporting and Sending Files Publishing Data and Results Customizing SAS Enterprise Guide Working with Enterprise Guide Explore Running SAS Tasks
□IIIS
& Base SAS - Windows Internet Explorer provided by SAS
SAS- Enterprise Guide
Help
SAS Enterprise Guide is a powerful Microsoft Windows client application that pr mechanism to exploit the power of SAS and publish dynamic results throughout Select the topics below to get started.
► Where do I start?
► Tutorial and training for SAS Enterprise Guide
► What are tasks and projects?
► Can I still write SAS programs?
► What tasks are available in SAS Enterprise Guide 4.2?
► What's new in SAS Enterprise Guide 4.2
3 -   I S hi:l:p://si_
m/docurrientation/onlinedoc/base/index, html
J][g[x] |soogl.
□EIS
EE
ť*1   <Ä      <> Base SAS
A '   0  -   Ö ' Ear" Pag* -
§sas
I Search support .sas .com
sujpport.sas.com knowledge base       support       learning center community
KNOWLEDGE BASE / PRODUCT DOCUMENTATION
KNOWLEDGE BASE
Prirt   ž- Essva---
• System Requirements
• Install Center
» Product Documentation j*Wtiars New in SAS j*SAS 9 2 !-»SAS 9 1 !-»SAS 8 2
o Samples & SAS Notes
• Focus Areas
Base SAS	[""10RE ABOUT THIS PRODUCT
Base SAS 9.2	* Product Description
	* Bookstore
[Bsse SAS 9.2] [Base SAS 9.1.3] [Base SAS 9.1]	* IJSTraining
	* Worldwide Training
• What's New in SAS 9.2	
PDF [1.3GMB) I   HTML |   Purchase book	* Base SAS Focus Area
	FEEDBACK
Most Used Documentation	
	* Send a Comment
• Base SAS 9.2 Procedures Guide	
PDF [5.simb> | HTML	
• Base SAS 9.2 Procedures Guide: Statistical	
Procedures	
PDF f+S4MB> | HTML	
• SAS 9.2 Language Reference: Concepts	
PDF i7.16MB> | HTML	
• SAS 9.2 Language Reference Dictionary	
PDF (T.44MB> | HTML	
.           a t nfl™™ i tlafaiuuum	
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS na webu
Michal Kulich: Malý manuál uživatele SASu http: / / wwwJ<:arlin. m
Phil Spector: An Introduction to the SAS System http://www.stat.berkeley.edu/classes/sioo/sas.pdf
Patric McLeod : Introduction to SAS 9 http://www.unt.edu/rss/class/sasi/
http://en.wikipedia.org/wiki/SAS_%28software%29
2. Organizace dat, úvod do SQL
Historie skladování dat
V minulosti byla data ukládána v jednom velkém souboru, ke kterému se přistupovalo indexovanými sekvenčními metodami. Soubor byl indexován na základě předpokládaných způsobů dotazování. Velkou nevýhodou bylo to, že se informace v záznamech opakovaly a typy dotazů byly předurčeny.
45
Historie skladování dat
datum	jméno	prijmeni	adresa_ulice	adresa_mesto	cislCMJCtU	platba	zůstatek
980103	Jan	Novak	Dlouhá 5	Praha 1	9945371	100,00	100,00
980105	Jan	Novak	Dlouhá 5	Praha 1	9945371	1500,00	1600,00
980106	Jan	Novak	Dlouhá 5	Praha 1	9945371	-1500,00	50,00
980106	Karel	Nemec	Lucni 4	Praha 2	24867134	3000,00	6000,00
980107	Karel	Nemec	Lucni 4	Praha 2	24867134	-4000,00	2000,00
980108	Jan	Novak	Dlouhá 5	Praha 1	9945371	-150,00	-100,00
980111	Karel	Nemec	Lucni 4	Praha 2	24867134	5000,00	7000,00
46
Relační databáze
i i
transakce
id_klient jmeno prijmeni adresa_ulice adresa mesto
i.
ucet
id_ucet id klient
□
id_transakce
id_ucet
datum
platba
zůstatek
SELECT klient.jmeno, klient.prijmeni, klient.adresa_ulice,
klient.adresa_mesto, ucet.cislo_uctu, transakce.zůstatek
FROM klient, ucet, transakce
WHERE klient.idjdient = ucet.idjdient;
AND transakce.id_ucet = ucet.id_ucet;
AND transakce.zůstatek < 100;
GROUP BY klient.adresa_mesto;
Relační databáze
• Relační databáze je databáze založená na relačním modelu. Často se tímto pojmem označuje nejen databáze samotná, ale i její konkrétní softwarové řešení.
Relační databáze je založena na tabulkách, jejichž řádky obvykle chápeme jako záznamy a eventuelně některé sloupce v nich (tzv. cizí klíče) chápeme tak, že uchovávají informace o relacích mezi jednotlivými záznamy v matematickém slova smyslu.
• Termín relační databáze definoval Edgar F. Codd v roce 1970.
způsoby kladení dotazů: QBE (query by example) SQL (structured query language)
48
Relační databáze
• Dle relační teorie lze pomocí základních operací (sjednocení, kartézský součin, rozdíl, selekce, projekce a spojení) uskutečnit veškeré operace s daty a ostatní operace jsou již jen kombinacemi těchto pěti.
49
Relační databáze
Základem relačních databází jsou databázové tabulky. Jejich sloupce se nazývají atributy nebo pole, řádky tabulky jsou pak záznamy. Atributy mají určen svůj konkrétní datový typ -doménu. Řádek je řezem přes sloupce tabulky a slouží k vlastnímu uložení dat. Konkrétní tabulka pak realizuje podmnožinu kartézského součinu možných dat všech sloupců - relaci.
Primární klíč
• Primární klíč je jednoznačný identifikátor záznamu, řádku tabulky. Primárním klíčem může být jediný sloupec či kombinace více sloupců tak, aby byla zaručena jeho jednoznačnost. Pole klíče musí obsahovat hodnotu, tzn. nesmí se zde vyskytovat nedefinovaná prázdná hodnota NULL. V praxi se dnes často používají umělé klíče, což jsou číselné či písmenné identifikátory - každý nový záznam dostává identifikátor odlišný od identifikátorů všech předchozích záznamů (požadavek na unikátnost klíče), obvykle se jedná o celočíselné řady a každý nový záznam dostává číslo vždy o jednotku vyšší (zpravidla zcela automatizovaně) než je číslo u posledního vloženého záznamu (číselné označení záznamů s časem stoupá).
Cizí klíč
• Dalším důležitým pojmem jsou nevlastní/cizí klíče. Slouží pro vyjádření vztahů, relací, mezi databázovými tabulkami. Jedná se o pole či skupinu polí, která nám umožní identifikovat, které záznamy z různých tabulek spolu navzájem souvisí.
50
Relační databáze - vztahy mezi tabulkami
• Vztahy, neboli relace, slouží ke svázání dat, která spolu souvisejí a jsou umístěny v různých databázových tabulkách. V zásadě rozlišujeme čtyři typy vztahů.
mezi daty v tabulkách není žádná spojitost, proto nedefinujeme žádný vztah.
1:1 používáme, pokud záznamu odpovídá právě jeden záznam v jiné databázové tabulce a naopak. Takovýto vztah je používán pouze ojediněle, protože většinou není pádný důvod, proč takovéto záznamy neumístit do jedné databázové tabulky. Jedno z mála využití je zpřehlednění rozsáhlých tabulek. Jako ilustraci je možné použít vztah řidič - automobil. V jednu chvíli (diskrétní časový okamžik) řídí jedno auto právě jeden řidič a zároveň jedno auto je řízeno právě jedním řidičem.
51
Relační databáze - vztahy mezi tabulkami
i:N přiřazuje jednomu záznamu více záznamů z jiné tabulky. Jedná se o nejpoužívanější typ relace, jelikož odpovídá mnoha situacím v reálném životě. Jako reálný příklad může posloužit vztah autobus - cestující. V jednu chvíli cestující jede právě jedním autobusem a v jednom autobuse může zároveň cestovat více cestujících.
M:N je méně častým. Umožňuje několika záznamům z jedné tabulky přiřadit několik záznamů z tabulky druhé. V databázové praxi bývá tento vztah z praktických důvodů nejčastěji realizován kombinací dvou vztahů i:N a i:M, které ukazují do pomocné tabulky složené z kombinace obou použitých klíčů (třetí resp. tzv. vazební tabulka). Příkladem z reálného života by mohl být vztah výrobek - vlastnost. Výrobek může mít více vlastností a jednu vlastnost může mít více výrobků. V reálném životě nicméně existuje velké množství vztahů M : N, mimo jiné také proto, že často existuje praktická potřeba zachovávat i údaje o historii těchto vztahů z časového hlediska (jeden řidič v delším časovém období řídí více rozličných aut a jedno auto v delším časovém období může mít více různých řidičů).
52
Slovník pojmů
□ ODS
□ D WH
□ DataMart
□ Meta Data
□ BI
□ OLAP
□ OLTP
□ ETL
□ ELT
□ EAI
□ ERP
□ DBMS
□ SQL
Operational Data Store DataWareHouse
Business Intelligence On Line Analytical Processing On Line Transaction Processing Extract, Transform, Load Extract, Load, Transform Enterprise Application Integration Enterprise Resource Planning Database Management System Structured Query Language
Slovník pojmů
ODS: Short for operational data store, a type of d 2 that serves as an interim area for ad in order to store
time-sensitive operational data that can be accessed quickly and efficiently. In contrast to a data warehouse, which contains large amounts of j : data, an ODS contains small amounts of information that is updated through the course of business transactions. An ODS will perform numerous quick and simple s on small amounts of data, such as acquiring an account balance or finding the status of a customer order, whereas a data warehouse will perform complex queries on large amounts of data. An ODS contains only current operational data while a data warehouse contains both current and historical data.
Data Mart: A se, or collection of databases, designed to help managers make strategic decisions about their business. Whereas a combines databases across an entire enterprise, data marts are usually smaller and focus on a particular subject or department. Some data marts, called dependent data marts, are subsets of larger data warehouses.
Meta Data: a about data. Metadata describes how and when and by whom a particular set of data was collected, and how the data is formatted. Metadata is essential for understanding information stored in is and has become increasingly
important in XML-based Web applications.
SQL (někdy vyslovováno anglicky es-kjů-el, někdy též síkvl) je standardizovaný c c používaný pro práci s daty v
relačních databázích. SQL je zkratka anglických slov Structured Query Language (strukturovaný dotazovací jazyk).
DWH: Abbreviated DW, a collection of d    designed to support management decision making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time. Development of a data warehouse includes development of systems to extract data from operating systems plus installation of a warehouse d that provides managers flexible access to the data.
The term data warehousing generally refers to the combination of many different databases across an entire enterprise. Contrast with yt.
BI: Most companies collect a large amount of c 1 from their business operations. To keep track of that information, a business and would need to use a wide range of sj e programs , such as Excel, Access and different s applications for various departments throughout their organization. Using multiple software programs makes it difficult to retrieve information in a timely manner and to perform analysis of the data.
The term Business Intelligence (BI) represents the tools and systems that play a key role in the strategic planning process of the corporation. These systems allow a company to gather, store, access and analyze corporate data to aid in decision-making. Generally these systems will illustrate business intelligence in the areas of customer profiling, customer support, market research, market segmentation, product profitability, statistical analysis, and inventory and distribution analysis to name a few.
A Database Management System (DBMS) is a set of o
that controls the creation, maintenance, and the use of a d        . Details on
http://en.wikipedia.org/wiki/ Database management system
54
Slovník pojmů
OLAP: Short for Online Analytical Processing, a category of software tools that provides analysis of d a stored in a c se. OLAP tools enable users to analyze different dimensions of multidimensional data. For example, it provides time series and trend analysis views. OLAP often is used in d ig.-
The chief component of OLAP is the OLAP       , which sits between a rt and a d S). The OLAP server
understands how data is organized in the database and has special functions for analyzing the data. There are OLAP servers available for nearly all the major database systems.
OLTP: Short for On-Line Transaction Processing. Same as h
IS-
Transaction processing: A type of c ; processing in which the
computer responds immediately to u requests. Each request is considered to be a transaction. Automatic teller machines for banks are an example of transaction processing.
The opposite of transaction processing is b ig., in which a batch of
requests is s    d and then e all at one time. Transaction processing
requires interaction with a user, whereas batch processing can take place without a user being present.
ETL: Short for extract, transform, load, three e functions
that are combined into one tool to pull data out of one database and place it into another database.
Extract - the process of reading data from a
database.
Transform ~ the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data.
Load - the process of writing the data into the
target database.
ETL is used to rr     e data from one database to another, to form and s and also to convert databases
from one format or type to another.
EAI: Acronym for enterprise application integration. EAI is the unrestricted sharing of data and business processes throughout the or data sources in an organization. Early s e programs in areas such as inventory control, human resources, sales automation and management were designed to run independently, with no interaction between the systems. They were custom built in the technology of the day for a specific need being addressed and were often proprietary systems. As enterprises grow and recognize the need for their information and applications to have the ability to be transferred across and shared between systems, companies are investing in EAI in order to streamline processes and keep all the elements of the enterprise interconnected.
ERP: Short for enterprise resource planning, a business management system that integrates all facets of the business, including planning, manufacturing, sales, and marketing. As the ERP methodology has become more popular, s have emerged to help business managers implement ERP in business activities such as inventory control, order tracking, customer service, finance and human resources.
55
Datový sklad (Data Warehouse)
• Definice (W.H. Inmon 1996): Datový sklad je
>subjektově orientovaný
> integrovaný
> časově proměnný
> stálý
soubor dat, který slouží pro podporu rozhodování.
Datový sklad
• prvotní koncepce datována počátkem 80.let
• vznik z potřeby jednoduchého přístupu ke strukturovanému úložišti kvalitních dat
• pomáhá získat odpovědi pro lepší rozhodování
• umožňuje použití dat pro dotazování, reportování a analýzu
57
Struktura datového skladu
• třívrstvá architektura:
> datový sklad
> aplikační vrstva
> prezentační vrstva
• fyzicky centralizovaný nebo distribuovaný
Datový sklad
Pre-Data Data Warehouse Cleansing
OLT
■'Server J
Data Repositories
Date
Data Mart
ODS
Front-End Analytics
OLAP
Data Mining
ŕ^tr' Data
Visualization
Reporting
□ata Flow
59
Datový sklad
Data Marts
V-- "V----Y-- V--
Data Sources        Data Storage OLAP Engine   Front-End Tools
60
Datový sklad
C «tl-tools.inío
OLAP Analysis
Datový sklad
EAJ & D WH Syslem Configuratiori Diagram
Dala etílMtKHi
DWH SĚrver
I:-i:i:t-M.i>:-Cc.iborjiiicriR-ng
		
CoCaboňtíiCKiRinfl T MspUůt toť HULFT/		nSLtt
i--'--.-.-.:-Co-.^or.i'jonkjrig PM		1 ^ * ^ o—
	1 1	
	l • i 1 l •	Mťvůr
		
SOHO: Zkratka pro smaZZ office/home office - malé nebo domácí kancelářské prostředí a business kultura, která je s ním spojena.
62
Datové Modely
□ Star (hvězda)
Star Schema
Dl
f		D3
v		
	D2	
□ Snowflake (vločka)
□ Starflake
Starflake Schema
Dl
T		D3.1	-N-	D3.2
□ Constellation (souhvězdí)
D2
Snowflake Schema
DLI
F	_> D3.1		D3.2
N.			
Dl.
U2A
U2.2
Constellation Schema
Fl
f:
63
Příklad schématu hvězda (star)
time
time_key day
day_of_the_week
month
quarter
year
branch
branch_key branch_name branch_type
Sales Fact Table
time_key
item_key *
branch_key
location_key
units sold
dollars sold
avg_sales
	item	
item_key		
item_name		
brand		
type		
supplier_type		
■ ■ I
location	
location_key	
street	
city	
province_	or_street
country	
Příklad schématu vločka (Snowflake)
time
time_key day
day_of_the_week
month
quarter
year
branch
branch_key
branch_name
branch_type
Measures
Sales Fact Table
time_key
item_key *
branch_key
location_key
units sold
dollars sold
avg_sales
	item	
item_key		
item_name		
brand		
type		
supplier_key ♦		
	location	
	location_key	
	street	
	city_key	♦
supplier
supplier_key supplierjype
65
time
Příklad schématu souhvězdí (Constellation)
time_key day
day_of_the_week
month
quarter
year
.......
branch
branch_key branch_name branch_type
.......
Sales Fact Table
time_key
item_key
branch_key
location_key
units sold
dollars sold
g_sales
item
item_key item_name brand type
supplier_type
location	
location_key	
street	
city	
province_or_street	
country	
Shipping Fact Table
time_key
item_key
shipper_key
from location
to location
dollars cost
units_shipped
shipper
shipper_key shipper_name location_key shipper_type
Příklad datové kostky
$y TV
Datum
íQtr     2Qtr    3Qtr     4Qtr sum
Celkový roční prodej TV v USA
VC
-/       7       7 y
suma
USA
Kanada >g li
Mexiko
suma
All, All. All
67
Datové „kvádry" odpovídající datové kostce
product
country
product,date
o-D(apex) cuboid
í-D cuboids
date, country
product, elate, country
2-D cuboids
3~D(base) cuboid
68
Typické OLAP Operace
□ Roll up (drill-up): sumarizace dat
• Postoupení v hierarchii o úroveň výše nebo redukce dimenze (např. z kostky na čtverec).
□ Drill down (roli down): opak roll-up -zajímá nás větší detail
• Z vyšší úrovně sumarizace na nižší úroveň nebo zavedení nových datových dimenzí.
□ Slice and dice (krájet a kostkovat):
• Výběr datového podprostoru.
□ Ostatní operace:
• drill across: zahrnutí více datových tabulek (kostek)
• drill through: přes základní úroveň datové kostky zpět k podkladovým relačním tabulkám (pomocí SQL)
Architektura OLAP Serverů
• Relační OLAP (Relational OLAP -ROLAP)
• Využívá relační nebo rozšířenou relační DBMS pro ukládání a správu dat datového skladu a OLAPovou střední vrstvu pro podporu chybějících částí.
• Zahrnuje   optimalizační   možnosti   DBMS,   implementaci agregační navigační logiky a doplňkové nástroje a služby
• Vícedimenzionální OLAP (Multidimensional OLAP - MOLAP)
• Technologie založená na vícedimenzionálních datových polích (vč. technik pro řídké matice).
• Rychlé indexování předem spočtených sumarizovaných dat.
• Hybridní OLAP (Hybrid OLAP - HOLAP)
• Uživatelsky flexibilní, tj. low level: relační, high-level: pole.
• Specializované SQL servery
• specializovaná podpora pro SQL dotazy nad star/snowflake schématy.
70
ROLAP
• Data uložená v relační databázi - nejsou duplikována, ovšem není k nim možný přístup bez připojení k zdrojové databázi.
• dotazy O LAP se převádějí do klasických dotazů SQL -může být nevýhodou (limitované možnosti SQL, pomalejší odezva).
• Vhodný jen pro omezené množství dat.
71
MOLAP
• „tradiční" OLAP.
• Data uložena v multidimenzionálních kostkách mimo relační databázi. Jsou tudíž duplikována a je možný přístup i bez spojení s původním zdrojem dat.
• Hlavní výhodou je rychlá odezva na dotazy. Vše je předpočítáno a uloženo při tvorbě kostek.
72
HOLAP
• ponechává   původní   data   v   relačních tabulkách, agregace ukládá v multidimenzionálním formátu
• poskytuje propojení mezi rozsáhlými objemy dat v relačních tabulkách
• výhoda    rychlejšího    výkonu multidimenzionálně uložených agregací
73
Budování datového sklad
• metoda „velkého třesku":
> analýza požadavků podniku
> vytvoření podnikového datového skladu
> vytvoření datových tržišť
• přírůstková (evoluční) metoda
Plnění datového skladu
• počáteční plnění + pravidelná aktualizace
• plnění pomocí datových pump
• postupy ETL:
> extrakce
> transformace
> loading
Co je SQL?
The SQL procedure uses Structured Query Language to perform the following tasks:
• retrieve and manipulate SAS data sets
• create and delete SAS data sets
• generate reports
• add or modify values in a SAS data set
• add, modify, or drop columns in a SAS data set
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Úvod do SQL
General form of an SQL procedure query generate output:
PROC SQL;
SELECT variables
FROM SAS-data-set,
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Úvod do SQL
• Create a listing report of product activity.
• Step 1: Invoke the SQL procedure.
proc sql;	
• Step 2:	Identify the variables to display on the
report.	
proc sq sele	:i; ct CustomerID,  CustomerFirstNamef
	Cu s tome rLa stName
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
78
Úvod do SQL
• Step 3: Identify the input data set. proc sql;
select CustomerID,  CustomerFirstName, Cu s tome r La s tName from univ.mastercustomers;
• Step 4: End the procedure with a QUIT statement, proc sql;
select CustomerID,  CustomerFirstName, Cu s tome r La s tName from univ.mastercustomers;
quit;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
79
Úvod do SQL
• SQL joins have the following characteristics
• They do not require sorted data.
• They can be performed on up to 32 data sets at one time.
• They allow complex matching criteria using the WHERE clause.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Úvod do SQL
• General form of an SQL procedure join to generate output:
PROC SQL;
SELECT variables
FROM SAS-data-setl AS aliasl,
SAS-data-set2 AS alias2 WHERE aliasl. variable=alias2. variable,
i
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Úvod do SQL
• Create a listing report by joining data sets univ.mastercustomers and univ. customer orders by Customer ID.
• Step 1: Invoke the SQL procedure and list the variables
to display.
proc s<	qi;		
sel-	ect CustomerlD, CustomerFi.	rstNar	tie f
	CustomerLastName, Orde.	rID,	
	UnitPrice, Quantity		
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
82
Úvod do SQL
• Step 2: Identify the data sets to join and provide a table alias for each.
Because Customer ID exists in both data sets, identify which Customer ID to use.
select m.CustomerID, CustomerFirstName,
CustomerLastName,  OrderID, UnitPrice, Quantity from univ.mastercustomers as m, univ.customerorders as c
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
83
Úvod do SQL
• Step 3: State the condition on which observations are matched and terminate the query.
proc sql;
select m.CustomerIDf  CustomerFirstName,
CustomerLastName,  OrderID, UnitPrice, Quantity from univ.mastercustomers as m,
univ.customerorders as c where m.CustomerID=c.CustomerlD;
quit;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
84
Uvod do SQL
Create a new variable named TotSale by multiplying Quantity by UnitPrice. Name the new variable TotSale.
proc sql;
select m.CustomerlD, CustomerFirstName,
Cus tomerLas tName,  OrderID, UnitPrice,  Quantity, Quantity * UnitPrice as TotSale from univ.mastercustomers as m,
univ.customerorders as c where m.CustomerID=c.CustomerlD;
quit;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
85
Úvod do SQL
• General form of a PROC SQL query to create a SAS data set:
PROC SQL;
CREATE TABLE SAS-data-set AS SELECT...
other SQL clauses;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Úvod do SQL
• Join the tables univ .mastercustomers and univ. customerorders to create a new data set.
proc sql;
create table work.ordertotals as select m.CustomerZD,
CustomerFirstName, CustomerLastName, OrderID, UnitPrice, Quantity, Quantity*UnitPrice as TotSale from univ.mastercustomers as m,
univ.customerorders as c where m.CustomerID=c.CustomerID;
quit;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
87
Úvod do SQL
• General form of an SQL procedure query using labels and formats:
PROC SQL;
SELECT variable LABEL- column-header
FORM AT'=format
FROM SAS-data-set;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
88
Úvod do SQL
Enhance the previous report.
proc	sql;
select m.CustomerIDr	
	CustomerFirstName format=$10.,
	CustomerLastName format=$15.f
	OrderID,
	UnitPrice format=dollar7.2,
	Quantity,
	Quantity * UnitPrice as TotSale
	format=dollar8.2
	label=fTotal Sale Amount1
	from univ.mastercustomers as m,
	univ.customerorders as c
	where m.CustomerID=c.CustomerID;
quit;	
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Úvod do SQL
• Partial Output
Customer	Customer	Customer		Unit		Sale
ID	First Name	Last Name	OrderlD	Price	Quantity	Amount
062096	Craig	Knapmeyer	1240062267	$36.00	3	$108.00
062096	Craig	Knapmeyer	1240832690	$27.00	4	$108.00
062284	Robert	Britt	1238409388	$15.00	1	$15.00
062284	Robert	Britt	1238409388	$33.00	1	$33.00
064810	Randall	Goodman	1238248877	$175.00	4	$700.00
064810	Randall	Goodman	1238248877	$283.00	1	$283.00
064810	Randall	Goodman	1238273875	$220.00	1	$220.00
064810	Randall	Goodman	1238768955	$52.00	1	$52.00
064810	Randall	Goodman	1238842450	$24.00	1	$24.00
064810	Randall	Goodman	1239353817	$59.00	2	$118.00
064810	Randall	Goodman	1239489696	$11.00	2	$22.00
064810	Randall	Goodman	1239608721	$22.00	3	$66.00
064810	Randall	Goodman	1239608721	$46.00	3	$138.00
064810	Randall	Goodman	1240590287	$21.00	2	$42.00
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Úvod do SQL
• General form of an SQL procedure query to generate summary output:
PROC SQL;
SELECT group-variable,
SUM{analysis-variableJ FROM SAS-data-set GROUP BY group-variable,
• If a summary function is used in the SELECT clause with only one argument, then an overall statistic is calculated down the column.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Úvod do SQL
• Step 1: Identify the variables to display, the input data sets, and the matching criteria.
proc sql;
select m.CustomerIDf
CustomerFirstName format=$10., CustomerLastName format=$15., sum(Quantity)   label=  fTotal Quantity1, sum(Quantity*UnitPrice)   as TotSale format=dollarl2.2 label=fTotal Sale Amount1 from univ.mastercustomers as m, univ.customerorders as c where m.CustomerID=c.CustomerID;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
92
Úvod do SQL
• Step 2: Identify the grouping variable(s).
proc sql;
select m.CustomerID,
CustomerFirstName format=$10., CustomerLastName format=$15., sum(Quantity)   label=f Total Quantity1, sum(Quantity*UnitPrice)   as TotSale format=dollarl2.2 label=fTotal Amount Purchased1 from univ.mastercustomers as m,
univ.customerorders as c where m.CustomerID=c.CustomerID group by m.CustomerID,  CustomerFirstName, CustomerLastName;
quit;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
93
Úvod do SQL
General form of an SQL procedure query to generate ordered output:
PROC SQL;
SELECT group-variable,
SVM{analys/s-varfable) FROM SAS-data-set GROUP BY group-variable ORDER BY variable 1 <, variable2> ;
The default is ascending order.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Úvod do SQL
• Order the report by total sale.
proc sql;
select m.CustomerIDf
CustomerFirstName format=$10., CustomerLastName format=$15., sum(Quantity)   label=fTotal Quantity1, sum(Quantity*UnitPrice)  as TotSale format=dollarl2.2 label=fTotal Amount Purchased1 from univ.mastercustomers as m,
univ.customerorders as c where m.CustomerID=c.CustomerID group by m.CustomerlDf CustomerFirstName,
Cus tomerLas tName order by TotSale;
quit;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
95
Úvod do SQL
• Order the report by total sale - v sestupném pořadí
proc sql;
select m.CustomerIDf
CustomerFirstName format=$10., CustomerLastName format=$15., sum(Quantity)   label=fTotal Quantity1, sum(Quantity*UnitPrice)  as TotSale format=dollarl2.2 label=fTotal Amount Purchased1 from univ.mastercustomers as m,
univ.customerorders as c where m.CustomerID=c.CustomerID group by m.CustomerlDf CustomerFirstName,
Cus tomerLas tName order by TotSale desc;
quit;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
96
Inner IOIN
• The INNER JOIN keywords can be used to join tables. The ON clause replaces the WHERE clause for specifying columns to join. PROC SQL provides these keywords primarily for compatibility with the other joins (OUTER, RIGHT, and LEFT JOIN). Using INNER JOIN with an ON clause provides the same functionality as listing tables in the FROM clause and specifying join columns with a WHERE clause.
proc sql ; select p.country, barrelsperday 'Production', barrels 'Reserves'
from sql.oilprod p inner join sql.oilrsrvs r on p.country = r.country order by barrelsperday desc;
proc sql outobs=6; title 'Oil Production/Reserves of Countries';
select p.country, barrelsperday 'Production', barrels 'Reserves' from sql.oilprod p, sql.oilrsrvs r where p.country = r.country order by barrelsperday desc;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Left JOIN
• Outer joins are inner joins that are augmented with rows from one table that do not match any row from the other table in the join. The resulting output includes rows that match and rows that do not match from the joins source tables. Nonmatching rows have null values in the columns from the unmatched table. Use the ON clause instead of the WHERE clause to specify the column or columns on which you are joining the tables. However, you can continue to use the WHERE clause to subset the query result.
• A left outer join lists matching rows and rows from the left-hand table (the first table listed in the FROM clause) that do not match any row in the right-hand table. A left join is specified with the keywords LEFT JOIN and ON.
proc sql;
select Capital format=$20., Name 'Country' format=$20., Latitude, Longitude
from sql.countries a left join sql.worldcitycoords b on a.Capital = b.City and a.Name = b.Country;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
98
Right JOIN
• A right join, specified with the keywords RIGHT JOIN and ON, is the opposite of a left join: nonmatching rows from the right-hand table (the second table listed in the FROM clause) are included with all matching rows in the output.
proc sql outobs=l(Deselect City format=$20., Country 'Country'  format=$20., Population from sql.countries right join sql.worldcitycoords on Capital = City and Name = Country order by City;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
99
Inner/Full Outer/Left/Right JOIN
• A full outer join, specified with the keywords FULL JOIN and ON, selects all matching and nonmatching rows.
proc sql outobs=l(Deselect City '#City#(WORLDCITYCOORDS)' format=$20.,
Capital '#Capital#(COUNTRIES)' format=$20.,
Population, Latitude, Longitude from sql.countries full join sql.worldcitycoords on Capital = City and Name = Country;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
100
3. Příprava dat-čištění, kategorizace, agregace, transformace dat, úvod do SAS Data Step
N=7098 (g=5559. b=1539)
60
10
J I_I_l_I_L_l_!_I_i_I_i_I_I_u_L_
18 23 28 33 38 43 48 53 58 63 68 Vek
Čištění dat: Praktické zkušenosti
> Pokud vaše nová data obsahují více než 30 čísel, tak je v nich skoro jistě nějaká chyba.
> Čištění a příprava dat zabírá obvykle 80 - 90 % analytikova času.
> Pokud budete VELMI pečliví v této fázi, ušetříte si daleko víc času a nervů později - jinak stavíte dům na písku.
GIGO
> Garbage in, Garbage out (smetí dovnitř, smetí ven)
> sebelepší model (proces) nevyrobí ze smetí nic jiného než opět smetí.
Co způsobí nekvalitní data
> Správa nekvalitních/nadbytečných dat
> Nedoručené zásilky (marketing, fakturace)
> Nesprávné výsledky zpracování (reporting, analýzy, data mining)
> Špatné fungování systému (nekompatibilita)
> Ztráta image, nespokojení klienti
104
Co způsobí nekvalitní data
Pri mailingové kampani jedné britské maloobchodní společnosti se
ukázalo, že jedna pětina oslovených už zemřela. Přesto (nebo pro to?) byli obesláni s pozdravným oslovením „Drahý pane Zesnulý". 1)
Jistá pojišťovna zjistila, že většina jejich zákazníků má zaměstnání „Astronaut" - další pátrání ukázalo, že „Astronaut" je první volba v seznamu v jejich CRM systému. 1>
44 000-98 000 Američanů ročně umírá na základě odvratitelné medicínské chyby jako přepsání při psaní receptu, špatně popsaný výsledek krevní zkoušky, nečitelná informace v pacientských záznamech atd. Je to osmá nejčastější příčina úmrtí v USA 2>
^   7.5.1999 bombardovaly ozbrojené síly USA čínské velvyslanectví v Jugoslávii. Vyšetřováni zjistilo: CIA používá zastaralý mapový materiál; ještě k tomu pracovník předložil v důsledku chyby v datech spatnou adresu - „Doslovně nakreslil X na nesprávné místo" 3>
1) Peel, M: Letters to the dead and other data dereliction. © 2007 Financial Times Deutschland. http://www.ftd.de. vydáni z 2.10.2007
2) Oash, J. (1999): IT Can Reduce Medical Errors. Obsazeno v: Wang, Pierce, Madnick: Information Quality, 2005
3) BBC: Americas Chinese embassy warning ignored. © 1999 BBC. http://n ews/b b c. co. uk/1 /h i/worid/ame ri cas/37775. stm. vydaní z 2.10.2007
105
Datová kvalita
> Profiling, DQ Assessment - zjištění v jakém stavu jsou data
> Deduplikace, clustering, unifikace, konsolidace
> Prevence:
> Data Governance - soustavná péče o data
> Master Data Management - řešení pro správu klíčových dat
106
Čištění dat: Ověření souboru
□ Ověření souboru s daty / zdrojů dat
> Jsou to správná data (čas vzniku, výzkum...)?
> Jsou kompletní, bez duplicit, umím je číst...
□ Zkoumání případů
> Mají identifikátory?
0 Jsou tyto ID správné?
> Neopakují se (duplicity)?
E3 Existují i „skoro" duplicity - dva podobné, ale ne přesně totožné záznamy o tomtéž subjektu.
> Nejsou vynechány?
107
Čištění dat: Ověření proměnných
□ Zkoumání metadat o proměnných
> Jsou tam všechny proměnné a správně značené?
> Je jasné, co znamenají (kódovníky, definice...)? Dokumentace OK?
0 Pozor na mezinárodní studie, produkty konsorcií agentur a opakované vlny výzkumů. Jemné nuance metody mohou způsobit hrubý nesoulad !
> Neopakuje se některá proměnná vícekrát?
108
Čistení dat: Průzkum proměnných
□ Nabývá přípustných hodnot (x out of range)?
□ „Divné" kódy („xxx" „9999"..)
□ Duplicitní kódy pro stejnou věc („Ž" „ž" „žena" „zena"...
□ Kódování češtiny/ruštiny/...
Čištění dat: Průzkum proměnných
□ Překlepy apod.
> Editovací distance (Levenshteinova (BTiaAHMHpHocHdpoBHH^eBeHmTeHH) ...) pomohou odhalit překlep
> Editovací distance = počet elementárních editovacích kroků potřebných pro změnu jednoho řetězce na druhý Viz
i k Levenshteinově distanci
E3je zde aplet, který ji umí počítat
> Shlukování řetězců podle ED
110
Čištění dat: Průzkum proměnných
□ Slučování podobných kategorií (prodavač - prodejce -prodavačka);
□ Málo četné kategorie (národnost brazilská...) - je třeba sloučit/přiřadit k něj aké (kým) více četné (ným) kategorii(ím) na základě nějakého vhodného kriteria.
□ Je distribuce přiměřená našemu očekávání (interval hodnot, rozptyl, šikmost, špičatost, modálni hodnoty...)? Není např. příliš „ořezaná" či naopak „roztažená"?
> Někdy se obtížně poznává: Např. věk v části dat může být kódován jako poslední dvojčíslí roku narození, a v jiné části dat jako 200y - rok narození.
Čištění dat: Průzkum proměnných
□ Shluky (dumping), typicky kolem zaokrouhlených hodnot
> Příjem -lidé rádi zaokrouhlují směrem nahoru.
> Nebo třeba kolem hranic věkových kvót, vzniklé tím, jak tazatelé „upravují" věky respondentů, aby se vešli do kvót.
□ Chybějící hodnoty (příčiny vzniku, zastoupení,...)!!!
□ Pozor na kódy časů (amer. x evrop. konvence), regionů apod.!
112
Čištění dat: Vazby mezi daty
□ Více proměnných
> Kontingenční tabulky, box ploty s kategoriemi, bodové grafy a jejich matice, korelační koeficienty
> Logické vazby (např. íotiletý nemůže být ženatý, 30tiletý nemůže pracovat 2olet,...)
El Hledání pomocí programu/kódu - podmínky vyjádříme pomocí prostředků matematické logiky a necháme počítač, aby vyhledal případy, kde nejsou splněny.
113
Čištění dat: Vazby mezi daty
□ Více proměnných
> Extrémní hodnoty vícerozměrného rozdělení
s Bodový graf
s Mahalanobisova vzdálenost od těžiště: [(x-r)T S_1 (x-r)]~l/2, kde t je vektor těžiště, x zkoumaný bod a S kovarianční matice
• např. P. Filzmoser (2004) A multivariate outlier detection method,
http://www.statistik.tuwien.ac.at/public/filz/papers/minsk04.pdf
> Další vlastnosti; např. existují očekávané korelace?
114
Čištění dat: Vazby mezi daty
□ korektní vkládání dat do DB
> text. pole s názvem zboží vs. rolovací seznam s typem zboží
100% -i 90% ■-80% ■-70% -|-60% 50% +-40% -|-30% 20% 10% -h
> pořadí hodnot v rolovacím seznamu -problém první (defaultní) hodnoty
o% ~\—■—■—i—■—■—i—■—■—i—■—■—i—■—■—i—■—■—i—■—■—i—■—■—i—■—■—i—■—■—i—■—■—i—■—■—*
3       4       5       6       7       8       9      10      11      12      1 2
I VT
□ OT
I NA
□ MT
□ FK ICT
□ BT
Čištění dat: Odlehlé hodnoty
n - odlehlá hodnota
-np- - horní vnitřní hradba nebo max. hodnota
-'-1 —   horní kvartil
- — medián
-1-' —   dolní kvartil
— - dolní vnitřní hradba nebo min. hodnota
■fr - extrémní hodnota
> kvartilová odchylka: q= xQ 75 - xQ 25
> vnitřní hradby: xa2 - í.^q , x    + í.^q
> vnější hradby: xG 25 - 3q, xG 75 + 3q
> Odlehlá hodnota leží mezi vnějšími a vnitřními hradbami, tj. v intervalu (xo,75+ 1>5C1' x0,75+ 3Q) čiv intervalu (xQ 25 - 3q, xo 25 - i,
5q).
> Extrémní hodnota leží za vnějšími hradbami, tj. v intervalu (x    + 3q, oo) >^či v intervalu (-oo, x    - 3q).
116
Čištění dat: Opravy chyb
□ Zpět k pramenům!
□ Vyřazení podezřelých případů:
> Záměrné podvody, např. nespolehliví tazatelé (shluková analýza!).
> Neověřitelná data.
□ Vyřazení podezřelých hodnot.
□ Rekódování na správné hodnoty (imputace hodnot):
> imputace - průměrem, mediánem, max./min. hodnotou, pomocí modelu.
117
Transformace dat
□ Binarizace (dummy proměnné)
> Dummy proměnné představují techniku využívající dichotomické proměnné (kódované o neoo i) pro vyjádření jednotlivých hodnot nominálních proměnných.
>Název „dummy" poukazuje na fakt, že přítomnost znaku označeného kódem i reprezentuje faktor, nebo soubor faktorů, který není měřitelný žádným lepším způsobem v rámci dané analýzy.
118
Dummy proměnné
□ Dummy proměnná přiřazuje hodnotu i danému pozorování vybrané proměnné a hodnotu o ve zbývajících případech.
□ Pro pohlaví (2 kategorie), např. přiřadí 1 pro ženu a o pro muže. V tomto případě je postačující vytvoření právě jedné dummy proměnné.
□ Pro rasu (4 kategorie), je třeba vytvořit více dummy proměnných.
Pi=i, pokud rasa=„běloch" a o jinak. P2=i, pokud rasa=„černoch" a o jinak. P3=i, pokud rasa=„asiať a o jinak. P4=i, pokud rasa=„ostatní" a o jinak.
□ Důležité: Všechny 4 proměnné nejsou zahrnuty do regrese (způsobilo by to perfektní multikolinearitu, P4=i-P3-P2-Pi).
□ Počet dummy proměnných=počet kategorií -1.
□ Vynechaná proměnná je „referenční" proměnnou.
□ Konstanta obsahuje informaci o této referenční proměnné.
□ Koeficienty zahrnutých proměnných jsou brány ve vztahu ke konstantě.
119
Transformace dat
□ Kategorizace spojitých proměnných > decily
□ Agregace
□ Segmentace
Categorization of predictors
Every variable should be categorized (divided to reasonable number of categories
Best separation (default rates within categories are different as much as possible)
Time stability (ordering in categories by default rate is the same in different periods of development sample)
aqe def
0.2212
	pocet	podil	badrate
21	35 059	8.2%	13.11%
23	32 401	7.5%	9.81%
26	41 807	9.7%	8.61%
29	38 510	9.0%	8.07%
32	36 271	8.4%	6 79%
3S	44 648	10.4%	6.11%
41	50 015	11.6%	5.74%
45	40 099	9.3%	5.21%
'51	54 526	12.7%	4.52%
60	56 551	13.2%	3.71%
Info.Value:
0.1558
12.094
:
8.0% 6.0% 4.0% 2.0% 0.0%
age_def
1 podil - badrate
11. ~-~-?n
12.00%
10.00%
3.00%
S.00%
4.00%
2.00%
21    23   26   29   32    36   41   46 51
]
Total
429 887 100.0%
6.79%
Categorization of predictors
• We want to find out real statistical dependencies, not random differences in default.
I     I     1     J     4     5     «     1     I     I    ID    II    11    \i H
122
Transformace dat - WOE
□ Good        celkový počet dobrých klientů ve vzorku
□ Bad celkový počet špatných klientů ve vzorku
□
□
□
□
□
goodiS, badis
celková šance
počet dobrých, resp. špatných klientů v i-té kategorii příslušné s-té proměnné.
good
odds  all =
bad
šance i-té kategorie s-té proměnné
poměr šancí (OR)
WOE (weights of evidence)
odds/ =
goodj' bad4 8
odds  ratio/ =
odds.
odds all
WOE" =ln(odds_ratiois)=ln
goodi		goodi
badj8	= ln	good
good		badj8
bad v J		v bad y
Transformace dat -WOE
cat.	# bad clients	#good clients	Def rate	odds	OR	% bad [1]	% good [2]	[3] = [2]/[1]	WOE = ln[3]
1	4	1	80,0%	0,25	0,03	40,0%	1,1%	0,03	-3,58
2	2	6	25,0%	3,00	0,33	20,0%	6,7%	0,33	-1,10
3	2	18	10,0%	9,00	1,00	20,0%	20,0%	1,00	0,00
4	1	12	7,7%	12,00	1,33	10,0%	13,3%	1,33	0,29
5	1	53	1,9%	53,00	5,89	10,0%	58,9%	5,89	1,77
All	10	90	10,0%	9,00					
ALL 100
124
The SORT Procedure
•The SORT procedure rearranges the observations in work. qtrlsalesrep and places them in order by descending Last Name within Country.
ĚSÍ PROG2Review.sas *	
H	libnanie  orion   T s : \workshop T ;
	data work.qtrlsalesrep;
	proc  sort data=work.qtrlsalesrep; by Country descending Last Name; run;
•The OUT= option in the SORT procedure can be used
to create an output data set, instead of overwriting the input data
set.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The FORMAT Procedure
•The FORMAT procedure creates user-defined formats and informats, and stores them in the SAS catalog work.formats by default.
PROG2Review.sas *	
	libname  orion   's:\workshop';
H	data work.qtrlsalesrep;
S	proc  sort data=¥ork.qtrlsalesrep;
>	proc format; value   $ctryfmt   T AU T = T Australia T TUST=TUnited States'; run;
	
• Více na: http://www2.sas.com/proceedings/sugi27/po56-27.pdf
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The FORMAT Procedure
•Range(s) can be
• single values
• ranges of values
• lists of values.
•Labels
• can be up to 32,767 characters in length
• are typically enclosed in quotation marks, although it is not required.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
127
Character User-Defined Format
character		discrete
format		character
name		values
n—
$ctryfmt
-t-
'AU' 'US'
proc format
= 'United
run;
other = 'Miscoded';
keyword
labels
•The OTHER keyword matches all values that do not match any other value or range.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Character User-Defined Format
Iproc format;
value $ctryfmt    fAUf  =  fAustralia1
fUSf =  fUnited States other =  fMiscoded1; run ;
proc print data=orion.sales label; var Employee_ID Job_Title Salary Country Birth_Date Hire_Date; label Employee_ID= f Sales ID f Job_Title=fJob Title1 Salary= f Annual Salaryf Birth_Date=fDate of Birth1 Hire_Date=fDate of Hire1; format Salary dollarlO.O
Birth_Date Hire_Date monyy7. Country $ctryfmt.;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
129
Character User-Defined Format
•Partial PROC PRINT Output
					Annual			Date of
Obs	Sales ID	Job Title			Salary	Country	Birth	Hire
60	120178	Sales	Rep.	II	$26,165	Australia	N0V1954	APR1974
61	120179	Sales	Rep.	III	$28,510	Australia	MAR1974	JAN2004
62	120180	Sales	Rep.	II	$26,970	Australia	JUN1954	DEC1978
63	120198	Sales	Rep.	III	$28,025	Australia	JAN1988	DEC2006
64	120261	Chief	Sales	Officer	$243,190	United States	FEB1969	AUG1987
65	121018	Sales	Rep.	II	$27,560	United States	JAN1944	JAN1974
66	121019	Sales	Rep.	IV	$31,320	United States	JUN1986	JUN2004
67	121020	Sales	Rep.	IV	$31,750	United States	FEB1984	MAY2002
68	121021	Sales	Rep.	IV	$32,985	United States	DEC1974	MAR1994
69	121022	Sales	Rep.	IV	$32,210	United States	0CT1979	FEB2002
70	121023	Sales	Rep.	I	$26,010	United States	MAR1964	MAY1989
71	121024	Sales	Rep.	II	$26,600	United States	SEP1984	MAY2004
72	121025	Sales	Rep.	II	$28,295	United States	0CT1949	SEP1975
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
130
Numeric User-Defined Format
numeric ranges
proc format; value tiers
run;
t
numeric format name
20000-49999 50000-99999 00000-250000
Tier 1 Tier 2 Tier 3
labels
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Numeric User-Defined Formats
•The less than (<) symbol excludes values from ranges.
Put < after the value if you want to exclude the first value in a range.
Put < before the value if you want to exclude the last value in a range.
50000- 100000	Includes 50000	Includes 100000
50000 - < 100000            Includes 50000		Excludes 100000
50000<-100000	Excludes 50000	Includes 100000
50000<-< 100000	Excludes 50000	Excludes 100000
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Numeric User-Defined Format
keyword
proc format; 1			
value tiers low-<50000		=  1 Tier	lf
50000- 100000		=  1 Tier	2f
100000<-high		=  1 Tier	3' ;
run;	Í		
	keyword		
LOW encompasses the lowest possible value. HIGH encompasses the highest possible value.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Other User-Defined Format Examples
proc format;			
value $grade	iAi =	1 Good1	
	»Bf-fDf =	1 Fair1	
	»F» =	1 Poor1	
	111 , 1U1 =	1 See Instructor1	
	other =	1Miscoded1;	
run;			
proc format;			
value mnthfmt	1,2,3	=  fQtr lf	
	4,5,6	=  fQtr 2f	
	7,8,9	=  fQtr 3f	
	10,11,12	=  fQtr 4f	
	•	=  1missing1	
	other	=  1 unknown1 ;	
run;			
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Multiple User-Defined Formats
•Multiple VALUE statements can be in a single PROC FORMAT step.
proc format;	
value	$ctryfmt    fAUf  = Australia1
	fUSf  =  1 United States1
	other =  1Miscoded1;
value	tiers        low-<50000    =  1 Tier lf
	50000- 100000 =  fTier 2f
	100000<-high      =  1 Tier 3f;
run;	
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The FORMAT Procedure
proc format;
value $goods_t tbti=iAi
fBZT=TD, T T =Tmissing! 1   T='missingT .='missing1
r
run;
proc tabulate data=libl.tabl missing;
title "D vs.  goods_type"; class goods_type Datable  (goods_type all) , (D all)*(n colpctn=fc%! rowpctn=1r%1);
format goods_type $goods_t.; run;
136
The FORMAT Procedure
proc format;
value good_typ
1=1
2=3
3=10
■
run ;
Data libl.tabl/
Set libl.tabl;
goods_type3=goods_type2 format goods_typen3n
good_typ.;
run;
The FORMAT Procedure
proc format; invalue good t2e
1 BT'	=4	data libl.tabl;
'BZ'	=5	set libl.tabl;
'CK'	=5	goods typel=upcase(goods_type);
othe	r=-l	goods type3n=input(goods_typel,goo
■		d_t2e.);
r		evid id=put(evid id,zlO.);
run;		• r
		run;
138
Replacing Missing Values
The COALESCE function enables you to replace missing values in a column with a new value that you specify. For every row that the query processes, the COALESCE function checks each of its arguments until it finds a nonmissing value, then returns that value. If all of the arguments are missing values, then the COALESCE function returns a missing value. For example, the following query replaces missing values in the LowPoint column in the SQL.CONTINENTS table with the words Not Available:
proc sql;
title 'Continental Low Points'; select Name, coalesce(LowPoint, 'Not Available') as LowPoint
from sql.continents;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA
Oulpul 2.14  Using the COALESCE Function to Replace Missing Values
Name	Continental	LOW Points LowPoint
Africa		Lake Assal
Antarctica		Not Available
Asia		Dead Sea
Australia		Lake Eyre
Central America	and Caribbean	Not Available
Europe		Caspian Sea
Worth America		Death Valley
Oceania		Not Available
South America		Valdes Peninsula
139
The DATA Step
The SAS DATA step
• is the original SAS programming language for data manipulation
• can be used as a complete programming language
• is generated by SAS Enterprise Guide when data is imported or in support of other tasks.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Advantages of the DATA Step over SQL
DATA Step	SQL
Can read data from many different   Can only read from SAS database sources tables	
Can create multiple tables in a single pass of the data	Can only output one table at a time
Has comprehensive conditional       Only has the CASE clause processing	
Can deal with repetitive programming using loops and arrays	Does not support loops or arrays
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
141
Advantages of SQL over the DATA Step
SQL	DATA Step
Is very flexible when joining           Can require several steps to join multiple tables with non-common     multiple tables with different key key variables variables	
Can, in some cases, replace multiple SAS steps	Can require several steps
Is the native language of databases Might need to generate SQL to get
to data that is not SAS data
Choose the right tool for the task to be completed.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
142
The DATA Statement
The DATA statement begins a DATA step and provides the name of the SAS data set being created.
General form of the DATA statement:
DATA output-SAS-data-set; SET input-SAS-data-set; <additional SAS statements>
RUN; I
The DATA statement can create temporary or permanent data sets.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The SET Statement
•The SET statement reads observations from a SAS data set for further processing in the DATA step.
General form of the SET statement:
DATA output-SAS-data-set; SET input-SAS-data-set; <additional SAS statements>
RUN;
1
By default, the SET statement does the following:
• names the SAS data set(s) to be read
• reads all observations and all variables from the input data set
• can read temporary or permanent data sets
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
144
Business Scenario: Reading a SAS Data Set
This program does the following:
■ reads all the rows and all the columns from the sales data set in the orion library
■ writes all the rows and all the columns to a data set named comp in the Work library
Partial Listing of comp
work.comp;
t orion.sales
run;
Employee ID
First_ Name
l&Nam^Gende'
3ÍJ Salary
^ Job_Tille ^ Country
Birth Date
© Hire_Date
120102! Torn	Zhou       !M	108255! Sales Manager	AU	3510	10744
120103 J Wilson	Dawes     j M	87975; Sales Manager	AU	■3996	5114
120121 Mrenie	Elvish      j F	26600! Sales Rep. II	AU	■5630	5114
120122! Christina	Ngan      j F	27475! Sales Rep. II	AU	■1984	6756
120123!Kimiko	Hotstone ! F	26190! Sales Rep. 1	AU	1732	9405
120124!Lucian	Daynnond j M	26480! Sales Rep. 1	AU	-233	6999
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
145
Selecting Variables
•You can control the variables written out to SAS data sets using the following:
• the DROP statement to specify the variables that you want excluded
• the KEEP statement to specify the variables that you want included
•General form of DROP and KEEP statements:
DROP variable 1 variable2 ...; KEEP variable 1 variable2 ...;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Business Scenario: Selecting Variables
data work.comp;	
set orion.sa]	Les ;
drop Gender í	Salary Job Title
Country	Birth Date Hire Date;
run ;	
Partial Listing of comp	
This program can do these tasks:
■ read all the rows and columns from orion.sales
■ write all the rows and the three columns not excluded via the DROP statement to a data set called comp in the Work library
® Employee ^	A\ Last & Name
1201021 Tom	Zhou
120103 J Wilson	Dawes
120121 jlrenie	Elvish
120122 J Christina	Ngan
120123 i Kimiko	Hotstone
1201241 Lucian	Dayrnond
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
147
Selecting Rows
Partial Listing of austemp
	@ Employee	a First J--> Name	Last_Name	Gender	@ Salary	^ Job_Title	fy. Country
1	120102	Torn	Zhou	M	108255	Sales Manager	AU
2	120103	Wilson	Dawes	M	87975 Gales Manager		AU
3	120125	Fong	Hofmeister	M	32040! Sales Rep. IV		AU
4	120128	Monica	Kletschkus	F	30890: Sales Rep. IV		AU
5	120129	Alvin	Roebuck	M	30070! Sales Rep. Ill		AU
6	120135	Alewei	Platts	M	32490! Sales Rep. IV		AU
7	120144	Viney	Barbis	M	30265! Sales Rep. Ill		AU
8	120154	Caterina	Hayawardhana	F	30490! Sales Rep. Ill		AU
9	12015S	Daniel	Pilgrim	M	36605! Sales Rep. Ill		AU
10	120159	Lynelle	Phoumirath	F	30765! Sales Rep. IV		AU
11	1201G1	Rosette	Martines	F	30785! Sales Rep. Ill		AU
12	120166	Fadi	N owd	M	30660 Sales Rep. IV		AU
Orion wants to subset the data to only include Australian employees with a salary greater than $30,000.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
148
Selecting Rows with the WHERE Statement
You can control which rows are read from a SAS data set by using the WHERE statement.
General form of the WHERE statement:
WHERE expression;
• Only one WHERE statement can be included in a DATA step.
• The expressions that can be used are the same as expressions built in the Filter Data tab using either the Edit Filter window or the Advanced Expression Editor.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Comparison Operators -examples
where Gender =  TMT;		
		
where Gender eq T   T;		
		
where Salary ne  .;		
where Salary >= 50000;
where Country in  (T AUT , T US T) ;
where Country in  (T AUT   T US T) ;
Values must be separated by commas or blanks.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
150
Arithmetic Operators - examples
where Salary / 12 < 6000;
where  (Salary / 12 )   * 1.10 >= 7500;
where Salary + Bonus <= 10000;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Logical Operators - examples
where Gender ne  TMT  and Salary >=50000;
where Gender ne  TMT  or Salary >= 50000;
where Country =  T AUT  or Country =  T US T ;
where Country not in  (TAUT TUST);
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Multiple Choice Poll - Correct Answer
•Which WHERE statement correctly subsets for numeric months May, June, or July and character names with a missing value?
a. where Months in (5 - 7) and Names = . ; ^b^where Months in (5  ,  6  ,  7) and Names = 1   1 ; c. where Months in (f5f , 1 6 1 , f7f) and Names = 1 . 1 ;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Creating New Variables
•Assignment statements are used in the DATA step to update existing variables or create new variables.
•An assignment statement does the following:
• evaluates an expression
• assigns the resulting value to a variable General form of an assignment statement:
variable=expression;
DATA output-SAS-data-set\ SET input-SAS-data-set; variable = expression;
RUN:
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Expressions
• An expression contains operands and operators that form a set of instructions that produce a value.
Operands are
■ variable names
■ constants.
Operators are
■ symbols that request arithmetic calculations
■ SAS functions.
• An expression entered in an assignment statement is identical to an expression built using the SAS Enterprise Guide Advanced Expression Editor.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Operands
•Operands are constants (character, numeric, or date) and variables (character or numeric).
•Examples:
Bonus = 500;
numeric constant
NewSalary =1.1 * Salary;		variable
		
Hire_Date =  T01APR2008Td;		date constant
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Date Constants
The constant1ddMMMyyyy d (example: i4<iec2ooo'd) creates a SAS date value from the date enclosed in quotation marks.
dd	is a one- or two-digit value for the day.
MMM	is a three-letter abbreviation for the month
	(JAN, FEB, MAR, and so on).
yyyy	is a four-digit value for the year.
d	is required to convert the quoted string to a SAS
	date.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
perators
•Operators are symbols that represent an arithmetic calculation and SAS functions.
•Examples: Revenue = Quantity * Price;
NewCountry = upcase(Country);
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
158
Arithmetic Operators
•Arithmetic operators indicate that an arithmetic calculation is performed.
Symbol
**
■
Definition
exponentiation
multiplication
addition
Priority
•If a missing value is an operand for an arithmetic operator, the result is a missing value.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
159
Multiple Choice Poll - Correct Answer
•What is the result of the assignment statement given the values of varl and var2?
missing)
varl	var2
•	10
num. = varl + var2 / 2 ;
If an operand is missing for an arithmetic operator, the result is missing.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Using SAS Functions
•SAS functions can do the following:
• perform arithmetic operations
• compute sample statistics (for example: sum, mean, and standard deviation)
• manipulate SAS dates
• process character values
• perform many other tasks Sample statistics functions ignore missing values.
• SAS functions can be used in the DATA step or in the Advanced Expression Editor of the Query Builder to create new columns or filter data.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Multiple Choice Poll - Correct Answer
•What is the result of the assignment statement given the values of varl, var2, and var3?
a. . (missing)
b. o
c. 4 06
Varl	Var2	Var3
9	•	3
Average = mean(Varl,Var2,Var3);
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Using Date Functions
You can use SAS date functions to do the following:
• create SAS date values
• extract information from SAS date values
Calendar Date
01JAN1959
01JAN1960
01JAN1961
-365
0
t
366
SAS Date Value
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Date Functions: Creating SAS Dates
TODAY() obtains the date value from the
system clock.
M DY(month, day,year)	uses numeric month, day, and year values to return the corresponding SAS date value.
Example:	
Days Since Order =	= today()  - Order Date;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
164
Date Functions: Extracting Information
YEAR(SAS-date)	extracts the year from a SAS date and returns a four-digit value for year.
QTR(SAS-date)	extracts the quarter from a SAS date and returns a number from 1 to 4.
MOHJH{SAS-date)	extracts the month from a SAS date and returns a number from 1 to 12.
DAY(SAS-date)	extracts the day of the month from a SAS date and returns a number from 1 to 31.
	extracts the day of the week from a SAS
W E E K D AY( S/A S-date)	date and returns a number from 1 to 7, where 1 represents Sunday, and so on.
Example:
BonusMonth = month(Hire Date);
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The LABEL Statement
•Permanent labels can also be assigned in the DATA step. •General form of the LABEL statement:
LABEL variable =	'labet I
variable =	'labet
variable =	'labet; 1
• A label can be up to 256 characters.
• Any number of variables can be associated with labels in a single LABEL statement.
• Using a LABEL statement in a DATA step permanently associates labels with variables by storing the label in the descriptor portion of the SAS data set.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
166
Business Scenario: Formats and Labels
data work.comp;
set orion.sales; Bonus=500;
Compensation=sum(Salary,Bonus);
BonusMonth=month(Hire_Date);
drop Gender Salary Job_Title Country
^^^^B^tl^D^^^^^^^^^^^^^^^^^^^ format Bonus Compensation dollar8. ^^^^   Hire_Date date9 . ; ^^^^^^ label Employee_ID="Employee ID" First_Name="First Name" Last_Name="Last Name" BonusMonth="Month of Bonus" Hire_Date="Hire Date";
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
4. Explorační analýza, vizualizace dat, kontingenční tabulky
168
Explorační analýza - PROČ?
□ Je třeba pochopit data:
> najít chyby v datech
> najít vzory v datech
> najít porušení statistických předpokladů, testování hypotéz
> ...a především proto, že pokud to neuděláme, budeme mít velké problémy později.
169
Explorace dat - jednorozměrná
Frekvenční tabulky, histogramy:
	počet	podii	badrate
Muz	248 768	55,0%	13,08%
Zena	203 194	45,0%	7,69%
Total	451 962	100,0%	10,66%
délka zaměstnáni	počet	podii	badrate
0	20 825	4,6%	4,69%
1	163 144	36,1%	13,43%
2	67 462	14,9%	12,80%
3	43 778	9,7%	10,97%
4	26 256	5,8%	10,01%
5	27 526	6,1%	9,32%
6	15 893	3,5%	8,16%
8	18 036	4,0%	8,39%
10	17 195	3,8%	6,72%
20	33 641	7,4%	5,60%
24	5 176	1,1%	4,48%
48	12 934	2,9%	4,28%
666	96	0,0%	3,13%
Total	451 962	100,0%	10,66%
40,0% 30,0% 20,0% 10,0% 0,0%
Muz
Zena
délka zamestnaní
i-1 podii I
-badrate I
				
	-			
	i			
		i	. ,n,n,n,n,n,r	- ■ ■—■ ■11 ■—
15,00%
0,00%
0    1    2   3   4   5   6   8   10 20 24 48 666
170
Explorace dat - jednorozměrná
□ vyse úveru vs. bad rate
OK? Nebo je to způsobeno jiným faktorem???
25,0%
vyse_uveru
l-1 podii |
badrate|
«5^ ^ ^ ^ ^ ^ ^ ^ ^ ^
171
Explorace dat - jednorozměrná
□ spojité proměnné:
> Průměr
> Modus
> Kvantily
> Rozptyl
> Min./maximální hodnota
□ je vhodná kategorizace
Explorace dat - jednorozměrná
□ Histogramy, box ploty
□ Stabilita v čase
I B AD
j GOOD
Počet návrhů smluv - typ zboží
-cr
FK
27.2.- 5.3.     6.3.- 12.3.     13.3.- 19.3.    20.3.- 26.3.    27.3.- 2.4.      3.4.- 9.4.
Počet návrhů smluv - typ zboží
■ VT ]OT
■ NA
□ MT
□ FK ICT
27.2.- 5.3.      6.3.- 12.3.      13.3.- 19.3.     20.3.- 26.3.     27.3.- 2.4.       3.4.- 9.4.
173
Explorace dat - vícerozměrná
□ Kontingenční tabulky
	do 5 000	5 000- 10 000	10 000-15 000	víc než 15 000
BT	4 291	8 581	9 176	9 044
CT	7 587	12 493	6 500	7 236
FK	258	1 017	851	557
MT	27 191	39 551	16 524	5 992
NA	426	1 088	1 114	2 737
OT	2 478	3 689	2 103	3 475
VT	384	1 001	963	9 086
row%	do 5 000	5 000- 10 000	10 000 - 15 000	víc než 15 000
BT	13,8%	27,6%	29,5%	29,1%
CT	22,4%	36,9%	19,2%	21,4%
FK	9,6%	37,9%	31,7%	20,8%
MT	30,5%	44,3%	18,5%	6,7%
NA	7,9%	20,3%	20,8%	51,0%
OT	21,1%	31,4%	17,9%	29,6%
VT	3,4%	8,8%	8,4%	79,5%
col%	do 5 000	5 000- 10 000	10 000 - 15 000	víc než 15 000
BT	10,1%	12,7%	24,6%	23,7%
CT	17,8%	18,5%	17,5%	19,0%
FK	0,6%	1,5%	2,3%	1,5%
MT	63,8%	58,7%	44,4%	15,7%
NA	1,0%	1,6%	3,0%	7,2%
OT	5,8%	5,5%	5,6%	9,1%
VT	0,9%	1,5%	2,6%	23,8%
Explorace dat - vícerozměrná
Počet návrhů smluv - typ zboží
Tin
ct fk mt na ot vt
Počet návrhů smluv - typ zboží
.1
■ 17 a víc
■ 12-16
□ 10-11
□ 8-9
■ 6-7
□ 4-5
□ víc než 15 000
□ 10 000- 15 000
■ 5 000- 10 000
■ do 5000
mt na ot vt
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
Počet návrhů smluv - typ zboží
8-9 10-11 12-16 17av(c
Počet návrhů smluv - typ zboží
= ==-=
do 5 000 5 000- 10 000        10 000- 15 000        víc než 15 000
] vt
□ ot ] na
□ MT
□ fk
1CT
□ bt
] vt
□ ot ] na
□ MT
□ fk
1CT
□ bt
175
Explorace dat - vícerozměrná
□ Věk vs. délka zaměstnání
počet pozorovaní
5 let
...defaultní hodnota???
délka zamestnaní
176
Explorace dat - vícerozměrná
□ Věk vs. délka zaměstnání vs. default
Pravděpodobnost default (goodE)
177
Correlation
STRONG weak Negative
STRONG Positive
-1
Correlation Coefficient
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
179
Extréme Data Values
Odlehlé (extrémní) hodnoty mohou zcela zkreslit výsledky analýzy
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
X
Diskriminační síla proměnných pro prediktivní modely
Weight of evidence, information value
r... number of levels (categories) of the categorical variable g,... number of "goods" the in Ath category b,... number of "bads" the in hXh category G:= Z g,... total number of "goods" B := Zb j... total number of "bads"
Weight of evidence for the Áth category:       woe, = In (g,/G) - In (b,/B)
Information value for the Áth category: Inf_val, = [(g,/ G) - (b,/ B)] -
woe,
Total information value for the corresponding variable:    Inf_val = Zinf_val,
Diskriminační síla proměnných
Incorporation Date
Raw	RegVar	Percant	B	G	TOT	G/B Odds	%Good	%Bad	Bad Rate	WoE IV
O&NOI	inc_1	12%	139	952	1091	7	11%	19%	12,7%	-0,557! 0,046116
1	inc_2	13%	133	1073	1206	8	12%	19%	11,0%	-0,394! 0,023731
2-7	miss	42%	299	3601	3900	12	42%	42%	7,7%	0,007 2,04E-05
8-15	inc_3 ;	22%	108	1942	2050	18	23%	15%	5,3%	0,408; 0,030887
16+	inc_4 ;	11%	39	1019	1058	26	12%	5%	3,7%	0,781; 0,050288
Total			718	8587	9305	12			7,7%	0,151
>	<0.02	unpredictive
>	0.02 - O.l	weak
>	o.i - 0.3	medium
>	O.3   - O.5	strong
>	> O.5	too high ...je třeba prověřit, pravděpodbně je něco špatně
182
Diskriminační síla proměnných
□ Lorenzova křivka, Giniho index
X ~~ ^m.BAD (a)
y = F,
111. BAD \ — s n.GOOD
(a), a e [L, H].
Gini =
A A+B
= 2A
n+m
Gini      1     2^ ( Fm. BAD k      ^m. BAD k-1 ) * (     . GOOD k . GOOD k-1
k=2
Diskriminační síla proměnných
□ Lorenzova křivka ...kontrola monotónnosti vysvětlované proměnné (def. rate) na dané vysvětlující proměnné
Kategorizace (WOE)
Diskriminační síla proměnných
□ Giniho index
> <o.05 unpredictive
> 0.05 - 0.1 weak
> 0.1 - 0.2 medium
> 0.2 - 0.5 strong
> > 0.5 too high ...je třeba prověřit, pravděpodbně je něco špatně
185
Diskriminační síla proměnných
Ipohlavi
Gini:
0,1401
	počet	podii	bad rate
Muz	248 768	55,0%	13,08%
Zena	203 194	45,0%	7,69%
Total	451 962	100,0%	10,66%
Info.Value:
0,0828
60,0% 50,0% 40,0% 30,0% 20,0% 10,0% 0,0%
pohlaví					bad rate
					12,00% 10,00% 8,00% 6,00% 4,00% 2,00% 0,00%
					
					
					
					
					
|delka_zamestnani_hruben
Gini:
0,1611
	pocet	pod i I	bad rate
0	20 825	4,6%	4,69%
1	163 144	36,1%	13,43%
5	165 022	36,5%	11,29%
666	102 971	22,8%	6,45%
Total
451 962 100,0%
Info.Value:
0,1100
40,0% 35,0% 30,0% 25,0% 20,0% 15,0% 10,0% 5,0% 0,0%
delka_zamestnani_hrube							bad rate
							14,00% 12,00% 10,00% 8,00% 6,00% 4,00% 2,00% 0,00%
							
							
							
							
							
							
I I		i		i			
|delka_zamestnani_jemne~l
Gini:
0,1762
délka zaměstnáni	pocet	pod i I	bad rate
0	20 825	4,6%	4,69%
1	163 144	36,1%	13,43%
2	67 462	14,9%	12,80%
3	43 778	9,7%	10,97%
4	26 256	5,8%	10,01%
5	27 526	6,1%	9,32%
6	15 893	3,5%	8,16%
8	18 036	4,0%	8,39%
10	17 195	3,8%	6,72%
20	33 641	7,4%	5,60%
24	5 176	1,1%	4,48%
48	12 934	2,9%	4,28%
666	96	0,0%	3,13%
Total	451 962	100,0%	10,66%
Info.Value:
0,1285
40,0% 30,0% 20,0% 10,0% 0,0%
delka zamestnaní
1 podi bad rate
		
/		
/	-	
n,		nnnnnil.n
10,00%
5,00%
0,00%
8    10   20   24   48 666
186
The One-Way Frequencies Task
[101 One-Way Frequencies for LocakSASUSER.SALES
Statistics
Plots
Results
Titles
Properties
Data
Data source: Local:SASUSER.SALES Task filter: None
Variables to assign:
Name
qg) Purchase $± Gender $± Income ' ~ • Age
ll_
xl i
Task roles:
\&\ Analysis variables
< variable required> . Frequency count (Limit
^ Group analysis by
The selection pane enables you to choose different sets of options for the task.
Run
J.
Save
J.
Cancel
J.
"3
lelp
The "Analysis variables" role must have at least 1 variable assigned to it.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The Table Analysis Task
III Table Analysis for LocafcSASUSER.SALESJNCLEVEL
Tables
Cell Statistics
Table Statistics
Association
Agreement
Ordered Differences
Trend Test
Computation Options Results
Cell Stat Results
Table Stat Results Titles Properties
Data
Data source: Locar.SASUSER.SALESJNCLEVEL Task filter: None
Variables to assign:
Task roles:
Name
(ij3) IncLevel (jžj) Purchase .^Gender /Q, Income ©Age
Ml Frequency count (Limit: 1 }■ Group analysis by Table variables
< variable required>
< variable required>
J Jj
Edit..
Select a column..
The selection pane enables you to choose different sets of options for the task.
3
You must define at least one table on the Tables page.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The FREQ Procedure
•The FREQ procedure can do the following:
• produce one-way to n-way frequency and crosstabulation (contingency) tables
• compute chi-square tests for one-way to n-way tables and measures of association and agreement for contingency tables
• automatically display the output in a report and save the output in a SAS data set
•General form of the FREQ procedure:
-
PROC FREQ DATA=SAS-data-set
<option(s)>;
TABLES variable(s) <l option(s)>; RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
189
The FREQ Procedure
A FREQ procedure with no TABLES statement generates one-way frequency tables for all data set variables.
proc freq data=orion.sales; run;
This PROC FREQ step creates a frequency table for the following nine variables:
• Employ ee_ID • Job_Title First_Name • Country
• Last_Name • Birth_Date
• Gender * Hire_Date
• Salary
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The TABLES Statement
The TABLES statement specifies the frequency and crosstabulation tables to produce.
proc freq data=orion.sales;
tables Gender Country; ^ run;
one-way frequency tables
An asterisk between variables requests a n-way crosstabulation table.
proc freq data=orion.sales;
tables Gender*Country; run;
two-way frequency table
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The TABLES Statement
A one-way frequency table produces frequencies, cumulative frequencies, percentages, and cumulative percentages.
proc freq data=orion.sales;
tables Gender Country; run;
		The FREQ Procedure		
			Cumulative	Cumulative
Gender	Frequency	Percent	Frequency	Percent
F	68	41.21	68	41.21
M	97	58.79	165	100.00
			Cumulative	Cumulative
Country	Frequency	Percent	Frequency	Percent
AU	63	38.18	63	38.18
US	102	61.82	165	100.00
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The TABLES Statement
An n-way frequency table produces cell frequencies, cell percentages, cell percentages of row frequencies, and cell percentages of column frequencies, plus total frequency and percent.
proc freq data=orion.sales;
tables Gender*Country; run; V     ^     A )
rows
columns
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The TABLES Statement
The FREQ Procedure Table of Gender by Country Gender Country
Frequency Percent Row Pet Col Pet	AU	US
F	27 16.36 39.71 42.86	41 24.85 60.29 40.20
M	36 21 .82 37.11 57.14	61 36.97 62.89 59.80
		
Total
63 38.18
102 61 .82
Total
68 41 .21
97 58.79
165 100.00
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
194
Additional SAS Statements
Additional statements can be added to enhance the report.
proc format;
value $ctryfmt  fAUf=fAustraliaf
f US f = fUnited States f;
run ;
options nodate pageno=l;
ods html file=fpll2d01.htmlf; proc freq data=orion.sales;
tables Gender*Country;
where Job_Title contains fRepf;
format Country $ctryfmt.;
title  fSales Rep Frequency Report'; run ;
ods html close;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
195
Additional SAS Statements
•HTML Output_
Sales Rep Frequency Report
The FREQ Procedure
Frequency Percent Row Pet Col Pet
Table of Gender by Country			Total
Gender	Country		
	Australia	United States	
F	27	40	67
	16.98	25.16	42.14
	40.30	59.70	
	44.26	40.82	
M	34	58	92
	21.38	36.48	57.86
	36.96	63.04	
	55.74	59.18	
Total	61	98	159
	38.36	61.64	100.00
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
196
Options to Suppress Display of Statistics
•Options can be placed in the TABLES statement after a forward slash to suppress the display of the default statistics.
Option	Description
NOCUM	suppresses the display of cumulative frequency and cumulative percentage.
NOPERCENT	suppresses the display of percentage, cumulative percentage, and total percentage.
NOFREQ	suppresses the display of the cell frequency and total frequency.
NOROW	suppresses the display of the row percentage.
NOCOL        suppresses the display of the column percentage.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
197
Additional TABLES Statement Options
•Additional options can be placed in the TABLES statement after a forward slash to control the displayed output.
Option
LIST CROSSLIST FORMAT=
Description
displays />way tables in list format, displays n-way tables in column format formats the frequencies in n-way tables
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
198
LIST and CROSSLIST Options
Gender Country	Frequency	Percent	Cumulative Frequency	Cumulative Percent			
F Australia F          United States M Australia M          United States	27 41 36 61	16.36 24.85 21.82 36.97	27 68 104 165	16.36 41.21 63.03 100.00			
					tables Gender*Country /		' list;
							
	Table of Gender by Country						
				Row Column			
Gender	Country	Frequency	Percent	Percent Percent			
F	Australia	27	16.36	39.71	42.86		
	United States	41	24.85	60.29	40.20		
	Total	68	41.21	100.00			
M	Australia	36	21.82	37.11	57.14		
	United States	61	36.97	62.89	59.80		
	Total	97	58.79	100.00	tables	Gender*Country /	' crosslist;
Total	Australia	63	38.18	100.00			
	United States	102	61 .82	100.00			
	Total	165	100.00				
							
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
199
PROC FREQ Statement Options
Options can also be placed in the PROC FREQ statement.
Option	Description
NLEVELS      displays a table that provides the number of levels for each variable named in the TABLES statement.	
PAGE	displays only one table per page.
begins the display of the next one-way frequency table COMPRESS    on the same page as the preceding one-way table if
there is enough space to begin the table.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
200
NLEVELS Option
proc freq data=orion.sales nlevels;
tables Gender Country Employee_ID; run;
Partial PROC FREQ Output
The FREQ	Procedure
Number of Variable Levels	
Variable	Levels
Gender	2
Country	2
Employee_ID	165
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
201
Output Data Sets
•PROC FREQ produces output data sets using two different methods.
• The TABLES statement with an OUT= option is used to create a data set with frequencies and percentages.
TABLES variables 10\JT=SAS-data-set <options>;
The OUTPUT statement with an OUT= option is used to create a data set with specified statistics such as the chi-square statistic.
OUTPUT 0\JT=SAS-data-set <options>;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
202
The MEANS Procedure
•The MEANS procedure provides data summarization tools to compute descriptive statistics for variables across all observations and within groups of observations.
General form of the MEANS procedure:
-k
PROC MEANS DATA=SAS-data-set <statistic(s)> <option(s)>;
VAR analysis-variable(s); C LASS classification-variable(s); RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
203
The MEANS Procedure
•By default, the MEANS procedure reports the number of nonmissing observations, the mean, the standard deviation, the minimum value, and the maximum value of all numeric variables.
proc means data=orion.sales; run;
The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
Employee_ID Salary Birth_Date Hire Date
165 165 165 165
120713.90 31160.12 3622.58 12054.28
450.0866939
20082.67 5456.29 4619.94
120102.00 22710.00 -5842.00 5114.00
121145.00 243190.00 10490.00 17167.00
pll2d05
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
204
The VAR Statement
The VAR statement identifies the analysis variables and their order in the results.
proc means data=orion.sales;
var Salary; run;
		The MEANS Procedure	
		Analysis Variable : Salary	
N	Mean	Std Dev Minimum	Maximum
165	31160.12	20082.67 22710.00	243190.00
			
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
205
The CLASS Statement
The CLASS statement identifies variables whose values define subgroups for the analysis.
	proc means data=orion.sales; var Salary;						
	class	Gender Country;					
	run;						
				The MEANS Procedure			
				Analysis Variable : Salary			
Gender Country		N Obs	N	Mean             Std Dev	Minimum		Maximum
F AU		27	27	27702.41 1728.23	25185.00		30890.00
	us	41	41	29460.98 8847.03	25390.00		83505.00
M AU		36	36	32001.39 16592.45	25745.00		108255.00
	us	61	61	33336.15 29592.69	22710.00		243190.00
							
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
206
The CLASS Statement
classification variables
j j
Gender Country
proc means data=orion.sales;
var Salary;
class Gender Country; run ;
The MEANS Procedure
Analysis Variable : Salary
N
Obs
analysis variable
N
Mean
Std Dev
Minimum
Maximum
M
AU
US
AU
US
27
41
36
61
27702.41
1728.23
25185.00
30890.00
41 29460.9
36 32001.3
nroftA r\f\
statistics for analysis variable
00
00
61 33336.15
29592.69
22710.00
243190.00
I
The CLASS statement adds the N Obs column, which is the number of observations for each unique combination of the class variables.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
207
PROC MEANS Statistics
The statistics to compute and the order to display them can be specified in the PROC MEANS statement.
proc means data=orion.sales sum mean range;
var Salary;
class Country; run ;
The MEANS Procedure
Analysis Variable : Salary
Country
N
Obs
Sum
Mean
Range
AU
US
63
102
1900015.00
3241405.00
30158.97
31778.48
83070.00
220480.00
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
208
PROC MEANS Statistics
Descriptive Statistic Keywords	
clm            css             cv lclm	max
mean            min            mode n	nmiss
kurtosis       range      skewness stddev	stderr
sum         sumwgt        uclm uss	var
Quantile Statistic Keywords	
MEpD^Nl           pi               p5 p10	q1 I p25
q3 | p75           p90             p95 p99	qrange
Hypothesis Testing Keywords	
probt t
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
209
PROC MEANS Statement Options
•Options can also be placed in the PROC MEANS statement.
Option	Description
MAXDEC=	specifies the number of decimal places to use in printing the statistics.
FW=	specifies the field width to use in displaying the statistics.
NONOBS	suppresses reporting the total number of observations for each unique combination of the class variables.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
210
MAXDEC= Option_
Iproc means data=orion.sales maxdec=0;
Analysis Variable : Salary
N
Country	Obs	N	Mean	Std Dev	Minimum	Maximum
AU	63	63	30159	12699	25185	108255
US	102	102	31778	23556	22710	243190
proc means data=orion.sales maxdec=l;
Analysis Variable : Salary
N
Country	Obs	N	Mean	Std Dev	Minimum	Maximum
AU	63	63	30159.0	12699.1	25185.0	108255.0
US	102	102	31778.5	23555.8	22710.0	243190.0
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
211
FW= Option_
- proc means data=orion.sales;
Analysis Variable : Salary
Country	N Obs	N	Mean	Std Dev	Minimum	Maximum
AU	63	63	30158.97	12699.14	25185.00	108255.00
US	102	102	31778.48	23555.84	22710.00	243190.00
proc means data=orion.sales fw=15;
Analysis Variable : Salary
Country	N Obs	N		Mean		Std Dev		Minimum	Maximum
AU	63	63	30158	.96825397	12699.	13932690	25185	.00000000	108255
US	102	102	31778	.48039216	23555.	84171928	22710	.00000000	243190
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
212
NONOBS Option_
Iproc means data=orion.sales;
Analysis Variable : Salary
N
Country	Obs	N	Mean	Std Dev	Minimum	Maximum
AU	63	63	30158.97	12699.14	25185.00	108255.00
US	102	102	31778.48	23555.84	22710.00	243190.00
		proc mear	is data=orion.sa		les nonobs;	
		Analysis Variable : Salary				
Country	N	Mean	Std Dev	Minimum	Maximum	
AU	63	30158.97	12699.14	25185.00	108255.00	
US	102	31778.48	23555.84	22710.00	243190.00	
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Output Data Sets
•PROC MEANS produces output data sets using the following method:
OUTPUT OVT=SAS-data-set <options>\
The output data set contains the following variables:
• BY variables
• class variables
• the automatic variables _T YPE_ and _FREQ_
• the variables requested in the OUTPUT statement
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
OUTPUT Statement OUT= Option
The statistics in the PROC statement impact only the MEANS report, not the data set.
A,
proc means data=orion.sales 1 sum mean range;
var Salary;
class Gender Country;
output out=work.means1; run;
proc print data=work.means1; run;
pll2d06
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
OUTPUT Statement OUT= Option
Obs     Gender     Country _TYPE_		FREQ_	_STAT_	Salary
1	0	165	N	165.00
2	0	165	MIN	22710.00
3	0	165	MAX	243190.00
4	0	165	MEAN	31160.12
5	0	165	STD	20082.67
6	AU 1	63	f~N	63.00
7	default statistics		1 MIN	25185.00
8			MAX	108255.00
9	ftU 1	DO	1 MEAN	30158.97
10	AU 1	63	L_STD	12699.14
11	US 1	102	N	102.00
12	US 1	102	MIN	22710.00
13	US 1	102	MAX	243190.00
14	US 1	102	MEAN	31778.48
15	US 1	102	STD	23555.84
16 F	2	68	N	68.00
17 F	2	68	MIN	25185.00
18 F	2	68	MAX	83505.00
19 F	2	68	MEAN	28762.72
20 F	2	68	STD	6974.15
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
OUTPUT Statement OUT= Option
•The OUTPUT statement can also do the following:
• specify the statistics for the output data set
• select and name variables
proc means data=orion.sales noprint; var Salary; class Gender Country; output out=work.means2
min=minSalary max=maxSalary sum=sumSalary mean=aveSalary;
run ;
proc print data=work.means2; run ;
•The NOPRINT option suppresses the display of all output.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
217
OUTPUT Statement OUT= Option
•PROC PRINT Output
					min	max	sum	ave
Obs	Gender	Country	_TYPE_	_FREQ_	Salary	Salary	Salary	Salary
1			0	165	22710	243190	5141420	31160.12
2		AU	1	63	25185	108255	1900015	30158.97
3		US	1	102	22710	243190	3241405	31778.48
4	F		2	68	25185	83505	1955865	28762.72
5	M		2	97	22710	243190	3185555	32840.77
6	F	AU	3	27	25185	30890	747965	27702.41
7	F	US	3	41	25390	83505	1207900	29460.98
8	M	AU	3	36	25745	108255	1152050	32001.39
9	M	US	3	61	22710	243190	2033505	33336.15
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
218
OUTPUT Statement OUT= Option
•_TYPE_ is a numeric variable that shows which combination of class variables produced the summary statistics in that observation.
						min          max          sum ave	
Obs	Gender	Country	_TYPE_		overall summary		
1 2 3			0^ 1—i 1 —	H CE			
		AU US			summary by Country only		
				'               1 \jc			
4 5	F M	AU	z—i 2—■		summary by Gender only		
6	F		3—^1	2/		2b1öb      3UÖ9U       /4/96b 2//U2.41	
7 8	F M	US AU	3 3		Summary by Country and Gender		
9	M	US	3—1	61		22710     243190     2033505 33336.15	
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
219
PUTPUT Statement OUT= Option
					min	max	sum	ave
Obs	Gender	Country	_TYPE_	_FREQ_	Salary	Salary	Salary	Salary
1			0	165	22710	243190	5141420	31160.12
2		AU	1	63	25185	108255	1900015	30158.97
3		US	1	102	22710	243190	3241405	31778.48
4	F		2	68	25185	83505	1955865	28762.72
5	M		2	97	22710	243190	3185555	32840.77
6	F	AU	3	27	25185	30890	747965	27702.41
7	F	US	3	41	25390	83505	1207900	29460.98
8	M	AU	3	36	25745	108255	1152050	32001.39
9	M	US	3	61	22710	243190	2033505	33336.15
_TYPE_	Type of Summary		FREQ
0		overall summary	165
		summary by Country only	63 AU + 102 AU = 165
2		summary by Gender only	68F + 97M = 165
3	summary by Country and Gender		27FAU + 41 F US +    36 M AU + 61 M US = 165
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
OUTPUT Statement OUT= Option
•Options can be added to the PROC MEANS statement to control the output data set.
Option	Description
specifies that the output data set contain only statistics for the observations with the highest type value.	
DESCENDTYPES	orders the output data set by descending _type_ value.
specifies that the _type_ variable in the output data CHARTYPE       set is a character representation of the binary value of
type .
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
221
OUTPUT Statement OUT= Option
									
						min	max	sum	ave
Obs	Gender Country	r	_TYPE_		_FREQ_	Salary	Salary	Salary	Salary
1			0		165	22710	243190	5141420	31160.12
2	AU		1		63	25185	108255	1900015	30158.97
3	US		1		102	22710	243190	3241405	31778.48
4	F		2		68	25185	83505	1955865	28762.72
5	M		2		97	22710	243190	3185555	32840.77
6	F AU		3		27	25185	30890	747965	27702.41
7	F US		3		41	25390	83505	1207900	29460.98
8	M AU		3		36	25745	108255	1152050	32001.39
9	M US		3		61	22710	243190	2033505	33336.15
with NWAY
Obs	Gender	Country		_TYPE_		_FREQ_	min Salary	max Salary	sum Salary	ave Salary
1	F	AU		3		27	25185	30890	747965	27702.41
2	F	US		3		41	25390	83505	1207900	29460.98
3	M	AU		3		36	25745	108255	1152050	32001.39
4	M	US		3		61	22710	243190	2033505	33336.15
pll2d06
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
OUTPUT Statement OUT= Option
	with DESCENDTYPES								
						min	max	sum	ave
									
Obs		Gender	Country	_TYPE_ ^	_FREQ_	Salary	Salary	Salary	Salary
1		F	AU	3	27	25185	30890	747965	27702.41
2		F	US	3	41	25390	83505	1207900	29460.98
3		M	AU	3	36	25745	108255	1152050	32001.39
4		M	US	3	61	22710	243190	2033505	33336.15
5		F		2	68	25185	83505	1955865	28762.72
6		M		2	97	22710	243190	3185555	32840.77
7			AU	1	63	25185	108255	1900015	30158.97
8			US	1	102	22710	243190	3241405	31778.48
9				0 J	165	22710	243190	5141420	31160.12
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
223
OUTPUT Statement OUT= Option
with CHARTYPE
						min	max	sum	ave
Obs	Gender	Country	r -TYPE		_FREQ_	Salary	Salary	Salary	Salary
1			00		165	22710	243190	5141420	31160.12
2		AU	01		63	25185	108255	1900015	30158.97
3		US	01		102	22710	243190	3241405	31778.48
4	F		10		68	25185	83505	1955865	28762.72
5	M		10		97	22710	243190	3185555	32840.77
6	F	AU	11		27	25185	30890	747965	27702.41
7	F	US	11		41	25390	83505	1207900	29460.98
8	M	AU	11		36	25745	108255	1152050	32001.39
9	M	US	L 11		61	22710	243190	2033505	33336.15
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
224
The SUMMARY Procedure
•The SUMMARY procedure provides data summarization tools to compute descriptive statistics for variables across all observations and within groups of observations.
General form of the SUMMARY procedure:
PROC SUMMARY DATA=SAS-data-set <statistic{s)>
<option(s)>\
VAR analysis-variable(s); CLASS classification-variable(s); RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
225
The SUMMARY Procedure
The SUMMARY procedure uses the same syntax as the MEANS procedure.
The only differences to the two procedures are the following:
PROC MEANS	PROC SUMMARY
The PRINT option is set by default,     The NOPRINT option is set by default, which displays output.                    which displays no output.	
Omitting the VAR statement analyzes all the numeric variables.	Omitting the VAR statement produces a simple count of observations.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
226
The TABULATE Procedure
•The TABULATE procedure displays descriptive statistics in tabular format.
General form of the TABULATE procedure:
PROC TABULATE DATA=SAS-data-set <options>; CLASS classification-variable(s); VAR analysis-variable(s); TA B L E page-expression, row-expression,
column-expression </ option(s)>\
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
227
Dimensional Tables
•The TABULATE procedure produces one-, two-, or three-dimensional tables.
	page dimension	row dimension	column dimension
one-dimensional			
two-dimensional			
three-dimensional			
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
228
The TABLE Statement
•The TABLE statement describes the structure of the table.
table
page expression
row expression
column expression '
dimension expressions
Commas separate the dimension expressions.
Every variable that is part of a dimension expression must be specified as a classification variable (CLASS statement) or an analysis variable (VAR statement).
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
229
The TABLE Statement
table	page	row	column
	expression ,	expression '	expression ;
•Examples:
table Country;
table Gender  , Country;
table Job Title  , Gender  , Country;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
230
The CLASS Statement
The CLASS statement identifies variables to be used as classification, or grouping, variables.
General form of the CLASS statement:
CLASS classification-variable(s); |
• N, the number of nonmissing values, is the default statistic for classification variables.
• Examples of classification variables: Job_Title, Gender, and Country
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The VAR Statement
•The VAR statement identifies the numeric variables for which statistics are calculated.
•General form of the VAR statement:
VAR analysis-variable(s); I
• SUM is the default statistic for analysis variables.
• Examples of analysis variables:
Salary and Bonus
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
232
One-Dimensional Table
proc tabulate data=orion.sales;
class Country;
table Country; run;
Country	
AU	US
N	N
63.00	102.00
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Two-Dimensional Table
proc tabulate data=orion.sales;
class Gender Country;
table Gender,  Country; run;
	Country	
	AU	US
	N	N
Gender	27.00	41.00
F		
M	36.00	61.00
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Three-Dimensional Table
proc tabulate data=orion.sales;
class Job_Title Gender Country;
table Job_Title, Gender, Country; run ;
Job_Title Sales Rep. I
Gender
M
pll2d08
Country
AU
US
Job_Title Sales Rep. II
	Country	
	AU	US
	N	N
Gender	10.00	14.00
F		
M	8.00	14.00
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
235
Dimension Expression
•Elements that can be used in a dimension expression:
• classification variables
• analysis variables
• the universal class variable ALL keywords for statistics
Operators that can be used in a dimension expression:
• blank, which concatenates table information
• asterisk *, which crosses table information
• parentheses (), which group elements
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Dimension Expression
proc tabulate data=orion.sales; class Gender Country;
var Salary; _
table Gender all,  Country*Salary;
run;
	Country	
	AU	US
	Salary	Salary
	Sum	Sum
Gender	747965.00	1207900.00
F		
M	1152050.00	2033505.00
All	1900015.00	3241405.00
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC TABULATE Statistics
Descriptive Statistic Keywords					
	CSS	CV	LCLM	MAX	
MEAN	MIN	MODE	N	NMISS	■
KURTOSIS	RANGE	SKEWNESS	STDDEV	STDERR	
SUM	SUMWGT	UCLM	USS	VAR	
PCTN	REPPCTN	PAGEPCTN	ROWPCTN	COLPCTN	
PCTSUM	REPPCTSUM	PAGEPCTSUM	ROWPCTSUM	COLPCTSUM	
Quantile Statistic Keywords					
MEDIAN 1 P50	P1	P5	P10	Q1 I P25	
	P90	P95	P99	ORANGE	
Hypothesis Testing Keywords					
PROBT T
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
238
PROC TABULATE Statistics
proc tabulate data=orion.sales; class Gender Country;
var Salar^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ table Gender all,  Country*Salary*(min max); run;
	Country			
	AU		US	
	Salary		Salary	
	Min	Max	Min	Max
Gender	25185.00	30890.00	25390.00	83505.00
F				
M	25745.00	108255.00	22710.00	243190.00
All	25185.00	108255.00	22710.00	243190.00
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Additional SAS Statements
•Additional statements can be added to enhance the report.
proc format;
value $ctryfmt  1AU1 ='Australia 1
'US'='United States';
run;
options nodate pageno=l;
ods html file=fpll2d08.html1; proc tabulate data=orion.sales;
class Gender Country;
var Salary;
table Gender all,  Country*Salary*(min max); where Job_Title contains 'Rep1; label Salary='Annual Salary'; format Country $ctryfmt.; title  'Sales Rep Tabular Report'; run;
ods html close;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA
pll2d08
240
Additional SAS Statements
•HTML Output
Sales Rep Tabular Report
		Country		
	Australia		United States	
	Annual Salary      Annual Salary			
	Min	Max	Min	Max
Gender F	25185.00	30890.00	25390.00	32985.00
M	25745.00	36605.00	22710.00	35990.00
All	25185.00	36605.00	22710.00	35990.00
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
241
Output Data Sets
•PROC TABULATE produces output data sets using the following method:
PROC TABULATE DAJA=SAS-data-set
OVT=SAS-data-set <options>\
i
•The output data set contains the following variables:
• BY variables
• class variables
• the automatic variables _T YPE_, _PAGE_, and _TABLE_
• calculated statistics
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC Statement OUT= Option
proc tabulate data=orion.sales
out=work.tabulate;
where Job_Title contains fRepf;
class Job_Title Gender Country;
table Country;
table Gender, Country;
table Job_Title, Gender, Country; run;
proc print data=work.tabulate; run;
pll2d09
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC Statement OUT= Option
♦Partial PROC PRINT Output
Obs	Job.	Title		Gender Country		_TYPE_	_PAGE_	_TABLE_	N
1					AU	001	1	1	61
2					US	001	1	1	98
3				F	AU	011	1	2	27
4				F	US	011	1	2	40
5					AU	011	1	2	34
6					US	011	1	2	58
7	Sales	Rep. ]		F	AU	111	1	3	8
8	Sales	Rep. ]		F	US	111	1	3	13
9	Sales	Rep. ]			AU	111	1	3	13
10	Sales	Rep. ]			US	111	1	3	29
11	Sales	Rep. ]	[I	F	AU	111	2	3	10
12	Sales	Rep. ]	[I	F	US	111	2	3	14
13	Sales	Rep. ]	[I		AU	111	2	3	8
14	Sales	Rep. ]	[I		US	111	2	3	14
15	Sales	Rep. ]	[II	F	AU	111	3	3	7
16	Sales	Rep. ]	[II	F	US	111	3	3	8
17	Sales	Rep. ]	[II	M	AU	111	3	3	10
18	Sales	Rep. ]	[II	M	US	111	3	3	9
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC Statement OUT= Option
•_T YPE_ is a character variable that shows which combination of class variables produced the summary statistics in that observation.
>Partial PROC PRINT Output
Obs    Job_Title      Gender Country _TYPE__PAGE__TABLE_ N
1
2
3
4
5
6
F F M M
AU US AU US AU US
1 1 1
1
1
2
61 98 27
0 for Job_Title,
1 for Gender, and 1
for Country
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
245
PROC Statement OUT= Option
•_PAGE_ is a numeric variable that shows the logical page number that contains that observation.
•Partial PROC PRINT Output
Obs    Job_Title      Gender Country _TYPE__PAGE__TABLE_ N
7
8
9
10
11
12
13
14
15
16
17
18
Sales Sales Sales Sales Sales Sales Sales Sales Sales Sales Sales Sales
Rep. Rep. Rep. Rep. Rep. Rep. Rep. Rep. Rep. Rep. Rep. Rep.
I I I
I
II II II II
M M
AU US AU US AU US AU US AU US AU US
Page 1 for Sales Rep. I
I
Page 2 for Sales Rep. II
i
Page 3 for Sales Rep. Ill
I
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC Statement OUT= Option
•_TABLE_ is a numeric variable that shows the number of the TABLE statement that contains that observation.
Partial PROC PRINT Output
Obs    Job_Title      Gender Country _TYPE__PAGE__TABLE_ N
1
2
3
4
5
6
7 Sales Rep.
8 Sales Rep.
9 Sales Rep. 10 Sales Rep. I
1 for first TABLE statement
AU
011
2 for second TABLE statement
M F
US AU
011 111
3 for third TABLE statement
M
us
111
61 98 27 40 34 58 8 13 13 29
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
247
Vice o PROC TABULATE:
• In the SUGI 28 proceedings:
• 'The Simplicity and Power of the TABULATE Procedure", by Dan Bruns
http://www2.sas.com/proceedings/sugi28/197-28.pdf
• Online (from the SUGI 27 proceedings):
• 'Anyone Can Learn PROC TABULATE" by Lauren Haworth,
http://www2.sas.com/proceedings/sugi27/po60-27.pdf
The UNIVARIATE Procedure
•The UNIVARIATE procedure produces summary reports that display descriptive statistics.
•General form of the UNIVARIATE procedure:
PROC UNIVARIATE DATA=SAS-data-set;
VAR variable(s); RUN;
i
•The VAR statement specifies the analysis variables and their order in the results.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
249
The UNIVARIATE Procedure
The following PROC UNIVARIATE step shows default descriptive statistics for Salary.
proc univariate data=orion.nonsales;
var Salary; run;
•Without the VAR statement, SAS will analyze all numeric variables.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
250
The UNIVARIATE Procedure
The UNIVARIATE procedure can produce the following sections of output:
• Moments
• Basic Statistical Measures
• Tests for Locations
• Quantiles
• Extreme Observations
• Missing Values
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Vizualizace - zdroje
Na prvním místě se obvykle citují knihy prof. Tufteho, např. Tufte E.R. (1983) The Visual Display of Quantitative Information, Graphic Press, Chesire, Conn.
• Weby o vizualizaci, např.
• http://www.math.yorku.ca/SCS/Gallery/noframes.html - galerie s poučným výkladem a příklady i nezdařených či lživých grafů
• http://www.agocg.ac.uk/ - John Lansdown (1992) Aspects of Design in Computer Graphics: Some Notes -http://www.agocg.ac.uk/train/hitch/hitch.htm
• Jiné weby, např. stránky různých vizualizačních programů a organizací
• http://www.cybergeography.org/atlas/atlas.html nebo http://miner3d.c0m/products/galleryhtml
252
Vizualizace- historie
□ William Playfair, 1786: první publikovaná prezentační grafika
□ Dr. John Snow, 1845: epidemie cholery v Londýně
Vizualizace - historie
□ Florence Nightingale, 1858: důvody úmrtí v průběhu Krymské války (1853-1856)
riiE CAUSES of MORTALITY
in the ARMY n the EAST
l>Kh\>NIIH)> l>L\> Y\»
OTHER CAUSES
□ Harry Beck, 1931: schéma Londýnského metra
Vizualizace - investigativni analyza
http://www.i2inc.com/
Law Enforcement
» Counterterrorism » Narcotics investigations » Organized crime » Intelligence analysis » Fraud
» Missing persons » Major investigations » Counterfeiting » Immigration control » Major event security » Money laundering » Gang investigations
Government
» Criminal prosecutions » National security » Military intelligence » Embassy security » Postal inspection and fraud » Prison investigations » Park and wildlife services » Antitrust investigations » Tax fraud investigations » Customs investigations
Commercial
» Forensic accounting
» Money laundering
» Insider trading violations
» Corporate security
» Anti-pirating investigations
» Entertainment copyright violations
» Competitive intelligence
» Civil lawsuits
» Fraud:
» Credit card
» Insurance
» Retail
» Health care
» Commercial
» Telephone
Vizualizace - investigativní analýza
□ osobní kontakty, pojistné podvody
256
Vizualizace - investigativní analýza
□ Praní špinavých peněz, kriminální gangy
j F* E<S Vlno Iibbi Fwiut T«lr Am^hš [lautetet Wndow He*
□ ořm aíffiisa ^^asiiss-p-: Biiäo  aha ffs
■■ n i.i ( ■ (o
1 HíQh IC lť
H i . ■ b: ■ .■
OUTLAW MOTORCYCLE GANG HIERARCHY & ACTIVITY
He :: Ridns Motütcyc o CTiD ■ Baíiirwje J ii.n-r.ic-■
51-
257
Vizualizace - risk management
258
Vizualizace - dendrogram
Credit ranking (1=default)
Node 0	
Category	% n
p Bad	52,01  1681
□ Good	47,99 155
Total	(100,00) 323
Paid Weekly/Monthly Adj. P-value=0,0000, Chi-square=179,6665, df=1
Weekly pay
L
Monthly salary
Node 1 Category °/
Bad
86,67
□ Good
T43l
13,33 22
Total
(51,08) 165
Node 2	
Category	% n
P Bad	15,82 25|
□ Good	84,18 133
Total	(48,92) 158
Social Class Adj. P-value=0,0004, Chi-square=20,3674, df=2
Age Categorical Adj. P-value=0,0000, Chi-square=58,7255, df=1
I Management; Professional		Clerical;Skilled Manual I			1 Unskilled 1
Node 3 Category     % n		Node 4 Category     % n			Node 5 Category     % n
P Bad         71,11 32|		P Bad	97,56   801		P Bad          81,58   31 1
□ Good        28,89 13		□ Good	2,44 2		□ Good        18.42 7
Total        (13,93) 45		Total	(25,39) 82		Total        (11,76) 38
I I Young^ (< 25)                   Middle (25-35);Old ( > 35)			
Node 6 Category     % n		Node 7 Category     % n	
0 Bad          48,98   241		□ Bad	0,92 1
□ Good        51,02 25		P Good	99,08 1081
Total        (15,17) 49		Total	(33,75) 109
259
Vizualizace - ekonomie
S&PComposite Index: Regression to Trend
Real (inflation-adjusted) Price since 1871 with Regression
Variance measured below
dshortcoin February 2010
This log-scale chart illustrates regress ion to trie trend across 139 years of market history. The peak in 2000 was an unprecedented 162% above trend — double the peak in 1929. The index had been above trend for 17 years. The latest daily close was 3454 above trend.
Variance from trend
arithmetic scaie
1370      1SS0      1S90      1900      1910      1920      1930      1940      1950      I960      1970      1930      1990      2000      2010 2020
261
Kartogram
□ Obce s počtem 500 a více obyvatel s vysokorychlostním připojením k
Kartodiagram
ZÁSAHY JEDNOTEK PO PROTI HMYZU
v okresech České republiky v letech 1997-2000
Grafy-další typy
Měřítko grafu
□ Která přímka roste strměji?
400 300
200
100
o
100
105
110
115
100
105
110
115
X	y
103	567
105	577
107	587
109	597
110	602
Měřítko grafu
□ Pohled tvůrce grafu:
> Zvýraznění trendu - pozitivní výsledky.
> Potlačení trendu - negativní výsledky.
□ Pohled uživatele grafu:
> Grafy bez uvedeného měřítka jsou silně podezřelé.
> Nepodléhat podsouvané informaci o růstu/poklesu.
What Is SAS/GRAPH Software?
SAS/GRAPH software is a component of SAS software that enables you to create the following types of graphs:
• bar, block, and pie charts
• two-dimensional scatter plots and line plots
• three-dimensional scatter and surface plots
• contour plots
• maps
• text slides
• custom graphs
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
267
Základní typy grafů
Bar Charts (GCHART Procedure)   «Pie Charts (GCHART Procedure)
Frequency of Job Title, Broken Down by Gender
FREQUENCY
70
Sales Rep. I Sales Rep. II Sales Rep. I
Employee Job Title Employee Gender   I       I F     I       I M
Sales Rep. IV
Frequency Distribution of Job Titles
3-D Pie Chart
Sales Rep. II 46
Sales Rep II 34
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
268
Základní typy grafů
•Scatter and Line Plots (GPLOT Procedure)
Plot of Budget by Month for 2006 and 2007
Budget
115,000,000
2 3 4 5 6 7
Month
•Bar Charts with Line Plot Overlay (GBARLINE Procedure)
Costs and Personnel for Western Regions
Total Cost S3,000,000
$2,000,000
£1,000,000
# of Employees 1,500,000
1,250,000
1,000,000
750,000
500,000
250,000
Beaumont        Cheyenne Portland REGION
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
269
Three-Dimensional Surface and Scatter Plots (G3D Procedure)
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
270
Maps (GMAP Procedure)
Maps (GMAP Procedure)
Distribution of Jobs
An Empty State Indicates No Jobs
•Multiple graphs on a page (GREPLAY Procedure)
Number of Jobs
1 58 — 127
Total Equipment Costs by Region
fbi CO and Lsad
$100,000 $200,000 $300,000 $400,000 $500,000
POL TYPE ^CO
Regional Office Locations
Equipment and Personnel Costs in Canada
Equipment ■ Personnel
Number of Contracts by Pollution Type
For Boston ani Raleigh
Raleigh Boston
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Producing Bar and Pie Charts with the GCHART Procedure
•General form of the PROC GCHART statement: PROC GCHART DATA=SAS-data-set;
•Use one of these statements to specify the chart type:
HBAR chart-variable. . .	</options>;
HBAR3D chart-variable.	. . </options>;
VBAR chart-variable. ..	</options>;
VBAR3D chart-variable.	. . </options>;
PIE chart-variable... </options>;	
PIE3D chart-variable. . .	</options>;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
272
Producing Plots with the GPLOT Procedure
You can use the GPLOT procedure to plot one variable against another within a set of coordinate axes.
•General form of a PROC GPLOT step:
PROC GPLOT DATA=SAS-data-set;
PLOT vertical-variable*horizontal-variable </options>; RUN; QUIT;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
5. Regrese. Logistická regrese
274
Overview
Type of Predictors
Type of Response	Categorical	Continuous	Categorical and Continuous
Continuous	Analysis of Variance	Linear Regression	Analysis of Covariance (Regression with dummy variables)
Categorical	Logistic Regression or Contingency Tables	Logistic Regression	Logistic Regression
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
275
Přehled procedur SASu pro regresi
1 SAS/STATi logistická regrese
C ATM OD, GAM, GENMOD,^GLIMMIX, GLM, LIFEREG, LOESS, LOGISTIC, MIXED, NLIN, NLMIXED, ORTHOREG, PHREG, PLS, PROBIT, REG, ROBUSTREG, RSREG, SURVEYLOGISTIC, SURVEYPHREG, SURVEYREG, TRANSREG. „klasická
„, „ lineární regrese
• SAS/ETS: 8
AUTOREG, COUNTREG, MODEL, PANEL, PDLREG, SYSLIN.
276
Simple Linear Regression Model
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Simple Linear Regression Model
Predictor (X)
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The REG Procedure
•General form of the REG procedure:
PROC REG DAJA=SAS-data-set <options>;
MODEL dependent(s)=regressor(s) </ options>\ RUN;
Popis + jednoduchý příklad:
http://support.sasxom/documentation/cdl/en/statug/63033/HT M L/defa u It/viewer. h tm # statug_reg_sect003. htm
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
279
Lineární regrese - PROČ REG
PROČ REG <options> ;
<label:>MODEL dependents=<regressors> </ options> ;
BY variables ;
FREQ variable ;
ID variables ;
VAR variables ;
WEIGHT variable ;
ADD variables ;
DELETE variables ;
<label:>MTEST <equation, ...,equation> </ options> ; OUTPUT <OUT=SAS-data-set>< keyword=names> <...keyword=names> ; PAINT condition | ALLOBS> </ options > | < STATUS | UNDO> ; RESTRICT equation, ...,equation ;
REWEIGHT condition | ALLOBS> </ options > | < STATUS | UNDO> ;
PLOT <yvariable*xvariable> <=symbol> <...yvariable*xvariable> <=symbol> </ options> ;
PRINT <options> <ANOVA> <MODELDATA> ;
REFIT;
RESTRICT equation, ...,equation ;
REWEIGHT condition | ALLOBS> </ options > | < STATUS | UNDO> ; <label:>TEST equation,<,...,equation> </ option> ;
Vice na: http://support.sasxom/documentation/cdl/en/statug/63033/HTML/default/viewer.htm
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
280
Modelování kategoriální responze
Nastane default?
st.	X	Y
1	2.6	1
2	1.4	0
3	.65	1
4	4.1	1
5	.25	0
6	1.9	0
„klasická" regrese není vhodná nepoužívá se logistická regrese
0
1
281
Types of Logistic Regression
Response Variable
Two
Categories ^
Type of Logistic Regression
Three or More Categories
en
nary
Nominal
]
Ordinal
]
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Why Not Ordinary Least Squares Regression?
Yi = PG + PAi + Si
• If the response variable is categorical, then how do you code the response numerically?
• If the response is coded (i=Yes and o=No) and your regression equation predicts 0.5 or 1.1 or -0.4, what does that mean practically?
• If there are only two (or a few) possible response levels, is it reasonable to assume constant variance and normality?
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
283
What About a Linear Probability Model?
Pi = Po + PiXii + 8i
• Probabilities are bounded, but linear functions can take on any value. (Once again, how do you interpret a predicted value of -0.4 or 1.1?)
• Given the bounded nature of probabilities, can you assume a linear relationship between X and p throughout the possible range of X?
• Can you assume a random error with constant variance?
• What is the observed probability for an observation?
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
284
Měření pravděpodobnosti úspěchu
• Pravděpodobnost je měřena pomocí šance úspěchu (události).
• Jestliže P je pravděpodobnost události, pak (í-P) je pravděpodobnost, že nenastane.
• Šance události = P / í-P
Logistická regrese
Simultánní efekt nezávislých (explanačních) proměnných na šanci
Odds = P/i-P = e P°+ PiXi + P2*2 +
Jestliže logaritmujeme obě strany Log{P/i-P} = log e Po+t31x1+p2x2+...+pkxk
Ix>gitP = P0+PÄ+PÄ+..+PÄ
Logit Transformation
Logistic regression models transform probabilities called logits*.
r \
Pi
v(l-R)y
logit( Pi) = In
where
z        indexes all cases (observations)
Pi       is the probability the event (a default, for example) occurs in the zth case
In      is the natural log (to the base e).
* The logit is the natural log of the odds.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Logit link function
1.0 H
10H
Logit Transform
logit link function
o
-10-
The logit link function transforms probabilities (between o and 1) to logit scores (between -oo and +00).
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
288
Logistic Regression Model
logit (pf) = (30 +      + . . . + P/cX/c
where
• logit (Pi)= l°git of the probability of the event
• P0=intercept of the regression equation
• p/c= parameter estimate of the kth predictor variable
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
289
Logistic Regression Curve
290
Logistic Regressions -example
logit( p } = w0 + wr xA + w2 x2
A
P =
1
1 + e-logit(p)
Find parameter estimates by maximizing
2>g(P/)+ Zlogtl-p/)
primary outcome training cases
secondary outcome training cases
log-likelihood function
U.U 0.1  U.Z U.o U.4 U.Í) U.b u./ u.b u.a 1.U
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Logistic Regressions -example
logit(p) =-0.81+0.92x1 + 1.11 x2 0.9
0.8
A
P =
1
\ + e-logit(p)
Using the maximum likelihood estimates, the prediction formula assigns a logit score to each x1 and x2.
Další příklad na:
http://support.sasxom/documentation/cdl/en/statug /63033/HTM L/defau It/viewer. htm #statug_log istic_se ct002.htm
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
0.0 0.1  0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Odhad parametrů
Metoda maximální věrohodnosti vede na soustavu nelineárních rovnic.
• Tuto soustavu řešíme Netwon-Raphsonovou iterační metodou.
Více na:   • http://www.stat.cmu.edu/^cshalizi/402/lectures/14-logistic-
regression/lecture-H.pdf
• http://czep.net/stat/mlelr.pdf
• http://www.stat.psu.edu/~jiali/course/stat597e/notes2/logit.pdf
293
Maximálně věrohodný odhad (MLE)
MLE is a general purpose method for parametric model estimation. We will make use of it to estimate the logistic regression.
If we have a model with parametric structure 9, we can compute the likelihood that the model will generate a sequence of n observations
L(6|D) = P(D\&)
The model which best fits the data is selected as the one which maximizes this likelihood.
9 = argmaxL(G|D)
If we assume independence between the observations, this then gives
B = aiBnMx]~|p(di|G)
E = l
Zdroj: http://www2.imperial.ac.uk/-abellott/Credit%2oScoring%202.pdf 294
Maximálně věrohodný odhad
This MLE can be expressed more conveniently in terms of log-likelihoods (since log is monotonie on its argument):
Remember:
• We do not know the true value of the parameter 9, but we want to estimate it.
• To distinguish the estimate from the true value, in our notation, we put a "hať' on the estimate: 8.
MLE has several nice asymptotic properties: o Consistency o Asymptotic normality o Efficiency.
295
Maximálně věrohodný odhad
Consider the training data set Dtraln with n observations (borrowers). Remember
• Xj denotes values for predictor variables for observation i.
• yt denotes the outcome for observation if either 0 or 1.
Then the likelihood of the outcome for each observation i is given by
P(yt = Q\xirfí 1 - P(yÉ = 0|xĚ, p)
ifyr=0,
ifyi=i
which is
P(w = 0|x(,p)1^(l-P(yi = 0|x|rp))
giving log-likelihood for each observation:
(1 - yt) \ogP(yi = 0\xt, p) + yt log(l - P(yt = Q\xit p))
Maximálně věrohodný odhad
Assuming independence between observations, this gives the log-likelihood function for p:
ji
logUP|Dtrain) = ^(1 - yd log(1 + e_(ft+p.Xi)) + 7t log(1 + eft+p.Xj)
1=1
Differentiating by each coefficient in p and setting the derivative equal to zero to find the maxima gives
n
Z (1_ Ml+ *-«.♦■*>)) = °
i=l v '
and
for each attribute ;'=! to m.
These are non-linear equations that can be solved by computer intensive processes such as Newton-Raphson methods.
297
Standard errors on the MLE
Since G is only an estimate of the best model to explain the data, it is possible to derive standard errors s on the estimates.
Asymptotic normality for MLE is such that
^^Ua/(0,1) as7i^oo
where §j, 8j and sj are theyth components of 9, 6 and s respectively a N(Q,1) is the standard normal distribution.
This property then allows us to generate:-
• Generate a hypothesis tests using the Wald chi-square statistic;
• Generate confidence intervals around the estimate.
MLE- testování hypotéz
We test the hypothesis that an estimated coefficient is not zero against the null hypothesis that it is zero. That is, we testing if a parameter has a genuine effect in the model.
• Null hypothesis: H0\ 6} = 0
• Alternative hypothesis: H^. 9j =t 0
\e I
The Wald test says reject H0 if -r^- > za/2 for some significance level a,
where za/2 = <3>-1(l - a/2) and <$> is the CDF for the standard normal distribution.
299
MLE - konfidencni intervaly
The asymptotic normality property also allows us to compute confidence intervals (CIs):
~ za/2$j < Qj < Sj + 1 _ a
as n —> co.
This is a range of possible values of the parameter within a given confidence level 1 — a.
Note: the larger the confidence level, the broader the confidence interval.
Likelihood Ratio Test
The maximized likelihood gives a measure of how well the model fits the data (l = perfect fit, 0 = no fit). The ratio of likelihoods between two models, A "nested" in B, can be used to test whether the fit of A improves on B.
Definitions
Suppose we have two models A and B with the same structure except A has more parameters than B:
®a = (&if">6m+r) and 9S = (0lf...,0m) Then A is nested in B.
The likelihood ratio statistic is Ä
Newton-Raphsonova metoda
• Základní princip metody:
1 2 _1
p(x,/?) =-L(^) = Xy1^Tx1-lo^l + e^)      ^new = ^old     d L(/?) 5L(/?)
► Maticový zápis:
y9new = (X'WXr1 XTW(X^old +W_1(y - p))
y...    vektor pozorování vysvětlované proměny
X ...     matice plánu, typu n x ( p +1)
P •••     vektor pravděpodobností P (xi,/?°ld)
W...nxn   diagonální matice vah, s diag. prvky pC^, /?°ld) -(1 — (Xj, /?°ld))
de o numerickou iterační metodu -> ie třeba zkontrolovat, zda 3yla splněna podmínka konvergence (metoda „dokonvergovala" k optimálnímu řešení)
302
Výhody logistické regrese
• Málo parametrů
• Snadné použití i interpretace
• Lze snadno začlenit i diskrétní prediktory
Funguje dobře i na datech, která se poměrně značně liší od gaussovských směsí
• A   především   většinou   dobře   funguje,   pokud věnujeme odpovídající pozornost přípravě dat
• praktická zkušenost: ve čtyřech případech z pěti je logistická regrese na datech, která analyzuji, buď nej lepší nebo zhruba stejně dobrá jako jiné metody
303
Interpretace, rozdíly proti OLS
• Regresní koeficienty b: kladné znamenají, že proměnná svým růstem zvyšuje šanci zařazení do skupiny kódované číslem i, a naopak záporné indikují pokles této šance
Často se používá exp(Ďř): je to faktor, kterým se násobí šance pl(í-p) při jednotkovém nárůstu xt a neměnných ostatních xk
• Pozor na různá měřítka, v nichž xt mohou být měřena;
• Místo F-testu celkové validity nyní máme chí-kvadrátový test pro totéž
Místo t-testu signifikance proměnných v modelu jsou Waldovy statistiky; je to v podstatě totéž a čteme to stejně
• Místo R2 jsou jen pseudo-R2
Příklad
The following logistic regression output was produced on a data set of 40,000 credit cards.
Likelihood Ratio = 1819 (p-value < 0.001)
Variable	Coefficient	Estimate	Standard error	Wald chi-square	P > chi-square
Intercept	ßo	-0.181	0.084	4.6	0.032
Age	ßi	+ 0.0353	0.0013	757.6	<0.001
Income (log)	ß2	-0.0164	0.0100	2.67	0.10
Residential phone	ß*	+ 0.622	0.030	430.8	<0.001
Home owner *		0			
Renter	A	-0.155	0.039	15.6	<0.001
Lives with parents	ßs	+ 0.256	0.045	32.1	<0.001
Months in residence		-0.00025	0.00011	5.4	0.020
Months in current job	ß?	+ 0.00210	0.00025	72.9	<0.001
* Notice that the Home owner category is set as base residency category and so has no coefficient estimate. We will discuss this in a later lecture.
Zdroj: http://www2.imperial.ac.uk/-abellott/Credit%2oScoring%202.pdf
305
Příklad
We have used logistic regression to model the negative outcome (ie y = 0).
• This may seem odd given that the outcome of interest is the positive one (eg default).
• However, this model ensures the log-odds scores are the right way round: ie increasing scores imply increasing creditworthiness.
• There is no material difference. If we had modelled y = 1, the signs on the coefficient estimates would be reversed but everything else would be the same.
Interpretations:
• The estimates (highlighted) form the scorecard.
• Estimates greater than 0 indicate relative decrease in risk.
• Estimates less than 0 indicate relative increase in risk.
• Small p-values indicate coefficients that are statistically significantly different to zero (how small?).
• Large p-values indicate coefficients that have a good chance of actually being zero.
306
Příklad
Remember in the exercise in Chapter 1 we gave details of six borrowers. You were asked to select three to accept and three to reject.
Here the scores assigned by the model above are shown. The observations with the three lowest scores are rejected by the model. The actual outcome in each case is also shown. How does your performance compare with the model?
Age	Monthly	Residential	Residence	Months in	Months	Score	Model	Actual
	Income	phone?	type?	residence	In current		accept or	outcome
	(£)				job		reject?	
22	1,145	Yes	Home owner	43	12	1.11	Reject	Good
46	15,500	Yes	Renter	48	192	2.14	Accept	Good
71	900	Yes	Renter	96	12	2.6S	Accept	Good
32	5,000	Yes	Renter	48	168	1.61	Accept	Bad
25	1,385	Yes	Renter	12	0	1.05	Reject	Bad
43	3,145	No	Home owner	96	36	1.25	Reject	Bad
Příklad
Variable	Value	Coefficient	Estimate	Value x Estimate
Intercept	n/a	ft	-0.181	-0.181
Age	22	ft	+ 0.0353	+0.777
Income (log)	log(1145) =7.04	ft	-0.0164	-0.116
Residential phone	1	ft	+ 0.622	+ 0.622
Home owner *	1		0	0
Renter	0	ft	-0.155	0
Lives with parents	0	ft	+0.256	0
Months in residence	48	ft	-0.00025	-0.012
Months in current job	12	ft	+ 0.00210	0.025
Score (sum)				+ 1.115
Compute the PD of the borrower.
I     > P(y = lis) = — ~ 0. Score = 1.115 1—y     Ky      1 J i+es
Multinomiální logistická regrese
• Taktéž polytomická regrese
• Závisle proměnná má M kategorií, více než dvě. Např.: kterou stranu respondent volí?
• Základní idea:
• Prohlásit jednu kategorii za referenční
• Spočítat M-i obyčejných logistických modelů pro každou ze zbylých kategorií oproti referenční
• A predikovat tu kategorii, kde vyšla největší pravděpodobnost přes všechny modely
309
Budování modelu
□ Forward
□ Backward
□ Stepwise
- začíná se s prázdným modelem
- postupné přidávání proměnných
- začíná se s plným modelem (všechny proměnné)
- postupné odebírání proměnných
- začíná se s prázdným modelem
- postupně se přidávají a odebírají proměnné
□ Enter
- je předepsán seznam proměnných v modelu
Logistic Regression with Sequential Steps
• Forward regression
• starts with a baseline model (intercept-only)
• searches all variables and finds the strongest one
• keeps adding variables in order of strength until no significant improvement is achieved in the model.
• Backwards regression
• starts with a full model using all variables
• removes the weakest input variable provided that taking it out does not cause a significant reduction in the fit of the model
• continues removing the weakest input variables in order unless there is a significant reduction in the fit of the model; at which point the algorithm stops.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Logistic Regression with Sequential Steps
• Stepwise regression
• is a combination of forward and backward regression
• begins the same way as forward
• re-evaluates the statistical significance of all included variables after each new variable is added.
• If a previously included variable becomes statistically insignificant when a new variable is added, that variable is then removed.
• f The algorithm stops when no more variables can be found that add significantly to the fit of the model and all variables remaining in the model are statistically significant.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The Logistic Regression Task
. Logistic Regression for Local:SASUSER.SALES_INCLEVEL
Model
Response Effects Selection Options
Plots
Predictions
Titles
Properties
Preview code
Data
Data source: Local:SASUSER.SALESJNCLEVEL Task filter: None
Variables to assign:
Name
@ IncLevel @ Purchase ^.Gender Income ® Age
Task roles:
Dependent variable (Li : < variable required> F
Quantitative variables Classification variables Group analysis by Frequency count (Limit Relative weight (Limit: '
±1
2i
The selection pane enables you to choose different sets of options for the task.
Edit..
Ť     I 4-
Cancel
—3
Help
The "Dependent variable" role must have a variable assigned to it.
Reprodukováno se svolením
společnosti SAS Institute Inc., Cary, NC, USA.
313
Which link function, which response Level to Model?
Volba
. Logistic Regression For LocaLSASUSER.SALES INCLEVEL
Data Model
Response
Effects Selection Options Plots
Predictions
Titles
Properties
Model > Response
3 Preview code
linkovaci
funkce.
Specify the level of the response variable that you want to model. For example, do you want to model the probability of a o orai?
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
314
LOGISTIC Procedure
General form of the LOGISTIC procedure:
PROC LOGISTIC DATA=SAS-data-set <options>\
CLASS variables </ options>\ MODEL response=predictors </ options>\ UNITS independent =Hst... </options>\ ODDSRATIO <'label'> variable </options>\ OUTPUT OUT=SAS-data-set keyword=name </ options>;
RUN;
Více např. na: http://www.okstate.edu/sas/v8/sashtml/onldoc.htm
http://wwwokstate.edu/sas/v8/saspdf/stat/chap39.pdf
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
LOGISTIC Procedure - příklad
ods html file="logistic_vyvoj.htmr style=sasweb;
proc logistic data=dm1.data_vyvoj descending;
model good4=goods_type_w phone_w a_uver_w
fam_state_w income_w credit_w vekw ■
run;
ods html close;
LOGISTIC Procedure - příklad
proc logistic data=dm1.score_base outest=work.model_def;
CLASS AGE_d EDUCATIONd CARAGEd / param=glm; MODEL def_bad = AGE_d EDUCATIONd CAR_AGE_d total_income_d(init_pay_by_INCOME_d)
/ SELECTION=FORWARD HIERARCHY=MULTIPLECLASS;
score out=work.tab_scored_def;
run;
LOGISTIC Procedure - příklad
proc logistic
data=dm1 .score_base outest=work.model_def namelen=200;
where client_type="1-Novy";
CLASS sex_k child_num_k fam_state_k age_k;
MODEL def_bad = AGE_w EDUCATION_w
AGE_w*EDUCATION_w
s ex_k | c h i I d_n u m_k | fa m_st at e_k | ag e_k@ 4
/selection=stepwise slentry=0.6 slstay=0.1 details corrb
run:
318
LOGISTIC Procedure - příklad
proc logistic
data=dm1 score_base inest=hc.modelSU namelen=200;
CLASS sex_k child_num_k fam_state_k age_k; MODEL def_bad = AGE_w EDUCATION_w
AGE_w*EDUCATION_w
s ex_k | c h i I d_n u m_k | fa m_st at e_k | ag e_k@ 4
/selection=none maxiter=0;
output out=dm1 .data_all_scr (keep=id_credit score def_bad
compress=yes) prob=score; run;
What Happens to Classification Variables?
• The Logistic Regression task assumes a linear relationship between predictors and the logit for the response.
• For categorical variables, that assumption cannot be met.
• Specification as a Classification variable creates "design variables" representing the information in the categorical variables.
• The design variables are the ones actually used in model calculations.
• There are many possible "parameterizations" of the design variables.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
320
Effects (Default) Coding: Three Levels
Design Variables
CLASS Value Label 1 2
incLevel      l Low Income 1 0
2 Medium Income 0 1
3 High Income -1 -1
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
321
Effects Coding: An Example
logit(/7) = ß0 + ßi * DLowjncome + ß2* DMedium income
ßQ =     the average value of the logit across all categories
ßx =      the difference between the logit for Low income and the average logit
P2 =     the difference between the logit for Medium income and the average logit
Analysis of Maximum Likelihood Estimates						
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept	i		-0.5363	0.1015	27.9143	<.0001
IncLevel	1	1    -0.2259     0.1481       2.3247 0.1273				
IncLevel		1	-0.2200	0.1447	2.3111	0.1285
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Reference Cell Coding: Three Levels
Design Variables
CLASS	Value	Label	1	2
IncLevel	1	Low Income	1	0
	2	Medium Income	0	1
	3	High Income	0	0
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
323
Reference Cell Coding: An Example
logit(/7) - P0 + Pi * DLow jncome + P2* ^Medium income p0 =    the value of the logit when income is High
Pi =    the difference between the logits for Low and High income
p2 =    the difference between the logits for Medium and High income
Analysis of Maximum Likelihood Estimates						
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept	■<		-0.0904	0.1608	0.3159	0.5741
IncLevel	1	1    -0.6717     0.2465      7.4242 0.0064				
IncLevel		1	-0.6659	0.2404 7.6722		0.0056
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Odds Ratio Calculation from the Current Logistic Regression Mode
•Logistic regression model:
logit(p) = log(odds) = fiQ+ pl* (gender)
Odds ratio (females to males): oddsfemales =qP°+P1
oddsmales =
e/Wi A odds ratio =--— = e
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Odds Ratios for Categorical Predictors
Odds Ratio Estimates		
Effect	Point Estimate	95% Wald Confidence Limits
Gender Female vs Male	1.549	1.040 2.3G5
Profile Likelihood Confidence Interval for Odds Ratios			
Effect	Unit	Estimate	95% Confidence Limits
Gender Female vs Male	1.0000	1.549	1.043 2.312
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Odds Ratio Plot
Odds Ratios with 95% Profile-Likelihood
Confidence Limits
Gender Female vs Male
1.00 1.25
1.50 1.75 2.00 Odds Ratio
2.25
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Odds Ratios for Continuous Predictors
Odds Ratio Estimates		
Effect	Point Estimate	95% Wald Confidence Limits
Age	1.G52	1.016 1.G90
Profile Likelihood Confidence Interval for				
		Odds Ratios		
Effect	Unit	Estimate	95% Confidence Limits	
Age	1D.00D0	1.663	1.176	2.373
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
328
Predicted Probability Plots - Continuous
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
329
Model Fit versus Complexity
Model fit statistic
Evaluate each sequence step.
n~B~rn    i i ■ i ■    rrmm ■ ■■■■1
1      2      3      4      5 6
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Select Model with Optimal Validation Fit
Model fit statistic
11 ■ 111 11 ■ i ■ 1      2      3      4 5
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Choose simplest optimal model.
331
Model Assessment: Comparing Pairs
• Counting concordant, discordant, and tied pairs is a way to assess how well the model predicts its own data and therefore how well the model fits.
• In general, you want a high percentage of concordant pairs and low percentages of discordant and tied pairs.
• Následuje příklad určení těchto párů na modelu predikujícím zda daná osoba nakoupí zboží za více než 100$.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
332
Comparing Pairs
To find concordant, discordant, and tied pairs, compare everyone who had the outcome of interest against everyone who did not.
< $100 $100 +
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
333
Concordant Pair
Compare a woman who bought more than $100 worth of goods from the catalog and a man who did not.
< $100 $100 +
P(100+) = .32 P(100+) = .42
The actual sorting agrees with the model. This is a concordant pair.
334
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Discordant Pair
Compare a man who bought more than $100 worth of goods from the catalog and a woman who did not.
< $100 $100 +
P(100+) = .42 P(100+) = .32
The actual sorting disagrees with the model.
This is a discordant pair.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Tied Pair
Compare two women. One bought more than $100 worth of goods from the catalog, and the other did not.
< $100 $100 +
P(100+) = .42 P(100+) = .42
The model cannot distinguish between the two.
This is a tied pair.
00c
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Model: Concordant, Discordant, and Tied Pairs
• PROC Logistic standardně nabízí četnosti (relativní) jednotlivých typů párů a z nich odvozené statistiky kvality modelu:
Association of Predicted Probabilities and Observed Responses			
Percent Concordant	30.1	Somers1 D	0.1G7
Percent Discordant	19.5	Gamma	G.215
Percent Tied	50.4	Tau-a	G.G5G
Pairs	43578	c	0.553
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
337
6. Credit scoring- historie, základní pojmy
Good customers
Bad customers
Low risk
High risk
Business Loan
Úvod
• Credit Scoring je soubor prediktivních modelů a jejich základních technik, které slouží jako podpora finančním institucím při poskytování
r     v o
uveru.
• Tyto techniky rozhodují, kdo dostane úvěr, jaká má být výše úvěru a jaké další strategie zvýší ziskovost dlužníků vůči věřitelům.
• Credit Scoringové techniky kvantifikují a posuzují rizika při poskytování úvěrů konkrétnímu spotřebiteli.
Úvod
• Nerozeznají a nestanovují "dobré" nebo "špatné" (očekává se negativní cnování, tj. např. default) žádosti o úvěr na individuální bázi, nýbrž poskytují statistické šance, nebo pravděpodobnosti, že žadatel s daným skóre se stane dobrým" nebo "špatným".
• Tyto pravděpodobnosti nebo skóre, spolu s dalšími obchodními úvahami jako jsou předpokládaná míra schvalování, zisk nebo ztráty, jsou pak použity jako základ pro rozhodování o poskytnutí/neposkytnutí úvěru.
340
Why do we need score?
HISTORICAL EVOLUTION":
Money lender
• lend only to people which he knows
Operators
• they make decision based on client's information and their experience
Automatic scoring
• make decision on statistical base
PAST EXPERIENCE -> ESTIMATION FOR FUTURE
score?
ANTAGES:
• Automatization of approval proces
• Cost - effective
• Less fraud possibilities
	DISADVANTAGES	i ->	
			
• Statistical based, not take in account client like individual			
Úvod
• Zatímco historie úvěru sahá 4000 let nazpět (první zaznamenaná zmínka o úvěru pochází ze starověkého Babylonu - 2000 let před n.L), nistorie credit scoringu je pouze 50-70 let stará.
• První přístup k řešení problému identifikace skupin v populaci představil ve statistice Fisher (1936). V roce 1941, Durand jako první rozpoznal, že tyto techniky mohou být použity k rozlišování mezi dobrými a špatnými úvěry.
343
Úvod
• Významným milníkem při posuzování úvěrů byla druhá světová válka.
• Do té doby bylo standardem individuální posuzování žadatele o úvěr. Dále bylo standardem, že ve finanční sféře byli zaměstnáni (téměř) výhradně muži.
• Odchod značné části mužské populace do služeb armády měl za následek potřebu předat zkušenosti dosavadních posuzovatelů žádostí o úvěr novým pracovníkům.
• Díky tomu vznikla jakási rozhodovací pravidla a došlo k „automatizaci" posuzování žádostí o úvěr.
Úvod
• Příchod kreditních karet ke konci šedesátých let minulého století a růst výpočetního výkonu způsobil obrovský rozvoj a využití credit scoringových technik. Událost, která zajistila plnou akceptaci credit scoringu, bylo přijetí zákonů „Equal Credit Opportunity Acts" (o rovné příležitosti přístupu k úvěrům) a jeho pozdějších znění přijatých v USA v roce 1975 a 1976. Tyto stanovily za nezákonné diskriminace v poskytování úvěru, vyjma situace, pokud tato diskriminace „byla empiricky odvozená a statisticky validní".
345
Úvod
V osmdesátých letech minulého století začala být využívána logistická regrese, dodnes v mnoha oblastech považovaná za průmyslový standard, a lineární programování. O něco později se objevily na scéně metody umělé inteligence, např. neuronové sítě. Mezi další používané techniky lze zařadit metody nejbližšího souseda, splajny, waveletové vyhlazování, ádrové vyhlazování, Bayesovské metody, regresní a dasifikační stromy, support vector machines, asociační pravidla, klastrová analýza a genetické algoritmy.
346
Historie -detail
Date Event
2000 BC       First use of credit in Assyria, Babylon, and Egypt.
1100s First pawnshops in Europe established by charitable institutions, and by 1350
they were being run as commercial concerns. 1536 Charging of interest deemed acceptable by the Protestant church.
1730 hirst advertisement for credit placed by Christopher Thornton of Southward,
London who offered furniture that could be paid off weekly. 17S0s First use of cheques in England,
IS 03 First consumer reports by Mutual Communications Society in London.
1S 32 Fi rst pu bl ication of the A rnerican Railroad Journal.
IS41 Mercantile Agency is first American credit reporting agency.
IS49 HarrodTs established as one of the world's first department stores,
IS51 First use of credit ratings for trade creditors by John M. Bradstreet.
IS56 Singer Sewing Machines offers consumer credit.
IS 62 Poor's Publishing publishes Manual of the Railroads of the United States.
IS69 First American consumer bureau is Retailers Commercial Agency (RCA)
in Brooklyn,
1SSĚ Sears established, and launches its catalogue in 1S93. S^a^^e^Swažován za přijatelný
Advertisement for credit = reklama na úvěr Zdroj: Anderson Mercantile agency = obchodní agentura
347
Historie -detail
Date Event
1906 National Association of Retail Credit Agencies formed in the USA.
1909 John M, Moody publishes first credit rating grades for publicly traded bonds.
1913 Henry Ford uses production lines to produce affordable automobiles.
1927 Establishment of Schuf a Holdings AG, first credit bureau in Germany.
1934 First public credit registry (PCR) established in Germany,
1936 U.A. Fisher's use of statistical techniques to discriminate between iris species.
1941 David Durand writes report, suggesting statistics can assist credit decisions.
194? Henry Wells uses credit scoring at Spiegel Inc.
1950 Diners Club and American Express launch first charge cards.
1950s Sears uses propensity scorecards for catalogue mailings.
1956 FI consultancy established in California, USA.
1958 First use of application scoring by American Investments.
1960s Widespread adoption of credit scoring by credit card companies.
1966 Credit Data Corp, becomes first automated credit bureau.
1970 Fair Credit Reporting Act governs credit bureaus.
1974 Equal Credit Opportunity Act causes widespread adoption of credit scoring.
1975 FI implements first behavioural scoring system for Wells Fargo,
1978 Stannic implements first vehicle finance scorecards in South Africa.
1982 CCN offers Credit Account Information Sharing (CAIS), its consumer credit bureau service.
1984 FI develops first bureau scores used for pre-screening.
1987 MDS develops first bureau scores used for bankruptcy prediction.
1995 Mortgage securitisers Freddy Mac and Fannie Mae adopt credit scoring.
20QO Moody's KMV introduces RiskCalc for financial ratio scoring [FRSj.
2000s Basel II implemented by many banks.
Zdroj: Anderson
affordable = dostupný
iris species = druhy kosatců
Charge card = kreditní karta
Propensity scorecard = scoringová karta pro
modelování náchylnosti (k nákupu)
FI = splolečnost Fair, Isaac...dnes FICO
Mortgage = hypotéka
348
Historie -detail
Table 2.4. Genealogies and milestones—credit cards
Dace
Event
1914 Western Union introduces embossed metal plate first charge card in the United States,
1920s Introduction of 'shoppers plates', early version of modern store cards.
1950 Diners Club and American Express launch first charge cards.
1 -' v Diners Club launches first credit card in New York city.
I960 Bank Amerlcard established, later to become Visa.
1966 Master Charge established, later to become MasterCard,
1966 Barclay card established in the United Kingdom.
Table 2.5. Genealogies and milestones—credit scoring consultancies
Zdroj: Anderson
Name	Year	Notes
Fair Istitic (Fii		
FI	1956	Founded San Francisco CA, by Bill Fair and Earl [saac
	19SS	First scorecard development, for American Investments
	1984	Develops first bureau score for pre-screening
	1995	First use of scoring by mortgage securitisers
Experian-Scorex		
Management Decision	1974	Founded by John Coffman and Gary Chandler
Systems (MDS)		
	1982	MDS purchased by CCN
Scorex	1984	Founded in Monaco by Jean-Michel Trousse
MDS	1987	MDS develops first monthly bureau score, for bankruptcy
Experian-Scorex	2003	Created as subsidiary of Experian, after purchase of Scores
349
Historie -detail
Table 2.7. Genealogies and milestones—credit bureaux
Name
Year
Notes
Du» & Bradsireet Mercantile Agency
John M. Bradstreet Co.
R.G. Dun Sc Co. Dun 6t Biadstreet
Expcrittn
Manchester Guardian
Society Chilton Corp. Michigan Merchants TRW
TRW
CCN
TRW Experian
Equifax
London Assn. for the Protection of Trade RCA
RCC
United Assn. for the Protection of Trade Equifax
Trans Um/on TransUnion
1841 1849 1849 1851 1859 1933
1827
1897 1932 1368
1976
1980
1884 1989 1996
1842
1869 1899 1934 1965
1975 1994
1968
1969
Founded, New York NY, by Lewis Tappan.
Benjamin Douglass takes over, and expands.
Founded, Cincinnati OH.
First use of credit rating grades.
Robert G. Dun incorporates Mercantile Agency.
Merger orchestrated by Arthur Whiteside.
Founded, Manchester, UK.
Founded, Dallas TX. Publishes "Red Book1. Founded, later to become Credit Data Corp. Purchases Credit Data Corp., and changes name to
TRW-Credit Data. Information Systems and Services (ISBfS) division
produces first business credit report. Founded, when Great Universal Stores (GUS) spins off
information services division Purchases Manchester Guardian Society Purchases Chilton Corp.
Founded, through TRW divestiture of TRW-CD & ]S6fS. Purchased by GUS, who merges it with CCN.
Founded, London, UK
Founded, Brooklyn, NY Founded, Atlanta, CA Purchases RCA LAPT renamed
RCC renamed to Equifax
Purchases UAPT-Infolink and Canadian Bonded Credits
Founded, as holding company for Union Tank Car
Company (UTCC} Purchases the Credit Bureau of Cook County
Zdroj: Anderson
350
Historie
-detail
Table 2.8. Genealogies and milestones—credit rating agencies
Name Year     Notu s
Standard & Poor's fi&Pi
Poor's Publi shi ng Co. 18 62
S&P 1941
Moody's Investor Services (MIS)
John Moody Sc Co. 1900
John Moody 1909
Moody1! Investor Services 1914
1962
Moody1! KMV 2002 Fitch IBCA
Fitch Publishing Co. 1913
IBCA 1973
KiKh [KCA 1997
Founded, by Henry Vamum Poor
Poors Publishing and Standard Statistics merge
Founded, by John Mood); but fails in 1907 First use of rating grades for bonds Incorporation of MIS MIS purchased by DficB
Created as MIS subsidiary after merger of Risk Management Services and KMV
Founded, by John Knowles Fitch Founded
Merger of Fitch Publishing and IBCA
Zdroj: Anderson
351
Historie -další zajímavé čtení
http://www.fundinguniverse.com/company-h^ Com pa ny-Com pa ny- H istory. htm I
http://www.fico.com/en/Company/News/Pages/03-10-2009.aspx http://www.directlendingsolutions.com/history_credit_scoring.htm http://www.pbs.org/wgbh/pages/frontline/shows/credit/more/scores.html http://en.wikipedia.org/wiki/Credit_score
352
Risk Management - Acquisition
• Credit Bureau
• Other External Data
Policy Rules Scorecards
• Fraud
• Delinquency
• Bankruptcy
• Claims
Strategy
Fail
Pass
Data Acquisition
Risk Management - Customer
Credit Line Management Usage Monitoring Transaction Fraud Transaction Approval Renewal/Reissue
Collections Claims
V Scorecards V  V Policy Rules
V Strategies
.. Lots of analysis
Risk Management
Risk Management
I
I
Financial
I
I
Commercial/Consumer
1
Enterprise
Delinquency, Fraud, Claim, Collections
Operational
Market, Interest, VaR (Risk Dimensions)
355
Risk Management a druhy rizik
Selhání procesů a systémů, podvody, přepadení,
pověst společnosti
Risk Management
Commercial/Consumer
Delinquency, Fraud, Claim, Collections
_i-
Delinquency
	Fraud		Claim	
■ Late payments - Bankruptcy _ Write-off
Applicant
Transaction
Claims
Internet
(app+trans)
P&C, Life,
Health
Mortgage
insurance
Export financing
insurance
_I_
Collections
Payment Projection (recovery) Outsourcing to agency
P&C: Property & Casualty Insurance (majetkové a úrazové pojištění)
Why Manage Risk?
¥ €
£
£
$
¥
€
$
$
€
£
¥
Reduce exposure to high-risk accounts. Decrease bad debt and claims payouts. Ensure better pricing to reflect risk. Detect fraud early-on.
Increase approval rates (the "right kind" - potentially increasing revenue).
Handle most approvals/declines quickly (customer service).
Analysts/investigators only focus on difficult accounts.
Ensure consistent, equal and objective treatment of each applicant across the organization.
Offer more efficient marketing initiatives.
358
Users of Risk Management
• Banks
Citibank, Royal Bank, CIBC, BankOne
• Finance Companies
• GE Capital, HFC, GMAC
• Insurance
• Life, Property and Casualty, Health
• Government
• Ministries/Departments of Health (Medicare), Ministries of Finance (IRS), Workers Compensation.
359
Users of Risk Management
• Utilities
• Hydro/Power/Energy, Water
• Communications
• Bell, Sprint, AT&T (land lines and cellular)
• Retail
• JC Penneys, Sears, Hudsons Bay Company, Target
• Manufacturers/Industrials
Those who give credit to small businesses.
360
Risk Management "Toolbox"
• Risk Data Mart/Data Warehouse
• Risk prediction models (scorecards)
• Reporting
• Analysis tools
• Operational/strategy implementation software (for example, FICO™ Blaze Advisor®, FICO@ TRIAD® Customer Manager, Experian Probe SM, Experian NBSM, Cardpac, VisionPlus, Pro-Logic Ovation).
361
FICO™ Blaze Advisor®
gl ft Ife X
-inlxi
RulBflOW; I Lending Decision Flow Ruteflow Cortert | Rule How Attrtiui« |
S\J\* írľun l^^s
tj a as a    q <> s _?
t
Initial
CKiďli nation
		i Get Additional Info	
		+ Store Applicant	
Retrieve			
Customer Info 1			
		l	
+ Validation Rules		Student Loan Selection	Non-student Loan Selection
t			^ 1
^^J^^^                                                                  Set Loan Limit			
Funding Process m^tm™			
	Assign Promo Display Results t		
	l_	A	B      1       C              D      1       E_F_G       j H						'      I 'II	
VT  X ft KL *lr *W 3j *J	i	HatPtřrcok	lfl            9              S              7              E              S A							
al Lřies n Erdora omenta Quote 111 Watimal D RÚB ill Itatflcatjon me Endo re em h tie Rctiig ■jTÍ ^CBDf51[udLJB2519_TÉ3lB	2	,n.še üf b-rj^urc surcharge Surcharge bL.rc-:rgc su'chü'gc surcharge surcharge sj'cha'gc								
	6 7	>50 4C<..<=50 M< ..<=4C ZO < ..<= 30 10<..<=20	2534         r25%          25%          25 34         r25%         '25% 25% ZO %         ZO %         ZO Ä         20 %         2Q%        rZ0 %         ZO 96 1534         15%         15%         1594         15%        'l5% 153í r103í         109&        \o%        "íOM                       lott lütt 5%          534          5 94          5%          534          5% 5%							
										
										
										
										
	10									
	11									
-]§ FrB_EüprEefilwi_2517_TaHB	12 13									
										
	14									
AgeQfStni cture251 j_Tab la	15									
	16									U
1	17									
g Flui e Mai nt	ensnee Applicaiion - Inura Ejplor-er provided by -air Is			
	Bj httpi:/l?JflilJSIHD'rtnn'inclfl.html			
ii Famiites g Rule Mni	-_[-. ^GetMoreAiMo.5-K n ail c e Application		ft - 5 -Q r	0 -   Pagŕ-   Safety-  Tools- 0- "
FICO	Rule Maintenance Application			
				
			ernrltjis i LhrJerikTlttTQ CetHnis Tree	_B
Quote
O Omers KSK
F J njeftows
□ Q UrutervirttrnE
j--fg UiHkr-in'lerRiteribUÜ Eg UnderwiltkigDecbtHT
unflBiwiiiriijQGCiĚiCHiĚ a
Version; Ľ.'Qrxinij
x:
V ZfO
lasiwrwng
/m -
Zdroj: http://www.ficoxom/account/resourcelookup.aspx?theID=43o
362
Scorecards
• Predict the probability of a negative event.
• Custom - based on clients own data
• Generic - based on pooled industry or bureau data (Beacon, Empirica)
• Application - new applicants
• Behavioral - current customers
Scorecard Types
Risk	
30/60/90 Delinquency	
Bankruptcy	
Write-off	
Claim	
Fraud	
Collections	
		Mktg/CRM
Combination		
		Response
Resp/approve/delq		
Response/profit		Churn
Risk/churn/profit		Revenue
Profit		Cross sell
Scoring in approval process
Client (new)
Hard checks
rejection
Verifications
(dependant on riskgroup)
rejection
Policy declines - low age, unsufficient length of employment, "terorrist" etc.
What is the probability that client will pay?
Will the contract be profitable?
Is the number of client's phone valid? Etc.
Fraud Risk
• Fraud risk is one of the fastest growing areas in risk management.
• Examples include bank/retail card fraud, insurance fraud, health care fraud, welfare fraud, franchise fraud, internet fraud, mortgage fraud, investment fraud, tax fraud, merchant fraud.
• E-commerce presents opportunities.
• The F.B.I, estimates that between 10-15% of loan applications contain material misrepresentations.
366
Reporting and Analysis
• Scorecard and portfolio performance
• Approval rates, applicant profile, loss rates, high risk segments
• Behavior tracking to develop better strategies
• Capturing fraud, approval/decline, pricing,
credit line management, collections, cross sells qualification, claims.
367
Risk Applications
• Retail/banking (consumer and commercial)
• Application and behavior scorecards for all credit products.
• Strategy design for credit limit setting, authorizations and collections/reissue/suspension.
• Fraud application and transaction detection
• Pricing/down payment
• ATM limits, check holds
• Pre-qualifying direct marketing lists.
Automotive/finance
• Loans and leasing
• Application, behavioral, fraud, collection scorecards
• Pricing/down payment.
Risk Applications
• Government
• Fraud detection (for example, Welfare, health insurance)
• Entitlement/claims assessment (for example, Workers compensation)
• Communications
• Security deposit
• International call access
• Contract/"pay as you go"
• Telephone fraud
• "Shadow limit" setting
• Suspension of service
• Collections.
Risk Applications
Insurance
• Rate setting
• Fraud detection
• Claims management
• Risk control for CRM initiatives. • Utilities
• Security deposit
• Collections.
Risk Applications
• Manufacturers/pharmaceuticals/industrials
• Assessing credit risk of business clients
• Credit risk assessment of franchisees (for example, gas stations)
• Payment terms
• Collections
• Merchant fraud.
Risk Applications
• Optimizing work flow in adjudication departments
• Evaluating/pricing portfolios
• Securitization
• Setting economic/regulatory capital allocation
• Reducing turnaround time (automated scoring)
• Comparing quality of business from different channels/regions/suppliers.
Resources
www.ftc.gov/bcp/conline/pubs/credit/scoring.htm
www.creditscoring.com
www.my-credit-score.com
www.fairisaac.com, www.myfico.com
www. experian. com
www.creditinfocenter.com
www.consumersunion.org/finance/scorewc200.htm
www.phil.frb.org/files/br/brs097lm.pdf
www.nacm.org
www.rmahq.org
www.riskmail.org
www.occ.treas.gov
Resources
• Credit Scoring & Its Applications
by Lyn Thomas, Jonathan Crook, David Edelman
• Credit Risk Modeling: Design and Application
by Elizabeth Mays (Editor)
• Internal Credit Risk Models: Capital Allocation and Performance Measurement
by Michael K Ong
• Handbook of Credit Scoring
by Elizabeth Mays
• Applications of Performance Scoring to Accounts Receivables Management in Consumer Credit
by John Y. Coffman
• Introduction to Credit Scoring,
by E.M. Lewis
Scorecard Development roles-objectives
• Understand the critical resources needed to successfully complete a scorecard development and implementation project.
• Understand some of the operational considerations that go into scorecard design.
Major Roles
• Scorecard Developer
• Data miner, data issues
• Credit Scoring Manager/Risk Manager
• Strategic view, corporate policies, implementation
• Product Manager
• Client base, target market, marketing direction.
Major Roles
• Operational Managers
• Customer Service, Adjudication, Collections
• Strategy execution, impact on customers
• IT/IS Managers
• external/internal data, implementation platforms.
Minor Roles
Project Manager
• Coordination, time lines
• Corporate Risk staff
• Corporate policies, capital allocation
• Legal.
Why All of These Roles?
• Can I use this variable?
• Legal, technical (derived variables, implementation platform), future application form design
• Segmentation
• Marketing, application form design, systems
• What is the impact on this segment?
• Operational, marketing, risk manager, corporate risk.
Introduction to SAS Enterprise Guide
•SAS Enterprise Guide provides a point-and-click interface for managing data and generating reports.
Summary Statistics for S:\workshop\customers.sas7bdat
Data Statistics Basic
Percentiles Additional
Plots
Results
Titles
Properties
Data source: S:\workshop\custonners.sas7bdat Task filter: None
Variables to assign:
Name
CustomerJD ^ Customer_Country ^ Customer_G ender
Customer_N ame ^ Customer_FirstName ^ Customer_LastName
Customer_BirthDate ^ Customer_Age_G roup
Customer_Type ^ Customer_G roup (2> Customer_Age
	Analysis variables
	Customer_Age
	FH Classification variables
	^RuS Custorner_Type
	Frequency count (Limit: 1)
	ffil Relative weight (Limit: 1)
	B Copy variables
	B Group analysis by
H	
Class level Customer_ Sort by:
Unformatted values
Ascending
Missing values:
I  I Allow multi-label for
The variables that you assign to this role are character or discrete numeric variables that are used to divide t data into categories or subgroups. The statistics will be calculated on all selected analysis variables for e. combination of classification variables.
Bar Chart for S:\workshop\orion_profit.sas7bdat
3 of A      Specify appearance
§sas
0 3D chart Colors -
Sample chart:
Color bars by:
Labels
I Bar category
Preview code
HE
t\ae t eachl
3
I  I Legend: 0Data labels:
Lines and ticks
South
Sum
Anis Labels
W] Use reference lines [^1 Tick marks
380
SAS Enterprise Guide Interface
SAS Enterprise Guide also includes a full programming interface that can be used to write, edit, and submit SAS code.
^ SAS Enterprise Guide
File     Edit     View Tasks
Project Tree
Q S^n Process Flow
3'"i^l orion_profit
fe|$ Bar Chart customers □~CD Programs
jM ep02d01
Program Tools T x    ep02d01 -
Help "a-&T
¥^ R=) CB X I W> IC« I □ " I     Process Flow
Output Data (2)
Program
y 5ave "  w Run  "       itop   5elect 5erver    Export " 5end To - Create " | [?] Properties
libnarne orion "s:\workshop'
set orion.customers;
if Customer_Type_ID = 3010
then output work.nonelub; else output work. clubrnembers;
1
2
3 13 data work. c lubrnembers work. none lub;
4
5
6
7
8
9
lOBproc print data=work.nonelub noobs; 11 title  "Non Club Members";
var Country Gender Custorner_Name;
13 run;
14
run;
381
SAS Enterprise Guide Interface: The Project
•A project serves as a collection of
• data sources
• SAS programs and logs
• tasks and queries
• results
informational notes for documentation.
□■■■5^ Process Flow
l=l"ip 0RI0N_PR0FIT
S] Product Frequencies 5 Profit by Category Pie Chart |!f| Profit by Country/AgeGroup Report Create Format ($CntryFrnt - Local) ^ Link to Create Orion_Profit □■■■CZl Programs
rjl-l^ Create Clrion_Profit □-G5| Custom Reports J Report
Tasks by Category
3
Data
^j£J Filter and Sort
% Query Builder
jit Append Table
lj!d Create Imported Format
JiJ Sort Data
wld Create Format
es
S*- Run '  y Stop  I Export " Schedule - | Zoom " |
<
Program Descripti...
Report
																								Vf| /												
																																				
																																				
																									RTF- //											
																																				
																								rroaucc r... / f												
																																				
																																				
																												<								
																																				
																\    Profit by            SAS Report j \   Country/A...          -Profit... /																				
																																				
											1																									
																																				
																																				
																												<	J?							
																																				
																			Profit by              SAS Report																	
																																				
																			i—aregory ...             - rrorir ...																	
																																				
																																				
																																				
																																				
																																				
>
Ready
^ No connection
You can control the contents, sequencing, and updating of a project.
SAS Programs
data work.clubmembers work.nonclub;^ set orion.customer; if Customer_Type_ID = 3010
then output work.nonclub; else output work.clubmembers;
run;
print data=work.nonclub; title "Non Club Members"; var Country Gender Customer Name*
run ;
DATA Step
PROC Step
ep02d01.sas
383
PROC PRINT Output
5sa			Enterprise Guide-			
		The Fkruver tftKiww-Non Club Members				
	Obs	Country		Gender	Customer_Name	
	1	DE		M	Ulrich Heyde	
	2	US		M	Tulio Devereaux	
	3	US		F	Robyn. Klera	
	4	us		F	Cynthia Mccluney	
	5	AU		F	Candy Kinsey	
	6	US		M	Phenix. Hill	
	7	IL		M	Avinoam a we ig"	
	8	CA		F	Lauren Marx	
						
Saving SAS Programs
•The SAS program in the project is a shortcut to the physical storage location of the .sas file. Select the program icon and then select File O Save program name to save the program as the same name, or Save program name As... to choose a different name or storage location. ^^^^^^^^^^^^^^^
SAS Enterprise Guide
File
Edit     View     Tasks Program
New "J Open
Close Project
►
►
5ave Project 5ave Project As...
Ctrl+5
y    5a ve ep02d01 Ctrl+Shift+S
IE
Save ep02d01 As...
385
Embedding Programs in a Project
•A SAS program can also be embedded in a project so that the code is stored as part of the project .epg file.
•Right-click on the Code icon in a project and select Properties O Embed.
a3 Properties for ep02d01
General
Results Prompts Summary
General
Label:
ep02d01
Code will run on server:
Local
Last Execution Time:
2 seconds
File path:
Change..
Location:
My Computer
Embed
1¥
Save As..
Embeds the code in the SAS Enterprise Guide project so that any changes that you make to the code in SAS Enterprise Guide are not applied to the original code file. This
option is available only for existing code files that you have inserted into yr1
□ K
More (F1)...
Cancel
386
How Do You Include Data in a Project?
File
Edit     View     Tasks Program
'=1 New
Tools     Help    I HI -Process Flow
[?=t Open
Close Project
5ave Project 5ave Project As. 5ave
F
igfl  Project Ctrl+O
SAS Enterprise Guide
Ctrl+
File Edit
View
Tasks     Program Tools
Selecting File ■=> Open o Data
adds a shortcut to a SAS data source in the project.
Project Tree
Process FlowA
□ S^c Process Flow order_item
Pun Stop
												
												
												
order_item												
												
												
												
												
												
												
Assigning a Libref
•You can use the Assign Project Library task to define a SAS library for an individual project.
SAS Enterprise Guide																																				
File     Edit     View     Tasks Program		Tools		Help |H~|^r^&|											=) Oi ?\ I             I n» | E^g Process Flow ^																					
Project Tree ^		D	Add-In																																	
El S^g Process Flow			Create HTML Document... Style Manager												" I Zoom "   "^Project Log | [^Properties »																					
U   1_1	i luyiain-^ ep02d01																																			
																																				
	£J ep02d02																																			
		Assign Project Library...																																		
																																				
																																				
			JMP Stored Process Packager.. Project Maintenance.., View Open Data Sets...																																	
																																				
																																				
																																				
																																				
																																				
																																				
		■_■ i ■_■ i ij																																		
																																				
																																				
																																				
																																				
											HTML ep02d0																									
																																				
												1																								
388
Browsing a SAS Library
Server List
H~W LLB Inventory Local Ö-|3) Libraries
+
GO-GO-(=)■
•During an interactive SAS Enterprise Guide session, the Server List window enables you to manage your files in the windowing environment.
In the Server List window, you can do the following:
■ view a list of all the servers and libraries available during your current SAS Enterprise Guide session
■ drill down to see all tables in a specific library
■ display the properties of a table
■ delete tables
■ move tables between libraries
GISMAPS MAPS MYDATA ORION
TJ ALL_SUPPLIERS
AUSTRALIA_SUPPLIERS QJ COMBINED_PRODUCT [3 CUSTOMER [3 CUSTOMER_PROFIT jJJ CUSTOMER_TYPE [3 EMPLOYEE_CELL |3 EMPLOYEEJHOME |3 MNTH7_2007 [3 MNTHS_2007 [3 MNTH9_2007 TJ NONSALES
jg NORTH_AMERICAN_SUPPLIERS
389
Applying Formats
Display formats can be applied in a SAS Enterprise Guide task or query by modifying the properties of a variable.
Task roles:
I
List variables ^) EmploiieeJD ■jfci. First_Nanne ■jfä^ Last_Name (;Ti) ff^JB'111 (j3) Bonus ^) Compens ^) BonusMq
Group analy-sl | ^ | Page by (LimJP Total of Subtotal of Identifying laBeT
Remove from Role 5ort Columns
Show Names 5how Labels
Properties
Ír
Categories:
Formats:
None Numeric
Time
Date/Time Currency User Defined All
MMDDYYSw.d_ai|	
MMDDYYw.d	
MMYYCw.d	
MMYYDw.d	
MMYYNw.d	._■
MMYYPw.d	
MMYYSw.d	
MMYYw.d	
■Attributes -Overall width:
Decimal places:
|Š~~ 3J Min: 2 Max: 10 |Ö~~ ~~3 Min: 0    Max: 7
"Description date values
"Example Value:
14245 (01Jan1999]
Output:	0	1	/	0	1	/	9	9	
OK
Cancel
390
uery Builder Join
•When you use the Query Builder to join tables in SAS Enterprise Guide, SQL code is generated.
SQL does not require sorted data.
SQL can easily join multiple tables on different key variables.
SQL provides straightforward code
to join tables
based on a non-equal
comparison of common columns
(greater than, less than,
between).
"!:: Query Builder for Local: ORION. ORD ERJTEM
Query name:   Join for Product List]
Output name:   |SASUSER.QUERY_FOR_ORDER_IT | | Change...
H] Computed Columns | @ Prompt Manager    Tc^ Preview | Tools
Options "
} Add Tables  X Delete       Join Tables
■ © o © ©
©
■ ©
© © ©
ORDERJTEM) □ rder_ID Order_ltem_Num Product_ID Quantity
Total_Retail_Price CostPrice_Per_U nit Discount
PRODUCTJJST )
Product_ID Product_Name Supplier_ID Product_Level Product Ref ID
Select Data | Filter Data| Sort Data]
Column Name	Input	Summary	
® OrderJD [Order...	tl.OrderJD	H	
@ Order_ltem_Num...	t1 .OrderJtem_Num		
® ProductJD (Pro...	H.Product ID		
® Quantity (Quantit...	H .Quantity		
*f* Total_Retail_Pri...	H.Total Retail Price		
^ CostPrice_Per_...	tl.CostPrice Per ...		
@ Discount (Disco...	tl .Discount		
© ProductJD 1 (Pr...	t2. ProductJD		
^ ProductJJame (...	t2.Product_Name		
© SupplierJD (Sup...	t2. SupplierJD		
@ ProductJ_evel (...	t2.Product Level		
@ Product_Ref_ID ...	t2.Product Ref ID		
			
		>	
I  I Select distinct rows only
Run
Save and Close
Cancel
Ml
m
a
Help
391
Sort Data Task
•The Sort Data task enables you to create a new data set sorted by one or more variables from the original data.
Ü Sort Data for Local:ORION.PRODUCT_L 1ST
Options Results Properties
Data
Data source: Local:ORION.PRODUCTJJST Task filter: None
Edit..
Columns to assign:
Task roles:
ProductJD sort order:
Name
® ProductJD $1 Product_Nanne © SupplierJD ® Product_Level ® Product_Ref_ID
| Sort by
■@ ProductJD
| Columns to be dropped (Limit: A
l±J
Ascending
O Sort Data for Local:ORION.PRODUCT LIST
Data Options Results Properties
Results
(^Location to save output data-
Local:WORK.SORT PRODUCT
Browse...
392
Business Scenario
•Orion Star wants to send information about a specific promotion to female customers in Germany. The report can be created by querying the or ion. customer data set to include only the desired customers, and then by producing a report with the List Data task.
CU5TKiMEF[
rennale German
F_G errmanv
List
Da:a
HTML
Female Customers in Germany
Customer Country	Customer First Name	Customer Last Name	Customer Birth □ate
DE	Co rnelia	Krahl	27FEB1974
DE	Elke	iallstab	16AUG1974
DE	Ines	Deisser	20JUL1969
			
393
Business Scenario
•The same report can be generated more efficiently by subsetting the data directly within the List Data task. This requires modification of the code generated by SAS Enterprise Guide.
IS
CUSTOMER
List
Data
"HTMU-List Data
2 7 □ PRÜC PRIHT DATA=WORK. SORTTempTableSorted
2 8 NOOBS
29 LABEL
30 ;
31 _
32
33
34
35
3 6 VAR Country Customer_FirstName Custoiner_LastName Birth_Date; 3 7 RUH;
/* Start of custom user code. */ where Country =   1DE1   and Gender =   1F1;
/* End of custom user code. */
38 /*------------------
39 End of task code.
40
41 RUH; QUIT;
394
Understanding Generated Task Code
•There are many situations where task results created by SAS Enterprise Guide can be further enhanced or customized by modifying the code.
•However, before you can effectively modify the code, you must first understand the code that SAS Enterprise Guide generates.
List Data Task
H List Data for Local:ORION.CUSTOMER
Data
Options
Titles
Properties
Preview code
Data
Data source: Task filter:
LocahORION.CUSTOMER None
Edit..
Variables to assign:
Task roles:
Gender sort order:
Name
1^3) CustornerJD fy. Country ^.Gender
^PersonalJD fy. Customer_Name fy. Customer_FirstName fy. Custormer_LastName Q Birth_Date fy> Customer_Address © StreetJD $L Street_N umber Customer_Tyc
assign one c d for each d
List variables Country
Customer_FirstName Customer_LastName -H Birth_Date Group analysis by
Page by (Limit: 1 ] Total of
Subtotal of (Limit: 1] Identifying label
Ascending
0 Sort by variables
The Preview code button enables you to view and modify the code generated by the task.
3
Run
Save
Cancel
Help
396
List Data Task - Code Preview
Code Preview For Task
Insert Code..
Code generated by SAS Task
Generated on: Wednesday, April 29, 2009 at 11:12:29 AH By task:   List Data
Input Data:   ORION.CUSTOMER Server: Local
%  eg coitdit±oital_djropds (WORK.S ORTTemp Tab 1e S o r t e d) ;
/*-----------------------------------------------
Sort data set ORION.CUSTOMER
7
□ PROC SORT
DATA=ORION. CUSTOMER (KEEP = Country Custorner_FirstNaroe Customer_LastName Bin OUT=UORK.SORTTempTab 1eS□rted
BY Gender;
RU1I;
TITLE;
TITLE 1 "Report Listing"; FOOTNOTE;
TT" TT,TTTT" H       rT .-
+- 1-..-.     Cf Ä C ■—J -r -r .—■ +- .
.-     Cf ?. '~J CJ TT" TlT J TT1 TlT.T ?. TT TT" .- Cř TTCř Cř r-1 Ti T ■.
397
Using the List Data Task to Generate Code
•This demonstration illustrates building a List Data task and examining the code generated by SAS Enterprise Guide.
Customer Listing
Customer Gender=F
Customer  Customer First  Customer Last Name Customer Birth
Country         Name Date
US                Sandrina           Stephano 09JUL1979
DE                Cornelia           Krahl 27FEB1974
US               Karen                Ballinger 1BOCT19B4
DE                 Elke                     Wallstab 16AUG1974
Customer Gender=M
Customer Country	Customer First Name	Customer Last Name	Customer Birth □ate
US	James	Kvarniq	27JUN197 4
us	David	Black	12APR1969
DE	Markus	Sepke	21JUL1988
DE	Ulrich	Heyde	16JAN193 9
			
398
List Data Task - Generated Code
•The initial comment block shows information about the task.
/*----------------------------------------------------------
Code generated by SAS Task
Generated on:   Wednesday,   April 29,   2009 at  1:13:33 PH By task:   List Data
Input Data:   ORION.CUSTOMER Server: Local
------------------------------------------------------- */
399
List Data Task - Generated Code
•The first line uses a macro to delete temporary tables or views if they already exist. If the Group by role is used in the task, the data must be ordered by the grouping variable. PROC SORT is used by default. Only variables assigned to roles are kept in the new data set.
h  eg cofldl t± on a Id ±op d & (WORK ■ S ORTTernp Tab leSorted) ;
/*---------------------------------------------------------
Sort  data set  ORION.CUSTOMER
------------------------------------------------------ V
PROC SORT
DATA=ORION.CUSTOMER(KEEP=Country Custonier_FirstNanie
Custonier_LastNaitte Eii:th_Date Gender)
OUT=UORK.SORTTempTableSorted
■
BY Gender;
RU1I;
400
List Data Task - Generated Code
If the Group by role is not used, SQL creates a temporary view of the required data. Again, only variables assigned to roles in the task are included in the view. This comment incorrectly states that sorting occurs.
h  eg cofldi t± on a Id jop d & ( tjqrk . S ORTTernp Tab leSorted) ;_
/*---------------------------------------------------------
Sort data set ORION.CUSTOMER
------------------------------------------------------ */
PROC SQL;
CREATE VIEW WORK.S ORTTemp TableSorted AS
SELECT T.Country,   T.Customer_FirstName, T. C us t oine r_L as tName f   T. E i r t h_D at e FROM ORION.CUSTOMER as T
■
QUIT;_
401
List Data Task - Generated Code
The main part of the code includes the titles, footnotes, and procedure code to generate the report. PROC PRINT is the procedure used with the List Data task.
TITLE;
TITLE 1  "Customer Listing"; FOOTNOTE;
FOOTNOTE1  "Generated toy the  3AS System   (£  SASSERVERNAHEf fiSYSSCPL)
PROC PRINTEDATA= WORK. SORTTeinpTatoleSorted IUlDUL!j-= "Row number" LABEL
VAR Country Customer_F ir stName Custor[ier_LastName Birth_Date; BY Gender;
RUN;
y>    TITLE and FOOTNOTE are examples of global statements and can be included anywhere in a SAS program.
on hTRIM(%QSYSFUNC(DATE(), NLDATE2□.)) at  % TRIM(% SYSFUNC(TIME[) f   NLTIHAP2□.) )
rr .
402
List Data Task - Generated Code
•At the end, the final lines of code delete any temporary tables created to build the task, and delete any assigned titles and footnotes.
/*---------------------------------------------------------
End of task code.
------------------------------------------------------ */
RUN; QUIT;
%_eg_cojidit±onal_djropds (WORK. SORTTempTab leSorted) ; TITLE; FOOTNOTE;
403
Techniques to Modify Code
•Three methods can be used to modify code generated by SAS Enterprise Guide:
1. Edit the last submitted task code in a separate Code window.
2. Automatically submit custom code before or after every task and query.
3. Insert custom code in a task.
404
Edit Last Submitted Code
•After a task runs, the code can be viewed from either the Project Tree or Process Flow.
SAS Enterprise Guide
File     Edit     View     Tasks     Program     Tools Help
I ,ä ^ ^ XI*
Project Tree
S S^g Process Flow ä-lii CUSTOMER
List Data
Process Flow
t> Run " □ 5top    Export - Schedule
Open
l>     Run List Data HI    Modify List Data Select Input Data Publish,,,_
is
' I Zoom
Open List Data
Open Last Submitted Code Open Log
Process Flow
b- Run
I Stop
CUSTOMER
Export T Schedule ^ | Zoom
11--
List MTMI -
■ Project Log | [?] Properties
Open
Run
Modify List Data
Run Branch from List Data
T-l--■■ T----
-a
pj   Open List Data
[SI   Open Last Submitted Code
\=M  Open Log
0 HTML - List Data
405
Edit Last Submitted Code
The task code is read-only and cannot be edited directly. To create a copy of the code from the Last Submitted Code window, select any key while in the SAS program window. SAS Enterprise Guide offers to make a copy.
SAS Enterprise Guide
This code is read-only.
Do you want to create a copy of this code that can be modified?
Yes
No
After the code is copied, there is no link between the task and the new code. Any changes in the task are not reflected in the copied code, and modifications to the code do not affect the task.
406
Summary of Editing Last Submitted Code
Custom code linked to task?	No
Can be used to modify query code?	Yes
Extent of modification allowed?	Anything in the program can be changed.
Custom code included when exported?	Yes. You must export the edited program and select the option in the Export wizard.
Automatically Submit Custom Code Before or After Every Task and Query
•There are times when you might need to run a SAS statement or program before or after any task or query is executed. The Custom Code option enables you to insert custom code before or after all tasks and queries.
408
utomatically Submit Custom Code Before or After Every Task and Query
•IT
General Project Views Project Recovery Results
Results General
Viewer
SAS Report
HTML
RTF
PDF
Graph
Stored Process Data
Data General Performance Query □ LAP Data Tasks
Tasks General Custom Code
Tasks > Custom Code
■Additional SAS code-
[^1 Insert custom SAS code before task and query code ^ Insert custom SAS code after task and query code
To run code before tasks and queries, select the
first check box and select Edit... to type the code.
Edit
Edit.
409
utomatically Submit Custom Code Before or After Every Task and Query
Global statements or complete program steps can be entered. Example: Set the LOCALE= option to Great Britain.
Edit
/* Insert custom code before task code here */ OPTIONS  LOCALE=en GB;
nsert Code Before or After SAS Programs
•Similar options exist to automatically submit code before or after SAS programs written and submitted in Code windows in SAS Enterprise Guide.
General Project Views Project Recovery Results
Results General
Viewer
SAS Report
HTML
RTF
PDF
Graph
Stored Process
Data
Data General Performance
Query
OLAP Data
Tasks
Tasks General Custom Code Output Library
SAS ProgramsK
Security §
SAS Programs
r General
0 Automatically direct results back to SAS Enterprise Guide 3 Close all open data before running code
Editor Options..
r Additional SAS code
[^1 Insert custom SAS code before submitted code [^1 Insert custom SAS code after submitted code ] Submit SAS code when server is connected
Edit.
Edit...
Edit...
411
Summary of Submitting Custom Code Before or After Every Task and Query
Custom code linked to task?	Yes
Can be used to modify query code?	Yes
Extent of modification allowed?	Statements can only be submitted before or after the task code.
Custom code included when exported?	Yes, select the option in the Export wizard.
Insert Custom Code in a Task
In most task dialog boxes, you have the ability to insert custom code within the generated SAS program. This technique has the significant benefit that the task interface can still be used to modify the report.
w lor Tas
13
Insert Code..
2 7 □ PROC PRINT DATA= WORK. SORTTempTab leSot^ted
28 OBS="Row number"
29 LABEL
30 ;
3 1
32
33
34
35 3 6
/* Staut of custom usee code. where country="DE"; ^^^En^^of^custo^j^iser^codej^^^
7
VAR Country Customer FirstName Customer LastName Birth Date;
RUH;
.a
37 /*-----------------
38 End of task code
39
413
Insert Custom Code in a Task
•In the Code Preview window, select Insert Code... to add custom code in predefined locations in the SAS program.
Code Preview For Task
Insert Code.
21
! /*------------------------------------------------------
2 Code generated toy SAS Task
3
4 Generated on:   Thursday,   March 13,   2008 at  10:57:50 PM
5 By task:   List Data
6
7 Input Data:   ORION.CUSTOMER
8 Server: Local
10
HBPROC SQL;
12 %_£A£TA£K_BFtGPDS (UORK. SORTTennpTab leSorted) ;
13 QUIT;
414
Insert Custom Code in a Task
User Code
Positions where user code may be inserted are indicated by the icons. Double-click on a marked line to add user code or change existing user code.
V
PROC SQL;
CREATE VIEW WORK.SORTTempTableSorted
AS SELECT Country, Customer_FirstNarme, Customer_LastName, Birth_Date FROM ORIO QUIT; TITLE;
TITLE1 "Report Listing"; FOOTNOTE;
F00TN0TE1 "Generated by the SAS System (&_SAS S ERVER NAM E.. S^SC^Toy^ffSFUNC(DATE(L E
i <double dick to insert code}
PROC PRINT DATA=WORK.S0RTTempTableSorted OBS="Row number" LABEL
, <double click to insert code}
<double-click to insert code}
VAR Country Customer_FirstName Customer_LastN; <double-click to insert code}
Birth Date
<
In any of these predefined
locations, you can double-click on a line to insert
custom code.
>
Clear All
OK
Cancel
Help
415
nsert Custom Code in a Task
Some insert points enable custom options to be added to existing statements.
User Code
0®
Positions where user code may be inserted are indicated by the icons. Double-click on a marked line to add user code or change existing user code.
1
PROC PRINT DATA=WORK.SORTTempTableSortec □ BS="Row number" LABEL
-Ji, <doub/e-aitc-k to inseif code}
< do ubfe. - dick fo iitseif code}
VAR Country Customer_FirstName Customer_LastName Birth_Date <doubfe-chct fo imerf code}
Insert options in the PRINT statement.
< do ubfe. - dick f o insert code}
Insert options in the VAR statement. -■
i
416
Insert Custom Code in a Task
•Other insert points enable entire statements to be added inside a step in the program.
User Code
Positions where user code may be inserted are indicated by the icons. Double-click on a marked line to add user code or change existing user code.
PR Oil PRINT D AT A=WO R K. S □ R T T empT ableS orted □BS="Row number" LABEL
< do tib/e - dick to insert code}
<doubie-ciick f o insert code}
VAR Country Customer_FirstName Customer_LastName Birth. < do ubf e - click to insert code}
Statements inside the PRINT step
<doub/e-cJick f o insert code}
417
Insert Custom Code in a Task
•Additional locations enable global statements or additional steps to be inserted before or after the main code.
User Code
(n)(x)
Positions where user code may be inserted are indicated by the icons. Double-click on a marked line to add user code or change existing user code.
< do ubfe - dick to ifrsett code}
PROC PRINT DATA=WORK.S0RTTempTableSorti □ BS="Row number" LABEL
Locations for global statements or additional steps
<double click to insert code>	
RUN;	
[31 <double-dick to insert code>	
r	
End of task code.	
	.......V
RUN;QUIT;	
<doubSe click to insert code>	
PROC SQL;
%_SASTASK_DROPDS(WORK.SORTTempTableSorted); QUIT;
TITLE; FOOTNOTE;
<doubie click to insert code>
>
Clear All
OK
Cancel
Help
Default SAS Enterprise Guide Footnote
Options
General Project Views Project Recovery Results
Results General
Viewer
SAS Report
HTML
RTF
PDF
Graph
Stored Process
Data
Data General Performance
Query
OLAP Data
Tasks
Tasks General
Custom Code
f~l i irni ih I ihrjan i
Tasks > Tasks G
General— Default title tex/for task oi
The default footnote includes macro references to the SAS server name, operating system, and date and time that the task runs.
Default |ootnote text for task output:
^DjspJa^^l^ejTe^a^e^^AS^odejnJ^
Generated by the SAS System version &SY S VER (&_S AS SERVERNAME , &SYSSCPL) on %TRIM (%QSYSFUNC (DATE () ,  NLDATE20 . ) ) at %TRIM(%SYSFUNC(TIME() ,  NLTIMAP20 . ) )
419
ODS and SAS Enterprise Guide
Default result formats can be set under Tools O Options
SI Options
General Project Views Project Recovery Results
Results General
Viewer SAS Report HTML RTF PDF Graph
Stored Process
Data
Data General Performance
Query
□ LAP Data
Tasks
Tasks General Custom Code Output Library
SAS Programs
Security
Administration
Results > Results General
■ Result Formats-
0 SAS Report □ RTF Default:
□ html
I  I Text output
□ PDF
SAS Report
■Managing Results-Replace results:
Prompt before replacing
0 Display SAS log when errors occur 0 Automatically open data or results when generated 0 Link handcoded ODS results 0 Change task icon when warnings occur 0 Show generated wrapper code in SAS log Prompt before opening results larger than: MB
Maximum number of output data sets to add to the project:
50
ODS and SAS Enterprise Guide
•Additional settings can be made for each result format.
General Project Views Project Recovery Results
Results General
Viewer
SAS Report
HTML
Stored Process
Data
Data General Performance
Query
OLAP Data
Tasks
Tasks General Custom Code Output Library
SAS Programs
Security
Administration
Results > PDF
Appearance-Style:
printer
Columns:
0 Color
0 Add Bookmarks
f File Header-Author: Keywords: Subject: Title:
Advanced-
Additional options for ODS PDF statement:
421
ODS and SAS
Enterprise Guide
Task properties can be used to override the default for an individual task.
Generated output can be switched off completely and handled by inserting code.
H Properties for List Data
General
Prompts!^ Summary
Results
1
Right-click on a task icon and select Properties.
O Use preferences from Tools -> Options HTML - EGDefault
© Customize result formats, styles, and behavior
□ SAS Report 0 HTML
□ PDF
□ RTF
□ Text
Graph Format: | ActiveX
EGDefault	
	
EGDefault	
	
printer	
	
Rtf	
□ Automatically open data or results when generated
Reset Options
The selection pane enables you to select a category of options to view.
OK
More (Fl)...
Cancel
422
SAS Enterprise Guide Help (Review)
•If Help files were installed along with SAS Enterprise Guide, you can select Help to access the Help facility regarding both the point-and-click functionality of SAS Enterprise Guide as well as SAS syntax.
& SAS Enterprise Guide			
File     Edit     View     Tasks Program		Help   H - (är 35 S	
Project Tree ▼			SAS Enterprise Guide Help SAS Syntax Help Getting Started Tutorial
■S^o Process Flow			
			SAS on the Web ►
		About SAS Enterprise Guide k	
			
	1		
Task and Procedure Help
E? SAS Enterprise Guide Help		
		
Hide Back Print		
				
Contents	Index	Search	Favorites	
Type in the keyword to find:
I list data
List Data task
List Report wizard log entries
collapsing and expanding log properties LOGISTIC procedure
Logistic Regression task
To find information regarding the syntax of the code behind
the scenes of a particular task, type the name of the task in the Index tab.
Display
5
List Data
About the List Data task
The List Data task prints the observations in a SAS data set, using all or some of the variables. You can create a variety of reports, ranging from a simple listing to a highly customized report that groups the data and calculates totals and subtotals for numeric variables.
For example, you can use the Lis report that sums the expenses a sales region, compares these va expenses and revenues for the c number of observations in each ; whole report, and has a customi name of the region.
► Assigning variables to analysis
► Setting listing options
► Specifying titles and footnot
► Viewing properties
The task help indicates the procedure name to search in the SAS syntax help.
SAS procedures used	PRINT
Required SAS products	^aase^Ksr^
Recommended additional	none
SAS products	
J
424
Procedure Syntax Help
ri? SAS Documentation			
File   Edit   View   Go Help			
	o	©si3-	
Hide    Locate    Back Forward	Stop	Refresh    Print Options	
Contents   |nden    Search Favorites
El 13
a ü - _j
+
a ma + _i + _j + _i + _j
a a + _i + _i a a + _j + _i + _j + _j i±i + + +
_i _l _l _i _l
The PLOT Procedure The PMENU Procedure The PRINT Procedure
g] Overview: PRINT Procedure
i] PROC PRINT Statement
J] BY Statement
5] ID Statement
§] PAGEBY Statement
g SUM Statement
D SUM BY Statement
ff) VAR Statement \i] Results: Print Procedure Pi Examples: PRINT Procedure The PRINTTO Procedure The PROTO Procedure The PRTDEF Procedure The PRTEXP Procedure The PWENCODE Procedure The RANK Procedure The REGISTRY Procedure The REPORT Procedure The SCAPROC Procedure The SOAP Procedure The SORT Procedure The SQL Procedure The STANDARD Procedure The SUMMARY Procedure The TABULATE Procedure The TEMPLATE Procedure The TIMER LOT Procedure The TRANSPOSE Procedure The T RAN TAB Procedure The UNIVARIATE Procedure
]
>
Previous Page  |   Next Pane The PRINT Procedure
Syntax: PRINT Procedure
Tip: Supports the Output Delivery System. See Output Delivery System: Basic Concepts in SAS Output Delivery System: User's Guide for details.
Tip: You can use the ATTRIB, FORMAT. LABEL, and WHERE statements. See Statements with the Same Function in Multiple Procedures for details. You can also use any global statements. See Global Statements for a list.
Table of Contents: The PRINT Procedure
PROC PRINT <option(s)>;
BY <DESCENDING> variable-1 <...<DESCENDING> variable-n><NOTSORTED>;
PAGEBY BY-variable, SUMBY BY-variable.
ID vanable(s) <opfion>; SUM vanable(s) <option>; VAR variable(s) <option>;
Task	Statement
Print observations in a data set.	PROC PRINT
Produce a separate section of the report for	BY
425
7. Metodologie vývoje scoringových funkcí
Objectives
• Understand how scorecards to predict credit risk are developed.
• Understand the analyses and issues for implementation of scorecards.
427
Main Stages - Development
• Stage 1: Preliminaries and Planning
• Create Business Plan
• Identify organizational objectives
• Internal versus External development, and scorecard type
• Create Project Plan
• Identify project risks
• Identify project team.
Main Stages - Development
• Stage 2: Data Review and Project Parameters
• Data availability and quality
• Data gathering for definition of project parameters
• Definition of project parameters
• Performance window and sample window
• Performance categories definition (target)
• Exclusions
• Segmentation
• Methodology
• Review of implementation plan.
Main Stages - Development
• Stage 3: Development Database Creation
• Development sample specification
• Sampling
• Development data collection and construction
• Adjusting for prior probabilities.
Main Stages - Development
• Stage 4: Scorecard Development
• Missing values and outliers
• Initial characteristic analysis
• Preliminary scorecard
• Reject inference
• Final scorecard production
• Scaling
• Points allocation
• Misclassification
• Scorecard strength
• Validation.
Main Stages - Development
• Stage 5: Scorecard Management Reports
• Gains tables and charts
• Characteristic reports.
Main Stages - Implementation
• Stage 1: Pre-Implementation Validation
Stage 2: Strategy Development
• Scoring strategy
• Setting cutoffs
• Strategy considerations
• Policy rules
• Overrides.
Main Stages - Post Implementation
• Post-Implementation
• Scorecard and Portfolio Monitoring Reports
• Review.
Development
Stage 1: Preliminaries and Planning
Objectives
• Create a business plan to ensure a viable and smooth project.
• "All Models are wrong. Some are useful"
George Box
436
Create Business Plan
• Identify organizational objectives.
• Reasons for model development
• Profit, revenue, loss, automation, operational efficiency
• Role of scorecards in decision making
• sole arbiter or decision support tool?
Create Business Plan
• Internal/External Development and Scorecard Type
• Capability and resources
• Staff, tools, expertise, data
• Market segment
• Custom, generic, judgmental
• segment, data, time.
Create Project Plan
• Scope and timelines
• Deliverables (scorecard format and documentation,...)
• Implementation strategy
• Testing, coding
• Strategy development
• FYIlist.
• Seamless process from planning to development and implementation.
Create Project Pla
• Identify Project Risks
• Data risks
• Availability, quality, quantity
• Weak data
• Operational risks
• Organizational priority
• Implementation delays
• System interpretation of data.
Create Project Plan
• Identify Project Team
• Roles clearly defined
• Signoff, executor, advisor, FYI
• Critical path.
Development
Stage 2: Data Review and Project Parameters
442
Objectives
• Identify data requirements.
• Perform pre-modeling analysis.
• Understand the business
• Exclusions
• What is a "bad"? - target definition
• Sample Window/ Performance Window.
Data Availability and Quality
• Number of "goods" "bads" and "rejects"
• Initial idea at this stage, estimated from performance reports
• Internal data
• Reliable, accessible
• External data
• Accessible, format
• Retro pull.
Data Gathering
• To determine "bad" definition and exclusions
• All applications over the last 2-5 years (or a large sample)
• account/ID number
• Date opened/applied
• Accept/reject indicator
• Arrears/payment history
• Product/channel and other identifiers
• Account status
• Other items to understand the business.
Exclusions
• "Include those whom you would score during normal day day operations"
• VIP
• Staff
• Fraud
• Pre-approved
• Underage
• Cancelled (sometimes).
Performance
"Sample Window"
"Performance Window"
New Account
Good/Bad?
447
Parameters
Performance Window
• How far back do I go to get my sample?
• Sample Window
• Time frame from which sample will be taken.
• Definition of "bad"
• Bad and approval rates (when oversampling
Parameters
• Seasonality
• Plot approval rate/applications across time
• Establish any abnormal' zones (for example, talk to marketing).
• Sample used in development must be from a normal business period, to get as accurate a picture as possible of the target population.
449
rameters - "Bad"
• Plot "bad" rate by "month opened" (cohort)
• For different definitions of bad
• 30/60/90 days past due
• Charge off/write-off
• Bankrupt
• Claim
• Profit based
• Less than x% owed collected
«p        yy a*-< ,yy 1 1
• Ever versus Current bad
• Ever bad should be used where possible
• Considered "bad" if you reach status anytime during performance window
Cohort Analysis - Example
Bad = 90 days					
Open Date J      1 Qtr		2 Qtr	3 Qtr	4 Qtr	5 Qtr
					
Jan-99	0.00%	0.44%	0.87%	1.40%	2.40%
Feb-99	0.00%	0.37%	0.88%	1.70%	2.30%
Mar-99	0.00%	0.42%	0.92%	1.86%	2.80%
Apr-99	0.00%	0.65%	1.20%	1.90%	
May-99	0.00%	0.10%	0.80%	1.20%	
Jun-99	0.00%	0.14%	0.79%	1.50%	
Jul-99	0.00%	0.23%	0.88%		
Aug-99	0.00%	0.16%	0.73%		
Sep-99	0.00%	0.13%	0.64%		
Oct-99	0.20%	0.54%			
Nov-99	0.00%	0.46%			
Dec-99	0.00%	0.38%			
Jan-00	0.30%				
Feb-00	0.00%				
Mar-00	0.00%				
Current versus Ever - Example
• Current bad definition: No Delinquency
• Ever bad definition: 3 months delinquent.
Month	1	2	3	4	5	6	7	8	9			12
Delq	0	0	1	1	0	0	0	1	2	(3	) 0	0
												
Month	13	14	15	16	17	18	19	20	21	22	23	
Delq	0	0	1	2	0	0	0	1	0	1	0	Co
452
Determining Parameters
Bad Rate Development
Month Opened
- mth opened from earliest to latest, and "bad rate" as of this month. For simplicity, this is straight delinquency .. No profit.
- notice at one point the bad rate levels off - this means everyone who was going to go bad has gone bad I.e. they have been given enough time. This is telling us that for this bad defn, accts from jan-march are mature enough.
-lesson 1: need sample that is mature enough, so that you wont be defining a "bad" as a good just because you haven't given them enough time.
-if you take accts from the middle (enter), some of the accts haven't matured yet so your bad rate is understated.
-Example: response scoring .. How long do you wait for the responses to come in. the period of measurement is 'perf window'.
453
Determining Parameters
Bad Rate Development
Sample
-Window H—4
Month Opened
So for each definition of "bad" you'll get a sample window of mature accounts, and a performance window indicating the time taken for the bad rate to mature. Also the approval rate for this sample window. Couple of notes on this "maturing" process.
- 30 day definition will mature quicker than 90 day. Cause it takes ppl less time to go 30 day than 90 day. Chargeoff even more.
- for the same bad defn, credit card quicker than mortgage (18-24 mths vs. 3-5 yrs) .
- Why are we doing all this for the different definition?
- because each one will produce different counts and based on reasons on the next slide, we'll determine the best set of parameters. 454
Determining Parameters - Bad
• Organizational objectives/purpose
• Tighter definition - more precise, low counts
• Looser definition - differentiation sub-optimal
• Interpretable and trackable
• Consistency
• Reality - the best definition under the circumstances (lack of data, history).
Lets look at the considerations.
- objectives: this may seem obvious, but it is not to a lot of ppl. If you're building a scorecard to predict profit, then use profit. Some orgs want a delinquency based defn, but also include profit. E.g. if acct is chronically 2 mths late, but still profitable.. You can't set 2 mths as a "bad" - whereas in a pure delq scorecard this may be possible.
- tighter/looser: tighter means 90 day, 120 day, writeoff.. Better differentiation, but low count. Remember 2000 bads.
- looser means more count, but sub-opt diff.
- interpretable e.g. bad is 2 times 60 days, 3 times 30 days or 1 times 90 days. Sounds good, but hell to track and interpret. Keep it simple.
- consistency across other cards, products. Also if accounting writes off acct at 7 mths, then keep it consistent with that.
- typically most delq cards are 90 days.
- Reality: you take what you got. Lack of history allows only a 30 day definition .. Take it. Can't measure real bad rate .. Use proxy, (example LOC like an account) 455
Sample Definitions - Bad
• Ever 90 days delinquent
• Bankrupt
• Claim over $1000
• 3 x 30 days, or 2 x 60 days, or 1 x 90 days
• Negative NPV
• Not profitable
• 50% recovered within 3 months
• Fraud over $500
• Closed within 6 months.
Confirming "Bad" Definition
• Analytical
• "Roll rate" analysis
Current versus worst delinquency comparison
• Profitability analysis
• Consensus.
Roll Rate Analysis
Compare Worst delinquency
• for example, Previous 12 months versus Next 12 Months
Month	7	2	3	4	5	6	7	8	9	7£   7 7		12
Arrears	0	0	1	2	0	0	0	1	2	C3	0	0
												
Month	13	14	15	76	77	78	79	20	21	22	23	24
Arrears	1	2	3	3	3	-G	M	0	0	0	0	
458
Roll Rate Analysis
Roll Rate
CO
CM
> CD
o
90+ 60 day 30 day
^ Curr/x day
3
0%
20% 40% 60%
Worst - Next 12 Mths
80%
Curr/x day □ 30 day ■ 60 day □ 90+
5
5
5
100%
You find out which 'bad defn' is truly bad' - also known as POINT OF NO RETURN.
Lets look at 30 day: out of everyone who had worst 30 day, majority became current, only a few became worse - this is not a good bad defn. - out of those 60 days, some went over.. Most went back I.e. became better
-but those who were 90 day .. Majority did not become better. This confirms our definition.
-In general .. Once you hit 90 days, you're not coming back. That's a true bad. Rem: this is based on 'bad' objective. If other, perhaps there is a different point in time.. 459
Roll Rate Analysis
• Look for 'point of no return.
• Consider objectives.
• Consider sample counts.
• Typically for delinquency, after 90 days most accounts do not cure.
460
Current versus Worst Comparison
					Worst Delinquency						
		Current	30 days		60 days		90 days		120 days		writeoff
Current	Current	100%	68%		560/o.	34%	40°/	[15%		4%	
Delinquency	30 days			'16%		1.22%		1 8%	18%	5%	
	60 days			8%		19%		M7%		8%	
	90 days		32% m	4%		14%		32%		L11%	
	120 days			2%	—44°/cr	8%	60°/1	18%	72Q^	54%	
	writeoff			, 2%,		, 3%		10%		18%	100%
461
Parameters - Goods/lndeterminates
Good • Indeterminate
• Never delinquent • Mild delinquency, roll rate not
• Ever x- days delinquent conclusive either way
• No claims • Inactive
• Profitable, positive NPV • Offer declined
• No fraud • Voluntary cancellations*
• No bankruptcy • High balance < $50
• Recovery > 75%, $ value
• Must be good throughout performance window
462
Default - definice cílové prom. (good/bad)
• Obvykle je tato definice založena na klientově počtu dnů po splatnosti (Days Past Due, DPD) a částce po splatnosti. S částkou po splatnosti je spojena potřeba stanovení jisté míry tolerance, tedy stanovení co je považováno za významný dluh a co nikoli. Např. nemusí dávat smysl považovat za dluh částky menší než 100 Kč.
• Dále je třeba stanovit časový horizont (performance window), na kterém jsou dva zmíněné parametry sledovány.
• Za dobrého klienta lze např. označit klienta, který:
• je po splatnosti méně než 60 dnů(s tolerancí 100 Kč) v prvních 6-ti měsících od první splátky,
• je po splatnosti méně než 90 dnů (s tolerancí 30 Kč) v průběhu celé své platební historie (ever).
463
Default - definice cílové prom
□ Volba těchto parametrů závisí do značné míry na typu finančního produktu (jistě se bude lišit volba parametrů pro spotřebitelské úvěry pro malé částky se splatností kolem jednoho roku a pro hypotéky, které jsou obvykle spojeny s velmi vysokou finanční částkou a se splatností až několik desítek let) a na další využití této definice (řízení rizik, marketing, ...).
464
Default - definice cílové prom
□ Další praktickým problémem definice dobrého klienta je souběh několika smluv jednoho klienta. Například je možné, že zákazník je po lhůtě splatnosti na více smlouvách, ale s rozdílnými dny po splatnosti a s různými částkami. V tomto případě jsou většinou částky klienta dlužné v jednom konkrétním časovém okamžiku sečteny, a ze dnů po splatnosti na jednotlivých smlouvách je brána maximální hodnota. Tento přístup lze uplatnit pouze v některých případech, a to zejména v situaci, kdy jsou k dispozici kompletní účetní data. Situace je podstatně složitější v případě agregovaných údajů, např. na měsíční bázi.
465
Default - definice cílové prom
• Obecně uvažujeme následující typy klientů:
> dobrý (good),
> špatný (bad),
> nedefinovaný (indeterminate),
> s nedostatečnou úvěrovou historií (insufficient),
> vyřazený (excluded),
> zamítnutý (rejected).
Default - definice cílové prom.
První dva typy byly diskutovány. Třetí typ, tj. indeterminate, je na hranici mezi dobrým a špatným klientem a při jeho použití přímo ovlivňuje definici dobrých/špatných klientů. Uvažujeme-li pouze DPD, klienti s vysokými DPD (např. 90 +) jsou typicky označeni za špatné, nedelikventní klienti (jejich DPD je rovno nule) jsou označeni za dobré. Za indeterminate jsou pak označeni delikventní klienti, kteří nepřekročí danou hranici DPD.
• Čtvrtý typ klientů jsou typicky klienti s velmi krátkou platební historií, u kterých je nemožná korektní definice cílové proměnné.
• Vyřazení klienti jsou klienti, jejichž data jsou natolik špatná, že by vedla ke zkreslení modelu(např. fraudy). Další skupinu tvoří klienti, kteří nejsou standardně hodnoceni daným modelem (VIP klienti)
• Poslední typ klientů jsou ti klienti, jejichž žádost o úvěr byla zamítnuta.
467
Definice dobrého/špatného klienta
Customer
Accepted
Rejected
I GOOD
Default
(60 or 90 DPP)
rau
(first delayed payment, 90 DPP) j
Early default
ite defai
(5+ delayed payment, 60 DPP)
Insufficient
INDETERMINATE
468
Performance Definitions
• "Goods" and "bads" (and rejects) are used for model development.
• Indeterminates included for Gains chart and forecasting.
Segmentation
• Can one scorecard work efficiently for all the different populations within your portfolio?
• Or would more than one scorecard be better?
• Segmentation maximizes predictiveness for unique segments within your population.
Segmentation
• Experience (Heuristic)
• Knowledge/experience, operational/industry based, common sense.
• Statistical
• Let the data speak.
"Distinct applicant/account sub-populations" • "Better predictive power than single model".
Experience Based Segmentation
• Product
• Card type, loan type (auto, home, unsecured), lease, used versus new, brand
• Demographics
• Geographical (region, urban/rural, state/province, internal definition, neighborhood), age, time at bureau
• Source of business
• Channel (net, branch, store-front, 'take one', brokers)
• Applicant type
• new/existing, first time home buyer, groups (retired, students, engineers), thin/thick rile, clean/dirty file
• Product Owned
• Credit Card for existing mortgage/loan holders.
472
Experience Based Segmentation
• Consider future plans, not just historic operations
• How do we detect new segments?
• Marketing/risk analysis:
• Bad rates
• Approval rate
• Profit, and so on.
• Look for significant performance difference.
Experience Based Segmentation
• Need to confirm experience using analytics.
• Definition of segments
• What is a thin file?
• What is young' versus old'?
• What is the best demographic split?
• What break is best for 'tenure at bank'?
Confirming Experience
Rule of thumb:
• "When the same information predicts differently across unique segments"
Bad Rate				
		Age > 30	Age < 30	Unseg
				
	Bent	2.1%	4.8%.	^ 2.9%
	Own	1.3%	1.8%	_1.4%
	Parents	3.8%	2.0%	3.2%
				
				
		5.0%	2.0%.	^ 4.0%
	1-3	?n%	3.4%	2.5%
14+		1.4%	5.8%	^ 2.3%
Confirming Experience
Attributes	Bad Rates	
		
Age		
Over 40 yrs	1.80%	
30-40 yrs	2.50%	
Under 30	6.90%	
		
Source of business		
Internet	20%	<^=^
Branch	3%	<^=^
Broker	8%	<^=^
Phone	14%	
		
Applicant Type		
First Time buyer	5%	
Renewal Mortgage	1%	
476
That Is the Easy Way
• You can also build full segmented models, and compare "lift", sensitivity, and so on, with a base model.
It is best to perform this analysis for both experience and statistically based segmentation.
477
Comparing Improvement
• Use different methods to measure improvement (lift, KS, c-stat, precision, and so on.)
Segment	Total c-stat	Seg c-stat	Improvement
			
Age < 30	0.65	0.69	6.15%
Age > 30	0.68	0.71	4.41%
			
Tenure < 2	0.67	0.72	7.46%
Tenure > 2	0.66	0.75	13.64%
			
Gold card	0.68	0.69	1.47%
Platinum card	0.67	0.68	1.49%
478
Comparing Improvement
• Portfolio stats will put improvements into measurable portfolio terms.
		After Segmentation		Before Segmentation	
Segment	Size	Approve	Bad	Approve	Bad
					
Total	100%	/o	%	/o	%
					
Age < 30	65%	/o	/o	/o	/o
Age > 30	35%	/o	/o	/o	/o
					
Tenure < 2	12%	/o	/o	/o	/o
Tenure > 2	88%	/o	/o	/o	/o
					
Gold card	23%	/o	/o	/o	/o
Platinum card	77%	/o	/o	/o	/o
479
oosing Segmentation
Cost of scorecards (internal/external)
Implementation
Processing
Data storage
Monitoring/strategy development Segment size Do I have to?
Statistically Based Segmentation
• Less preconceived notions
• Clustering
• Decision Trees.
481
Clustering
Clustering
Showing 3 distinct groups and one outlier.
Clustering
Here is an insurance example of one cluster.
- What do we see here?
- lower than avg age
- more claims
- live in region A only
- likely to be single and drive a sports car.
- this is obviously a high risk segment, (confirm this group with claims analysis)
- Similar groups according to characteristics, not performance - so confirm performance for the clusters and combine those with similar risk behavior. We're not building a marketing profile, but a RISK PROFILE.
483
Clustering
• Defining characteristics for each group
• From previous example,
• Young males region A
• Young females region A, and so on.
• Performance analysis to confirm segmentation.
Decision Trees
Isolates segments based on performance (target)
Easily interpretable and differentiates between goods and bads.
All Good/Bads Bad rate = 3.8%
Existing Customer Bad rate = 1.2%
New Applicant Bad rate = 6.3%
Customer > 2 yrs Bad rate = 0.3%
Customer < 2 yrs Bad rate = 2.2%
Age < 30 Bad rate = 11.7%
Age > 30 Bad rate = 4.2%
485
So Now We Know ...
• the business
• sample and performance windows
• bad , good , indeterminate
• exclusions
• bad rate, approval rate
• number of scorecards needed, and their segments.
Methodology/Format
• Implementation platform and format
• Interpretability, implementation
• Legal compliance
• Data quality, sample size, target type
• Tracking and diagnosis
• Specify parameters for scorecard (range of scores, "points to double the odds").
Why 'Scorecarď Format?
• Easiest to interpret, justify, implement
• Reasons for decline/low scores can be explained to auditors, Mgmt, regulators, adjudicators
• No black box
• Diagnosis, tracking, monitoring
• Development process fairly simple to understand.
488
Review Implementation Plan
• Number of scorecards
• Data requirements
• Manage expectations
• Continuity.
Everyone is aware of what's going on.
This is a business process, not a mystery novel. You'd be surprised how many people in companies like to spring surprises on other departments.
Cvičení
Základní popis dat:
Jsou k dispozici následující data:
Accepts.sas7bdat (64589 řádků) Rejects.sas7bdat (35411 ř.) Applicants.sas7bdat (100.000 ř.)
...24 sloupců
ID of applicant, Date of application/opening, Accept / Reject, 30-days deliquency, 30-days deliquency date, 60-days deliquency, 60-days deliquency date, 90-days deliquency, 90-days deliquency date, Worst previous deliquency, Current deliquency, Age, Age groups, Sex, Existing client?, Phone member?, Region, Income, Income groups, Debt, Income/Debt ratio, Income/Debt ratio groups, Probability of 60-days deliquency (old), Score (old).
title 'Accepts';
proc means data=indata.accepts n nmiss min median mean max;
var age income debt idratio; run;
title 'Accepts';
proc freq data=indata.accepts;
table sex client phone region;
table (sex client phone region)*bad60;
table bad30*(bad60 bad90) bad60*bad90;
run;
title 'All applicants'; goptions ftext='arial'; proc catalog c=gseg kill; quit;
proc gchart data=indata.applicants;
vbar age / midpoints=18 to 75 name='_1data_a';
vbar income / name='_1data_b';
vbar debt / name='_1 data_c';
vbar idratio / name='_1data_d';
vbar type / name='_1data_e';
vbar scoreold / levels=10 name='_1data_f;
vbar pbad60old / levels=30 name='_1data_f;
run;
quit;
proc univariate data=indata.applicants normal;
var age income debt idratio; histogram age income debt idratio;
Cvičení
Accepts The MEANS Procedure
Vybrané výstupy uvedeného kódu:
nie FREQ Procedure
		Sex	
Sex	Frequency	Cumulative Percent Frequency	Cumulative Percent
M	45138	69 88 45136	69.88
Z	19451	30.12 64589	100.00
		Existing client?	
Client	Frequency	Cumulative Percent Frequency	Cumulative Percent
0	60188	93.19 60188	93.19
1	4401	6.81 64589	100.00
Variable	Label	N	N Miss	Minimum	Median	Mean	Maximum
Age	Age	64589	0	18.0000000	43.0000000	43.3129945	74.0000000
Income	Income	64589	0	15000.07	19631.47	19854.56	35790.94
Debt	Debt	64589	0	100444.85	560744.83	576945.05	1611457.12
IDRatio	Income/Debt ratio	64589	0	0.0175377	0.0345483	0.0500680	0.2994807
	Phone member?			
		Cumulative	Cumulative	
Phone	Frequency	Percent Frequency	Percent	
0	8081	12.51 8081	12.51	17.5 -
1	56508	87.49 64589	■ oo.oo	150-
				
		Region		
Region	Frequency	Cumulative Percent Frequency	Cumulative Percent	
1	12537	19.41 12537	19.41	7.5 -
2	16335	25.29 28872	44.70	5 a -
3	10679	16.53 39551	61.23	
4	10797	16.72 50348	77.95	IS-
5	7199	11.15 57547	89.10	ní
6	7042	10.90 645B9	100.00	
The UNIVARIA TE Procedure
All applicants
Trie UNIVARIATE Procedure Variable: IDRatio (Income/Debt ratio)
Moments			
N	100000	Sum Weights	100000
Mean	0.04766914	Sum Observations	4756.91379
Std Deviation	0.03680037	Variance	0.00135427
Skewness	2.10159641	Kurtosis	4.8128053
Uncorrected SS	362.660032	Corrected SS	135.425362
Caeff Variation	77.1995591	Std Error Mean	0.00011637
Basic Statistical Measures Location Variability
Mean    0.047669 std Deviation o.03680
Median 0.033093 Variance 000135
Mode             - Range 0.28194
Interquartile Range 0 03334
491
Cvičení
/* 2a. Bad rate development, roll rate analysis */
Bad rate development - current deliquency
%let performancewindow='31dec2002'd>=datappl; %let deliq=worstdeliq;
procfreq data=indata.accepts /*noprint*/;
table datappl*&deliq / out=&deliq (keep=datappl &deliq pct_row
where=(&deliq ne '0')) outpct missing;
format datappl yyqs7.;
where &performancewindow;
run;
ods html path="&appl_root" file="2.&deliq..html";
goptions reset=all ftext='arial';
symbol 1 i=j v=dot;
axisl label=('Bad rate');
proc catalog c=gseg kill;
quit;
title 'Bad rate development - current deliquency'; proc gplot data=&deliq;
plot pct_row*datappl=&deliq / name='_2curdel' grid hreverse
vaxis=axisl hminor=0;
run;
quit;
ods html close;
2002«       2002/3      2002/2      2002/1       2001/4       2001/3      2001/2 2001/1
Date of application/opening Worst previous deliquency   *** 3 6 •«
2000/4       2000/3      2000/2 2000/1
492
Cvičení
/* analýza kohort */
%let target=bad30; %let date=dat30;
data cohorts;
set indata.accepts (keep=datappl bad: dat:);
if &target then qtr=int(yrdif(datappl,&date,'act/acť)*4)+l;
datappl=intnx('month',datappl,0);
format datappl mmyys7.;
run;
procfreq data=cohorts noprint;
table datappl / out=cohortsl (drop=percent
rename=(count=counttotal));
table datappl*qtr / out=cohorts (drop=percent);
run;
data cohorts;
merge cohorts cohortsl; by datappl;
if first.datappl then cumpct=.; if qtr ne . then do;
cumpct+(count/counttotal);
output; end; run;
ods html path="&appl_root" file='2.cohorts.html'; title "Cohort analysis for &target";
proc tabulate data=cohorts missing format=percent8.4;
class datappl qtr; var cumpct;
table datappl,qtr*cumpct="*sum="; run;
ods html close;
Cohort analysis for bad30
	1	qir 2	3
Date of applications pening			
HflM	5.0B7%	6.652%	
■HHH	8.327%	6.637%	7.055%
DffM	5.4B1%	6.441%	=..V^--:
UMU DO	5.456%	6.2*4%	6.387%
05,'200O	8.000%	7.643%	
0O2000	5.21 B%	6.724%	
07,'200O	5.437%	6 120%	
06:2000	8.321%	7.401%	7.455%
USI2 U DO	8.345%	7010*	
10:2000	8.023%	6.7*2%	
l-UZHD	5.613%	6 104%	6213%
12/2000	8.400%	7.346%	
01,'200l	8.064%	7.109%	
02:2001	5.E3S%	6.809%	
03720D1	8.271%	6.766%	6.078%
UiTHI	8.035%	7870%	7.025%
05:2001	8.201%	6.884%	
06,'200l	5.301%	6996%	6.051%
0 7:2 0 0 1	1.058%	6.720%	
06:2001	8.33B%	7.377%	
09:2001	8.376%	7 118%	
ItMl	8.446%	7200%	
11:2001	5.050%	6415%	
12:2001	5.521%	6 704%	6.781%
01:2002	5.666%	6812%	
02:2002	5.613%	6.684%	6.B12%
03:2002	5.024%	6.408%	
04/2002	5.778%	6.426%	
05.2002	5.304%	5.974%	
06.2002	8.070%	6.000%	
07:2002	5.374%	5.817%	
06.200Í	5.405%	5.622%	
03.2002	5.4B3%	5.880%	
10:2002	' 6D6%		
11:2002	2.563%		
Cvičení
/* performance window */
%let performancewindow='31dec2002'd>=datappl;
proc tabulate data=indata.accepts out=brdev (drop=_type__stable__page_);
class datappl;
var bad90 bad60 bad30;
table datappl,(bad90 bad60 bad30)*mean*format=percent8.2;
format datappl yyqs7.;
where &performancewindow;
label datappl='Month opened';
run;
ods html path="&appl_root" file='2.perf.html';
goptions reset=all ftext='arial';
symbol 1 i=j v=dot;
axisl label=('Bad rate');
proc catalog c=gseg kill;
quit;
title 'Bad rate development'; proc gplot data=brdev;
plot (bad:)*datappl / name='_2perf grid overlay legend hreverse
vaxis=axisl hminor=0;
run;
quit;
ods html close;
Bad rate development
2002(4       2002(3      2002(2      2002(1       2001(4      2001(3      2001(2      2001(1       2000(4       2000(3      2000(2 2000(1
Month opened
PLOT   »"Bad90 Mean   »"Bad60 Mean    »"Bad30 Mean
494
Cvičení
/* bad rate development */
%let samplewindow='30jun2001'd>=datappl>='01apr2001'd; %let samplewindow='31dec2001'd>=datappl;
procfreq data=indata.accepts noprint;
table dat60 / out=development missing; format dat60 mmyys7.; where &samplewindow; run;
data development;
set development; if _n_> 1 then do;
dat60=i ntnx(' month ',dat60,0);
cum_pct+percent;
output; end;
label datappl='Month of opening'; run;
ods html path="&appl_root" file='2.badratedev.html';
goptions reset=all ftext='arial';
symbol 1 i=j v=dot;
axisl label=('Bad rate');
proc catalog c=gseg kill;
quit;
title 'Bad rate development'; proc gplot data=development;
plot cum_pct*dat60 / name='_2brd' grid;
run;
quit;
ods html close;
cum_pct 4
Bad rate development
01/2000    04/2000 07/2000
1 0/2000    □1I2DD1     04/2001     07/2001     10/2001     01/2002    04/2002    0712002    1 0/2002 60-days deliquency date
495
Cvičení
/* BRDEV macro */
%macro brdev(data,out,datevar,targetvar,samplewindow);
proc freq data=&data noprint; table &datevar / out=&out missing; format &datevar mmyys7.; where &samplewindow; run;
data &out (keep=date cum_pct); set &out;
if _n_>lthen do;
date=intnx('month',&datevar,0); cum_pct+ percent; output; end;
format date mmyys7.; run;
%mend brdev;
%let samplewindow='30jun200rd>=datappl>='01apr200rd; %brdev(indata.accepts,development,dat60,bad60,&samplewindow)
I* several bad rate development */
%let samplewindow='30jun200rd>=datappl>='01apr200rd; %brdev(indata.accepts,development30,dat30,bad30,&samplewindow) %brdev(indata.accepts,development60,dat60,bad60,&samplewindow) %brdev(indata.accepts,development90,dat90,bad90,&samplewindow)
data developmentsev;
set development30 (in=_30) development60 (in=_60) developments;
if_30 then type='30';
else if_60 then type='60';
else type='90'; Run;
data anno;
function='laber;x=20;y=2;text='Sample window';output;
size=2;function='move';x=10;y=2.5;output;
function='draw';x=30;y=2.5;output;
function='move';x=20;y=3.5;output;
function='draw';x=140;y=3.5;output;
run;
ods html path="&appl_root" file='2.badratedev_several.html1;
goptions reset=all ftext='arial';
symbol 1 i=j v=dot;
axisl label=('Bad rate');
proc catalog c=gseg kill;
quit;
title 'Several bad rates development';
proc gplot data=developmentsev annotate=anno;
plot cum_pct*date=type / grid vminor=0 name='_2brds' vaxis=axisl; format date mmyys5.; label date='Performance window'; run; quit;
ods html close; Badl
Several bad rates development
Sample window
Performance window
Cvičení
/* Roll rate analysis */
ods html path="&appl_root" file='2.roll_rate.html';
proc format;
value $deliq (notsorted)
'0'=' no deliquency'
'3'='30 days'
'6'='60 days'
'9'='90+ days'; run;
proc tabulate data=indata.accepts out=rollrate missing;
class curdeliq worstdeliq;
tables worstdeliq,curdeliq*rowpctn;
format curdeliq $deliq. worstdeliq $deliq.;
title 'Roll rate analysis';
run;
proc gchart data=rollrate;
hbar3d worstdeliq / sumvar=pctn_01 subgroup=curdeliq nostats
clipref autoref raxis=axisl;
axisl label=none minor=none;
run;
quit;
Roll rate analysis
Worst previous deliquency
no deliquency
30 days
60 days
90+ days
3
P
P
I
0 10 20 3U 40 50 60 70 80 90 100
Current deliquency   I_I no deliquency   I    nn riays
] 60 days
190+ days
ods html close;
497
8. Příprava dat II
Development
Stage 3: Development Database Creation
Development Sample Specification
Development sample spec, means specifying what we need in the database we will use for development. We are not going to take a dump of everything from the CDW or datamart.
• Make the development process manageable and efficient:
• list of characteristics (or "variables" to be considered for devp. You don't want to have the entire DW.)
• sample sizes (for each segment and category. No point regressing on look when 3k will suffice.)
• parameters from previous section.
• Do all this bearing in mind the number of scorecards you want developed and for which segments.
500
Characteristic Selection
How do you select characteristics? Reinforce: there is a need for some thought to be put into process in selecting characteristics ..
You get together with risk, mktg, product. And get operations areas such as collections aboard (WHO knows your bad guys better than anyone else?)
• Expected predictive power
• Reliability: (is this manipulated? or prone to be manipulated?, e.g. salary. Check historical data - cannot be confirmed or too expensive to confirm. Can it be interpreted e.g. occupation/industry type is the worst cases. Do poeple usually leave this blank.)
• manipulation (non-confirmable)
• interpretation (present and future)
• missing
• Legal issues (Cant ask/get some info?.. Might get into trouble with some?)
501
Characteristic Selection
• Ease in collection
• Do you want to spend time chasing missing info for a credit card?... may be OK for a mortgage. How easy it is to get this piece of info?
• Policy rules
• Don't include anything that is unchangeable PR, e.g. bankruptcy If you are going to decline all bankrupcy no need to use it in scorecard.
• Derived variables - ratios
• Can do a lot of ratios .. But put some business thought into it.
• Future direction.
• Will this info be collected in the future (e.g. app form redesign)?
• Industry direction - not relevant today but will change, can include in card or collect for future e.g. higher credit lines. Talk to credit bureaus industry trend and how they affect the scorecard.
What are you doing: you re looking at objectives, company operations, business knowledge, ground realities etc. This is not just a stats exercise!!!
' 502
tripling
• Development, validation
70:30, 80:20
• If sample is small, do 100%, but validate with several 50-80%.
• Good, bad, reject
• 2000 of each (or higher)
• Oversampling (oversampling is common when modeling rare events ... it leads to better predictions)
• Proportional sample - not recommended for low bad rates.
• Take what you got for bads and sample the goods.
• Ensure that each group has sufficient numbers for meaningful analysis.
503
Data Collection and Database Construction
• Random and representative
• for each segment applicants (and accounts)
• One for unsegmented (to measure lift from segmentation)
• Data quirks, changes (preferably documented)
• e.g. code for renters changed from R to E .. Stopped collecting some data item, new data fields, started collecting data recently etc. etc.
• Objective: Data collected, as specified.
504
Adjusting for Prior Probabilities
When oversampling
Adjust to actual:
• Approval rate
• Bad rate
Analysis and reports reflect reality
Do not need if you only want to know relationships or rank ordering.
Through-the-door 10,000
1
r
Rejects 2,950
1
Accepts 7,050
I
Bads 874
1
I
Goods 6,176
505
Adjusting for Oversampling
• Separate sampling is standard practice (helps when you just did 'bad' definition)
• Prior probabilities must be known
• Can adjust before fitting the model or after.
• Two ways:
• Offset
Sampling weights (frequency variable).
506
Offset Method
• Logit (Pi)=P0+       ....+ (3kxk
• When oversampling, logits shifted by the offset:
• Logit (p*.)= In (pjijpji) + (30+ p,x1+ ....+ (3kxk
• Where
• px and p0= proportion of target classes in the sample
• ti^ and ti0= proportion of target classes in the population.
Offset Method
• Adjustment post-model (after model development)
• PAi = (pA*iPcA) /   [(l " PA*i) P^o + PA*iPo^)]
• Where pA*j is the unadjusted estimate of posterior probability.
SAS Programs - Pre-model Adjustment
data develop; set develop;
In (p1n0 I p0nj
off=(offset calc) ; run;
proc logistic data=develop  ...;
model ins=......./ offset=off;
run;
proc score .... ;
p=l /   (1+exp(-ins)); proc print; var p .... ; run;
509
SAS Program - Post-model Adjustment
proc logistic data=develop...; run;
proc score run;
.  out=scored...;
data scored; set scored;
off = (offset calc) ;
p=l / (l+exp(- (Ins-off) ) ) ; run;
proc print data=scored ..;
var p  ...;
run;
510
Sampling Weights
• Adjusts data to reflect true population
• Weights:       and 7i0/p0
• Or set weight of bad=i and weight of good = p(good)/p(bad) for population.
• For example, p(bad)=4%, 2000 goods, 2000 bads. Sample will show 2000 bads and 48,000 goods.
• Normalization causes less distortion in p values and standard errors.
• Use FREQ variable in EM or calculate sample weight and use weight=sampwt in the LOGISTIC procedure.
511
SAS Program
• When using the WEIGHT statement, some output is not correct.
data develop; set develop;
saxapwt=( n0/ p0)*  (ins=0) + ( n±/ p±)*  (ins=l) ; run;
proc logistic data=develop ... ; we±ght=saxapwt ;
model ins=.......;
run;
512
What Is the Difference?
• The parameter estimates will be different.
• When linear-logistic model is correctly specified, offset is better.
• When logistic model is an approximation of some non-linear model, weights are better.
• For scorecards, weighting is better since it corrects the parameter estimates used to derive scores (prior probabilities only affect the predicted probabilities).
513
Development
Stage 4: Scorecard Development
Objective
• Understand a methodology for developing and assessing risk scorecards.
• Grouped attributes
• Logistic regression Reject inference
• Scaled points.
Process Flow-Application Scorecard
Explore Data Data Cleansing
Validate
Initial Characteristic Analysis (Known Good Bad)
Final
Scorecard (AGB)
• Scaling
• Assessment
Preliminary Scorecard (KGB)
i
Reject Inference
Initial Characteristic Analysis (All Good Bad)
516
Process Flow - Behavior Scorecard
Explore Data Initial Characteristic
Data Cleansing Analysis (Known
Good Bad)
\
Final
w ,. . .        ._, Scorecard
Validate     <    ■   0 ,.
• Scaling
• Assessment
517
Before you start...
Explore the data, visualize (Insight in SAS EM)
• Distributions
• mean, max/min, range, missing
• Compare with overall portfolio distributions
• Data integrity (any garbage, outliers)
• Ensure data meets the data specifications done earlier.
• Check that o's mean zero, not missing values.
• Population stability check:
• Month by month table of distribution for each predictor (e.g. 200701 men 55%, women 45%, 200702 men 57%, women 43%)
518
Missing Values and Outliers
• Missing (ALL financial data has missing and garbage values)
• Complete Case Analysis - Exclude everything with missing data .. In CS, you'll end up with nothing ©.
• Exclude characteristics or records with significant missing values
• Group 'missing' as a distinct attribute -the weight of missing will tell you what missing contains. If it is close to neutral, good since it is random. Recommended - recognize that missing data has information value and may not be randomly missing. Find the value and use it. Plus, including missing points' in scorecard will take care of ppl who leave it blank.
• Impute missing values - don't use mean/most likely, model based on decision tree may be better.
• Outliers (and mis-keys)
• Exclude/replace records.
519
Missing Values
• Missing data is not usually random
• Missing data can be related to the target
• New at job may leave yrs at empl blank
• Low income or commercial customers leave income blank
• Do bad customers leave certain fields blank?
• Including and grouping missing data can answer this question.
Initial Characteristic Analysis
• Analyze individual characteristics
• Identify strong characteristics
• Best differentiators between good' and 'bad'
• Screening
• Select characteristics for regression (variable selection).
Initial Characteristic Analysis
• Start by performing initial grouping for each characteristic and rank order Information Value (PROC DMSPLIT or SPLIT, or EM node)
• Alternate: rank order characteristics by Chi Square or other method
• Fine tune grouping for stronger characteristics
May want to perform other analysis prior to this (for example, use PC to identify collinear characteristics)
• Some people use principal components (PROC VARCLUS) to identify which characteristics they need from each cluster. And then concentrate on the best out of each.
522
Criteria for Variable Selecti
• Predictive power of attribute: Weight of Evidence
• Range and trend of WOE across attributes
• Predictive power of characteristic: Information Value, Gini index(coefficient)
• Operational/business considerations.
Weight of Evidence
		Distr		Distr		Distr		
Age	Count	Count	Goods	Good	Bads	Bad	Bad rate	Weight
								
Missing	50	3.00%	43	2.40%	8	4.10%	16%	-55.497
18-22	200	10.00%	152	8.40%	48	24.90%	24%	-108.405
23-26	300	15.00%	246	13.60%	54	(^8.00%*	) 18%	(^72.039
27-29	450	23.00%	405	"B227Fn%	45		10%	
30-35	500	25.00%	475	26.30%	25	13.00%	5%	70.771
35-44	350	18.00%	349	19.30%	11	5.70%	3%	122.044
44 +	150	8.00%	147	8.10%	_3	1.60%	2%	165.509
								
Total	2,000		1,807		193		9.65%	
								
Information Value = 0.066								
Ln     Distr Good / Distr Bad     x 100
Weight of Evidence
• Measures strength of each (grouped) attribute in separating goods and bads
• (Distr Good / Distr Bad) = odds of being good
• Negative weight: more bads than goods
• Logical trend
• Forage 23-26:
WOE = In (0.136 / 0.28) = -0.722 (xioo = -72.2)
525
Information Value (Strength)
		Distr		Distr		Distr		
Age	Count	Count	Goods	Good	Bads	Bad	Bad rate	Weight
								
Missing	50	3.00%	43	2.40%	8	4.10%	16%	-55.497
18-22	200	10.00%	152	8.40%	48	24.90%	24%	-108.405
23-26	300	15.00%	246	13.60%	54	<^2aoo^	18%	(^72.039
27-29	450	23.00%	405	22.40%	45	"""23730%	10%	
30-35	500	25.00%	475	26.30%	25	13.00%	5%	70.771
35-44	350	18.00%	349	19.30%	11	5.70%	3%	122.044
44 +	150	8.00%	147	8.10%	_3	1.60%	2%	165.509
								
Total	2,000		1,807		193		9.65%	
								
Information Value = 0.066								
SI
Distr Bad    x Weight
Kullback, S., Information Theory and Statistics (1959)
526
Information Value
•Z   [(Distr  Good  -  Distr  Bad)   x  {In  (Distr  Good  /  Distr Bad)}]
When      figures      used      in      decimals format (for example, 0.136).
• Rule of thumb:
• < 0.02: unpredictive
• 0.02 - 0.1: weak
• 0.1 - 0.3: medium
• 0.3 +: strong
• Too strong? (IV>o.5) - use it in a controlled way (add them in the end of regression to see if they add any incremental value)
527
Grouping
Groups with similar WOE are put together
• For continuous variables, groups are created so as to maximize difference from one group to next - and maintain logical trend for WOE
• Why Group?
• Easier way to deal with outliers with interval variables, and for rare classes
• Format of the scorecard
• Easy to understand relationships
• Model non-linear dependencies with linear models
• Control the process
528
Grouping
Grouping of the demographic scorecard variable "age". On the left pictures, the dependence of bad rate (smoothed using normal probability density function) on the variables is presented. On the right, the cumulative distribution function is presented. Vertical lines represent the borders between categories, horizontal red lines in the left picture represent the mean bad rate in categories, horizontal blue lines in the right picture represent the relative distribution of observations in the categories. 529
Logical Trend
gical Trend
Final weightings make sense. Enables buy-in from risk managers.
Confirms business experience
• young people are higher risk
• higher debt service means higher risk
Reduces overfitting if done right - model overall trend, not quirks. Remember how long the scorecard has to last. This is not going to be used for the next campaign and then discarded.
Linear relationship not always true, but need trend to confirm, and back up with business experience. E.g. revolving open burden shows a 'banana curve' everywhere and is now accepted as that. People don t try to make it straight.
Logical Trend
Obviously not a logical trend!!!
Logical Trend
Which line shows logical trend?
Both are logical. What's the difference?
Blue    line    shows good
differentiation.
Red line is flat, and this
characteristic is likely very
week and will be reflected in
the IV
Predictive Strength
200
-150 J—
Age
533
Stability check
Check the stability of grouping throughout the whole developmnet time window:
agefr  pot  risky all segment
qu nr ier
20 D i í
_f r 20 29 31
3- 6
4- 1
5- 1 60
2.0 29 3-í sa + 1 5-1 60
20 29 3-ž 54 41 51 HO
3-2 54 41 &1 H Hl
Pc1H_011 SUN
1 o. sa
7 . E D
S . 25 5.25
11 12
C_-jqe_F r
PctN_011 suy I I I 3
534
Business Factors
• Nominal values
• group based on similar weight (for example, postal code, occupation)
• investigate splits on urban/rural, regional
• Breaks concurrent with policy rules
• Sanity check.
Variable Selection
List of information values of variables (predictors)
		IV	Information
No	Character	Rank	Value
1	Max delinq L9M	1	0.176
2	Months since delinquent	2	0.176
3	Active contract (Y/N)	3	0.045
4	Average Delinquency L9M	4	0.087
5	Months since >10 dpd	5	0.144
6	Max delinq L3M	6	0.117
7	Average Delinquency L3M	7	0.108
8	Age of oldest contract	8	0.013
9	Number of months on collections as % total time on book	S3	0.132
			
10	Months since >20 dpd	10	0.091
11	Months since >30 dpd	11	0.054
12	Num rejected applications L9M	12	0.033
13	Times 30+ dpd L9M	13	0.042
14	Total Payment L3M	14	0.018
15	Months since >40 dpd	15	0.030
16	Current balance as % of highest ever balance	16	0.048
17	Times 30+ dpd L3M	17	0.024
18	Payment Method	18	0.001
536
Cvičení-profile
/* 2b. Profiles */
%let input=income; %let groups=yes; %let n_groups=4;
/* grouping 1 - kvantily */
proc rank data=indata.accepts (keep=&input) groups=&n_groups out=bins;
var &input; ranks bin; run;
proc summary data=bins nway missing;
class bin;
output out=bins (drop=_type_) min(&.input)=start max(&Jnput)=end; run;
data bins;
set bins;
label=compress(put(start,best.))| |' - '| |compress(put(end,best.));
fmtname='_bin';
type='N'; run;
proc format cntlin=bins; run;
%macro profile(input,groups);
/* Profile of &input according to BAD60 */
proc summary data=indata.accepts; class &input;
output out=_bins (drop=_type_ rename=(_freq_=_n))
sum(bad60)=_nl;
%if %upcase(&groups)=YES %then %do;
format &input_bin.;
%end; run;
data_bins;
set_bins end=_finish;
if _n_=l then do;
_all_n=_n;
_all_nl=_nl;
_all_nO=_n-_nl;
retain_all_n:;
end; else do;
_p=_n/_all_n;
_n0=_n-_nl;
_pl=_nl/_all_nl;
_pO=_nO/_all_nO;
_rl=_nl/_n;
_rO=_nO/_n;
_woe=log((_p0)/(_pl))*100; _all_iv+(_p0-_pl)*_woe/100; output; end;
if_finish then do;
call symput('groups',compress(put(_n_-l,best.)));
call symput('iv',compress(put(_all_iv,8.4)));
call symput('br',compress(put(_all_nl/_all_n,best.)));
end; attrib
_n label='N'
_p label='%' format=percent8.1
_nl label="N of Bad" _n0 label="N of Good"
_pi label="% of Bad" format=percent8.1
_pO label="% of Good" format=percent8.1
_rl label="Bad rate" format=percent8.1
_rO label="Good rate" format=percent8.1
_woe label='WOE format=8.2 &input label="Group of &input"
drop_all:;
Run;
data_chart (keep=&input_sub_n_p_r);
set_bins (keep=&input_nO_pO_rO_nl_pi_rl);
length_sub $4;
_sub="Good";
_n=_nO; _P=_pO; _r=_rO; output;
_sub="Bad";
_n=_nl;
_P=_pl;
_r=_rl;
output; attrib
_n  label='N' format=8.0
_p label='%' format=percent8.1
_r label='Rate' format=percent8.1
_sub label=Target'
run;
proc datasets nolist;
delete gseg / memtype=catalog;
quit;
ods listing close;
goptions reset=all ftext='arial' htext=1.5 ftitle='arial' htitle=2;
proc gchart data=_chart;
axisl style=0;
axis2 minor=none order=(0 to 1 by .25) label=none;
axis3 minor=none label=none; axis4 minor=(n=4) label=none;
where_sub="Bad";
hbar &input / discrete sumvar=_r noframe nostats
maxis=axisl raxis=axis3 autoref cref=grayaO clipref name="_1";
title "Bad rates";
run;
where;
hbar &input / discrete subgroup=_sub sumvar=_n noframe nostats
maxis=axisl raxis=axis3 autoref cref=grayaO clipref name="_2";
title "Bad / Good frequencies";
run;
Quit;
proc gchart data=_bins;
hbar &input / discrete sumvar=_woe noframe nostats
maxis=axisl raxis=axis4 autoref cref=grayaO clipref name="_3";
title "Weight of evidence";
run;
hbar &input / discrete sumvar=_pi noframe nostats
maxis=axisl raxis=axis4 autoref cref=grayaO clipref name="_4";
title "Bad distribution";
run;
quit;
Ods html path="&appl_root" file="5.profile.html" style=statdoc;
proc report data=_bins nofs style(summary)=[htmlclass="Header"];
columns ("Attributes of &input" &input) (Total'_n_p)
("Good"_nO_pO) ("Bad"_nl_pi) ('Measures'_rl_woe);
define &input / group; compute after;
_rl.sum=&br;
_woe.sum=.;
endcomp;
rbreak after / summarize; title "Bad / Good by &input";
footnote "IV=&iv (<0.02 unpredictive, <0.1 week, <0.3 medium, <0.5 strong, >0.5 over)"; run;
goptions device=gif;
proc greplay nofs;
footnote;
igout gseg;
tc sashelp.templt;
template I2r2;
treplay 1:_1 2:_2 3:_3 4:_4 name="5_profil";
run; quit; title;
footnote;
ods html close;
ods listing;
%mend profile;
°/aprofile(&input,&groups)
Bad / Good by income
	Total		Good		Bad		Measures	
Group of income	N	%	N of Good	11 of Good	N of Bad	% of Bad	Bad rate	WOE
15000,067206 - 17541.45177	16117	25.0%	15631	25.0%	516	26.1%	3.2%	4.55
17541.610S3 - 19631.429437	16147	25.0%	15688	25.1%	459	23.2%	2.8%	7.52
19631.471069 - 21723.106242	16148	25.0%	15683	25.0%	465	23.5%	2.9%	6.19
21723.273059 - 35790.940583	16147	25.0%	15612	24.9%	535	27.1%	3.3%	-8.29
						100.«	1.1«	■
/1/=0.0016 (<0.02 unpredictive, <0.1 week, <B.3 medium, <0.S strong, >0.5 over)
538
Cvičení
/^profile multiple characteristics at once*/
%model_profilevar
(
data=data.accepts,
interval=age income idratio,
binary=sex phone client,
ordinal=age_grp income_grp region,
groups=5,
target=bad30,
rep_out=&appl_root
)
Bad / Good by Sex
Attributes of	sex	Total		Good		Bad		Measures
Group of sex	Sex	X	%	N of Good	% of Good	N of Bad	% of Bad	Bad rate WOE
I M	M	45138	69.9%	42061	69.6%	3077	"4.7%	6.8% -7.16
1	Z	19451	30.1%	18410	30.4%	1041	25.3%	5.4% 18.59
		64589	100.0%	60471	100.0%	4118	100.0%	6.4%
Unpredictive [TV = 0.0133: 2 groups) Bad / Good by Phone member?
Attributes of phone		Total		Good		Bad		Measures
Group of phoue   Phone member?		N	%	N of Good	% of Good	N of Bad	% of Bad	Bad rate WOE
0	0	8081	12.5%	7431	12.3%	650	15.8%	8.0% -25.04
1	1	56508	87.5%	53040	87.7%	3468	84.2%	6.1% 4.07
		64589	100.0%	60471	100.0%	4118	100.0%	6.4%
Unpredictive (TV = 0.0102: 2 groups)
Bad / Good by Existing client?
Attributes of dient	Total		Good		Bad		Measures
Group of client   Existing client?	N	%	N of Good   % of Good		N of Bad	% of Bad	Bad rate WOE
0 0	60188	93.2%	56251	93.0%	3937	95.6%	6.5% -2.74
1 1	4401	6.8%	4220	7.0%	1S1	4.4%	4.1% 46.23
	64589	100.0%	60471	100.0%	4118	100.0%	6.4%
Unpredictive (TV = 0.0126, 2 groups)
Cvičení
Bad / Good by Age groups
Attributes of	age grp	Total		Good		Bad		Measures	
Group of age_grp	Age groups	N	%	N of Good	% of Good	N of Bad	% of Bad	Bad rate	WOE
do 30	do 30	2957	4.6%	2662	4.4%	291	7.2%	10.0%	-48.69
30 - 60	30 - 60	58713	90.9%	55057	91.0%	3656	88.8%	6.2%	2.52
nad 60	nad 60	2919	4.5%	2752	4.6%	167	4.1%	5.7%	11.53
		64589	100.0%	60471	100.0%	4113	100.0%	6.4%	
Unpredictive (TV = 0.0146: 3 groups)
Bad / Good by Income groups
Attributes of income_grp		Total		Go	od	Bad		Measures	
Group of iucotne_grp	Income groups	N	%	N of Good	% of Good	N of Bad	% of Bad	Bad rate	WOE
do 17	do 17	12070	18.7%	11213	18.5%	857	20.8%	7.1%	-11.54
17-22	17-22	37859	58.6%	35567	58.8%	2292	55.7%	6.1%	5.52
22-27	22-27	13680	21.2%	12820	21.2%	860	20.9%	6.3%	1.50
nad 21	nad 27	980	1.5%	871	1.4%	109	2.6%	11.1%	-60.85
		64589	100.0%	60471	100.0%	4118	100.0%	6.4%	
Unpredictive (IV = 0.0118: 4 groups)
540
Cvičení
Bad / Good by Region
Attributes of region		Total		Good		Bad		Measures	
Group of region	Region	N	%	N of Good	% of Good	N of Bad	% of Bad	Bad rate	WOE
1	1	12537	19.4%	11404	18.9%	1133	27.5%	9.0%	-37.77
2	2	16335	25.3%	15498	25.6%	837	20.3°.;	5.1%	23.18
3	3	10679	16.5%	10034	16.6%	645	15.7%	6.0%	5.77
4	4	10797	16.7%	10170	16.8%	627	15.2%	5.8%	9.95
5	5	7199	11.1%	6783	11.2%	416	10.1%	5.8%	10.47
6	6	7042	10.9%	6582	10.9%	460	11.2%	6.5%	-2.59
		64589	100.0%	60471	100.0%	4118	100.0%	6.4%	■
Weak predictivity (TV = 0.0483= 6 groups)
541
Cvičení
Bad / Good by Age
Attributes of age		Total		Good		Bad		Measures	
Group of age	Age	N	%	N of Good %	of Good	N of Bad	% of Bad	Bad rate	WOE
0	18 - 35	13688	21.2%	12420	20.5%	1268	30.8%	9.3%	-40.49
1	36-40	11385	17.6%	10485	17.3%	900	21.9%	7.9%	-23.15
2	41 -45	14645	22.7%	13918	23.0%	727	17.7%	5.0%	26.52
3	46 - 51	12383	19.2%	11806	19.5%	577	14.0°.í	4.7%	33.17
4	52 - 74	12488	19.3%	11842	19.6%	646	15.7%	5.2%	22.18
		64589	100.0%	60471	100.0%	4118	100.0%	6.4%	-
Weak predictivity (IV = 0.093 L 5 groups)
542
Cvičení
Bad / Good by Income
Attributes of income		Total		Good		Bad		Measures	
Group of m co tne	Income	X	%	N of Good %	of Good	>" of Bad	% of Bad	Bad rate	WOE
0	15.000 - 17.105	12917	20.0%	12011	19.9%	906	22.0%	7.0%	-10.23
1	17.105 - 18.822	12918	20.0%	12204	20.2%	714	17.3%	5.5%	15.18
2	18.822-20.398	12918	20.0%	12122	20.0%	796	19.3%	6.2%	3.64
3	20.398 - 22.339	12918	20.0%	12102	20.0%	816	19.8%	6.3%	0.99
4	22.340 - 35.791	12918	20.0%	12032	19.9%	886	21.5%	6.9%	-7.82
		54589	100.0%	60471	100.0%	4118	100.0%	6.4%	■
Unpredictive (IV = Ü.0080=5 groups)
543
Cvičení
Bad / Good by Incom&'Debt ratio
Attributes of id ratio		Total		Good		Bad		Measures	
Group of id ratio	Income Debt ratio	N	%	X of Good %	of Good	N of Bad %	of Bad	Bad rate	WOE
0	0.0175 - 0.0225	12917	20.0%	11994	19.8%	923	22.4%	7.1%	-12.23
1	0.0225 - 0.0293	12918	20.0%	11998	19.8%	920	22.3%	7.1%	-11.87
2	0.0293 - 0.0421	12918	20.0%	12061	19.9%	857	20.8%	6.6%	-4.25
3	0.0421 - 0.0714	12918	20.0%	12216	20.2%	702	17.0%	5.4%	16.98
4	0.0714- 0.2995	12918	20.0%	12202	20.2%	716	17.4%	5.5%	14.89
		64589	100.0%	60471	100.0%	4118 100.0%		6.4%	
Unpredicth'e (TV = 0.0160. 5 groups)
544
9. Evaluace modelu - LC(ROC), Gini, KS, Lift
545
Úvod
□ Je nemožné využívat predikční modely efektivně bez znalosti jejich kvality/diskriminační síly.
□ Většinou je k dispozici celá řada modelů a je třeba vybrat jen jeden - ten nej lepší.
546
Měření kvality modelu
• Uvažujeme dva základní skupiny indexů kvality. První je založena na distribuční funkci. Mezi nejpoužívanější indexy patří
> Kolmogorovova-Smirnovova statistika (KS)
> Giniho index
> C-statistika
> Lift.
• Druhá skupina indexů je založena na pravděpodobnostní hustotě. Mezi nej známější indexy patří
> Střední diference (Mahalanobisova vzdálenost)
> Informační statistika/hodnota (IVa|).
Indexy založené na distribuční funkci - KS
X   klient je dobrý    Počet dobrých klientů: n
0,
jinak.
Počet špatných klientů: m Proporce dobrých/špatných klientů:
n
PG =- > P B =
n + m n
* Empirické distribuční funkce:
1 n
Fn.GOOD (a) = ~ Z7(^ " 61 A DK = l)
j m
Fm.BAD (d) = — Z7fe ^/adä =0)
" Kolmogorovova-Smirnovova statistika (KS)
1
fn.all(*) = — Z7fe ~a)
a
[L, H]
I(A) =
1 A platí 0 jinak
KS = max FmJÍAD (a) - Fn
,GOOD
(a)
ae\L,H J
Lorenzova křivka
> Lorenzova křivka (LC)
x — FmBAD (ci)
y' = Fn.GOOD(a\ a e [A H]
Giniho index
0 0.1        02        0.3        0.*        0.5        0.G        0.7        O.S 09
Gini =
A
A+B
= 2A
n+m
Gini — 1     2^ ("^m. BAD k FmBADk_
+ F
GOOD k ^ A n.GOODk
-,)
k=2
kde F
m.BAD k (Fn.GooD) Je ^~tá hodnota vektoru empirické distribuční funkce špatných (dobrých) klientů
549
Somersovo D, Kendalovo t
a
> Giniho index je speciální případ Somersova D (Somers (1962)), které je pořadovou asociační mírou definovanou jako
T
DYX =
XY
XX
kde      je Kendallovo ^definované jako       = E[sign(Xl - X2 )sign(Yl - Y2 )]
kde (^,10 {X29Y2) jsou bivariantní, stochasticky nezávislé, náhodné vektory nad touž datovou populací, a        značí střední hodnotu. V našem případě je Y=i jestliže je klient dobrý a Y=o jestliže je klient špatný. Proměnná X reprezentuje skóre.
Thomas (2009) uvádí, že Somersovo D hodnotící výkonnost daného credit scoringového modelu lze vypočítat pomocí
n • m
kde gj (b) je počet dobrých (špatných) klientů v i-tém intervalu skóre.
Somersovo D, Mann-Whitney U
> Dále platí, že Ds může být vyjádřeno pomocí Mann-Whitneyho U-statistiky.
Seřaď datový vzorek ve vzestupném pořadí podle skóre a sečti pořadí
dobrých klientů ve vzniklé posloupnosti. Označme tento součet jako Rq. Potom
U
-1
551
Konkordantní, diskordantní páry
> Konkordantní pár (X^Y,), (X2Y2):
	sgn(X2	-xx) =	sgn(Y,	-X)
> Diskordantní pár:				
	sgn( X2 -	-x1) = -	-sgn(Y,	-X)
> V našem případě X představuje skóre a Y ukazatel dobrého klienta (DK). Protože dobrý klient má hodnotu Y=i a špatný Y=o, je zřejmé, že u konkordantního páru má dobrý klient vyšší hodnotu skóre než klient špatný
552
Somersovo D, Goodman-Kruskal gamma
Uvažujme tedy dva náhodně vybrané klienty, přičemž jeden je dobrý (Y^i) a druhý špatný (Y2=o), skóre prvního označme sv druhého s2. Pak
Konkordantní pár (Concordant): Sj>s2
Diskordantní pár (Discordant): Sj<s2
Vázaný pár (Tied): Sj=s2
> Somersovo D:
D* =
#Concordant — #Discodrant #Concordant + #Discodrant + #Tied
> Goodmanovo-Kruskalovo Gamma:
r =
#Concordant — #Discodrant #Concordant + #Discodrant
553
C-statistika
Tato statistika je rovna pravděpodobnosti, že náhodně vybraný dobrý klient má vyšší skóre než náhodně vybraný špatný klient, tj.
c — stat = p(s , > s2   DK = 1 a DK = o)
Lift
I Další možnou mírou kvality scoringového modelu je Lift, který říká kolikrát je daný model, při dané úrovni zamítání, lepší než náhodný model. Přesněji řečeno jde o poměr proporce špatných klientů se skóre menším nebo rovno dané hodnotě skóre a, a e ku proporci špatných klientů v celé populaci.
Formálně jej lze zapsat takto:
/7-T-/7Y /7-T-/7Y
^ a a Y = O) ^ a A Y = °)
i=i
i=i
Lift(a) =
CumBadRate (a) BadRate
ri-rm
ri-rm
z=l
z=l
z=l
^/(r = 0vľ = i)
z=l
absLiftia) —
BadRate (a)
BadRate
4 5 4
3.5 3
2.5 2
1.5 1
0.5 0
								^ cumul. Lifi .........~.^.r.  i ;ft		
\										
										
......\										........
										
										
						........				
										........
0      0.1      0.2     0.3     0 4     0.5      0.6     0.7      0.8     0.9 1
Lift
□ Usually it is computed using table with numbers of all and bad clients in some score bands (deciles).
decile	# cleints	absolutely			cumulatively						
		# bad clients	Bad rate	abs. Lift	# bad clients	Bad rate	cum. Lift				
1	lOO	35	35.0%	3.50	35	35.0%	3.50				
2	lOO	16	16.0%	1.60	51	25.5%	2.55	4,00 3,50 3,00 O 2,50			
3	lOO	8	8.0%	O.80	59	IQ.7%	1.Q7			^^abs. Lift ^^cum. Lift	—
4	lOO	8	8.0%	O.80	67	16.8%	1.68				
	lOO	7	7.0%	O.70	74	14.8%	1.48				
6	lOO	6	6.0%	O.60	80	13.3%	i-33				
7	lOO	6	6.0%	O.60	86	12.3%	1.23	«j 2,00			
8	lOO		5.0%	O.50	91	11.4%	1.14	£ 1,50			
9	lOO		5.0%	O.50	96	10.7%	1.07	□ 1,00 0,50			
10	lOO	4	4.0%	O.40	100	10.0%	1.00				
All	lOOO	lOO	10.0%								
								decile			
□ It takes positive values. Cumulative form ends in value 1.
□ Upper limit of Lift depends on pB .
556
Lift
Pokud bad rate není monotónní:
LC vypadá OK > Gini se mírně sníží Lift ovšem vypadá podivně
■ Lornz curve - Base line
3,50 3,00 2,50
a>
= 2,00
co
>
£ 1,50 1,00 0,50
decile	# cleints	absolutely			cumulatively		
		# bad clients	Bad rate	abs. Lift	# bad clients	Bad rate	cum. Lift
1	100	/O 8 CN	8,0%	1,60	8	8,0%	1,60
2	100	f     12 J	12,0%	2,40	20	10,0%	2,00
3	100		16,0%	3,20	36	12,0%	2,40
4	100	5 ^	5,0%	1,00	41	10,3%	2,05
5	100	3	3,0%	0,60	44	8,8%	1,76
6	100	2	2,0%	0,40	46	7,7%	1,53
7	100	1	1,0%	0,20	47	6,7%	1,34
8	100	1	1,0%	0,20	48	6,0%	1,20
9	100	1	1,0%	0,20	49	5,4%	1,09
10	100	1	1,0%	0,20	50	5,0%	1,00
All	1000	50	5,0%				
Lift
I Pokud má skóre zcela opačný smysl, obdržíme „opačné" obrázky
3,50
Lift vs. Gini a KS
□ Je evidentni, ze pouze Gini nestaci!!!
SC 1:
decile	# cleints		
		# bad clients	Bad rate
1	100	35	35,0%
2	100	16	16,0%
3	100	8	8,0%
4	100	8	8,0%
5	100	7	7,0%
6	100	6	6,0%
7	100	6	6,0%
8	100	5	5,0%
9	100	5	5,0%
10	100	4	4,0%
All	1000	100	10,0%
SC 2:
decile	# cleints		
		# bad clients	Bad rate
1	100	20	20,0%
2	100	18	18,0%
3	100	17	17,0%
4	100	15	15,0%
5	100	12	12,0%
6	100	6	6,0%
7	100	4	4,0%
8	100	3	3,0%
9	100	3	3,0%
10	100	2	2,0%
All	1000	100	10,0%
0      0,1    0,2    0,3    0,4    0,5    0,6    0,7    0,8    0,9 1
0,8
0,6
0,4
0,2
0,8
0,6
0,4
0,2
Gini= 0,42	
	— Lornz curve
	-Base line
0,2 0,4 0,6 0,8
Gini = 0.42	
	^—Lomz curve -Base line
0,2
0,4
0,6
0,8
559
Lift vs. Gini a KS
SC 1:
CO
>
4,00 3,50 3,00 2,50 2,00 1,50 1,00 0,50
' abs. Lift ' cum. Lift
-i-r
i-1-1-r
5 6 decile
10
Lift20o/o = 2.55 > Lift50o/o - 1.48 <
SC 2:
2,50 2,00
J 1,50
ra >
j 1,00 0,50
abs. Lift ■cum. Lift
5 6 decile
Lift20o/0 = 1.90 Lift50% = 1-64
SC 2 je lepší, pokud je předpokládaná míra zamítaní (reject rate) přibližně 50%. SC 1 je významně lepší, pokud je předpokládaný reject rate přibližně 20%.
10
560
Lift, QLift
□ Lift can be expressed and computed by formula:
Lift (a) =
m. bad
(a)
n. all
(a)
ae[L,H]
□ In practice, Lift is computed corresponding to 10%, 20%, . ., 100% of clients with the worst score. Hence we define:
QLift(q) =
(F^Cq)) i
m.BAD V* N.ALL
-1
N. ALL (FN-.V(q)) q
m. bad (Fn^allCq)). qe(o,i]
^n. all(q) = mini a e [L, H], FN-ALL(a) > q}
□ Typical value of q is 0.1. Then we have
QLiftWo = QLift(O.l) = 10- FmBAD (F-V(O.l))
Lift and QLift for ideal model
□ It is natural to ask how look Lift and QLift in case of ideal model. Hence we derived following formulas.
> Lift for ideal model:
Liftideai(a) =
PB
FN.ALL(a) '
CI < C
a > c
> QLift for ideal model:
1        q e ((X PB
QLiftideai(q)= { lB
q g (ps, 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
We can see that the upper limit of Lift and QLift is equal to /pB.
N.ALL
562
Lift Ratio (LR)
□ Once we know form of QLift for Lift Ratio as analogy to Gini index.
LR =      A      =     Jo QLiftjq) dq - 1
A + B foQLiftideal(q)dq-l
□ It is obvious that it is global measure of model's quality and that it takes values from 0 to 1. Value 0 corresponds to random model, value 1 match to ideal model. Meaning of this index is quite simple. The higher, the better. Important feature is that Lift Ratio allows us to fairly compare two models developed on different data samples, which is not possible with Lift.
ideal model, we can define
0       „    0.2 0.4 0.6 0.8 1
563
Rlift, IRL
□ Since Lift Ratio compares areas under Lift function for actual and ideal models, next concept is focused on comparison of Lift functions themselves. We define Relative Lift function by
RLift(q) =
QLift(q) QLiftideai(q)
(0,1]
□ In connection to RLift we define Integrated Relative Lift (IRL):
Jo
IRL =  /   RLift(q) dq
N.ALL
□ It takes values from °-5 + ^f" , for random model, to 1, for ideal model. Following simulation study shows interesting connection to c-statistics.
564
Příklad
□ We consider two scoring models with score distribution given in the table below.
□ We consider standard meaning of scores, i.e. higher score band means better clients (the highest probability of default have clients with the lowest scores, i.e. clients in score band 1).
□ Gini indexes are equal for both models.
□ From the Lorenz curves is evident, that the first model is stronger for higher score bands and the second one is better for lower score bands.
□ The same we can read from values of QLift.
565
Příklad
□ Since Qlift is not defined for q=0, we extrapolated the value by QLift (0) = 3 • QLift (0.1) - 3 • QLift (0.2)+QLift (0.3)
10.00
QLift - model 1 QLift - ideal model QLift - random model QLift - model 2
1     2     3     4     5     6 7 score band
10
1.00
0.80
CD
« 0,60
ť _l
m 0.40
0.20
RLift - model 1 RLift - ideal model RLift - random model RLift - model 2
3    4     5 6 score band
s
10
According to both Qlift and Rlift curves we can state that:
> If expected reject rate is up to 40%, then model 2 is better.
> If expected reject rate is more than 40%, then model 1 is better.
566
Příklad
□ Now, we consider indexes LR and IRL:
> O
10
\
QLift - ideal model OQLift - model 2 □ QLift - random model
0     1      2     3     4     5     6 7
score band
	scoring model i	scoring model 2
GINI	0.420	0.420
QLift(o.i)	2.000	3.500
LR	0.242	0.372
IRL	0.699	0.713
8      9 10
1.00 0.80
<U
ffl 0.60
0.40
0.20
iRLift-model 2
IRL
1     2     3     4     5     6 7
score band
Using LR and IRL we can state that model 2 is better than model 1 although their Gini coefficients are equal.
Střední diference
- Střední diference (Mahalanobis distance):
D =
Mg-Mb S
kde S je společná směrodatná odchylka:
i
S =
^ nSg2 + mSb2 ^2
V
n + m
J
M' ,Mb jsou střední hodnoty dobrých (špatných) klientů S   ,Sh jsou příslušné směrodatné odchylky
0.5
0.4
0.3
0.2
0.1
Normálně rozložené skóre
Předpokládejme, že skóre dobrých a špatných klientů je normálně rozloženo, tj. jejich pravděpodobnostní hustoty mají tvar
(x-Mg y
Ígood (-*-)
2<ri
Íbad (x) —
2ťT?
<Jb42ŤŤ
Odhady parametrů crga <Jb :
Mg ,Mb jsou aritmetické průměty skóre dobrých (špatných) klientů Sg, Sb jsou směrodatné odchylky skóre dobrých (špatných) klientů
> Společná směrodatná odchylka:
s =
i
^nS 2 + mSb2 ^
v
n + m
J
> Odhady střední hodnoty a směrodatné odchylky skóre všech klientů ju^, cr^
nM + mM
nS,r2 +mSh2 +n(Mg - Ad)2 +m(Mb - M)
i
2^
Normálně rozložené skóre
Předpokládejme, že směrodatné odchylky obou skóre jsou rovny hodnotě <y, pak:
D =
CJ
	(D)					
KS = <X>		-<D		= 2-<t		-1
	v2y		l 2 J		v2,	
Gini = 2-0
í—1
-1
cr
ALL
1 ^
Liftq =-0
q   V cr
01(q)+pGD
D =
Mg-Mb
S
Lift =-0
r s
v
ALL
S
01(q)+pGD
Kde OQ je distribuční funkce standardizovaného normálního rozložení, O 2 (•) je distribuční funkce s parametry /u , rj2 a O1 (•) je standardizovaná kvantilová funkce.
Normálně rozložené skóre
* Obecně, tj. bez předpokladu rovnosti směrodatných odchylek skóre:
D =
2 2 (JÍ + (JÍ
KS = O
f
a
2 t^*2
*--aAcr&~ +2b-c
b 8
J
-O
f
a
*   1 / - — crh V
J
kde   a = ^ol+cj\,   b = <yl-cj2g, c = in
v
C2      C2    °6    ^ C2
5? f    + S> K + 2 • (Si - S2g )ln
J
í
-O
A2+3
v
n2      q<2       g" q<2
i-j 5, (S| + S* )Z)*2 + 2 • (s2 - S* )ln
J
571
Normálně rozložené skóre
* Obecně, tj. bez předpokladu rovnosti směrodatných odchylek skóre:
Gini = 2-o(D*)-l
Lift  = - O      (ju^ + cr^ • O"1 (q)) = - O
crALL(i)"1(q)+
v
Lift =-0
SALL-0-1(q)+M-Mt
V
J
Normálně rozložené skóre
0     o Q12345G7&3 1Q
573
Normálně rozložené skóre
Lift100/0: m>=0,<tI=1
l ■
□ V případě indexu Lift10O/oje evidentní silná závislost na ju^ významně vyšší závislost na &\ než v případě KS a Gini.
574
ROC (Receiver operating characteristic )
TjV (pne negative) - počet správně klasifikovaných negativních případů
TP (trnepositive) —počet správné Masifútovaných positivních případů
FP (falšeposiftve) —počet nesprávné klasifikovaných negativních případů
FN (falše negative) —počet nesprávné klasifikovaných posilivních případů
Predikce
	GO	G1	Celkem
GO		FP	IV
G1		TP	p
Celkem	Pnog	PPqs	
ROC-TPR, FPR
TPR = TP I P= TP I (TP+ FN) FPR= FPI N= FPI (FP+ TN)
•I
: ■=
tpr(c) = P(X > c I GL) =1-F2(c) t*r(cy = P(X<c\G0)=F0(c) Jpr(c)=P(X>c\G0) = l-Fo{c)
fhr(c)=P(X<c\G1) = Fl(c)
576
577
ROC - ACC
Accuracy:
ACC= (TP+ TN)I (P+ N)
0.9 0.8 0.7 ■5 0.6
« 0.5
o
EC
p- 0.4
perfect C
		A			B			C			C	
TP=63	FP=28		91	TP=77	FP=77	154	TP=24	FP=88	112	TP=88	FP=24	112
FN=37	TN=72		109	FN=23	TN=23	46	FN=76	TN=12	88	FN=12	TN=76	88
100		100	200	100	100	200	100	100	200	100	100	200
TPR = 0.63				TPR =0.77			TPR = 0.24			TPR =0.88		
FPR=0.28				FPR=0.77			FPR=0.88			FPR=0.24		
ACC = 0.68				ACC = 0.50			ACC = 0.18			ACC = 0.82		
0.3 0.2 0.1 0
ROC space
better
worse
0.2
C
0.4 0.6 0.8
FPR or (1 - specificity)
578
ROC-AUC, Gini
AUC (area under curve, neboli plocha pod ROC křivkou) je rovna pravděpodobnosti, že daný model ohodnotí náhodně vybraného dobrého klienta vyšším skóre než náhodně vybraného špatného klienta. Dá se ukázat, že plocha pod ROC křivkou se dá vyjádřit pomocí Mann-Whitneymu U, které testuje rozdíl mediánů mezi dvěma skupinami spojitých skóre. AUC se dá vyjádřit i pomocí Giniho koeficientu pomocí vzorce Gini+ 1 = IxAUC
579
Další evaluační grafy
Boxplot
Histogram
5 4.5
4 3.5
íň 3 a>
« 2.5 2 1.5 1
0.5 0
0.5
0.4
0.3
0.2
0.1
I BAD
GOOD
BAD
GOOD
-1 0 1 2 3 4 5 6
score
580
Další evaluační grafy
581
Další evaluační grafy
CAP (Lift chart):
ALL
AR (Accuracy Ratio)
AR
Plocha mezi CAP a diagonálou
Plocha mezi CAP ideálního modelu a diagonálou Plocha mezi CAP a diagonálou
0.5(l-pB)
= Gini
V tomto případě máme na x-ové ose proporci všech klientů (FALL) a na y-vé ose proporci špatných klientů (FBAD). Ideální model je tentokrát reprezentován lomenou čarou z bodu [o, o] přes [pB, i] do bodu [i, i]. Výhoda tohoto obrázku je ta, že je možné odečíst proporci zamítnutých špatných klientů vs. celková proporce zamítnutých klientů. Např. vidíme, že pokud chceme zamítnout 70% špatných klientů, musíme zamítat přibližně 40% všech žadatelů.
582
Postupy evaluace
> evaluace na učících datech
Evaluace na učících datech použitých k učícímu procesu není ke zjištění kvality modelu vhodná a má nízkou vypovídací schopnost, protože často může dojít k přeučení modelu. Odhad predikční kvality modelu na učících datech se nazývá resubstituční nebo interní odhad. Odhady ukazatelů kvality modelů provedených na učících datech jsou nadhodnocené, proto se místo nich používají testovací data, která se v rámci přípravy dat pro tyto účely vyčlení.
Postupy evaluace
> evaluace na testovacích datech
Evaluace na testovacích datech již má patřičnou vypovídací schopnost, jelikož tato data nebyla použita k sestavení modelu. Na testovací data jsou kladeny určité požadavky. Soubor testovacích dat by měl obsahovat dostatečné množství dat a měl by reprezentovat či vystihovat charakteristiky učících dat. Empiricky doporučený poměr učících a testovacích dat je 75%, resp. 25% případů. Zajištění patřičné reprezentativnosti je realizováno pomocí náhodného stratifikovaného výběru.
584
Postupy evaluace
> křížové ověřování (cross-validation)
V případě nedostatečného počtu pozorování, kdy rozdělení datového souboru na učící a testovací data za účelem vyhodnocení modelu není možné, je vhodné použít metodu křížového ověřování. Výhodou této metody na rozdíl od dělení datového souboru je, že každý případ z dat je použit k sestavení modelu a každý případ je alespoň jednou použit k testování. Postup je následující:
• Soubor dat je náhodně rozdělen do n disjunktních podmnožin tak, že každá podmnožina obsahuje přibližně stejný počet záznamů. Výběry jsou stratifikovány podle tříd (příslušnosti k určité třídě), aby bylo zajištěno, že podíly jednotlivých tříd podmnožin jsou zhruba stejné jako v celém souboru.
• Z těchto n disjunktních podmnožin se vyčlení n-i podmnožin pro sestavení modelu (konstrukční podmnožina) a zbývající podmnožina (validační podmnožina) je použita k jeho vyhodnocení. Model je tedy evaluován na podmnožině dat, ze kterých nebyl sestaven a na této množině dat je odhadována jeho predikční kvalita.
• Celý postup se zopakuje n-krát a dílčí odhady ukazatelů kvality se zprůměrňují. Velikost validační podmnožiny lze přibližně stanovit jako poměr počtu případů ku počtu validačních podmnožin.
585
Postupy evaluace
> bootstrap metoda
Metoda bootstrap zkoumá charakteristiky jednotlivých resamplovaných vzorků, které byly pořízeny z empirického výběru. Pokud původní výběr osahuje m prvků, tak každý má naději objevit se v resamplovaném výběru. Při úplném resamplování o velikosti vzorku n]sou uvažovány všechny možné výběry a existuje tedy m n možných výběrů. Úplné resamplování je teoreticky proveditelné, ale vyžádalo by si mnoho času. Alternativou je simulace Monte Car/o, pomocí níž se aproximuje úplné resamplování tak, že se provede B náhodných výběrů (obvykle se volí 500 - 10000 výběrů) s tím, že každý prvek je vždy nahrazen (vrácen zpět do osudí). Jsou-li dána data X={X1, Xn) a je-li požadován odhad parametru 6, provede se z původních dat B výběrů a pro každý výběr je spočítán odhad parametru 6 . Bootstrap odhad parametru je určen jako průměr dílčích odhadů. V případě evaluace modelů bude parametrem é? zvolený ukazatel predikční kvality.
> jackknife
Tato metoda je založena na sekvenční strategii odebírání a vracení prvků do výběru o velikosti n. Pro datový soubor, který obsahuje n prvků, procedura generuje n vzorků s počtem prvků n-1. Pro každý zmenšený výběr o velikosti n-1 je odhadnuta hodnota parametru. Dílčí odhady se následně zprůměrují podobně jako u metody bootstrap.
586
. Cutoff, RAROA, Monitoring
cutoff
587
Možné zamítací škály - cutoff
> cutoff hodnota určuje mez, při které je žádost o úvěr schválena/zamítnuta
> Je možné použít tyto zamítací škály:
> PD    -    Pravděpodobnost   Defaultu    (Probability of
Default)
> KRN - Kreditní Rizikové Náklady (CRE - Credit Risk Expenses)
> Marže (Margin) >RAROA
>...
Cutoff na škále PD
cutoff = o.i (tj. zamítám všechny s pravděpodobností defaultu větší než 10 %)
Cut off
• Pro SCi je reject rate 22 %.
• Pro SC2 je reject rate 33 %.
30 40 50 60 70
Score (relative) - the higher the better
589
Strategická křivka (Strategy curve)
Bad acceptance rate = pB (l - f(s\b))
Acceptance rate = 1 - F(s)
Pb{\-F{s\b))
i
Actual bad rate
1-F(s)
perfect information
Při zavádění nové scoringové funkce typicky dochází k tomu, že stávající nastavení schvalovacího procesu (nastavení eutoff) je reprezentováno bodem O , který leží nad novou strategickou křivkou. Otázkou pak je směr, kterým se chceme vydat při stanovení nového eutoff. Pokud se posuneme do bodu A, potom zachováme poměr schválených špatných klientů, ale současně zvýšíme celkový poměr schválených klientů. Při posunu do bodu B schválíme stejný poměr klientů, ale snížíme poměr schválených špatných klientů a tedy i poměr špatných klientů (bad rate). Posunem do bodu C zachováme bad rate při současném zvýšení poměru schválených klientů.
590
Nastavení cutoff maximalizující zisk (profit)
Profit - náhodná veličina definovaná jako:
0, je-li úvěr zamítnut
R = < L, je-li úvěr schválen a stane se dobrým - D, je - li úvěr schválen a stane se špatným
Označme pG a pB proporci dobrých a špatných klientů v populaci. q(G\s) (q(B\s)) označuje podmíněnou pravděpodobnost, že klient mající skóre s bude dobrý (špatný), přičemž q(G\s) + q(B\s) = i. Nechťp(s) je proporce populace se skóre s.
Střední hodnota profitu při schválení klientů se skóre s:
E{R\s] = Lq{G\s) - D(i - q{G\s)) = (L + D)q(G\s) - D
Tedy k maximalizaci profitu je třeba schválit ty klienty, jejichž skóre splňuje podmínku:
q{C\s) > ^
591
Nastavení cutoff maximalizující profit
Nechť A označuje množinu skóre, kde je splněna předchozí podmínka. Pak je střední hodnota zisku (profitu) na jednoho klienta dána vztahem:
Pokud L a D navíc závisí na skóre s, je situace ještě o něco složitější. Více viz Thomas et al. (2002).
592
Nastavení cutoff maximalizující profit
0.9 08 0.7 06 0.3 0.4 0.3
0.1
-o-o-o-
o O o
o o
0
OJ
0.2
0.3
E i profits)
(14
11.5
0.6
Body na spodní části křivky odpovídají vyšším cutoff hodnotám, a tedy i menšímu počtu přijatých špatných klientů, zatímco body na horní části křivky odpovídají menším hodnotám cutoff, tj. vyššímu počtu přijatých špatných klientů. Efektivní hranicí je tedy spodní část křivky od bodu C do bodu D.
Jestliže aktuální nastavení schvalovacího procesu odpovídá bodu O, opět máme možnost posunu na křivku odpovídající nové scoringové funkci. První možností je zachování poměru schválených špatných klientů, tj. posun do bodu A. Druhou možností je zachování celkového poměru schválených klientů, tj. posun do bodu B. Je zřejmé, že posun do bodu A není vhodná volba, protože tento bod neleží na efektivní hranici a lze snadno dosáhnout stejného očekávaného zisku při nižší očekávané ztrátě.
593
Definice KRN (CRE)
1 (.06)
N -1
OJ-OJ
O C/5 C/5
Číslo defaultní splátky (pravděpodobnost (PD))
2 (.02)
3 (.02)       4 (.02)_
5 (.02)       6 (.02)        7 (.02)       8 (.02)       9 (.02)
N
1
OJ-
OJ
N
1
OJ-
OJ
N
1
OJ-c+ OJ
Pravděpodobnost defaultu silně závisí na scoringové funkci
CRE = ((l-Recovery) * SUM(PD * Loss))/(Expected Average Volume) Profit = (Interest rate — CRE)*Expected Average Volume
1
J
Úroková míra
Očekávaný průměrný objem úvěru
10 (.03)
N
-tori-
QJ
594
Recovery (=Late collection(LC))
Číslo defaultní splátky	score			
	bandl	band2	band3	band4
1.	20%	25%	30%	35%
2.-4.	50%	55%	60%	65%
5. +	75%	80%	85%	90%
odhad
-1 sp1 -1 sp2 1 sp3 -1 sp4 -4 sp1 -4 sp2 -4 sp3 -4 sp4 -5 sp1 -5 sp2 -5 sp3 -5 sp4
Cutoff na skale KRN
596
Cutoff na skale KRN
597
(Očekávaná) Marže
(Očekávaná) Marže = Úroková míra (vč. poplatků) - KRN -OPEX
Úroková míra
Efektivní míra ideálního finančního toku (-výše úvěru-poplatky; anuita; anuita;. anuita).
□ KRN
Viz výše.
□ OPEX
Cena peněz.
Režijní náklady, variabilní náklady, podpora prodejní sítě.
Náklady na administrátory - vlastní zaměstnance zajišťující zpracování úvěru.
Marže (Margin)
> Optimální cutoff: marže
=0
RAROA
(Risk Adjusted Return On Assets)
Prob. oluVJ'aull .0(5 (basudL on S^rin^!
.0:
.0:
.0:
.0:
.0:
.0:
.0:
.0:
recoveries.
.50
.50
.50
o
Q-
<
r—*
3
EXPECTED INCOME - EXPECTED LOSS
No. of payment
TOTAL
RAROA=(EXPECTED INCOME - EXPECTED LOSS)/BORROWED VOLUME
600
* t... poradí splátky úvěru, 0 je okamžik poskytnutí úvěru
* T ... počet splátek
* t (ť) ... nesplacená část úvěru podle splátkového plánu v čase ŕ, .. .ŕ = 0,... tT. x(0) je výše úvěru, z(T) = 0.
* u(ť) .. . úroková část anuity t, t = 1,.. ., T.
* j (t)... část anuity odpovídající splátce jistiny ťř í = 1____,T.
* k(t) ... komise od klienta v čase t, t = 0,. .. ,T.
* A ... výše anuity (absolutné). A = u(ť) + j(t),t = 1,... ,T.
* pit)... pravděpodobnost 90 denního defaultu úvěru na splátce ttt = 1,...,T
* E Z ... očekávaná ztráta z úvěru
* E P ... očekávaný úrokový příjem z úvěru
* RC. .. absolutní výše z dlužné částky klienta 90 dní po splatnosti, která je klientem splacena v budoucnu, přepočtena přes NPV k okamžiku deva-desátidenního defaultu klienta
* r(t, f).. . procento výtěžnosti z dlužné částky klienta, který je poprvé 00 dní po splatnosti na splátce t a klient má hodnotu podvodníckeho skóre f nesplacení první splátky) /. Procento zohledňuje NPV všech budoucích splátek klienta po okamžiku defaultu.
601
RAROA
• GM ... hrubý očekávaný zisk z klienta
• s je sazba úvěru p.a.
• i. .. cena zdvojil vyjádřená v procentu p.a.
• c.. . komise z obchodu poskytnutá obchodnímu partnerovi vyjádřená jako procento z jistiny
• NMj . .. čistý očekávaný zisk typu I z klienta po odečtení ceny zdrojů
• NMjj . .. čistý očekávaný zisk z klienta typu II po odečtení ceny zdrojů a komisí z obchodu.""'
• ROA .. . ukazatel Re tur n on Asset počítaného z hrubého zisku
• ROAj ... ukazatel Return on Asset typu I počítaný z čistého zisku typu
I
• ROAii . .. ukazatel Return on Asset tvpu II počítán v z čistého zisku tvpu
II
• KRN je úroková míra p.a. vyjadřující rizikovost úvěru.
602
RAROA
j
EZ=^p(iJ-ar(i-l).
EP = fc(o) + y, 1 - E^s> («{t)+*(*))■
ť=i v    a=] /
GM = EP-EZ + RC.
T
t=i
NMj = GM -      (1 -X>00] 4 ■ *(ť - X)"
;=1
N M g = N Ms -cx{0).
ROA =	GM x(0) ■
ROAj =	NMľ x(0) ■
ROA j j =	NMn x(0) ■
KRN
12
f1 -   puj ]   -!) = ez - rc-
ť=l
J2(i-J2p(s))x(t-i) =
t=l
s=l
s/12
KRN = -■ s.
EP
603
Výhody RAROA
	Case A		Case B	
	Ideal flow	Expected flow	Ideal flow	Expected flow
	-1000	-1000	-1000	-1000
1	400	200	150	110
2	400	180	150	100
3	400	170	150	90
4	400	160	150	80
5			150	70
6			150	60
7			150	50
8			150	40
9			150	30
10			150	16
11			150	10
12			150	0
A - krátkodobý úvěr s vysokým rizikem fraudu B - dlouhodobý úvěr s vysokým rizikem defaultu
Úroková míra (A) = 22% Úroková míra (B) = 10%
KRN(A) = 44% KRN(B) = 20%
cutoff na škále KRN preferuje B
Marže (A) = -22% Marže (B) = -10% cutoff na škále marže preferuje B
RAROA (A) = -0.29 RAROA (B) = -0.36 cutoff na škále RAROA preferuje A
Úvěr A je lepší, protože z něj plyne vyšší zisk (7io>656), navíc je ho dosaženo mnohem dříve.
604
Cutoff segmentace
□ Možná segmentace podle:
> Prodejní síť (skupina obchodních míst)
> Profitabilita produktu
> Kvalita prodejního místa
> Typ zboží (pro spotřebitelské úvěry)
> Výše úvěru
> ...
Cutoff scénáře
All credits
Reject rate
Credits
0.0*/*
22.4%
36.8%
29.5%
30.3%
31.5%
35.8%
38.9%
59.3%
50.9%
Volume
0.00%
24.74%
46.79%
37.07%
38.55%
39.99%
46.01%
48.97%
70.03%
63.64%
Avg. margin
Credits
-7.83%
-3.97%
2.32%
3.18%
4.11%
4.63%
7.32%
8.17%
19.39%
19.24%
Volume
-21.34%
-15.72%
-4.58%
-3.36%
-1.29%
-0.77%
3.22%
3.41%
17.14%
17.03%
Approved credits
Avg. KRN
30.70%
26.33%
19.11%
19.78%
19.26%
18.96%
16.07%
15.19%
13.47%
14.23%
	All credits		Approved credits		
	Reject rate		Avg. margin		Avg. KRN
	Credits	Volume	Credits	Volume	
	0.0%	0.00%	-4.19%	-16.40%	26.19%
	9.0%	10.25%	-2.39%	-13.71%	24.50%
	24.5%	34.89%	3.26%	-3.42%	17.99%
	16.5%	23.87%	4.07%	-2.34%	18.60%
	17.3%	25.42%	4.85%	-0.59%	18.19%
	18.6%	27.16%	5.36%	-0.03%	17.87%
	23.2%	33.80%	7.74%	3.52%	15.34%
	26.7%	37.35%	8.61%	3.88%	14.45%
	50.7%	62.90%	19.53%	17.26%	13.05%
	47.5%	60.46%	19.48%	17.23%	13.26%
606
Cutoff impact evaluation
Evaluation of Reject rate, Profitability, Default and Loss rates before and after cutoff change according to Distribution channel or Segment of scorecard.
Cutoff impact evaluation table
								
		Before Christmas (approved credits)				\fter Christmas (approved credits)		
			Loss rate	Profit (per year)	Reject rate	RAROA	Loss rate	Profit (per year)
Segment 1	24.7%	3.65%	11.33%	414363 110	24.3%	3.75%	11.19%	428 757 430
Segment 2	12.1%	4.01%	8.22%	160 364 072	12.9%	3.95%	8.29%	159 917 943
Segment 3	45.1%	9.64%	9.69%	747 636 468	45.1%	9.8%	9.5%	758 966 512
Segment 4	22.2%	5.80%	4.89%	52 213 720	20.1%	5.62%	5.05%	51 715 263
Segment 5	20.9%	6.77%	5.41%	54312 614	19.7%	6.61%	5.48%	53 975 903
Segment 6	33.4%	7.04%	7.22%	212 090 365	32.6%	7.04%	7.16%	211 684 371
Segment 7	49.3%	9.30%	8.93%	36 840 287	49.2%	9.4%	8.8%	37 140 165
Segment 8	19.3%	4.68%	2.96%	15 668 962	14.9%	4.54%	3.16%	15 636 910
Segment 9	32.0%	8.41%	5.06%	3 679 430	27.2%	7.97%	5.26%	3 535 809
Segment 10	33.4%	7.14%	6.69%	1 823 050 341	33.4%	7.2%	6.6%	1 832 986 599
Segment 11	28.5%	6.34%	7.36%	2 633 609 071	28.6%	6.47%	7.24%	2 651 352 740
ALL	32.6%	6.64%	8.37%	6153 828 440	32.6%	6.96%	8.17%	6 205 669 645
607
Cutoff sensitivity analysis
Profitability, Default and Loss rates according to reject rate into one graph
Characteristics of approved credits according to reject rate
.o\o
<o3 <bx <o" <cP /V"-Reject rate
000 000 000 000 000 000 000 000 000 000 000 000 000 000 '+ 1 000 000 000 0
Profit (per year)
RAROI
Loss rate
Decision
Reasoning, why the final cutoffs were chosen
608
onitoring
	výv. vzorek [1]	týdenl [2]	[3]=[2] -[1]	[4]=[2]/[1]	[5]=ln[4]	[6]=[3]*[5]
skórel	10,00%	5,63%	-0,044	0,563	-0,574	0,025
skóre_2	10,00%	11,21 %	0,012	1,121	0,114	0,001
skóre_3	10,00%	11,00%	0,010	1,100	0,095	0,001
skóre_4	10,00%	10,97%	0,010	1,097	0,092	0,001
skóre_5	10,00%	10,31%	0,003	1,031	0,031	0,000
skóre_6	10,00%	10,12%	0,001	1,012	0,012	0,000
skóre_7	10,01%	9,62%	-0,004	0,961	-0,039	0,000
skóre_8	10,00%	9,89%	-0,001	0,989	-0,011	0,000
skóre_9	10,00%	10,31%	0,003	1,031	0,030	0,000
skóre 10	10,00%	10,94%	0,009	1,095	0,091	0,001
					PSI	0,030
Stabilita SF -týdny
0,4 ■ 0,3 ■ 0,2 ■
Vzorek 2006-   2006-   2006-   2006-   2006-   2006-   2006-   2006-   2006- 2006-13        14       15       16        17       18        19       20       21 22
Monitoring scoringových modelů
□ Není překvapivé, že prediktivní modely se ve statistickém slova smyslu chovají nejlépe na vývojovém vzorku dat. Výstupy těchto modelů, např. skóre nebo rating klienta, jsou počítány pomocí jistých vzorců, jejichž koeficienty příslušející nezávislým proměnným (prediktorům) jsou odvozeny na datech vývojového vzorku. Posun distribuce výstupu daného modelu je pak zapříčiněn právě změnou vstupních hodnot modelu, tj. prediktorů, v průběhu času. V podstatě ihned (alespoň většinou) po nasazení prediktivního modelu do praxe dochází k jistému poklesu jeho prediktivní síly, který je způsoben určitou změnou vstupních hodnot modelu. Zásadní je v praxi nastavení takových procesů, které odhalí, že se tak děje, proč se tak děje a jak vážný problém to ve svých důsledcích znamená.
610
Monitoring scoringových modelů
□ Faktorů způsobujících posun v distribuci prediktorů, a následně posun v distribuci výstupu prediktivního modelu, je několik:
> Přirozený posun v datech/změna demografické struktury dat
> Databázové chyby
>Změna datového zdroje >Změna definice/formátu vstupních dat >Změna datového univerza Ostatní
611
Monitoring scoringových modelů
□ Typickým příkladem prvního uvedeného důvodu je příjem klienta (všeobecným trendem je růst příjmu populace). Změnou definice/formátu vstupních dat je myšlena například situace, kdy je rozšířen číselník hodnot, kterých může vstupní proměnná nabývat. Změnou datového univerza je myšlen případ kdy je vyvinutý prediktivní model použit např. pro odlišný/nový segment portfolia nebo odlišný/nový produkt.
612
Monitoring scoringových modelů
□ K-S, Gini:
Stabilita SF -týdny
Vzorek 2006-   2006-   2006-   2006-   2006-   2006-   2006-   2006-   2006- 2006-13        14       15        16        17        18        19       20        21 22 V_J
613
Monitoring scoringových modelů
Závislost defaultu na Skóre
.vN .sQ, ,<b ,\U .Á .sfo ,<\ .sffj ,s°>
1 vzorek -2006-13 2006-14 2006-15 -2006-16 -2006-17 -2006-18 -2006-19 2006-20 2006-21 -2006-22
> Čím strmější křivka tím lépe.
> V průběhu času se zplošťuje - jde o to, jak moc.
614
Monitoring scoringových modelů
□ c-statistika:
100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%
Rating (pst Defaultu naeODPS do 12 měs.)
		B		B		B				
										
										
vývojirzorek
Xr04
XI04
XII04
I-05
1-05
lľlB07 (6%)
I     IhBID (13.5%)
I-ihC01 (21.25%)
i-iľlC04 (34%)
IľiBOS (9%) lhB11 (16.25%) lhC02 (25%)
■odhad průměr, pst. defaultu
□ hB09 (11%)
■ hB12 (18.75%)
■ hC03 (30%)
18.00%
17.50%
16.50%
16.00%
15.00%
KRN
vývoj.vzorek        X-04 Xl-04 XII-04
I-05
0.55
0.54
0.53
0.52
0.51
II-05
1KRN1 (1%) 1 KRN5 (16.5%)
1 KRN2 (4%) 1 KRN6 (20%)
] KRN3 (9%) 1 KRN7 (24%)
] KRN4 (13.5%) -C-statistika
615
Monitoring scoringových modelů
□ Chceme posoudit zda se distribuce skóre na vývojovém vzorku liší od distribuce skóre v daném časovém intervalu:
1=1
PSI = ^{0, -E,)\n
i=l
616
onitoring scoringových modelů
	výv. vzorek [1]	týden 1 [2]	[3]=[2] -[1]	[4]=[2]/[1]	[5]=ln[4]	[6]=[3]*[5]
skóre_1	10,00%	5,63%	-0,044	0,563	-0,574	0,025
skóre_2	10,00%	11,21%	0,012	1,121	0,114	0,001
skóre_3	10,00%	11,00%	0,010	1,100	0,095	0,001
skóre_4	10,00%	10,97%	0,010	1,097	0,092	0,001
skóre_5	10,00%	10,31%	0,003	1,031	0,031	0,000
skóre_6	10,00%	10,12%	0,001	1,012	0,012	0,000
skóre_7	10,01%	9,62%	-0,004	0,961	-0,039	0,000
skóre_8	10,00%	9,89%	-0,001	0,989	-0,011	0,000
skóre_9	10,00%	10,31%	0,003	1,031	0,030	0,000
skóre_10	10,00%	10,94%	0,009	1,095	0,091	0,001
					PSI	0,030
617
Monitoring scoringových modelů
PSI < 0,1 značí žádný nebo jen velmi malý rozdíl daných distribucí skóre.
091 < PSI < 0925    znamená, že došlo k nějakému posunu distribuce, nicméně
nikterak významnému.
DCT     f\ signalizuje významný posun v distribuci skóre, tj. zamítáme
^ hypotézu o shodě daných distribucí.
618
Monitoring scoringových modelů
619
Monitoring scoringových modelů
PSIdr=é(DR2i-DRli)ln
i=l
	def_rate	Gini	PSI_DR	PSI	chi-kvardat
vzorek	7,69%	0,643			
200613	9,38%	0,564	0,120	0,030	0,024
200614	9,35%	0,542	0,131	0,034	0,027
200615	8,70%	0,537	0,093	0,032	0,025
200616	8,57%	0,523	0,089	0,033	0,026
200617	8,59%	0,540	0,071	0,030	0,025
200618	9,19%	0,544	0,111	0,030	0,024
200619	8,03%	0,558	0,063	0,034	0,026
200620	8,52%	0,552	0,055	0,023	0,019
200621	8,05%	0,555	0,043	0,027	0,022
200622	7,76%	0,539	0,039	0,045	0,034
620
onitoring scoringových modelů
621
Champion-challenger (mistr-vyzýva tel)
□ K rozšíření využití strategie champion-challenger došlo v devadesátých letech minulého století. Princip je velmi jednoduchý Předpokládejme, že existuje nějaký způsob dělání něčeho (např. aktuálně používaný scoringový model pro schvalování/zamítání žádostí o úvěr). Tento způsob nazveme mistrem (champion). Nicméně existují další, jeden nebo více, alternativní způsoby jak dosáhnout téhož (nebo velmi podobného) cíle. Tyto nazveme vyzyvateli (challengers). Na náhodném vzorku otestujeme vyzyvatele a porovnáme s mistrem. To nám umožní nejen porovnat efektivnost vyzyvatelů a mistra, ale získáme možnost identifikovat existenci a rozsah vedlejších efektů. Výsledkem pak může být zjištění, že některý z vyzyvatelů je lepší než mistr a tento vyzyvatel se stane novým mistrem.
622
11. Reference
Literatura - knihy
> Anderson, R. (2007). The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management and Decision Automation, Oxford: Oxford University Press.
> Giudici, P. (2003). Applied Data Mining: statistical methods for business and industry, Chichester: Wiley.
>Han, 1, Kamber, M. (2006). Data mining: Concepts and Techniques, 2nd ed. San Francisco: Morgan Kaufmann.
>Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, New York: Springer-Verlag.
> Hosmer, D. W., Lemeshow S. (2000). Applied Logistic Regression, Textbook and Solutions Manual, 2nd ed., New York: John Wiley and Sons.
624
Literatura - knihy
> Siddiqi, N. (2006). Credit Risk Scorecards: developing and implementing intelligent credit scoring, New Jersey: Wiley.
> Thomas, L.C. (2009). Consumer Credit Models: Pricing, Profit, and Portfolio, Oxford: Oxford University Press.
> Thomas, L.C, Edelman, D.B., Crook, J.N. (2002). Credit Scoring and Its Applications, Philadelphia: SIAM Monographs on Mathematical Modeling and Computation.
>Wilkie, A.D. (2004). Measures for comparing scoring systems, In: Thomas, L.C, Edelman, D.B., Crook, J.N. (Eds.), Readings in Credit Scoring. Oxford: Oxford University Press, pp. 51-62.
> Witten, LH., Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, San Francisco: Morgen Kaufmann.
625
Literatura - časopisy
> Crook, J.N., Edelman, D.B., Thomas, L.C (2007). Recent developments in consumer credit risk assessment. European Journal of Operational Research, 183 (3), 1447-1465
> Hand, DJ. and Henley, W.E. (1997). Statistical Classification Methods in Consumer Credit Scoring: a review. Journal, of the Royal Statistical Society, Series A., 160,No.3, 523-541.
> Harreil, F.E., Lee, K.L. and Mark, D.B. (1996). Multivariate prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15, 361-387.
> Lilliefors, H.W. (1967). On the Komogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 62, 399-402.
626
Literatura - časopisy
> Nelsen, R. B. (1998). Concordance and Gini's measure of association. Journal of Nonparametric Statistics, 9, Isssue 3, 227-238.
> Newson R. (2006). Confidence intervals for rank statistics: Somers' D and extensions. The Stata Journal, 6(3), 309-334.
> Somers R. H. (1962). A new asymmetric measure of association for ordinal variables. American Sociological Review, 27, 799-811.
> Thomas, L.C (2000). A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers. International Journal of Forecasting, 16(2), 149-172 .
627
Literatura - web
> Coppock, D.S. (2002). Why Lift?, DM Review Online, www.d mreview.com/news/53291. html
> Xu, K. (2003). How has the literature on Gini's index evolved in past 80 years?, www.economics.dal.ca/RePEc/dal/wparch/howgini.pdf
> Xin Ming Tu, Wan Tang (2006). Categorical Data Analysis.
http://www.urmc.rochester.edu/smd/biostat/people/faculty/TuSite/bst466/handouts.htm
Jiawei Han and Micheline Kamber (2006). Data Mining: Concepts and Techniques. http://www.cs.illinois.edu/~hanj/bk2/
> Jens Peter Dittrich (2007). Data warehousing.
http://www.dbis.ethz.ch/education/ss2007/07_dbs_datawh/Data_Mining.pdf
> Joe Carthy (2006). Data Warehousing.
http://www.csi.ucd.ie/staff/jcarthy/home/DataMining/DM-Lecture02-01.ppt
> Jan Spousta (?). Prednasky kdata miningu. [cit. 19.03.2009] http://samba.fsv.cuni.ez/~
Další zajímavé zdroje informací
> http://www.es. uiuc.edu/homes/hanj/
> http://www-users.es. umn.edu/~kumar/
> http://www.kdnuggets.com/
> http://www.kdnuggets.com/datasets/competitions.htmI
> http://www.crc.man.ed.ac.uk/conference/
> http://www.crc.man.ed.ac.uk/conference/archive/
> http://www.kmining.com/info_conferences.html
> http://en.wikipedia.org/wiki/Data_mining
> http://cs.wikipedia.org/wiki/Data_mining
> http://en.wikipedia.org/wiki/Credit_scorecards
Užitečné zdroje dat
>http://archive.ics.uci.edu/ml/ >http://kdd.ics.uci.edu/
	mi	
Machine Learning Repository		
>http://sede.neurotech.com.br:443/PAKDD2009/
> http: / / www. dataminingbook. com/
>http://www.stat.uni-muenchen.de/service/datenarchiv/welcome_e.html
> http://www. kaggle.com/
630