Obsah: 1. Credit scoring (CS) - historie, základní pojmy. 2. Úvod do SAS EG 3. Metodologie vývoje scoringových funkcí. 4. Příprava dat II. 5. Úvod do shlukové analýzy. Hiearchické shlukování. 6. Vývoj CS modelu. 7. Úvod do analýzy přežití. 8. Coxova regrese. 9. Evaluace modelu II. 10. Stanovení cut-off. RAROA, CRE. Monitoring. 11. Reference. 3 45 92 164 212 275 331 382 398 443 479 1. Credit scoring- historie, základní pojmy 3 4 Úvod  Credit Scoring je soubor prediktivních modelů a jejich základních technik, které slouží jako podpora finančním institucím při poskytování úvěrů.  Tyto techniky rozhodují, kdo dostane úvěr, jaká má být výše úvěru a jaké další strategie zvýší ziskovost dlužníků vůči věřitelům.  Credit Scoringové techniky kvantifikují a posuzují rizika při poskytování úvěrů konkrétnímu spotřebiteli. 5 Úvod  Nerozeznají a nestanovují "dobré" nebo "špatné" (očekává se negativní chování, tj. např. default) žádosti o úvěr na individuální bázi, nýbrž poskytují statistické šance, nebo pravděpodobnosti, že žadatel s daným skóre se stane "dobrým" nebo "špatným".  Tyto pravděpodobnosti nebo skóre, spolu s dalšími obchodními úvahami jako jsou předpokládaná míra schvalování, zisk nebo ztráty, jsou pak použity jako základ pro rozhodování o poskytnutí/neposkytnutí úvěru. Why do we need score?  “HISTORICAL EVOLUTION”: Money lender • lend only to people which he knows Operators • they make decision based on client's information and their experience Automatic scoring • make decision on statistical base PAST EXPERIENCE -> ESTIMATION FOR FUTURE 6 Why score? • Automatization of approval proces • Cost – effective • Less fraud possibilities ADVANTAGES: • Statistical based, not take in account client like individual DISADVANTAGES 7 Úvod  Zatímco historie úvěru sahá 4000 let nazpět (první zaznamenaná zmínka o úvěru pochází ze starověkého Babylonu - 2000 let před n.l.), historie credit scoringu je pouze 50-70 let stará.  První přístup k řešení problému identifikace skupin v populaci představil ve statistice Fisher (1936). V roce 1941, Durand jako první rozpoznal, že tyto techniky mohou být použity k rozlišování mezi dobrými a špatnými úvěry. 8 Úvod  Významným milníkem při posuzování úvěrů byla druhá světová válka.  Do té doby bylo standardem individuální posuzování žadatele o úvěr. Dále bylo standardem, že ve finanční sféře byli zaměstnáni (téměř) výhradně muži.  Odchod značné části mužské populace do služeb armády měl za následek potřebu předat zkušenosti dosavadních posuzovatelů žádostí o úvěr novým pracovníkům.  Díky tomu vznikla jakási rozhodovací pravidla a došlo k „automatizaci“ posuzování žádostí o úvěr. 9 Úvod  Příchod kreditních karet ke konci šedesátých let minulého století a růst výpočetního výkonu způsobil obrovský rozvoj a využití credit scoringových technik. Událost, která zajistila plnou akceptaci credit scoringu, bylo přijětí zákonů „Equal Credit Opportunity Acts” (o rovné příležitosti přístupu k úvěrům) a jeho pozdějších znění přijatých v USA v roce 1975 a 1976. Tyto stanovily za nezákonné diskriminace v poskytování úvěru, vyjma situace, pokud tato diskriminace „byla empiricky odvozená a statisticky validní”. 10 Úvod  V osmdesátých letech minulého století začala být využívána logistická regrese, dodnes v mnoha oblastech považovaná za průmyslový standard, a lineární programování. O něco později se objevily na scéně metody umělé inteligence, např. neuronové sítě. Mezi další používané techniky lze zařadit metody nejbližšího souseda, splajny, waveletové vyhlazování, jádrové vyhlazování, Bayesovské metody, regresní a klasifikační stromy, support vector machines, asociační pravidla, klastrová analýza a genetické algoritmy. 11 Historie -detail Zdroj: Anderson pawnshop = zastavárna deemed acceptable = považován za přijatelný Advertisement for credit = reklama na úvěr Mercantile agency = obchodní agentura 12 Historie -detail Zdroj: Anderson affordable = dostupný iris species = druhy kosatců Charge card = kreditní karta Propensity scorecard = scoringová karta pro modelování náchylnosti (k nákupu) FI = splolečnost Fair, Isaac…dnes FICO Mortgage = hypotéka 13 Historie -detail Zdroj: Anderson 14 Historie -detail Zdroj: Anderson 15 Historie -detail Zdroj: Anderson 16 Historie –další zajímavé čtení http://www.fundinguniverse.com/company-histories/Fair-Isaac-and- Company-Company-History.html http://www.fico.com/en/Company/News/Pages/03-10-2009.aspx http://www.directlendingsolutions.com/history_credit_scoring.htm http://www.pbs.org/wgbh/pages/frontline/shows/credit/more/scores.html http://en.wikipedia.org/wiki/Credit_score 17 Risk Management – Acquisition Data Acquisition Internal Info Strategy • Policy Rules • Scorecards • Fraud • Delinquency • Bankruptcy • Claims • Credit Bureau • Other External Data Pass Fail 18 Risk Management – Customer  Credit Line Management  Usage Monitoring  Transaction Fraud  Transaction Approval  Renewal/Reissue  Collections  Claims  Scorecards  Policy Rules  Strategies .. Lots of analysis 19 Risk Management Commercial/Consumer Delinquency, Fraud, Claim, Collections Market, Interest, VaR (Risk Dimensions) Enterprise Financial Operational Risk Management 20 Klienti nesplácí poskytnuté půjčky Změny úrokových sazeb, cen akcií, kurzů 21 22 Risk Management Delinquency, Fraud, Claim, Collections Commercial/Consumer Applicant Transaction Claims Internet (app+trans) Fraud P&C, Life, Health Mortgage insurance Export financing insurance Claim Payment Projection (recovery) Outsourcing to agency Collections Late payments Bankruptcy Write-off Delinquency P&C: Property & Casualty Insurance (majetkové a úrazové pojištění) Why Manage Risk?  Reduce exposure to high-risk accounts.  Decrease bad debt and claims payouts.  Ensure better pricing to reflect risk.  Detect fraud early-on.  Increase approval rates (the “right kind” – potentially increasing revenue).  Handle most approvals/declines quickly (customer service).  Analysts/investigators only focus on difficult accounts.  Ensure consistent, equal and objective treatment of each applicant across the organization.  Offer more efficient marketing initiatives. $ $ $ £ £ £ ¥ ¥ ¥ € € € 23 Users of Risk Management  Banks  Citibank, Royal Bank, CIBC, BankOne  Finance Companies  GE Capital, HFC, GMAC  Insurance  Life, Property and Casualty, Health  Government  Ministries/Departments of Health (Medicare), Ministries of Finance (IRS), Workers Compensation. 24 Users of Risk Management  Utilities  Hydro/Power/Energy, Water  Communications  Bell, Sprint, AT&T (land lines and cellular)  Retail  JC Penneys, Sears, Hudsons Bay Company, Target  Manufacturers/Industrials  Those who give credit to small businesses. 25 Risk Management “Toolbox”  Risk Data Mart/Data Warehouse  Risk prediction models (scorecards)  Reporting  Analysis tools  Operational/strategy implementation software (for example, FICO™ Blaze Advisor®, FICO® TRIAD® Customer Manager, Experian Probe SM, Experian NBSM, Cardpac, VisionPlus, Pro-Logic Ovation). 26 FICO™ Blaze Advisor® Zdroj: http://www.fico.com/account/resourcelookup.aspx?theID=430 27 Scorecards  Predict the probability of a negative event.  Custom – based on clients own data  Generic – based on pooled industry or bureau data (Beacon, Empirica)  Application – new applicants  Behavioral – current customers 28 Scorecard Types Mktg/CRM Response Churn Revenue Cross sell Risk 30/60/90 Delinquency Bankruptcy Write-off Claim Fraud Collections Combination Resp/approve/delq Response/profit Risk/churn/profit Profit 29 Scoring in approval process Client (new) Hard checks Scoring on fraud and default cutoffs on RAROA Verifications (dependant on riskgroup) + + - - rejection rejection rejection Policy declines – low age, unsufficient length of employment, “terorrist” etc. What is the probability that client will pay? Will the contract be profitable? Is the number of client„s phone valid? Etc. 30 Fraud Risk  Fraud risk is one of the fastest growing areas in risk management.  Examples include bank/retail card fraud, insurance fraud, health care fraud, welfare fraud, franchise fraud, internet fraud, mortgage fraud, investment fraud, tax fraud, merchant fraud.  E-commerce presents opportunities.  The F.B.I. estimates that between 10–15% of loan applications contain material misrepresentations. 31 Reporting and Analysis  Scorecard and portfolio performance  Approval rates, applicant profile, loss rates, high risk segments  Behavior tracking to develop better strategies  Capturing fraud, approval/decline, pricing, credit line management, collections, cross sells qualification, claims. 32 Risk Applications  Retail/banking (consumer and commercial)  Application and behavior scorecards for all credit products.  Strategy design for credit limit setting, authorizations and collections/reissue/suspension.  Fraud application and transaction detection  Pricing/down payment  ATM limits, check holds  Pre-qualifying direct marketing lists.  Automotive/finance  Loans and leasing  Application, behavioral, fraud, collection scorecards  Pricing/down payment. 33 Risk Applications  Government  Fraud detection (for example, Welfare, health insurance)  Entitlement/claims assessment (for example, Workers compensation)  Communications  Security deposit  International call access  Contract/”pay as you go”  Telephone fraud  “Shadow limit” setting  Suspension of service  Collections. 34 Risk Applications  Insurance  Rate setting  Fraud detection  Claims management  Risk control for CRM initiatives.  Utilities  Security deposit  Collections. 35 Risk Applications  Manufacturers/pharmaceuticals/industrials  Assessing credit risk of business clients  Credit risk assessment of franchisees (for example, gas stations)  Payment terms  Collections  Merchant fraud. 36 Risk Applications  Optimizing work flow in adjudication departments  Evaluating/pricing portfolios  Securitization  Setting economic/regulatory capital allocation  Reducing turnaround time (automated scoring)  Comparing quality of business from different channels/regions/suppliers. 37 Resources  www.ftc.gov/bcp/conline/pubs/credit/scoring.htm  www.creditscoring.com  www.my-credit-score.com  www.fairisaac.com, www.myfico.com  www.experian.com  www.creditinfocenter.com  www.consumersunion.org/finance/scorewc200.htm  www.phil.frb.org/files/br/brso97lm.pdf  www.nacm.org  www.rmahq.org  www.riskmail.org  www.occ.treas.gov 38  Credit Scoring & Its Applications by Lyn Thomas, Jonathan Crook, David Edelman  Credit Risk Modeling: Design and Application by Elizabeth Mays (Editor)  Internal Credit Risk Models: Capital Allocation and Performance Measurement by Michael K Ong  Handbook of Credit Scoring by Elizabeth Mays  Applications of Performance Scoring to Accounts Receivables Management in Consumer Credit by John Y. Coffman  Introduction to Credit Scoring, by E.M. Lewis Resources 39 Scorecard Development roles- objectives  Understand the critical resources needed to successfully complete a scorecard development and implementation project.  Understand some of the operational considerations that go into scorecard design. 40 Major Roles  Scorecard Developer  Data miner, data issues  Credit Scoring Manager/Risk Manager  Strategic view, corporate policies, implementation  Product Manager  Client base, target market, marketing direction. 41 Major Roles  Operational Managers  Customer Service, Adjudication, Collections  Strategy execution, impact on customers  IT/IS Managers  external/internal data, implementation platforms. 42 Minor Roles  Project Manager  Coordination, time lines  Corporate Risk staff  Corporate policies, capital allocation  Legal. 43 Why All of These Roles?  Can I use this variable?  Legal, technical (derived variables, implementation platform), future application form design  Segmentation  Marketing, application form design, systems  What is the impact on this segment?  Operational, marketing, risk manager, corporate risk. 44 2. Úvod do SAS EG 45 Introduction to SAS Enterprise Guide SAS Enterprise Guide provides a point-and-click interface for managing data and generating reports. 46 SAS Enterprise Guide Interface SAS Enterprise Guide also includes a full programming interface that can be used to write, edit, and submit SAS code. 47 SAS Enterprise Guide Interface: The Project A project serves as a collection of  data sources  SAS programs and logs  tasks and queries  results  informational notes for documentation. You can control the contents, sequencing, and updating of a project. 48 data work.clubmembers work.nonclub; set orion.customer; if Customer_Type_ID = 3010 then output work.nonclub; else output work.clubmembers; run; proc print data=work.nonclub; title "Non Club Members"; var Country Gender Customer_Name; run; DATA Step PROC Step SAS Programs ep02d01.sas 49 PROC PRINT Output 50 Saving SAS Programs The SAS program in the project is a shortcut to the physical storage location of the .sas file. Select the program icon and then select File  Save program name to save the program as the same name, or Save program name As… to choose a different name or storage location. 51 Embedding Programs in a Project A SAS program can also be embedded in a project so that the code is stored as part of the project .epg file. Right-click on the Code icon in a project and select Properties  Embed. 52 How Do You Include Data in a Project? Selecting File  Open  Data adds a shortcut to a SAS data source in the project. 53 Assigning a Libref You can use the Assign Project Library task to define a SAS library for an individual project. 54 Browsing a SAS Library During an interactive SAS Enterprise Guide session, the Server List window enables you to manage your files in the windowing environment. In the Server List window, you can do the following:  view a list of all the servers and libraries available during your current SAS Enterprise Guide session  drill down to see all tables in a specific library  display the properties of a table  delete tables  move tables between libraries 55 Applying Formats Display formats can be applied in a SAS Enterprise Guide task or query by modifying the properties of a variable. 56 Query Builder Join When you use the Query Builder to join tables in SAS Enterprise Guide, SQL code is generated.  SQL does not require sorted data.  SQL can easily join multiple tables on different key variables.  SQL provides straightforward code to join tables based on a non-equal comparison of common columns (greater than, less than, between). 57 Sort Data Task The Sort Data task enables you to create a new data set sorted by one or more variables from the original data. 58 Business Scenario Orion Star wants to send information about a specific promotion to female customers in Germany. The report can be created by querying the orion.customer data set to include only the desired customers, and then by producing a report with the List Data task. 59 Business Scenario The same report can be generated more efficiently by subsetting the data directly within the List Data task. This requires modification of the code generated by SAS Enterprise Guide. 60 Understanding Generated Task Code There are many situations where task results created by SAS Enterprise Guide can be further enhanced or customized by modifying the code. However, before you can effectively modify the code, you must first understand the code that SAS Enterprise Guide generates. 61 List Data Task The Preview code button enables you to view and modify the code generated by the task. 62 List Data Task – Code Preview 63 Using the List Data Task to Generate Code This demonstration illustrates building a List Data task and examining the code generated by SAS Enterprise Guide. 64 List Data Task – Generated Code The initial comment block shows information about the task. 65 List Data Task – Generated Code The first line uses a macro to delete temporary tables or views if they already exist. If the Group by role is used in the task, the data must be ordered by the grouping variable. PROC SORT is used by default. Only variables assigned to roles are kept in the new data set. 66 List Data Task – Generated Code If the Group by role is not used, SQL creates a temporary view of the required data. Again, only variables assigned to roles in the task are included in the view. This comment incorrectly states that sorting occurs. 67 List Data Task – Generated Code The main part of the code includes the titles, footnotes, and procedure code to generate the report. PROC PRINT is the procedure used with the List Data task.  TITLE and FOOTNOTE are examples of global statements and can be included anywhere in a SAS program. 68 List Data Task – Generated Code At the end, the final lines of code delete any temporary tables created to build the task, and delete any assigned titles and footnotes. 69 Techniques to Modify Code Three methods can be used to modify code generated by SAS Enterprise Guide: 1. Edit the last submitted task code in a separate Code window. 2.Automatically submit custom code before or after every task and query. 3.Insert custom code in a task. 70 Edit Last Submitted Code After a task runs, the code can be viewed from either the Project Tree or Process Flow. 71 Edit Last Submitted Code The task code is read-only and cannot be edited directly. To create a copy of the code from the Last Submitted Code window, select any key while in the SAS program window. SAS Enterprise Guide offers to make a copy. After the code is copied, there is no link between the task and the new code. Any changes in the task are not reflected in the copied code, and modifications to the code do not affect the task. 72 Summary of Editing Last Submitted Code Custom code linked to task? No Can be used to modify query code? Yes Extent of modification allowed? Anything in the program can be changed. Custom code included when exported? Yes. You must export the edited program and select the option in the Export wizard. 73 Automatically Submit Custom Code Before or After Every Task and Query There are times when you might need to run a SAS statement or program before or after any task or query is executed. The Custom Code option enables you to insert custom code before or after all tasks and queries. 74 Automatically Submit Custom Code Before or After Every Task and Query To run code before tasks and queries, select the first check box and select Edit… to type the code. 75 Automatically Submit Custom Code Before or After Every Task and Query Global statements or complete program steps can be entered. Example: Set the LOCALE= option to Great Britain. 76 Insert Code Before or After SAS Programs Similar options exist to automatically submit code before or after SAS programs written and submitted in Code windows in SAS Enterprise Guide. 77 Summary of Submitting Custom Code Before or After Every Task and Query Custom code linked to task? Yes Can be used to modify query code? Yes Extent of modification allowed? Statements can only be submitted before or after the task code. Custom code included when exported? Yes, select the option in the Export wizard. 78 Insert Custom Code in a Task In most task dialog boxes, you have the ability to insert custom code within the generated SAS program. This technique has the significant benefit that the task interface can still be used to modify the report. 79 Insert Custom Code in a Task In the Code Preview window, select Insert Code… to add custom code in predefined locations in the SAS program. 80 Insert Custom Code in a Task In any of these predefined locations, you can double-click on a line to insert custom code. 81 Insert Custom Code in a Task Some insert points enable custom options to be added to existing statements. Insert options in the PRINT statement. Insert options in the VAR statement. 82 Insert Custom Code in a Task Other insert points enable entire statements to be added inside a step in the program. Statements inside the PRINT step 83 Insert Custom Code in a Task Additional locations enable global statements or additional steps to be inserted before or after the main code. Locations for global statements or additional steps 84 Default SAS Enterprise Guide Footnote The default footnote includes macro references to the SAS server name, operating system, and date and time that the task runs. Generated by the SAS System version &SYSVER(&_SASSERVERNAME, &SYSSCPL) on %TRIM(%QSYSFUNC(DATE(), NLDATE20.)) at %TRIM(%SYSFUNC(TIME(), NLTIMAP20.)) 85 ODS and SAS Enterprise Guide Default result formats can be set under Tools  Options. 86 ODS and SAS Enterprise Guide Additional settings can be made for each result format. 87 ODS and SAS Enterprise Guide  Task properties can be used to override the default for an individual task.  Generated output can be switched off completely and handled by inserting code. Right-click on a task icon and select Properties. 88 SAS Enterprise Guide Help (Review) If Help files were installed along with SAS Enterprise Guide, you can select Help to access the Help facility regarding both the point-and-click functionality of SAS Enterprise Guide as well as SAS syntax. 89 Task and Procedure Help To find information regarding the syntax of the code behind the scenes of a particular task, type the name of the task in the Index tab. The task help indicates the procedure name to search in the SAS syntax help. 90 Procedure Syntax Help 91 3. Metodologie vývoje scoringových funkcí 92 Objectives  Understand how scorecards to predict credit risk are developed.  Understand the analyses and issues for implementation of scorecards. 93 Main Stages – Development  Stage 1: Preliminaries and Planning  Create Business Plan  Identify organizational objectives  Internal versus External development, and scorecard type  Create Project Plan  Identify project risks  Identify project team. 94 Main Stages – Development  Stage 2: Data Review and Project Parameters  Data availability and quality  Data gathering for definition of project parameters  Definition of project parameters  Performance window and sample window  Performance categories definition (target)  Exclusions  Segmentation  Methodology  Review of implementation plan. 95 Main Stages – Development  Stage 3: Development Database Creation  Development sample specification  Sampling  Development data collection and construction  Adjusting for prior probabilities. 96 Main Stages – Development  Stage 4: Scorecard Development  Missing values and outliers  Initial characteristic analysis  Preliminary scorecard  Reject inference  Final scorecard production  Scaling  Points allocation  Misclassification  Scorecard strength  Validation. 97 Main Stages – Development  Stage 5: Scorecard Management Reports  Gains tables and charts  Characteristic reports. 98 Main Stages – Implementation  Stage 1: Pre-Implementation Validation  Stage 2: Strategy Development  Scoring strategy  Setting cutoffs  Strategy considerations  Policy rules  Overrides. 99 Main Stages – Post Implementation  Post-Implementation  Scorecard and Portfolio Monitoring Reports  Review. 100 Development Stage 1: Preliminaries and Planning 101 Objectives  Create a business plan to ensure a viable and smooth project.  “All Models are wrong. Some are useful” George Box 102 Create Business Plan  Identify organizational objectives.  Reasons for model development  Profit, revenue, loss, automation, operational efficiency  Role of scorecards in decision making  sole arbiter or decision support tool? 103 Create Business Plan  Internal/External Development and Scorecard Type  Capability and resources  Staff, tools, expertise, data  Market segment  Custom, generic, judgmental  segment, data, time. 104 Create Project Plan  Scope and timelines  Deliverables (scorecard format and documentation,…)  Implementation strategy  Testing, coding  Strategy development  FYI list.  Seamless process from planning to development and implementation. 105 Create Project Plan  Identify Project Risks  Data risks  Availability, quality, quantity  Weak data  Operational risks  Organizational priority  Implementation delays  System interpretation of data. 106 Create Project Plan  Identify Project Team  Roles clearly defined  Signoff, executor, advisor, FYI  Critical path. 107 Development Stage 2: Data Review and Project Parameters 108 Objectives  Identify data requirements.  Perform pre-modeling analysis.  Understand the business  Exclusions  What is a “bad”? – target definition  Sample Window/ Performance Window. 109 Data Availability and Quality  Number of “goods”, “bads” and “rejects”  Initial idea at this stage, estimated from performance reports  Internal data  Reliable, accessible  External data  Accessible, format  Retro pull. 110 Data Gathering  To determine “bad” definition and exclusions:  All applications over the last 2–5 years (or a large sample)  account/ID number  Date opened/applied  Accept/reject indicator  Arrears/payment history  Product/channel and other identifiers  Account status  Other items to understand the business. 111 Exclusions  “Include those whom you would score during normal day to day operations”  VIP  Staff  Fraud  Pre-approved  Underage  Cancelled (sometimes). 112 Performance New Account Good/Bad? ? ? “Sample Window” “Performance Window” 113 Parameters  Performance Window  How far back do I go to get my sample?  Sample Window  Time frame from which sample will be taken.  Definition of “bad”  Bad and approval rates (when oversampling). 114 Parameters  Seasonality  Plot approval rate/applications across time  Establish any ‘abnormal’ zones (for example, talk to marketing).  Sample used in development must be from a normal business period, to get as accurate a picture as possible of the target population. 115 Parameters – “Bad”  Plot “bad” rate by “month opened” (cohort)  For different definitions of bad  30/60/90 days past due  Charge off/write-off  Bankrupt  Claim  Profit based  Less than x% owed collected  “Ever” versus “Current” bad  Ever bad should be used where possible  Considered “bad” if you reach status anytime during performance window. 116 Cohort Analysis – Example Bad = 90 days Open Date 1 Qtr 2 Qtr 3 Qtr 4 Qtr 5 Qtr Jan-99 0.00% 0.44% 0.87% 1.40% 2.40% Feb-99 0.00% 0.37% 0.88% 1.70% 2.30% Mar-99 0.00% 0.42% 0.92% 1.86% 2.80% Apr-99 0.00% 0.65% 1.20% 1.90% May-99 0.00% 0.10% 0.80% 1.20% Jun-99 0.00% 0.14% 0.79% 1.50% Jul-99 0.00% 0.23% 0.88% Aug-99 0.00% 0.16% 0.73% Sep-99 0.00% 0.13% 0.64% Oct-99 0.20% 0.54% Nov-99 0.00% 0.46% Dec-99 0.00% 0.38% Jan-00 0.30% Feb-00 0.00% Mar-00 0.00% 117 Current versus Ever – Example  Current bad definition: No Delinquency  Ever bad definition: 3 months delinquent. Month 1 2 3 4 5 6 7 8 9 10 11 12 Delq 0 0 1 1 0 0 0 1 2 3 0 0 Month 13 14 15 16 17 18 19 20 21 22 23 24 Delq 0 0 1 2 0 0 0 1 0 1 0 0 118 Determining Parameters Bad Rate Development 0% 1% 2% 3% 4% 5% 6% 7% Mar Jan-02 Nov Sep Jul May Mar Jan-01 Nov Sep Jul May Mar Jan-00 Month Opened 119 - mth opened from earliest to latest, and “bad rate” as of this month. For simplicity, this is straight delinquency .. No profit. - notice at one point the bad rate levels off - this means everyone who was going to go bad has gone bad I.e. they have been given enough time. This is telling us that for this bad defn, accts from jan-march are mature enough. -lesson 1: need sample that is mature enough, so that you wont be defining a “bad” as a good just because you haven‟t given them enough time. -if you take accts from the middle (enter), some of the accts haven‟t matured yet so your bad rate is understated. -Example: response scoring .. How long do you wait for the responses to come in. the period of measurement is „perf window‟. Determining Parameters Bad Rate Development 0% 1% 2% 3% 4% 5% 6% 7% Mar Jan-02 Nov Sep Jul May Mar Jan-01 Nov Sep Jul May Mar Jan-00 Month Opened Sample Window Performance Window 120 So for each definition of “bad” you‟ll get a sample window of mature accounts, and a performance window indicating the time taken for the bad rate to mature. Also the approval rate for this sample window. Couple of notes on this “maturing” process. - 30 day definition will mature quicker than 90 day. Cause it takes ppl less time to go 30 day than 90 day. Chargeoff even more. - for the same bad defn, credit card quicker than mortgage (18-24 mths vs. 3-5 yrs) . - Why are we doing all this for the different definition? - because each one will produce different counts and based on reasons on the next slide, we‟ll determine the best set of parameters. Determining Parameters – Bad  Organizational objectives/purpose  Tighter definition – more precise, low counts  Looser definition – differentiation sub-optimal  Interpretable and trackable  Consistency  Reality – the best definition under the circumstances (lack of data, history). 121 Lets look at the considerations. - objectives: this may seem obvious, but it is not to a lot of ppl. If you‟re building a scorecard to predict profit, then use profit. Some orgs want a delinquency based defn, but also include profit. E.g. if acct is chronically 2 mths late, but still profitable.. You can‟t set 2 mths as a “bad” - whereas in a pure delq scorecard this may be possible. - tighter/looser: tighter means 90 day, 120 day, writeoff .. Better differentiation, but low count. Remember 2000 bads. - looser means more count, but sub-opt diff. - interpretable e.g. bad is 2 times 60 days, 3 times 30 days or 1 times 90 days. Sounds good, but hell to track and interpret. Keep it simple. - consistency across other cards, products. Also if accounting writes off acct at 7 mths, then keep it consistent with that. - typically most delq cards are 90 days. - Reality: you take what you got. Lack of history allows only a 30 day definition .. Take it. Can‟t measure real bad rate .. Use proxy. (example LOC like an account) Sample Definitions – Bad  Ever 90 days delinquent  Bankrupt  Claim over $1000  3 x 30 days, or 2 x 60 days, or 1 x 90 days  Negative NPV  Not profitable  50% recovered within 3 months  Fraud over $500  Closed within 6 months. 122 Confirming “Bad” Definition  Analytical  “Roll rate” analysis  Current versus worst delinquency comparison  Profitability analysis  Consensus. 123 Roll Rate Analysis  Compare Worst delinquency  for example, Previous 12 months versus Next 12 Months Month 1 2 3 4 5 6 7 8 9 10 11 12 Arrears 0 0 1 2 0 0 0 1 2 3 0 0 Month 13 14 15 16 17 18 19 20 21 22 23 24 Arrears 1 2 3 3 3 4 3 0 0 0 0 0 124 Roll Rate Analysis 0% 20% 40% 60% 80% 100% Worst - Next 12 Mths Curr/x day 30 day 60 day 90+ Worst-Prev12Mths Roll Rate Curr/x day 30 day 60 day 90+ 125 You find out which „bad defn‟ is truly bad‟ - also known as POINT OF NO RETURN. Lets look at 30 day: out of everyone who had worst 30 day, majority became current, only a few became worse - this is not a good bad defn. - out of those 60 days, some went over .. Most went back I.e. became better -but those who were 90 day .. Majority did not become better. This confirms our definition. -In general .. Once you hit 90 days, you‟re not coming back. That‟s a true bad. Rem: this is based on „bad‟ objective. If other, perhaps there is a different point in time.. Roll Rate Analysis  Look for ‘point of no return’.  Consider objectives.  Consider sample counts.  Typically for delinquency, after 90 days most accounts do not cure. 126 Current versus Worst Comparison Worst Delinquency Current 30 days 60 days 90 days 120 days writeoff Current Current 100% 68% 34% 15% 4% Delinquency 30 days 16% 22% 8% 5% 60 days 8% 19% 17% 8% 90 days 4% 14% 32% 11% 120 days 2% 8% 18% 54% writeoff 2% 3% 10% 18% 100% 127 32% 44% 60% 72% 56% 40% 18% Parameters – Goods/Indeterminates  Good  Never delinquent  Ever x- days delinquent  No claims  Profitable, positive NPV  No fraud  No bankruptcy  Recovery > 75%, $ value  Must be good throughout performance window  Indeterminate  Mild delinquency, roll rate not conclusive either way  Inactive  Offer declined  Voluntary cancellations*  High balance < $50 128 Default – definice cílové prom. (good/bad)  Obvykle je tato definice založena na klientově počtu dnů po splatnosti (Days Past Due, DPD) a částce po splatnosti. S částkou po splatnosti je spojena potřeba stanovení jisté míry tolerance, tedy stanovení co je považováno za významný dluh a co nikoli. Např. nemusí dávat smysl považovat za dluh částky menší než 100 Kč.  Dále je třeba stanovit časový horizont (performance window), na kterém jsou dva zmíněné parametry sledovány.  Za dobrého klienta lze např. označit klienta, který:  je po splatnosti méně než 60 dnů(s tolerancí 100 Kč) v prvních 6-ti měsících od první splátky,  je po splatnosti méně než 90 dnů (s tolerancí 30 Kč) v průběhu celé své platební historie (ever). 129 Default – definice cílové prom.  Volba těchto parametrů závisí do značné míry na typu finančního produktu (jistě se bude lišit volba parametrů pro spotřebitelské úvěry pro malé částky se splatností kolem jednoho roku a pro hypotéky, které jsou obvykle spojeny s velmi vysokou finanční částkou a se splatností až několik desítek let) a na další využití této definice (řízení rizik, marketing, ...). 130 Default – definice cílové prom.  Další praktickým problémem definice dobrého klienta je souběh několika smluv jednoho klienta. Například je možné, že zákazník je po lhůtě splatnosti na více smlouvách, ale s rozdílnými dny po splatnosti a s různými částkami. V tomto případě jsou většinou částky klienta dlužné v jednom konkrétním časovém okamžiku sečteny, a ze dnů po splatnosti na jednotlivých smlouvách je brána maximální hodnota. Tento přístup lze uplatnit pouze v některých případech, a to zejména v situaci, kdy jsou k dispozici kompletní účetní data. Situace je podstatně složitější v případě agregovaných údajů, např. na měsíční bázi. 131  Obecně uvažujeme následující typy klientů: Default – definice cílové prom.  dobrý (good),  špatný (bad),  nedefinovaný (indeterminate),  s nedostatečnou úvěrovou historií (insufficient),  vyřazený (excluded),  zamítnutý (rejected). 132  První dva typy byly diskutovány. Třetí typ, tj. indeterminate, je na hranici mezi dobrým a špatným klientem a při jeho použití přímo ovlivňuje definici dobrých/špatných klientů. Uvažujeme-li pouze DPD, klienti s vysokými DPD (např. 90 +) jsou typicky označeni za špatné, nedelikventní klienti (jejich DPD je rovno nule) jsou označeni za dobré. Za indeterminate jsou pak označeni delikventní klienti, kteří nepřekročí danou hranici DPD.  Čtvrtý typ klientů jsou typicky klienti s velmi krátkou platební historií, u kterých je nemožná korektní definice cílové proměnné.  Vyřazení klienti jsou klienti, jejichž data jsou natolik špatná, že by vedla ke zkreslení modelu(např. fraudy). Další skupinu tvoří klienti, kteří nejsou standardně hodnoceni daným modelem (VIP klienti)  Poslední typ klientů jsou ti klienti, jejichž žádost o úvěr byla zamítnuta. Default – definice cílové prom. 133 Customer Default (60 or 90 DPD) Not default Fraud (first delayed payment, 90 DPD) Early default (2-4 delayed payment, 60 DPD) Late default (5+ delayed payment, 60 DPD) Definice dobrého/špatného klienta Rejected Accepted Insufficient GOOD BAD INDETERMINATE 134 Performance Definitions  “Goods” and “bads” (and rejects) are used for model development.  Indeterminates included for Gains chart and forecasting. 135 Segmentation  Can one scorecard work efficiently for all the different populations within your portfolio?  Or would more than one scorecard be better?  Segmentation maximizes predictiveness for unique segments within your population. 136 Segmentation  Experience (Heuristic)  Knowledge/experience, operational/industry based, common sense.  Statistical  Let the data speak.  “Distinct applicant/account sub-populations”  “Better predictive power than single model”. 137 Experience Based Segmentation  Product  Card type, loan type (auto, home, unsecured), lease, used versus new, brand  Demographics  Geographical (region, urban/rural, state/province, internal definition, neighborhood), age, time at bureau  Source of business  Channel (net, branch, store-front, ‘take one’, brokers)  Applicant type  new/existing, first time home buyer, groups (retired, students, engineers), thin/thick file, clean/dirty file  Product Owned  Credit Card for existing mortgage/loan holders. 138 Experience Based Segmentation  Consider future plans, not just historic operations  How do we detect new segments?  Marketing/risk analysis:  Bad rates  Approval rate  Profit, and so on.  Look for significant performance difference. 139 Experience Based Segmentation  Need to confirm experience using analytics.  Definition of segments  What is a thin file?  What is ‘young’ versus ‘old’?  What is the best demographic split?  What break is best for ‘tenure at bank’? 140 Confirming Experience  Rule of thumb:  “When the same information predicts differently across unique segments” Bad Rate Age > 30 Age < 30 Unseg Res Status Rent 2.1% 4.8% 2.9% Own 1.3% 1.8% 1.4% Parents 3.8% 2.0% 3.2% Trades 0 5.0% 2.0% 4.0% 1-3 2.0% 3.4% 2.5% 4+ 1.4% 5.8% 2.3% 141 Confirming Experience Attributes Bad Rates Age Over 40 yrs 1.80% 30-40 yrs 2.50% Under 30 6.90% Source of business Internet 20% Branch 3% Broker 8% Phone 14% Applicant Type First Time buyer 5% Renewal Mortgage 1% 142 That Is the Easy Way  You can also build full segmented models, and compare “lift”, sensitivity, and so on, with a base model.  It is best to perform this analysis for both experience and statistically based segmentation. 143 Comparing Improvement  Use different methods to measure improvement (lift, KS, c-stat, precision, and so on.) Segment Total c-stat Seg c-stat Improvement Age < 30 0.65 0.69 6.15% Age > 30 0.68 0.71 4.41% Tenure < 2 0.67 0.72 7.46% Tenure > 2 0.66 0.75 13.64% Gold card 0.68 0.69 1.47% Platinum card 0.67 0.68 1.49% 144 Comparing Improvement  Portfolio stats will put improvements into measurable portfolio terms. After Segmentation Before Segmentation Segment Size Approve Bad Approve Bad Total 100% % % % % Age < 30 65% % % % % Age > 30 35% % % % % Tenure < 2 12% % % % % Tenure > 2 88% % % % % Gold card 23% % % % % Platinum card 77% % % % % 145 Choosing Segmentation  Cost of scorecards (internal/external)  Implementation  Processing  Data storage  Monitoring/strategy development  Segment size  Do I have to? 146 Statistically Based Segmentation  Less preconceived notions  Clustering  Decision Trees. 147 Clustering Clustering 0 2 4 6 8 10 12 0 2 4 6 8 10 148 Showing 3 distinct groups and one outlier. Clustering 0 0.2 0.4 0.6 0.8 1 1.2 1.4 A ge C laim s R egion A R egion B M arried:1 A uto:S ports Overall Mean Mean for Cluster 149 Here is an insurance example of one cluster. - What do we see here? - lower than avg age - more claims - live in region A only - likely to be single and drive a sports car. - this is obviously a high risk segment. (confirm this group with claims analysis) - Similar groups according to characteristics, not performance – so confirm performance for the clusters and combine those with similar risk behavior. We‟re not building a marketing profile, but a RISK PROFILE. Clustering  Defining characteristics for each group  From previous example,  Young males region A  Young females region A, and so on.  Performance analysis to confirm segmentation. 150 Decision Trees  Isolates segments based on performance (target)  Easily interpretable and differentiates between goods and bads. Customer > 2 yrs Bad rate = 0.3% Customer < 2 yrs Bad rate = 2.2% Existing Customer Bad rate = 1.2% Age < 30 Bad rate = 11.7% Age > 30 Bad rate = 4.2% New Applicant Bad rate = 6.3% All Good/Bads Bad rate = 3.8% 151 So Now We Know ...  the business  sample and performance windows  “bad”, “good”, “indeterminate”  exclusions  bad rate, approval rate  number of scorecards needed, and their segments. 152 Methodology/Format  Implementation platform and format  Interpretability, implementation  Legal compliance  Data quality, sample size, target type  Tracking and diagnosis  Specify parameters for scorecard (range of scores, “points to double the odds”). 153 Why ‘Scorecard’ Format?  Easiest to interpret, justify, implement  Reasons for decline/low scores can be explained to auditors, Mgmt, regulators, adjudicators  No black box  Diagnosis, tracking, monitoring  Development process fairly simple to understand. 154 Review Implementation Plan  Number of scorecards  Data requirements  Manage expectations  Continuity. 155 Everyone is aware of what‟s going on. This is a business process, not a mystery novel. You‟d be surprised how many people in companies like to spring surprises on other departments. 156 Jsou k dispozici následující data: Accepts.sas7bdat (64589 řádků) Rejects.sas7bdat (35411 ř.) Applicants.sas7bdat (100.000 ř.) …24 sloupců ID of applicant, Date of application/opening, Accept / Reject, 30-days deliquency, 30-days deliquency date, 60-days deliquency, 60-days deliquency date, 90-days deliquency, 90-days deliquency date, Worst previous deliquency, Current deliquency, Age, Age groups, Sex, Existing client?, Phone member?, Region, Income, Income groups, Debt, Income/Debt ratio, Income/Debt ratio groups, Probability of 60-days deliquency (old), Score (old). title 'Accepts'; proc means data=indata.accepts n nmiss min median mean max; var age income debt idratio; run; title 'Accepts'; proc freq data=indata.accepts; table sex client phone region; table (sex client phone region)*bad60; table bad30*(bad60 bad90) bad60*bad90; run; title 'All applicants'; goptions ftext='arial'; proc catalog c=gseg kill; quit; proc gchart data=indata.applicants; vbar age / midpoints=18 to 75 name='_1data_a'; vbar income / name='_1data_b'; vbar debt / name='_1data_c'; vbar idratio / name='_1data_d'; vbar type / name='_1data_e'; vbar scoreold / levels=10 name='_1data_f'; vbar pbad60old / levels=30 name='_1data_f'; run; quit; proc univariate data=indata.applicants normal; var age income debt idratio; histogram age income debt idratio; run; Cvičení Základní popis dat: 157 Cvičení Vybrané výstupy uvedeného kódu: 158 /* 2a. Bad rate development, roll rate analysis */ %let performancewindow='31dec2002'd>=datappl; %let deliq=worstdeliq; proc freq data=indata.accepts /*noprint*/; table datappl*&deliq / out=&deliq (keep=datappl &deliq pct_row where=(&deliq ne '0')) outpct missing; format datappl yyqs7.; where &performancewindow; run; ods html path="&appl_root" file="2.&deliq..html"; goptions reset=all ftext='arial'; symbol1 i=j v=dot; axis1 label=('Bad rate'); proc catalog c=gseg kill; quit; title 'Bad rate development - current deliquency'; proc gplot data=&deliq; plot pct_row*datappl=&deliq / name='_2curdel' grid hreverse vaxis=axis1 hminor=0; run; quit; ods html close; Cvičení 159 /* analyza kohort */ %let target=bad30; %let date=dat30; data cohorts; set indata.accepts (keep=datappl bad: dat:); if &target then qtr=int(yrdif(datappl,&date,'act/act')*4)+1; datappl=intnx('month',datappl,0); format datappl mmyys7.; run; proc freq data=cohorts noprint; table datappl / out=cohorts1 (drop=percent rename=(count=counttotal)); table datappl*qtr / out=cohorts (drop=percent); run; data cohorts; merge cohorts cohorts1; by datappl; if first.datappl then cumpct=.; if qtr ne . then do; cumpct+(count/counttotal); output; end; run; ods html path="&appl_root" file='2.cohorts.html'; title "Cohort analysis for &target"; proc tabulate data=cohorts missing format=percent8.4; class datappl qtr; var cumpct; table datappl,qtr*cumpct=''*sum=''; run; ods html close; Cvičení 160 /* performance window */ %let performancewindow='31dec2002'd>=datappl; proc tabulate data=indata.accepts out=brdev (drop=_type_ _table_ _page_); class datappl; var bad90 bad60 bad30; table datappl,(bad90 bad60 bad30)*mean*format=percent8.2; format datappl yyqs7.; where &performancewindow; label datappl='Month opened'; run; ods html path="&appl_root" file='2.perf.html'; goptions reset=all ftext='arial'; symbol1 i=j v=dot; axis1 label=('Bad rate'); proc catalog c=gseg kill; quit; title 'Bad rate development'; proc gplot data=brdev; plot (bad:)*datappl / name='_2perf' grid overlay legend hreverse vaxis=axis1 hminor=0; run; quit; ods html close; Cvičení 161 /* bad rate development */ %let samplewindow='30jun2001'd>=datappl>='01apr2001'd; %let samplewindow='31dec2001'd>=datappl; proc freq data=indata.accepts noprint; table dat60 / out=development missing; format dat60 mmyys7.; where &samplewindow; run; data development; set development; if _n_>1 then do; dat60=intnx('month',dat60,0); cum_pct+percent; output; end; label datappl='Month of opening'; run; ods html path="&appl_root" file='2.badratedev.html'; goptions reset=all ftext='arial'; symbol1 i=j v=dot; axis1 label=('Bad rate'); proc catalog c=gseg kill; quit; title 'Bad rate development'; proc gplot data=development; plot cum_pct*dat60 / name='_2brd' grid; run; quit; ods html close; Cvičení 162 /* BRDEV macro */ %macro brdev(data,out,datevar,targetvar,samplewindow); proc freq data=&data noprint; table &datevar / out=&out missing; format &datevar mmyys7.; where &samplewindow; run; data &out (keep=date cum_pct); set &out; if _n_>1 then do; date=intnx('month',&datevar,0); cum_pct+percent; output; end; format date mmyys7.; run; %mend brdev; %let samplewindow='30jun2001'd>=datappl>='01apr2001'd; %brdev(indata.accepts,development,dat60,bad60,&samplewindow) /* several bad rate development */ %let samplewindow='30jun2001'd>=datappl>='01apr2001'd; %brdev(indata.accepts,development30,dat30,bad30,&samplewindow) %brdev(indata.accepts,development60,dat60,bad60,&samplewindow) %brdev(indata.accepts,development90,dat90,bad90,&samplewindow) data developmentsev; set development30 (in=__30) development60 (in=__60) development90; if __30 then type='30'; else if __60 then type='60'; else type='90'; Run; Cvičení data anno; function='label';x=20;y=2;text='Sample window';output; size=2;function='move';x=10;y=2.5;output; function='draw';x=30;y=2.5;output; function='move';x=20;y=3.5;output; function='draw';x=140;y=3.5;output; run; ods html path="&appl_root" file='2.badratedev_several.html'; goptions reset=all ftext='arial'; symbol1 i=j v=dot; axis1 label=('Bad rate'); proc catalog c=gseg kill; quit; title 'Several bad rates development'; proc gplot data=developmentsev annotate=anno; plot cum_pct*date=type / grid vminor=0 name='_2brds' vaxis=axis1; format date mmyys5.; label date='Performance window'; run; quit; ods html close; 163 Cvičení /* Roll rate analysis */ ods html path="&appl_root" file='2.roll_rate.html'; proc format; value $deliq (notsorted) '0'=' no deliquency' '3'='30 days' '6'='60 days' '9'='90+ days'; run; proc tabulate data=indata.accepts out=rollrate missing; class curdeliq worstdeliq; tables worstdeliq,curdeliq*rowpctn; format curdeliq $deliq. worstdeliq $deliq.; title 'Roll rate analysis'; run; proc gchart data=rollrate; hbar3d worstdeliq / sumvar=pctn_01 subgroup=curdeliq nostats clipref autoref raxis=axis1; axis1 label=none minor=none; run; quit; ods html close; 4. Příprava dat II 164 Development Stage 3: Development Database Creation 165 Development Sample Specification  Development sample spec. means specifying what we need in the database we will use for development. We are not going to take a dump of everything from the CDW or datamart.  Make the development process manageable and efficient:  list of characteristics (or “variables” to be considered for devp. You don’t want to have the entire DW.)  sample sizes (for each segment and category. No point regressing on 100k when 3k will suffice.)  parameters from previous section.  Do all this bearing in mind the number of scorecards you want developed and for which segments. 166 Characteristic Selection  Expected predictive power  Reliability: (is this manipulated? or prone to be manipulated?, e.g. salary. Check historical data - cannot be confirmed or too expensive to confirm. Can it be interpreted e.g. occupation/industry type is the worst cases. Do poeple usually leave this blank.)  manipulation (non-confirmable)  interpretation (present and future)  missing  Legal issues (Cant ask/get some info?.. Might get into trouble with some?) 167 How do you select characteristics? Reinforce: there is a need for some thought to be put into process in selecting characteristics .. You get together with risk, mktg, product. And get operations areas such as collections aboard (WHO knows your bad guys better than anyone else?) Characteristic Selection  Ease in collection  Do you want to spend time chasing missing info for a credit card?… may be OK for a mortgage. How easy it is to get this piece of info?  Policy rules  Don’t include anything that is unchangeable PR, e.g. bankruptcy. If you are going to decline all bankrupcy, no need to use it in scorecard.  Derived variables – ratios  Can do a lot of ratios .. But put some business thought into it.  Future direction.  Will this info be collected in the future (e.g. app form redesign)?  Industry direction - not relevant today but will change. can include in card or collect for future e.g. higher credit lines. Talk to credit bureaus industry trend and how they affect the scorecard. 168 What are you doing: you’re looking at objectives, company operations, business knowledge, ground realities etc. This is not just a stats exercise!!! Sampling  Development, validation  70:30, 80:20  If sample is small, do 100%, but validate with several 50–80%.  Good, bad, reject  2000 of each (or higher)  Oversampling (oversampling is common when modeling rare events … it leads to better predictions)  Proportional sample – not recommended for low bad rates.  Take what you got for bads and sample the goods.  Ensure that each group has sufficient numbers for meaningful analysis. 169 Data Collection and Database Construction  Random and representative  for each segment applicants (and accounts)  One for unsegmented (to measure lift from segmentation)  Data quirks, changes (preferably documented)  e.g. code for renters changed from R to E .. Stopped collecting some data item, new data fields, started collecting data recently etc. etc.  Objective: Data collected, as specified. 170 Adjusting for Prior Probabilities  When oversampling  Adjust to actual:  Approval rate  Bad rate  Analysis and reports reflect reality  Do not need if you only want to know relationships or rank ordering. Rejects 2,950 Bads 874 Goods 6,176 Accepts 7,050 Through-the-door 10,000 171 Adjusting for Oversampling  Separate sampling is standard practice (helps when you just did ‘bad’ definition)  Prior probabilities must be known  Can adjust before fitting the model or after.  Two ways:  Offset  Sampling weights (frequency variable). 172 Offset Method  Logit (pi)=β0+ β1x1+ ….+ βkxk  When oversampling, logits shifted by the offset:  Logit (p*i)= ln (ρ1π0 / ρ0π1) + β0+ β1x1+ ….+ βkxk  Where  ρ1 and ρ0= proportion of target classes in the sample  π1 and π0= proportion of target classes in the population. 173 Offset Method  Adjustment post-model (after model development):  p^i = (p^*iρ0π1) / [(1 - p^*i) ρ1π0 + p^*iρ0π1)]  Where p^*i is the unadjusted estimate of posterior probability. 174 SAS Programs – Pre-model Adjustment data develop; set develop; off=(offset calc); run; proc logistic data=develop ...; model ins=……./ offset=off; run; proc score ….; p=1 / (1+exp(-ins)); proc print; var p ….; run; 175 ln (ρ1π0 / ρ0π1) SAS Program – Post-model Adjustment proc logistic data=develop...; run; proc score ... out=scored...; run; data scored; set scored; off = (offset calc); p=1 / (1+exp(-(ins-off))); run; proc print data=scored ..; var p ...; run; 176 Sampling Weights  Adjusts data to reflect true population  Weights: π1/ρ1 and π0/ρ0  Or set weight of bad=1 and weight of good = p(good)/p(bad) for population.  For example, p(bad)=4%, 2000 goods, 2000 bads. Sample will show 2000 bads and 48,000 goods.  Normalization causes less distortion in p values and standard errors.  Use FREQ variable in EM or calculate sample weight and use weight=sampwt in the LOGISTIC procedure. 177 SAS Program  When using the WEIGHT statement, some output is not correct. data develop; set develop; sampwt=( π0/ ρ0)* (ins=0) + ( π1/ ρ1)* (ins=1); run; proc logistic data=develop …; weight=sampwt; model ins=…….; run; 178 What Is the Difference?  The parameter estimates will be different.  When linear-logistic model is correctly specified, offset is better.  When logistic model is an approximation of some non-linear model, weights are better.  For scorecards, weighting is better since it corrects the parameter estimates used to derive scores (prior probabilities only affect the predicted probabilities). 179 Development Stage 4: Scorecard Development 180 Objective  Understand a methodology for developing and assessing risk scorecards.  Grouped attributes  Logistic regression  Reject inference  Scaled points. 181 Process Flow – Application Scorecard Explore Data Data Cleansing Initial Characteristic Analysis (Known Good Bad) Preliminary Scorecard (KGB) Reject Inference Initial Characteristic Analysis (All Good Bad) Final Scorecard (AGB) • Scaling • Assessment Validate 182 Process Flow – Behavior Scorecard Explore Data Data Cleansing Initial Characteristic Analysis (Known Good Bad) Final Scorecard • Scaling • Assessment Validate 183 Before you start …  Explore the data, visualize (Insight in SAS EM)  Distributions  mean, max/min, range, missing  Compare with overall portfolio distributions  Data integrity (any garbage, outliers)  Ensure data meets the data specifications done earlier.  Check that ‘0’s mean zero, not missing values.  Population stability check:  Month by month table of distribution for each predictor (e.g. 200701 men 55%, women 45%, 200702 men 57%, women 43%) 184 Missing Values and Outliers  Missing (ALL financial data has missing and garbage values)  Complete Case Analysis - Exclude everything with missing data .. In CS, you’ll end up with nothing .  Exclude characteristics or records with significant missing values  Group ‘missing’ as a distinct attribute -the weight of missing will tell you what missing contains. If it is close to neutral, good since it is random. Recommended – recognize that missing data has information value and may not be randomly missing. Find the value and use it. Plus, including missing ‘points’ in scorecard will take care of ppl who leave it blank.  Impute missing values – don’t use mean/most likely, model based on decision tree may be better.  Outliers (and mis-keys)  Exclude/replace records. 185 Missing Values  Missing data is not usually random  Missing data can be related to the target  New at job may leave yrs at empl blank  Low income or commercial customers leave income blank  Do bad customers leave certain fields blank?  Including and grouping missing data can answer this question. 186 Initial Characteristic Analysis  Analyze individual characteristics  Identify strong characteristics  Best differentiators between ‘good’ and ‘bad’  Screening  Select characteristics for regression (variable selection). 187 Initial Characteristic Analysis  Start by performing initial grouping for each characteristic and rank order Information Value (PROC DMSPLIT or SPLIT, or EM node)  Alternate: rank order characteristics by Chi Square or other method  Fine tune grouping for stronger characteristics  May want to perform other analysis prior to this (for example, use PC to identify collinear characteristics)  Some people use principal components (PROC VARCLUS) to identify which characteristics they need from each cluster. And then concentrate on the best out of each. 188 Criteria for Variable Selection  Predictive power of attribute: Weight of Evidence  Range and trend of WOE across attributes  Predictive power of characteristic: Information Value, Gini index(coefficient)  Operational/business considerations. 189 Weight of Evidence Distr Distr Distr Age Count Count Goods Good Bads Bad Bad rate Weight Missing 50 3.00% 43 2.40% 8 4.10% 16% -55.497 18-22 200 10.00% 152 8.40% 48 24.90% 24% -108.405 23-26 300 15.00% 246 13.60% 54 28.00% 18% -72.039 27-29 450 23.00% 405 22.40% 45 23.30% 10% -3.951 30-35 500 25.00% 475 26.30% 25 13.00% 5% 70.771 35-44 350 18.00% 349 19.30% 11 5.70% 3% 122.044 44 + 150 8.00% 147 8.10% 3 1.60% 2% 165.509 Total 2,000 1,807 193 9.65% Information Value = 0.066 Distr Good Distr Bad/Ln x 100 190 Weight of Evidence  Measures strength of each (grouped) attribute in separating goods and bads  (Distr Good / Distr Bad) = odds of being good  Negative weight: more bads than goods  Logical trend  For age 23-26: WOE = ln (0.136 / 0.28) = -0.722 (x 100 = -72.2) 191 Information Value (Strength) Distr Distr Distr Age Count Count Goods Good Bads Bad Bad rate Weight Missing 50 3.00% 43 2.40% 8 4.10% 16% -55.497 18-22 200 10.00% 152 8.40% 48 24.90% 24% -108.405 23-26 300 15.00% 246 13.60% 54 28.00% 18% -72.039 27-29 450 23.00% 405 22.40% 45 23.30% 10% -3.951 30-35 500 25.00% 475 26.30% 25 13.00% 5% 70.771 35-44 350 18.00% 349 19.30% 11 5.70% 3% 122.044 44 + 150 8.00% 147 8.10% 3 1.60% 2% 165.509 Total 2,000 1,807 193 9.65% Information Value = 0.066 Distr Good - Distr Bad  x Weight Kullback, S., Information Theory and Statistics (1959) 192 Information Value  [(Distr Good - Distr Bad) x {ln (Distr Good / Distr Bad)}]  When figures used in decimals format (for example, 0.136).  Rule of thumb:  < 0.02: unpredictive  0.02 – 0.1: weak  0.1 – 0.3: medium  0.3 +: strong  Too strong? (IV>0.5) – use it in a controlled way (add them in the end of regression to see if they add any incremental value) 193 Grouping  Groups with similar WOE are put together  For continuous variables, groups are created so as to maximize difference from one group to next – and maintain logical trend for WOE  Why Group?  Easier way to deal with outliers with interval variables, and for rare classes  Format of the scorecard  Easy to understand relationships  Model non-linear dependencies with linear models  Control the process 194 Grouping of the demographic scorecard variable “age”. On the left pictures, the dependence of bad rate (smoothed using normal probability density function) on the variables is presented. On the right, the cumulative distribution function is presented. Vertical lines represent the borders between categories, horizontal red lines in the left picture represent the mean bad rate in categories, horizontal blue lines in the right picture represent the relative distribution of observations in the categories. 195 Grouping Logical Trend Predictive Strength -150 -100 -50 0 50 100 150 200 Missing 18-22 23-26 27-29 30-35 35-44 44 + Age Weight 196 Logical Trend  Final weightings make sense.  Enables buy-in from risk managers.  Confirms business experience  young people are higher risk  higher debt service means higher risk  Reduces overfitting if done right – model overall trend, not quirks. Remember how long the scorecard has to last. This is not going to be used for the next campaign and then discarded.  Linear relationship not always true, but need trend to confirm, and back up with business experience. E.g. revolving open burden shows a ‘banana curve’ everywhere and is now accepted as that. People don’t try to make it straight. 197 Logical Trend Predictive Strength -80 -60 -40 -20 0 20 40 60 80 100 Missing 18-22 23-26 27-29 30-35 35-44 44 + Age Weight 198 Obviously not a logical trend!!! Logical Trend Predictive Strength -150 -100 -50 0 50 100 150 200 Missing 18-22 23-26 27-29 30-35 35-44 44 + Age Weight 199 Which line shows logical trend? Both are logical. What’s the difference? Blue line shows good differentiation. Red line is flat, and this characteristic is likely very week and will be reflected in the IV. 200 Stability check Check the stability of grouping throughout the whole developmnet time window: Business Factors  Nominal values  group based on similar weight (for example, postal code, occupation)  investigate splits on urban/rural, regional  Breaks concurrent with policy rules  Sanity check. 201 List of information values of variables (predictors) No Character IV Rank Information Value 1 Max delinq L9M 1 0.176 2 Months since delinquent 2 0.176 3 Active contract (Y/N) 3 0.045 4 Average Delinquency L9M 4 0.087 5 Months since >10 dpd 5 0.144 6 Max delinq L3M 6 0.117 7 Average Delinquency L3M 7 0.108 8 Age of oldest contract 8 0.013 9 Number of months on collections as % total time on book 9 0.132 10 Months since >20 dpd 10 0.091 11 Months since >30 dpd 11 0.054 12 Num rejected applications L9M 12 0.033 13 Times 30+ dpd L9M 13 0.042 14 Total Payment L3M 14 0.018 15 Months since >40 dpd 15 0.030 16 Current balance as % of highest ever balance 16 0.048 17 Times 30+ dpd L3M 17 0.024 18 Payment Method 18 0.001 202 Variable Selection 203 Cvičení –profile /* 2b. Profiles */ %let input=income; %let groups=yes; %let n_groups=4; /* grouping 1 - kvantily */ proc rank data=indata.accepts (keep=&input) groups=&n_groups out=bins; var &input; ranks bin; run; proc summary data=bins nway missing; class bin; output out=bins (drop=_type_) min(&input)=start max(&input)=end; run; data bins; set bins; label=compress(put(start,best.))||' - '||compress(put(end,best.)); fmtname='__bin'; type='N'; run; proc format cntlin=bins; run; %macro profile(input,groups); /* Profile of &input according to BAD60 */ proc summary data=indata.accepts; class &input; output out=__bins (drop=_type_ rename=(_freq_=__n)) sum(bad60)=__n1; %if %upcase(&groups)=YES %then %do; format &input __bin.; %end; run; data __bins; set __bins end=__finish; if _n_=1 then do; __all_n=__n; __all_n1=__n1; __all_n0=__n-__n1; retain __all_n:; end; else do; __p=__n/__all_n; __n0=__n-__n1; __p1=__n1/__all_n1; __p0=__n0/__all_n0; __r1=__n1/__n; __r0=__n0/__n; __woe=log((__p0)/(__p1))*100; __all_iv+(__p0-__p1)*__woe/100; output; end; if __finish then do; call symput('groups',compress(put(_n_-1,best.))); call symput('iv',compress(put(__all_iv,8.4))); call symput('br',compress(put(__all_n1/__all_n,best.))); end; attrib __n label='N' __p label='%' format=percent8.1 __n1 label="N of Bad" __n0 label="N of Good" __p1 label="% of Bad" format=percent8.1 __p0 label="% of Good" format=percent8.1 __r1 label="Bad rate" format=percent8.1 __r0 label="Good rate" format=percent8.1 __woe label='WOE' format=8.2 &input label="Group of &input" ; drop __all:; Run; . . . 204 data __chart (keep=&input __sub __n __p __r); set __bins (keep=&input __n0 __p0 __r0 __n1 __p1 __r1); length __sub $4; __sub="Good"; __n=__n0; __p=__p0; __r=__r0; output; __sub="Bad"; __n=__n1; __p=__p1; __r=__r1; output; attrib __n label='N' format=8.0 __p label='%' format=percent8.1 __r label='Rate' format=percent8.1 __sub label='Target' ; run; proc datasets nolist; delete gseg / memtype=catalog; quit; ods listing close; goptions reset=all ftext='arial' htext=1.5 ftitle='arial' htitle=2; proc gchart data=__chart; axis1 style=0; axis2 minor=none order=(0 to 1 by .25) label=none; axis3 minor=none label=none; axis4 minor=(n=4) label=none; where __sub="Bad"; hbar &input / discrete sumvar=__r noframe nostats maxis=axis1 raxis=axis3 autoref cref=graya0 clipref name="__1"; title "Bad rates"; run; where; hbar &input / discrete subgroup=__sub sumvar=__n noframe nostats maxis=axis1 raxis=axis3 autoref cref=graya0 clipref name="__2"; title "Bad / Good frequencies"; run; Quit; proc gchart data=__bins; hbar &input / discrete sumvar=__woe noframe nostats maxis=axis1 raxis=axis4 autoref cref=graya0 clipref name="__3"; title "Weight of evidence"; run; hbar &input / discrete sumvar=__p1 noframe nostats maxis=axis1 raxis=axis4 autoref cref=graya0 clipref name="__4"; title "Bad distribution"; run; quit; ods html path="&appl_root" file="5.profile.html" style=statdoc; proc report data=__bins nofs style(summary)=[htmlclass="Header"]; columns ("Attributes of &input" &input) ('Total' __n __p) ("Good" __n0 __p0) ("Bad" __n1 __p1) ('Measures' __r1 __woe); define &input / group; compute after; __r1.sum=&br; __woe.sum=.; endcomp; rbreak after / summarize; title "Bad / Good by &input"; footnote "IV=&iv (<0.02 unpredictive, <0.1 week, <0.3 medium, <0.5 strong, >0.5 over)"; run; goptions device=gif; proc greplay nofs; footnote; igout gseg; tc sashelp.templt; template l2r2; treplay 1:__1 2:__2 3:__3 4:__4 name="5_profil"; run; quit; title; footnote; ods html close; ods listing; %mend profile; %profile(&input,&groups) 205 Cvičení /*profile multiple characteristics at once*/ %model_profilevar ( data=data.accepts, interval=age income idratio , binary=sex phone client, ordinal=age_grp income_grp region, groups=5, target=bad30, rep_out=&appl_root ) 206 Cvičení 207 Cvičení 208 Cvičení 209 Cvičení 210 Cvičení 211 Cvičení 5. Úvod do shlukové analýzy (SA). Hierarchická SA 212 213 Úvod Shluková (klastrová, z angl. Cluster) analýza je metoda, která na základě informací obsažených ve vícerozměrných pozorováních roztřídí základní množinu objektů do několika relativně stejnorodých shluků. Uvažujeme datovou matici typu n x p, kde n je počet objektů a p je počet proměnných. Uvažujeme různé rozklady množiny n objektů do g shluků a hledáme takový rozklad, který je z určitého hlediska nejvýhodnější. Cílem je dosáhnout stavu, kdy objekty uvnitř shluku jsou si podobné co nejvíce a objekty z různých shluků si jsou podobné co nejméně. Unsupervised Learning  Metody shlukové analýzy patří mezi tzv. „unsupervised learning“ metody.  “Learning without a priori knowledge about the classification of samples; learning without a teacher.” Kohonen (1995), “Self-Organizing Maps” 214 Cluster Profiling  Cluster profiling can be defined as the derivation of a class label from a proposed cluster solution.  The objective is to identify the features, or combination of features, that uniquely describe each cluster. 215  Rozlišujeme tři základní shlukovací metody:  Hiearchické shlukování (hierarchical clustering),  Shlukování s předem neznámým počtem shluků - s možným překryvem shluků (overlapping clusters),  Shlukování do předem daného počtu shluků (partitive/partitioning methods).  Fuzzy shlukování – fuzzy shluky jsou definovány stupněm příslušnosti objektů do daných shluků. Types of clustering 216  Hiearchické shlukovací algoritmy:  Aglomerativní  Divisive  Partitioning algoritms:  K-means  K-medoids  Probabilistic  Density based  Grid-based algoritms  Constraint-Based Clustering  Evolutionary Algoritms  Scalable Clustering Algoritms  Klasifikace shlukovacích algoritmů 217 Hierarchical Clustering Agglomerative Divisive 218 Problems with Hierarchical Clustering error error error 219 Partitive Clustering reference vectors (seeds) X X X X Initial State observations Final State X X X X X X X X • The goal of partitive clustering is to minimize or maximize some specified criterion. 220 Problems with Partitive Clustering  Many partitive clustering methods a. make you guess the number of clusters present, b. make assumptions about the shape of the clusters, usually that they are (hyper)spherical, and c. are influenced by seed location, by outliers, even by the order the observations are read in.  It is impossible to determine the optimal grouping, due to the combinatorial explosion of potential solutions. 221 Problems with Partitive Clustering 222  The number of possible partitions of n objects into g groups is given by:  For example, the number of partitions of 50 observations into 4 clusters, N(50,4), is equal to 5.3 x 1028. N(100, 4) generates 6.7 x 1058 partitions. Complete enumeration of every possible partition, therefore, is generally impossible. Heuristic Search 1. Generate an initial partitioning (based on the seeds) of the observations into clusters. 2. Calculate the change in error produced by moving each observation from its own cluster to another. 3. Make the move that produces the greatest reduction. 4. Repeat steps 2 and 3 until no move reduces error. 223 224 Hierarchická shluková analýza  Je třeba zvolit:  jak měřit vzdálenost/(ne)podobnost mezi objekty (euklidovská,…)  do úvahy je třeba vzít typ dat (intervalová, nominální,…)  značnou roli také hraje souměřitelnost dat -> standardizace (z-skóre, …)  jak měřit vzdálenost/(ne)podobnost mezi shluky (wardova,…)  jak určit finální rozklad objektů do shluků Příklad aglomerativního hiearchického shlukování 225 obj X1 X2 X3 X4 A 100 80 70 60 B 80 60 50 40 C 80 70 40 50 D 40 20 20 10 E 50 10 20 10  Uvažujeme 5 objektů A,B,C,D a E popsaných čtyřmi proměnnými X1-X4.  Neprovádíme žádnou standardizaci.  Vzdálenost mezi objekty měříme pomocí euklidovské vzdálenosti.  Vzdálenosti mezi shluky měříme pomocí metody průměrné vzdálenosti (average linkage). Data: Matice vzdáleností: A B C D E A 0 0 0 0 0 B 40,00 0 0 0 0 C 38,73 17,32 0 0 0 D 110,45 70,71 78,10 0 0 E 111,36 72,11 80,62 14,14 0 Příklad aglomerativního hiearchického shlukování 226 1. krok:  V matici vzdáleností hledáme nejmenší hodnotu. V našem případě je to 14,1 (vzd. mezi D a E).  Sloučíme objekty D a E do shluku D’ , zmenšíme a přepočteme matici vzdáleností. Používáme metodu průměrné vzdálenosti, takže:     9,1104,1114,110 21 1 , 1 A D'' '      i j ji DA AD xxd nn D   4,711,727,70 21 1 '   BDD   35,796,801,78 21 1 '   CDD A B C D E A 0 0 0 0 0 B 40,00 0 0 0 0 C 38,73 17,32 0 0 0 D 110,45 70,71 78,10 0 0 E 111,36 72,11 80,62 14,14 0 A B C D' A 0 B 40,00 0 C 38,73 17,3 0 D' 110,90 71,41 79,36 0 Příklad aglomerativního hiearchického shlukování 227 2. krok:  V redukované matici vzdáleností hledáme nejmenší hodnotu. V našem případě je to 17,3 (vzd. mezi B a C).  Sloučíme objekty B a C do shluku B’ , zmenšíme a přepočteme matici vzdáleností.     35,397,3840 21 1 , 1 A B'' '     i j ji BA AB xxd nn D       375,753,794,71 2 1 6,801,721,787,70 22 1 , 1 ' B''' ''     Di j ji BD BD xxd nn D A B C D' A 0 B 40,00 0 C 38,73 17,3 0 D' 110,90 71,41 79,36 0 A B' D' A 0 B' 39,36 0 D' 110,90 75,39 0 A B C D E A 0 0 0 0 0 B 40,00 0 0 0 0 C 38,73 17,32 0 0 0 D 110,45 70,71 78,10 0 0 E 111,36 72,11 80,62 14,14 0 Příklad aglomerativního hiearchického shlukování 228 3. krok:  V redukované matici vzdáleností hledáme opět nejmenší hodnotu. V našem případě je to 39,3 (vzd. mezi A a B’).  Sloučíme objekty A a B’ do shluku A’ , zmenšíme a přepočteme matici vzdáleností.       23,8739,75290,110 3 1 62,8010,7811,7271,7036,11145,110 23 1 , 1 'A D''' ''      i j ji DA DA xxd nn D A B' D' A 0 B' 39,36 0 D' 110,90 75,39 0 A' D' A' 0 D' 87,23 0 A B C D E A 0 0 0 0 0 B 40,00 0 0 0 0 C 38,73 17,32 0 0 0 D 110,45 70,71 78,10 0 0 E 111,36 72,11 80,62 14,14 0 Pozor!!! Slučují se dva nestejně velké objekty a nelze tedy počítat obyčejný průměr průměrů! Příklad aglomerativního hiearchického shlukování 229 proc distance data=aaa method=euclid out=dist; var interval(X1 X2 X3 X4); id obj; run; proc cluster data=dist method=ave outtree=tree nonorm; id obj; run; proc tree data=tree horizontal; id obj; run; Příklad aglomerativního hiearchického shlukování 230 14,14DED 32,17BCD 'B 'D 'A 35,39' ABD 23,87'' DAD Příklad aglomerativního hiearchického shlukování 231 8,474,392,87  1,223,174,39  A B C D E A 0 B 40 0 C 38,73 17,321 0 D 110,45 70,71 78,10 0 E 111,36 72,11 80,62 14,14 0 obj X1 X2 X3 X4 A 100 80 70 60 B 80 60 50 40 C 80 70 40 50 D 40 20 20 10 E 50 10 20 10  Určili jsme tedy dva shluky A’ ={A, B, C} a D’ = {D, E}. What Is Similarity?  To illustrate the difficulties involved in judging similarity, consider your answer to the following question: Which is more similar to a duck, a crow or a penguin?  The answer to this question largely depends on how you choose to define similarity.  Volba míry (ne)podobnosti závisí na typu proměnných (nominální, ordinální, poměrové, intervalové, binární). 232 Principles of a Good Similarity Metric  The following principles have been suggested as the foundation of a good similarity metric: 1. symmetry: d(x,y)=d(y,x). 2. non-identical distinguishability: if d(x,y)0 then xy. 3. identical non-distinguishability: if d(x,y)=0 then x=y.  Most good metrics are also consistent with the triangle inequality: d(x,y)  d(x,z) + d(y,z). 233 The DISTANCE Procedure General form of the DISTANCE procedure:  A distance method must be specified (no default), and all input variables are identified by level. PROC DISTANCE DATA=SAS-data-set METHOD=similarity-metric ; VAR level (variables < / option-list >); RUN; 234 Více na: http://support.sas.com/documentation/cdl/en/statugdistance/61780/PDF/default/statugdistance.pdf The DISTANCE Procedure 235  Metody měření vzdálenosti v SASu: Method Range Type Accepting variables Method Range Type Accepting variables GOWER 0 to 1 sim all HAMMING 0 to n dis Nominal DGOWER 0 to 1 dis all MATCH 0 to 1 sim Nominal EUCLID 0 dis Ratio, interval, ordinal DMATCH 0 to 1 dis Nominal SQEUCLID  dis Ratio, interval, ordinal DSQMATCH 0 to 1 dis Nominal SIZE  dis Ratio, interval, ordinal HAMANN –1 to 1 sim Nominal SHAPE  dis Ratio, interval, ordinal RT 0 to 1 sim Nominal COV  sim Ratio, interval, ordinal SS1 0 to 1 sim Nominal CORR –1 to 1 sim Ratio, interval, ordinal SS3 0 to 1 sim Nominal DCORR 0 to 2 dis Ratio, interval, ordinal DICE 0 to 1 sim Asymmetric nominal SQCORR 0 to 1 sim Ratio, interval, ordinal RR 0 to 1 sim Asymmetric nominal DSQCORR 0 to 1 dis Ratio, interval, ordinal BLWNM 0 to 1 dis Asymmetric nominal L(p)  dis Ratio, interval, ordinal K1  sim Asymmetric nominal CITYBLOCK  dis Ratio, interval, ordinal JACCARD 0 to 1 sim Asymmetric nominal, ratio CHEBYCHEV  dis Ratio, interval, ordinal DJACCARD 0 to 1 dis Asymmetric nominal, ratio POWER(p,r)  dis Ratio, interval, ordinal SIMRATIO 0 to 1 sim Ratio DISRATIO 0 to 1 dis Ratio NONMETRIC 0 to 1 dis Ratio CANBERRA 0 to 1 dis Ratio COSINE 0 to 1 sim Ratio DOT  sim Ratio OVERLAP  sim Ratio DOVERLAP  dis Ratio CHISQ  dis Ratio CHI  dis Ratio PHISQ  dis Ratio PHI  dis Ratio Euclidean Distance  Euclidean distance gives the linear distance between any two points in n-dimensional space.  It is a generalization of the Pythagorean theorem.    k i iiE yxD 1 2 wx x1 x2 (x1, x2) (0, 0)   2 1 2 i ixh 236 City Block (Manhattan) Distance  The distance between two points is measured along the sides of a right-angled the triangle.  It is the distance that you would travel if you had to walk along the streets of a right-angled city.   d i iiM yxD 1 1 (x1,x2) (y1,y2) 237 Hamming Distance 1 2 3 4 5 … 17 Gene A 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1 Gene B 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1 DH = 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 = 5 Gene expression levels under 17 conditions (low=0, high=1)   d i iiH yxD 1 238 Power Distance r d i q iiP yxD   1 239  Minkowského metrika (r=q)  Hemmingova vzdálenost (r=q=1)  Euklidovská vzdálenost (r=q=2)  Čebyšovova vzdálenost (r=q->) Correlation Similar (+1) . . . . . . . . . . . . . Dissimilar (-1) . . . . . . . . . . . . . . . . .. . .. . . . . . No Similarity (0) 240 Density-Based Similarity  Density-based methods define similarity as the distance between derived density “bubbles” (hyper-spheres). similarity density estimate 1 (cluster 1) density estimate 2 (cluster 2) 241 Gower’s Metric  Gower’s is the only similarity metric that accepts any measurement level.      v j jj v j jjj Gower w dw D 1 1 ),( ),(),( yx yxyx   242 Gower, J. C. (1971) A general coefficient of similarity and some of its properties, Biometrics 27, 623–637.  For nominal, ordinal, interval, or ratio variable:  For asymmetric nominal variable:  For nominal or asymmetric nominal variable:  For ordinal, interval, or ratio variable: 1),( yxj absentareandbothif,0),( presentisoreitherif,1),( jjj jjj yx yx   yx yx   jw …váha pro j-tou proměnnou jjj jjj yxd yxd if,0),( if,1),(   yx yx jjj yxd 1),( yx Další míry podobnosti 243 • Jaccardův koeficient • Diceův koeficient • Czekanowského koeficient       v j jj v j j v j j v j jj J yxyx yx D 11 2 1 2 1       v j j v j j v j jj D yx yx D 1 2 1 2 1 2       v j jj v j jj C yx yx D 1 1 )( ),min(2 Míry podobnosti pro binární data 244 • Koeficient souhlasu • Jaccardův koeficient • Diceův (Czekanowského) koef. • Yuleův koeficient Kat. objektu x 1 0 1 a b 0 c d Kategorie objektu w dcba da   cba a  cba a 2 2 bcad bcad   Míry (ne)podobnosti pro binární data 245 • Goodman-Kruskalovo lambda • Binární Lanceova-Williamsova míra nepodobnosti • Euklidovská vzdálenost • Bin. čtvercová euklid. vzdálenost (=Hammingova vzd.) ),max(),max()(2 ),max(),max(),max(),max(),max(),max( dcbadbcadcba dcbadbcadbcadcba   cba cb   2 cb cb Standardizace/normalizace  Před vlastním výpočtem vzdáleností je nanejvýš vhodné standardizovat (normalizovat) proměnné.  Důvodem je snaha o unifikaci měřítka a tím vyvážení vlivu jednotlivých proměnných.  Typicky:  Obecně: 246 scale location standard.   x x scale locationoriginal multiplyaddresult   result = final output value add = constant to add (ADD= option) multiply = constant to multiply by (MULT= option) original = original input value location = location measure scale = scale measure The STDIZE Procedure General form of the STDIZE procedure: PROC STDIZE DATA=SAS-data-set METHOD=method ; VAR variables; RUN; 247 Standardization Methods METHOD LOCATION SCALE MEAN mean 1 MEDIAN median 1 SUM 0 sum EUCLEN 0 Euclidean Length USTD 0 standard deviation about origin STD mean standard deviation RANGE minimum range MIDRANGE midrange range/2 MAXABS 0 maximum absolute value IQR median interquartile range MAD median median absolute deviation from median ABW(c) biweight 1-step M-estimate biweight A-estimate AHUBER(c) Huber 1-step M-estimate Huber A-estimate AWAVE(c) Wave 1-step M-estimate Wave A-estimate AGK(p) mean AGK estimate (ACECLUS) SPACING(p) mid minimum-spacing minimum spacing L(p) L(p) L(p) (Minkowski distances) IN(ds) read from data set read settings from data set "ds" 248„Z-skóre“ The Problem with Z-Score Standardization Standardization using the reciprocal of the variance can actually dilute the differences between groups! Source: Everitt et al. (2001) Before After 249 Cluster Preprocessing Before ACECLUS After ACECLUS 250  Řešením tohoto problému může být procedura ACECLUS (approximate covariance estimation for clustering) Více na: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_aceclus_sect002.htm The ACECLUS Procedure General form of the ACECLUS procedure: PROC ACECLUS DATA=SAS-data-set ; VAR variables; RUN; 251 Vzdálenost mezi shluky  Mimo určení jak měřit vzdálenosti mezi objekty uvnitř shluků je třeba definovat jak měřit vzdálenosti shluků mezi sebou.  Mezi základní metody patří metoda:  metoda nejbližšího souseda (single linkage),  metoda nejvzdálenějšího souseda (complete linkage),  metoda průměrné vazby (average linkage),  centroidní metoda,  Wardova metoda. 252 The CLUSTER Procedure The general form of the CLUSTER procedure:  The required METHOD= option specifies the hierarchical technique to be used to cluster the observations. PROC CLUSTER DATA=SAS-data-set METHOD=method ; VAR variables; FREQ variable; RMSSTD variable; RUN; 253 The CLUSTER Procedure, method=…  The METHOD= specification determines the clustering method used by the procedure. Any one of the following 11 methods can be specified for name:  AVERAGE | AVE requests average linkage (group average, unweighted pair-group method using arithmetic averages, UPGMA). Distance data are squared unless you specify the NOSQUARE option.  CENTROID | CEN requests the centroid method (unweighted pair-group method using centroids, UPGMC, centroid sorting, weighted-group method). Distance data are squared unless you specify the NOSQUARE option.  COMPLETE | COM requests complete linkage (furthest neighbor, maximum method, diameter method, rank order typal analysis). To reduce distortion of clusters by outliers, the TRIM= option is recommended.  DENSITY | DEN requests density linkage, which is a class of clustering methods using nonparametric probability density estimation. You must also specify either the K=, R=, or HYBRID option to indicate the type of density estimation to be used. See also the MODE= and DIM= options in this section.  EML requests maximum-likelihood hierarchical clustering for mixtures of spherical multivariate normal distributions with equal variances but possibly unequal mixing proportions. Use METHOD=EML only with coordinate data. See the PENALTY= option for details. The NONORM option does not affect the reported likelihood values but does affect other unrelated criteria. The EML method is much slower than the other methods in the CLUSTER procedure.  FLEXIBLE | FLE requests the Lance-Williams flexible-beta method. See the BETA= option in this section.  MCQUITTY | MCQ requests McQuitty’s similarity analysis (weighted average linkage, weighted pair-group method using arithmetic averages, WPGMA).  MEDIAN | MED requests Gower’s median method (weighted pair-group method using centroids, WPGMC). Distance data are squared unless you specify the NOSQUARE option.  SINGLE | SIN requests single linkage (nearest neighbor, minimum method, connectedness method, elementary linkage analysis, or dendritic method). To reduce chaining, you can use the TRIM= option with METHOD=SINGLE.  TWOSTAGE | TWO requests two-stage density linkage. You must also specify the K=, R=, or HYBRID option to indicate the type of density estimation to be used. See also the MODE= and DIM= options in this section.  WARD | WAR requests Ward’s minimum-variance method (error sum of squares, trace W). Distance data are squared unless you specify the NOSQUARE option. To reduce distortion by outliers, the TRIM= option is recommended. See the NONORM option. 254 Supported Data Types Hierarchical Method Coordinate Data Distance Data Average Linkage Yes Yes Centroid Linkage Yes Yes Complete Linkage No Yes Density Linkage No Some Options EML Yes No Flexible-Beta Method No Yes McQuitty’s Similarity No Yes Median Linkage No Yes Single Linkage No Yes Two-Stage Linkage No Some Options Ward’s Method Yes Yes 255 Average Linkage  The distance between clusters is the average distance between pairs of observations.      K LCi Cj ji LK KL xxd nn D , 1 CK d(xi, xj) CL 256 Centroid Linkage  The distance between clusters is the squared Euclidean distance between cluster centroids and .Kx Lx 2 LKKLD xx  X X DKL CK CL 257 Complete Linkage  The distance between clusters is the maximum distance between two observations, one in each cluster. ),(maxK jiLKL xxdCjCiD  DKL CK CL 258 Density Linkage 1. Calculate a new distance metric, d*, using k-nearest neighbor, uniform kernel, or Wong’s hybrid method. 2. Perform single linkage clustering with d*.            )( 1 )( 1 2 1 ,* ji ji xfxf xxdd*(xi,xj) CK CL 259 Equal Variance Maximum Likelihood  The distance between clusters CK and CL is given by a penalized maximum-likelihood variant.       LLKKMM G LKM KL nnnnnnp P www nD lnlnln1ln                  DKL CK CL CM 260 Flexible-Beta  The distance between clusters CK and CL is a BETA scaled measure of the component distances.    bD b DDD KLJLJKJM    2 1 DJK DJLDKL CJ CL CK CM 261 McQuitty’s  The average distance between an external cluster, J, and each of the component clusters (CK and CL) DJK DJL CK CL CM CJ 2 JLJK JM DD D   262 Median Linkage  The average distance between an external cluster and each of the component clusters, minus the distance between the component clusters. DJK DJL CK CL CM CJ DKL 42 KLJLJK JM DDD D    263 Single Linkage  The distance between clusters is the distance between the two nearest observations, one in each cluster. ),(minK jiLKL xxdCjCiD DKL CK CL 264 Two-Stage Density Linkage  The same as density linkage except that a cluster must have at least “n” members before it can be fused. 2. Apply single linkage modal cluster K modal cluster L modal cluster K modal cluster L DKL 1. Form „modal‟ clusters 265 Ward’s  Ward’s method uses ANOVA at each fusion point to determine if the proposed fusion is warranted.          LK LK KL nn xx D 11 2 ANOVA ANOVA 266 The TREE Procedure General form of the TREE procedure:  The TREE procedure either  displays the dendrogram (LEVEL= option) or  assigns the observations to a specified number of clusters (NCLUSTERS= option). PROC TREE DATA= ; RUN; 267 Interpreting Dendrograms change in fusion level 268 Určení finálních shluků 269  Finální shluky získáme vhodným „řezem“ dendrogramu.  Neexistuje univerzální postup, vždy záleží na konkrétních datech a interpretovatelnosti výsledku.  Lze ale použít např. tento postup:  Označíme μi vzdálenosti shluků, které vznikli v průběhu shlukovacího algoritmu v okamžicích spojování objektů/shluků.  Spočteme ri= μi+1 – μi .  Spočteme max(ri ) a určíme tím místo, kde „říznout“. 26,16,3  Zdroj obrázků : L.Žák, Shluková analýza (II), http://www.volny.cz/elzet/Libor/Aut_cl_2.pdf 270 Cvičení Generování dat: c_data.sas • COMPACT : Three well-separated, compact clusters. Source : SAS/STAT User's Guide (Introduction to Clustering Procedures). • DERMATOLOGY : Differential diagnosis of erythemato-squamous disease. Source : Nilselliter, N. and Altay Guvenir, H. (1998) • ELONGATED : Two parallel elongated clusters in which the variation in one dimension is 6 times the variation of the other dimension. There are 150 members in each of the clusters, for a total of 300 observations. Source : SAS/STAT User's Guide (Introduction to Clustering Procedures). • FISH : Seven species of fish caught off the coast of Finland. Source : Data Archive of the Journal of Statistics Education • INVESTORS : ...(training data) • OUTLIERS: Create two clusters with severe outliers. • PIZZA : Nutrient levels of various brands of frozen pizza. Source: D.E. Johnson (1998), Applied Multivariate Methods for Data Analysis, Duxbury Press, Cole Publishing Company, Pacific Grove, CA. (Example 9.2) • RING : A normal cluster surrounded by a ring cluster. Source : SAS/STAT User's Guide (The MODECLUS Procedure - Examples). • STOCK : Dividend yields for 15 utility stocks in the U.S. for 1986-1990. Source : SAS/STAT User's Guide (The DISTANCE Procedure - Examples). • TINVESTORS : Investors data set (test data) • UNEQUAL : Generate three unequal variance and unqual size clusters. Source : SAS/STAT User's Guide. 271 Cvičení /* clus01d01: Generating distances. The sasuser.stock data set contains the dividend yields for 15 utility stocks in the U.S. The observations are names of the companies, and the variables correspond to the annual dividend yields over the period 1986-1990. */ options nodate nonumber; goptions reset=all; %let inputs = div_1986 div_1987 div_1988 div_1989 div_1990; /* display the input data set */ title 'Stock Dividends'; title2 'The STOCK Data Set'; proc print data=sasuser.stock; var company &inputs; run; /* calculate the range standardized Euclidean distance */ proc distance data=sasuser.stock method=euclid out=dist; var interval(&inputs/std=range); id company; run; /* display the distance matrix generated */ title2 'Euclidean Distance Matrix'; proc print data=dist; id company; run; 272 Cvičení/* generate hierarchical clustering solution (Ward's method)*/ proc cluster data=dist method=ward outtree=tree noprint; id company; run; /* display the EUCLID dendrogram horizontally */ title2 "Cluster Solution"; proc tree data=tree horizontal; id company; run; /* calculate the range standardized city block distance */ proc distance data=sasuser.stock method=cityblock out=dist; var interval(&inputs/std=range); id company; run; /* display the distance matrix generated */ title2 'City Block Distance Matrix'; proc print data=dist; id company; run; /* generate hierarchical clustering solution (Ward's method)*/ proc cluster data=dist method=ward outtree=tree noprint; id company; run; /* display the CITYBLOCK dendrogram horizontally */ title2 "Cluster Solution"; proc tree data=tree horizontal; id company; run; 273 Cvičení/* clus02d4: Impact of input standardization on clustering. This demonstration evaluates the impact on cluster performance of changing the method of input standardization. Several methods are ranked according to their Cramer's V value and their misclassification rate. PROC FASTCLUS is used to cluster the observations. The input data set is the pizza data set. The input variables are the three inputs recommended using by the PROC VARCLUS 1-R**2 criterion. */ options nodate nonumber; %let group = brand; %let inputs = carb mois sodium; data results; length method$ 12; length misclassified 8; length chisq 8; length pchisq 8; length cramersv 8; stop; run; %macro standardize(dsn=, nc=, method=); … %mend standardize; %standardize(dsn=sasuser.pizza,nc=10,method=ABW(11)); %standardize(dsn=sasuser.pizza,nc=10,method=AGK(1)); 274 Cvičení%standardize(dsn=sasuser.pizza,nc=10,method=AHUBER(.1)); %standardize(dsn=sasuser.pizza,nc=10,method=AWAVE(.2)); %standardize(dsn=sasuser.pizza,nc=10,method=EUCLEN); %standardize(dsn=sasuser.pizza,nc=10,method=IQR); %standardize(dsn=sasuser.pizza,nc=10,method=L(1)); %standardize(dsn=sasuser.pizza,nc=10,method=L(1.5)); %standardize(dsn=sasuser.pizza,nc=10,method=L(2)); %standardize(dsn=sasuser.pizza,nc=10,method=MAD); %standardize(dsn=sasuser.pizza,nc=10,method=MAXABS); %standardize(dsn=sasuser.pizza,nc=10,method=MEAN); %standardize(dsn=sasuser.pizza,nc=10,method=MEDIAN); %standardize(dsn=sasuser.pizza,nc=10,method=MIDRANGE); %standardize(dsn=sasuser.pizza,nc=10,method=NONE); %standardize(dsn=sasuser.pizza,nc=10,method=RANGE); %standardize(dsn=sasuser.pizza,nc=10,method=SPACING(.9)); %standardize(dsn=sasuser.pizza,nc=10,method=STD); %standardize(dsn=sasuser.pizza,nc=10,method=SUM); %standardize(dsn=sasuser.pizza,nc=10,method=USTD); /* sort by number of misclassifications within Cramer's V */ proc sort data=results; by descending cramersv misclassified; run; /* display Cramer's V and misclassifications for each method */ title1 'Results'; proc print data=results; var method cramersv misclassified ; run; quit; 6. Vývoj CS modelu 275 Process Flow Explore Data Data Cleansing Initial Characteristic Analysis (KGB) Preliminary Scorecard (KGB) Reject Inference Initial Characteristic Analysis (AGB) Final Scorecard (AGB) • Scaling • Assessment Validate 276 Preliminary Scorecard (Known Good/Bad)  Group of characteristics, that together, offer the most predictive power  Logistic Regression (forward, backward, stepwise)  8–20 characteristics  stability. 277 Logistic Regression Logit (pi)=β0+ β1x1+ ….+ βkxk p – posterior probability of ‘event’ given inputs x – input variables β – parameters  Logit transformation is log of the odds, and is used to linearize posterior probability and limit outcome to between 0 and 1.  Maximum Likelihood used to estimate parameters.  Parameters estimates measure rate of change of logit for one unit change in input variable (adjusted for other inputs)  Depends on the unit of the input, therefore need to standardise (e.g. WOE) 278 Logistic Regression  Binary target (good/bad)  Variables  Raw data  Grouped data (for example, mid value of each group)  Weight of evidence 279 Logistic Regression  Forward Stepwise  Select best variable, add it to the model, and then add/subtract variables until no improvement in indicator.  Efficient, but weak when too many variables or high correlation  Backward Elimination  Start with all variables in the model, then eliminate least important variables.  Correlation is better taken care of  Better than stepwise, but can be computationally intensive. 280 Preliminary Scorecard  Choose the best – and build the most comprehensive risk profile  With as many independent data items as possible  independent data items representing different data types e.g. demog, financials, inquiries, trades info  10 characteristics with ‘100’ each preferred to 4 with ‘250’ each.  Correlation, co linearity etc. considered  Scorecard coherent with decision support structure  Sole arbiter or decision support tool: model needs to be coherent with overall decision support structure  Interpretability, implementability, and other business considerations. 281 Example of a Good Scorecard  Age  Residential status  Time at address  Inquiries 12 months 1)  Inquiries 3 months  Trades 90 days+ as % of total  Revolving balance/Total  Utilization  Number of products at bank  Delinquency at bank  Total Debt Service Ratio 282 1) Počet žádostí o úvěr za posledních 12 měsíců  contains some demographics, some inquiries, some trade, some utilization, internal bank perforamnce and capacity to pay. How Do We Get There?  Try statistically optimal approach (let the data speak)  “Design” a scorecard using stepwise/backward  Force characteristics in, or fix at each loop and adjust the hurdle rate  Consider:  “must have”  Weaker/stronger  Similar 283 Weaker, Similar  Weaker – consider first  Can 2 characteristics worth 40 points each model behavior better than one worth 70?  Same strength, broader base  Similar – put together  Time related, inquiries, trades, debt capacity, demographic  Takes care of correlation 284 Putting It Together  Try different combinations of characteristics in regression  Instead of putting all characts in, separate into categories, and try combinations.  Leave very strong characteristics out, or use at the end (for example, bureau scores)  Example “levels”  Weaker application info  Stronger application info  Weaker bureau  Stronger bureau  Mix and adjust with experience. 285 Putting It Together  Age, time at address, time at employment, time at bank  Region, postal code, province  TDSR, GDSR, capacity, Loan To Value  Time at bureau, current customer (Y/N)  Inq 3 months, inq 6 months, inq 12 months, inq 3/12 months  Trades delq, trades 3 mth/total, current trades  Utilization, public records  Bureau score, bankruptcy. 286 GDSR( Gross Debt Service Ratio) = (Annual Mortgage Payments + Property Taxes + Other Shelter Costs)/(Gross Family Income) TDSR (Total Debt Service Ratio) = (Annual Mortgage Payments + Property Taxes + Other Shelter Costs + Other Debt Payments)/(Gross Family Income) Logistic Regression  Use stepwise or backward  stepwise means dominating variable will stay in.  Backward: set of weak variables may end up staying.. That together add value (sometimes better than stepwise).. Also backward takes care of correlation better than others.  Modify to consider only selected characteristics at each “level”  series of regression runs, each as one “level”, force selected characteristics from previous “levels” in.  EM nodes in series. 287 • It is strongly recommended that all coefficients are logical. If some of them are not, include comments with explanation why it is good to keep them in scorecard. Include column, where it is easy to see the contribution of each category (either scaled scorepoints for linearization, either simply bi*xi*1000). Order categories in each predictor according to badrate (WOE) so that the worst are the first. 288 Logistic Regression 289 LR – scorecard example Own / Rent <5 12 Prof 50 <.5 2 None 0 Check 5 <15 22 0 3 <.5 0 0 5 0-15% 15 Years at address Occupation Dept St / Major CC Bank reference Debt ratio No. of recent inquiries Years in file # Rev trades outstanding % Credit line utilization Worst reference Rent 15 .5-2.49 10 SemiPrf 44 .5-1.49 8 Dept-St 11 Sav 10 15-25 15 1 11 1-2 5 1-2 12 16-30% 5 Other 10 2.5-6.49 15 Mgr 31 1.5-2.49 19 Maj-CC 16 Ck&Sav 20 26-35 12 2 3 3-4 15 3-5 8 31-40% -3 NI 17 6.5-10.49 19 >10.49 23 NI 14 Offc. 28 2.5-5.49 25 Both 27 Other 11 36-49 5 3 -7 5-7 30 6+ -4 41-50% -10 Bl.Col 25 5.5-12.49 30 No answr 10 NI 9 50+ 0 4 -7 8+ 40 >50% -18 5-9 -20 No Rcrd 0 NI 13 NI 12 Retired 31 Other 22 NI 27 12.5 39 Retired 43 NI 20 Years on job Obs colnamew colvalue Bi_x_Xi_x_1000 bad_rate freq_rate Bi Xi 1 Intercept 3129 . . 3.129451 . 3 age_fr_w 20 -359 10.3 3.5 0.377612 -0.950967 4 age_fr_w 29 -154 6.3 28 0.377612 -0.408556 5 age_fr_w 32 -47 4.8 8.4 0.377612 -0.123244 6 age_fr_w 36 20 4.1 10 0.377612 0.05253 7 age_fr_w 41 48 3.8 11 0.377612 0.12723 8 age_fr_w 51 173 2.7 23 0.377612 0.458154 9 age_fr_w 60 327 1.8 16 0.377612 0.865979 10 car_owner_fr_w 0 -60 4.5 76 1.179055 -0.051044 11 car_owner_fr_w 1 211 3.6 24 1.179055 0.179078 12 child_num_fr_w 99 -59 4.9 1.9 0.424104 -0.138759 13 child_num_fr_w 0 -33 4.6 60 0.424104 -0.078474 14 child_num_fr_w 1 34 4 27 0.424104 0.080184 15 child_num_fr_w 2 134 3.2 11 0.424104 0.315117 16 education_fr_w 5 -174 6.1 4.3 0.453663 -0.384073 17 education_fr_w 4 -79 5 3.3 0.453663 -0.174838 18 education_fr_w 2 -10 4.4 73 0.453663 -0.021929 19 education_fr_w 36 112 3.4 19 0.453663 0.247396 290 LR – scorecard example  For all predictors in scorecard estimated coefficient, wald chi-square, pvalue (e.g. in SAS output Analysis of maximum likelihood estimate), summary of predictor selections (order of predictors entering the model, e.g. in SAS output summary of stepwise selection) Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Wald Pr > ChiSq Error Chi-Square Intercept 1 3.1295 0.0106 86436.032 <.0001 age_fr_w 1 0.3776 0.0254 221.2018 <.0001 car_owner_fr_w 1 1.1791 0.1093 116.2664 <.0001 education_fr_w 1 0.4537 0.066 47.2419 <.0001 fam_state_fr_w 1 0.4167 0.0297 196.864 <.0001 goods_group_fr_w 1 0.2716 0.0311 76.3139 <.0001 child_num_fr_w 1 0.4241 0.0867 23.9081 <.0001 ident_card_age_fr_w 1 0.5677 0.0269 446.5471 <.0001 291 Logistic Regression Summary of Stepwise Selection Step Effect DF Number Score Wald Pr > ChiSq Entered Removed In Chi- Square Chi- Square 1 price1_fr_w 1 1 3858.056 <.0001 2 age_fr_w 1 2 3007.7932 <.0001 3 init_pay_by_price1_f 1 3 1434.1863 <.0001 4 ident_card_age_fr_w 1 4 868.4661 <.0001 5 type_suite_fr_w 1 5 800.2294 <.0001 6 time_on_job_fr_w 1 6 554.571 <.0001 7 mobile_phone_fr_w 1 7 342.7936 <.0001 8 sex_fr_w 1 8 357.5026 <.0001 9 fam_state_fr_w 1 9 331.6358 <.0001 10 ident_type2_fr_w 1 10 358.4905 <.0001 11 weekend_fr_w 1 11 323.0631 <.0001 12 region_fr_w 1 12 299.3854 <.0001 292 Summary of predictor selection Good Scorecard?  Eye Ball Test  Point allocation logical, no flips (after scaling)  “Flips” occur for several reasons: low count, correlation.  Scorecard characteristics make sense  what went in, what did’t., does it cover all the major categories of information?  Misclassification  Strength  Validation. 293 Scorecard Development  Build more complex models and compare predictiveness – if difference not significant, then scorecard is OK  Examine findings – is there a valid business reason?  Build several ‘different’ scorecards 294 What Have I Just Done?  “Designed” a scorecard  Used regression, with business considerations  Stable, represents strong major/independent information categories  Measurable strength and impact  Something a risk manager can buy and use.  Used only known goods and bads (that is, approves)  But need to apply scorecard on all applicants. 295 Declined Reject Inference Bad Good Total Applicants 296  Everything to this point has been for known performance - e.g. approval rate is 60%, building a model for 100% of the population based on 60% sample is not accurate. Reject Inference  Inferring the behavior of declined applicants Bad Good 297  This is where you need to get to: so need to create a sample representative of the “through the door” or entire applicant pop performance - 100% approval rate. The Known Good Bad Picture Rejects 2,950 Bads 874 Goods 6,176 Accepts 7,050 Through-the-door 10,000 ? 298 Reject Inference  Make the scorecard relevant  ignoring rejects distorts model  Influence of past decision making  For decision making  Get population odds  Expected performance  Swap set. Old scorecard Approve Decline New Approve A B Scorecard Decline C D 299 A – is approved goods B – is rejected goods C – is approved bads D – is rejected bads Where?  Medium/low approval rates  a 95% approval rate is close to “through the door”  Manual adjudication environment  Incorporates experience/intuition based overriding  “cherry picking” distorts performance. 300 Reject Inference Techniques  “True” Performance  “Nearly True” Performance  Statistical Inference  Or ignore the problem  Assume accepts = total population  not recommended unless previous credit granting was random or scorecard was perfect ( assume all rejects = bad). 301 “True” Performance  Approve every applicant  Or random sample  Expensive … but the only true way to determine performance of below cutoff applicants. cutoff ApproveAll 13 Sample 2 ApproveAll 302 “Nearly True” Performance 1  Bureau data  performance of declined apps on similar products with other companies  legal issues  difficult to implement in practice – timings, definitions, programming • Need consent to get bureau at any time • data - if u rejected them, they probably were rejected elsewhere • timings - performance window, sample window must be consistent • bad definition must be closely replicated • product must be similar - credit cards, unsecured line of credit with similar limit and conditions as you would have given • Experience - Programming effort is tremendous, depending on how detailed credit bureau reports are Declined - got credit elsewhere Jan 99 Analyze performance Dec 00 303 “Nearly True” Performance 2  In-house data  performance of declined apps on similar products, for example, credit cards/line of credit  timings, definitions my cause problems. • data - if u rejected them for a lower level product, they probably were rejected for higher one .. HOWEVER, in multiple product environments, scorecards are not always aligned and there is “ARBITRAGE”. • timings - performance window, sample window must be consistent • bad definition must be closely replicated • product must be similar - credit cards, unsecured line of credit with similar limit and conditions as you would have given Analyze performance Dec 00 Declined - got similar products Jan 99 304 Bureau Score Migration  Analyze bureau score migration of existing accounts with below cutoff scores  Identify accounts whose scores migrate to ‘above cutoff’ within specified time frame 305 Reclassification  Build an accept/reject model  Score all rejects and designate worst as accepted ‘bad’  Can use score or “serious derogatory” information to select accounts  Analyze Accept/Reject vs. Good/Bad cross tabs  Add to accepts and Re-model 306 Simple Augmentation  Simple Augmentation  Build good/bad model  Score rejects – establish a p(bad) to assign class  Add to Accepts and re-model  Simple  Arbitrary cutoff to assign goods and bads  Good/Bad model needs to be very good  No adjustment for p(approve). 307 Augmentation 2  Augmentation 2 (Coffman, Chandler 1977)  Build accept/reject model, obtain p(accept)  Build good/bad model  Adjust case weights of good/bad model to reflect probability of acceptance  Recognizes need to adjust for p(approve). 308 Parceling  Parceling (also called re-weighting)  score rejects with G/B model  split (randomly) rejects into proportional G and B groups. Score # Bad # Good % Bad % Good 0-99 24 10 70.3% 29.7% 100-199 54 196 21.6% 78.4% 200-299 43 331 11.5% 88.5% 300-399 32 510 5.9% 94.1% 400+ 29 1,232 2.3% 97.7% Reject 342 654 345 471 778 Rej - Bad 240 141 40 28 18 Rej - Good 102 513 305 443 760 continued... 309 Parceling  But ..  Reject bad proportion cannot be the same as approved?  Allocate higher proportion of bads from reject  Rule of thumb: bad rate for rejects should be 2–4 times that of approves.  Quick and simple  Good/Bad model better be good  May understate rejected bad rate. 310 Iterative Reclassification  Iterative Reclassification (McLachlan, 1975)  Build good/bad model using accepts  Score rejects and assign class based on chosen p(bad) cutoff  Rebuild model with combined dataset  Score rejects and re-assign class  Repeat until parameter estimates (and p(bad)) converge.  Can be modified for p(good) and p(bad) target assignment. 311 Iterative Reclassification  can be done as a plot of ln (odds) versus score. lnOdds Score KGB Iteration 1 Iteration 2 312 Fuzzy Augmentation  Step 1: Classification  Build good/bad model  Score rejects with G/B model  Do not assign a reject to a class  Create 2 weighted cases for each reject, using p(good) and p(bad). 313 Fuzzy Augmentation  Step 2: Augmentation  Combine rejects with accepts, adjusting for approval rate  For this, weigh rejects again: weight determines how much more frequent an actual case is compared to an inferred case in the augmented dataset  Freq of a ‘Good’ from rejects = p(good) x weight  Step 3: Remodel. 314 EM users: This is in the EM RI node. Freq= p(good) x (reject rate/approval rate) x (#accepts/#rejects) # rejects/accepts are proportional to actual population I.e. weighted, not raw counts Fuzzy Augmentation  No need for arbitrary classification cut-off  Augmentation step: better approach for choosing the importance of rejects. 315 Nearest Neighbor (Clustering)  Clustering  Create 2 sets of clusters: goods and bads  Run rejects through both clusters  Compare Euclidean distances to assign most likely performance  Combine accepts and rejects and re-model  Measures are relative  Adjustment for p(approve) can be added at augmentation step.  Can also use Memory-based Reasoning. 316 Other Techniques  Heckman’s Correction  http://ewe3.sas.com/techsup/download/stat/heckman.html  Heckman, James. "Sample Selection Bias as a Specification Error", Econometrica, Vol 47, No 1., January 1979, pp. 153-161.  Greene, William. "Sample Selection Bias as a Specification Error: Comment", {\sl Econometrica}, Vol. 49, No. 3, May 1981, pp. 795-798.  Mixture Decomposition  B.S Everitt and D.J. Hand, Finite Mixture Distributions (London:Chapman & Hall, 1981) 317 Verification  Compare bad rates/odds for known versus inferred, and use rule of thumb.  Review bad rates/weight of evidence of pre- versus post inference groupings.  Create “fake rejects” and test.  assign some accepted accounts as rejects with an artificial cutoff and test methods. 318 Factoring – Post Inference Bads 914 Goods 2,036 Rejects 2,950 Bads 874 Goods 6,176 Accepts 7,050 Through-the-door 10,000 Bad rate = 30.98% Bad rate = 12.4% 319  After rejects have been inferred, we build the post-inference data sets for the final scorecard production.  So the sample bias is solved and you can apply the scorecard on the entire population. Process Flow Explore Data Data Cleansing Initial Characteristic Analysis (KGB) Preliminary Scorecard (KGB) Reject Inference Initial Characteristic Analysis (AGB) Final Scorecard (AGB)Validate 320 Final Scorecard  Repeat Exploration, Initial Characteristics Analysis and Regression for “All Good Bad” data set  Scaling  Assessment  Misclassification  Strength. 321 Scorecard Scaling (conversion into points)  Why scale?  Implementation software – batch versus on-line  Marketing uses (off line selection, build retention model, score and isolate account numbers) vs. online decision support and app processing software  Ease of understanding and interpretation  End user can deal with points easier than weights  Continuity  previous scorecards were grouped/scaled . .and you want to have the same format and scaling.  Legal requirements  legal requirements to identify characteristics and reasons for decline  Components  Odds at a score  Points to double the odds  Example: Odds of 20:1 at 200, and odds double every 20 points. 322 Scorecard Scaling  This is the transformation from parameter estimates to scores.  Result: get a score card with discrete points, related to each other and the final score related to odds.  odds doubling every 20 points Score Odds 200 20 201 23 202 25 203 26 . . 220 40 . 240 80 323 Age 18-24 10 25-29 15 30-37 25 38-45 28 46+ 35 Time at Res 0-6 12 7-18 25 19-36 28 37+ 40 Region Major Urban 20 Minor Urban 25 Rural 15 Inq 6 mth 0 40 1-3 30 4-5 15 6+ 10 Scorecard Scaling In general: Score = A + B log (odds) Score +PDO = A + B log (2*odds)  Offset A and Factor B are to be calculated • Odds = odds at which to fix a score • Score = score at point x • PDO = points to double the odds 324 Scorecard Scaling  Solving for PDO: PDO = B Log (2), therefore B = PDO/log(2) ; A = Score – {B log (Odds)} Example, Odds of 50:1 at 600 and 20 pdo B = 20/log(2) = 28.8539 A = 600 – {28.8539 log (50)} = 487.123 Score = 487.123 + 28.8539 log (odds) Or log (odds) = (-16.88239) + 0.03465*Score 325 Scorecard Scaling n offset factor n a woe ii  *)*(   The points for each attribute are calculated by multiplying the Weight of Evidence of the attribute with the regression coefficient of the characteristic, then adding a fraction of the regression intercept, then multiplying this by -1 and by the factor and finally adding a fraction of the offset. 326  The negative sign is there because we switch from bad/good in modeling (regression) to good/bad in scaling (high scores being better than low scores). B A Scorecard Scaling )*)*(( *))*(( *))*(( *)log( 1 1 1 n offset factor n a woe offsetfactor n a woe offsetfactorawoe offsetfactoroddsscore n i ii n i ii n i ii               β = is the regression coefficient  WOE = weight of evidence for the attribute  n = number of characteristics  a = intercept 327 Check Points Allocation Age Weight Scorecard 1 Missing -55.50 16 18-22 -108.41 12 23-26 -72.04 18 27-29 -3.95 26 30-35 70.77 35 35-44 122.04 43 44 + 165.51 51 Scorecard 2 16 12 18 14 38 44 52 328  Scorecard 1 looks OK - logical distribution - as age increases, points increase according to weight  But Scorecard 2 doesn’t.  Why?  Correlation? Quirk in the data? Grouping? Maybe weights were too close together and not enough differentiation - repeat grouping with more distinct groups, and repeat regression. FICO is a unified score, which you can get from your score by linear transformation. The aim is to compute these transformation coefficients for every scorecard, because then you can compare quality of portfolios. If your development data are old enough, so that you can observe 90DPD @12MOB, take random sample (30 000 observations) from them, if not, take older data and score them by your new scorecard. Make a table according to example below, compute FICO score for each category as linear transformation ln(G/B)->Fico, defined FICO=(x+7.58)/0.0157. Apply linear regression on median score and FICO. Median of the category Lower bound of the score Upper bound of the score Numb @12 Mob Ever 90@12 MOB Good/Bad ln(G/B) Fico 0.716 0 0.752399981 1497 943 0.5874867 -0.5319 449 0.778 0.7524 0.7968 1504 733 1.0518418 0.050543 486 0.8132 0.7968 0.8268 1496 630 1.3746032 0.318165 503 0.8371 0.8268 0.8457 1508 564 1.6737589 0.515072 516 0.8532 0.8457 0.8596 1495 510 1.9313726 0.658231 525 0.8654 0.8596 0.8703 1500 474 2.164557 0.772216 532 0.875 0.8703 0.8792 1513 447 2.3847875 0.86911 538 0.8833 0.8792 0.8869 1489 414 2.5966184 0.95421 544 0.8901 0.8869 0.8934 1496 393 2.8066158 1.031979 549 0.8968 0.8934 0.8996 1521 378 3.0238095 1.106517 553 0.9024 0.8996 0.9051 1491 351 3.2478633 1.177997 558 329 FICO score (1497-943)/943 = 0.5874 (-0.5319 + 7.58)/0.0157 = 449 FICO transformation graph y = 678.99x - 49.314 R2 = 0.9613 0 100 200 300 400 500 600 700 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 sc1 Fico_score Řada1 Lineární (Řada1) 330 FICO score 7. Introduction to Survival Analysis 331 What Is Survival Analysis?  Survival analysis is a class of statistical methods for which the outcome variable of interest is time until an event occurs.  Time is measured from the beginning of follow-up until the event occurs or a reason occurs for the observation of time to end. 332 Examples of Survival Analysis  Follow-up of patients undergoing surgery to measure how long they survived after the surgery  Follow-up of leukemia patients in remission to measure how long they remain in remission  Follow-up of clients to measure how long they stay non-defaulted 333 What Is Survival Analysis? Time Subjects A B C D E F G Event End of Study Withdrew Event Lost to follow-up Event Event 1 2 3 4 5 6 334 Data Structure Subject Survival Time Status A 4.0 1 (event) B 6.0 0 (censored) C 3.0 0 D 5.0 1 E 3.0 0 F 3.0 1 G 2.0 1 335 Problems with Conventional Methods Logistic regression  ignores information on the timing of events  cannot handle time-dependent covariates. Linear regression  cannot handle censored observations  cannot handle time-dependent covariates  is not appropriate because time to event can have unusual distribution. 336 Right-Censoring An observation is right-censored if the observation is terminated before the event occurs. Time Subjects End of Study Withdrew Lost to follow-up 337 Left-Censoring Start of Study End of Study A B Time before Study Event Event An observation is left-censored when the observation experiences the event before the start of the follow-up period. 338 Interval-Censoring A B Event Event Time a b? An observation is interval-censored if the only information you know about the survival time is that it is between the values a and b. 339 Types of Right-Censoring  Type I subjects survived until end of the study. Censoring time is fixed.  Type II subjects survived until end of the study. Censoring time occurs when a pre-specified number of events have occurred.  Random observations are terminated for reasons that are not under the control of the investigator. 340 Uninformative Censoring Censoring is uninformative if it  occurs when the reasons for termination are unrelated to the risk of the event  assumes that subjects who are censored at time X should be representative of all those subjects with the same values of the predictor variables who survive to time X  does not bias the parameter estimates and statistical inference. 341 Informative Censoring Censoring is informative if it  occurs when the reasons for termination of the observation are related to the risk of the event  results in biased parameter estimates and inaccurate statistical inference about the survival experience. 342 Recommendations Regarding Informative Censoring  When designing and conducting studies, reduce the amount of random censoring.  Always analyze the pattern of censoring to see whether it is related to a subset of subjects.  Include in your study any explanatory variables that may affect the rate of censoring. 343 Time Origin Recommendations  Choose a time origin that marks the onset of continuous exposure to the risk of the event.  Choose the time of randomization to treatment as the time origin in experimental studies.  If there are several time origins available, consider controlling for the other time origins by including them as covariates. 344 Survival Analysis The goals of survival analysis might be to  estimate and interpret survival and hazard functions from survival data  compare survival and hazard functions among different groups  assess the relationship of time-independent and time-dependent explanatory variables to survival time  predict the remaining time until the event. 345 • T … náhodná veličina označující délku přežití (čas do sledované události nebo cenzorování). • δ… indikátor události (δ=1 pokud událost nastala, δ=0 pokud je pozorování cenzorované) • S(t)… funkce přežití (survival function) vyjadřuje pravděpodobnost, že jedinec v čase t ještě žije. • Platí: Survival Function (funkce přežití) )()( tTPtS  0)(lim 1)0(    tS S t 346 Kaplan-Meier estimation Time Number Events Number Censored 0 0 0 1 0 0 2 1 0 3 1 2 4 1 0 5 1 1 6 0 0 Number At Risk Cumulative Survival 7 1.00 7 1.00 7 (7-1)/7=.86 6 .86*5/6=.71 3 .71*2/3=.48 2 .48*1/2=.24 0 -------     j i i ii jjjj n dn ndtStS 1 1 )1)((ˆ)(ˆ nj… počet objektů pozorovaných do doby tj dj…počet objektů s událostí v době tj + -3 347 Kaplan-Meier Curve 348 Další metody odhadu 349 Life Table Method The life table method  is useful when there are a large number of observations  groups the event times into intervals  can produce estimates and plots of the hazard function. 350 Life Table Method 351 Differences between KM and Life Table Methods In the Kaplan-Meier method,  time interval boundaries are determined by the event times themselves  censored observations are assumed to be at risk for the whole event time period. In the life table method,  time interval boundaries are determined by the user  censored observations are censored at the midpoint of the time interval. 352 Standard error of KM estimate • The corresponding estimate of the standard error is computed using Greenwood’s formula (Kalbfleisch and Prentice; 1980) as    j i iii i jj dnn d tStS 1 )( )(ˆ))(ˆ(ˆ 353 Pointwise Confidence Limits 354 Pointwise Confidence Limits 355 Pointwise Confidence Limits 356 Simultaneous Confidence Intervals  Confidence bands show with a given confidence level that the survival function falls within the interval for all time points.  There are two approaches in SAS for constructing simultaneous confidence intervals.  Equal precision (CONFBAND=EP) confidence intervals are proportional to the pointwise confidence intervals.  Hall-Wellner (CONFBAND=HW) confidence intervals are not proportional to the pointwise confidence intervals.  Transformations that are used to improve the pointwise confidence bands can be used to improve the simultaneous confidence bands. 357 Simultaneous Confidence Intervals Let 358 Simultaneous Confidence Intervals 359 Simultaneous Confidence Intervals 360 Comparing Survival Functions 361 Likelihood-Ratio Test The likelihood-ratio test  is a parametric test that assumes that the distribution of event times follows an exponential distribution  can be verified if the plot of the negative log of the survival function by time follows a linear trend with an origin of 0. 362 Nonparametric Tests 363 Log-Rank Test The log-rank test  tests whether the survival functions are statistically equivalent  is a large-sample chi-square test that uses the observed and expected cell counts across the event times  has maximum power when the ratio of hazards is constant over time  loses power in the presence of interactions. 364 Log-Rank Test for Two Groups where d1j is the number of events that occur in group 1 at time j, and e1j is the expected number of events in group 1 at time j. 2 1 1 1 1 1 1 ( ) var ( ) r j j j r j j j d e d e                 365 Wilcoxon Test The Wilcoxon test  is also known as the Gehan test or the Breslow test  can be biased if the pattern of censoring is different between the groups  loses power in the presence of interactions. 366 Wilcoxon Test for Two Groups 2 1 1 1 1 1 1 ( ) var ( ) r j j j j r j j j j n d e n d e                 where nj is the total number at risk at each time point. 367 Log-Rank versus Wilcoxon Test Log-rank test  is more sensitive than the Wilcoxon test to differences between groups in later points in time. Wilcoxon test  is more sensitive than the log-rank test to differences between groups that occur in early points in time. 368 New Tests in SAS®9  Tarone-Ware test uses a weight equal to the square root of the number at risk. This gives more weight to differences between the observed and expected number of events at time points where there is the most data.  Peto-Peto and Modified Peto-Peto tests use weights that depend of the observed survival experience of the combined sample. The principle advantage of these tests is that they do not depend on the censoring experience of the groups.  Harrington-Fleming test incorporates features of both the log-rank and Peto-Peto tests. 369 Stratified Tests  Stratified tests are used when you want to compare survival functions across k populations while controlling for other covariates.  They are different than the k-sample tests which only compare survival functions across k populations.  Stratified tests are available in SAS®9 with the use of the GROUP= option in the STRATA statement. 370 Syntax for Stratified Tests STRATA variable1 / GROUP variable2 TEST=(list); Distinct values represent the m strata Distinct values represent the k populations 371 Multiple Comparison Methods  Bonferroni correction to the raw p-values  Dunnett’s two-tailed comparisons of the control group with all other groups  Scheffe’s multiple-comparison adjustment  Sidák correction to the raw p-values  Paired comparisons based on the studentized maximum modulus test  Tukey’s studentized range test  Adjusted p-values from the simulated distribution 372 Specification of Comparisons  DIFF=ALL requests all paired comparisons.  DIFF=CONTROL <(’string’ <...’string’>)> requests comparisons of the control curve with all other curves.  To specify the control curve, you specify the quoted strings of formatted values that represent the curve in parentheses. 373 374 375 376 377 LIFETEST Procedure General form of the LIFETEST procedure: PROC LIFETEST DATA=SAS-data-set ; TIME variable <*censor(list)>; STRATA variable <(list)> <...variable <(list)>> ; TEST variables; RUN; • The simplest use of PROC LIFETEST is to request the nonparametric estimates of the survivor function for a sample of survival times. In such a case, only the PROC LIFETEST statement and the TIME statement are required. You can use the STRATA statement to divide the data into various strata. A separate survivor function is then estimated for each stratum, and tests of the homogeneity of strata are performed. 378 Hazard Function (riziková funkce) The hazard function  is the instantaneous risk or potential that an event will occur at time t, given that the individual has survived up to time t  takes the form number of events per interval of time  is a rate, not a probability, that ranges from zero to infinity. 379 Hazard Function 0 ( | ) ( ) lim t P t T t t T t h t t         Instantaneous risk or potential (okamžité riziko/potenciál) Interval of time Conditional Probability ))(ln( )( )( )( tS ttS tf th    ))(exp()( tHtS   t dxxhtH 0 )()( Platí: kde je tzv. kumulativní riziková funkce. 380 kde je hustota náhodné veličiny T.)(tf Hazard Function 381 8. Cox model 382 Survival Models Models in survival analysis  are written in terms of the hazard function  assess the relationship of predictor variables to survival time  can be parametric or nonparametric models. 383 Parametric versus Nonparametric Models Parametric models require that  the distribution of survival time is known  the hazard function is completely specified except for the values of the unknown parameters. Examples include the Weibull model, the exponential model, and the log-normal model. 384 Parametric versus Nonparametric Models Properties of nonparametric models are  the distribution of survival time is unknown  the hazard function is unspecified. An example is the Cox proportional hazards model. 385 Cox Proportional Hazards Model 1 1{ ... } 0( ) ( ) i k ikX X ih t h t e     Baseline Hazard function – involves time but not predictor variables Linear function of a set of predictor variables – does not involve time 386 Popularity of the Cox Model The Cox proportional hazards model  provides the primary information desired from a survival analysis, hazard ratios and adjusted survival curves, with a minimum number of assumptions  is a robust model where the regression coefficients closely approximate the results from the correct parametric model. 387 Measure of Effect Hazard ratio = hazard in group A hazard in group B ˆ ( )i iA iBX X e   388 Properties of the Hazard Ratio Group B Higher Hazard Group A Higher Hazard 0 1 No Association   389 Proportional Hazards Assumption Log h(t) Time Females Males 390 Nonproportional Hazards 391 Cox model in credit scoring Credit-scoring systems were built to answer the question, "How likely is a credit applicant to default by a given time in the future?" The methodology is to take a sample of previous customers and classify them into good or bad depending on their repayment performance over a given fixed period. Poor performance just before the end of this fixed period means that customer is classified as bad; poor performance just after the end of the period does not matter and the customer is classified as good. This arbitrary division can lead to less-thanrobust scoring systems. Also, if one wants to move from credit scoring to profit scoring, then it matters when a customer defaults. One asks not if an applicant will default but when will they default. This is a more difficult question to answer because there are lots of answers, not just the yes or no of the "if" question, but it is the question that survival analysis tools address when modeling the lifetime of equipment, constructions, and humans. Zdroj: Thomas, Edelman, Crook – Credit scoring and its application. 392 Cox model in credit scoring Using survival analysis to answer the "when" question has several advantages over standard credit scoring. For example, • it deals easily with censored data, where customers cease to be borrowers (either by paying back the loan, death, changing lender) before they default; • it avoids the instability caused by having to choose a fixed period to measure satisfactory performance; • estimating when there is a default is a major step toward calculating the profitability of an applicant; • these estimates will give a forecast of the default levels as a function of time, which is useful in debt provisioning; • this approach may make it easier to incorporate estimates of changes in the economic climate into the scoring system. Zdroj: Thomas, Edelman, Crook – Credit scoring and its application. 393 Cox model in credit scoring Let T be the time until a loan defaults. Then there are three standard ways to describe the randomness of T in survival analysis (Collett 1994): S(t), f(t) and h(t). Zdroj: Thomas, Edelman, Crook – Credit scoring and its application. 394 Cox model in credit scoring In standard credit scoring, one assumes that the application characteristics affect the probability of default. Similarly, in this survival analysis approach, we want models that allow these characteristics to affect the probability of when a customer defaults. Two models have found favor in connecting explanatory variables to failure times in survival analysis: • proportional hazard models • accelerated life models. If x = (x1,..., xp) are the application (explanatory) characteristics, then an accelerated life model assumes that where ho and So are baseline functions, so the x can speed up or slow down the aging of the account. The proportional hazard assumes that so the application variables x have a multiplier effect on the baseline hazard. Zdroj: Thomas, Edelman, Crook – Credit scoring and its application. 395 Cox model in credit scoring Cox (1972) pointed out that in proportional hazards one can estimate the weights w without knowing h0(t) using the ordering of the failure times and the censored times. If ti and xi are the failure (or censored) times and the application variables for each of the items under test, then the conditional probability that customer i defaults at time ti given that R(i) are the customers still operating just before ti- is given by which is independent of h0. Zdroj: Thomas, Edelman, Crook – Credit scoring and its application. 396 PHREG Procedure PROC PHREG DATA=SAS-data-set ; CLASS variable <(options)><...variable <(options)>>; MODEL response<*censor(list)>=variables ; STRATA variable<(list)><…variable<(list)>> ; CONTRAST <‟label‟>effect values<,.., effect values> ; ASSESS keyword ; HAZARDRATIO <‟label‟> variable ; TEST equation1 <,..., equationk> < /options>; WEIGHT variable; OUTPUT ; programming statements; RUN; 397 9. Měření kvality (síly) modelu, validace modelu. 398 How Good Is the Scorecard?  And which one is the best?  Combination of statistical measures and business objectives  Misclassification (Confusion) matrix  Scorecard strength measures 399 Misclassification  Confusion matrix  Accuracy  (TP+TN)/total  Error rate  (FP+FN)/total  Sensitivity ; Specificity  (TP)/Actual Positives ; (TN)/Actual Negatives  Positive ; Negative predicted value  TP/predicted positives ; TN/predicted negatives Predicted Good Bad Good True Positive False Negative Actual Bad False Positive True Negative “Good”/”Bad” is above/below chosen cutoff. Want to max accuracy and min error rate. 400 Misclassification  Confusion matrix  Acceptance of bads (FP)  Acceptance of goods (TP)  Decline Goods (FN)  Decline Bads (TN) Predicted Good Bad Good True Positive False Negative Actual Bad False Positive True Negative Want to min rejection of goods and max rejection of bads. 401 Misclassification  Approval rate: bad rate relationship  Objective:  Minimize the rejection of goods or acceptance of bads  Best option for desired bad rate, approval rate  Compare scorecards and cutoff choices. “I’d rather approve some bads than reject good customers” vs “the cost of approving bads is too high, we can deal with PR”. Generate these stats for different cutoff rate choices and compare with base I.e. current approval and bad rates. If several models are being compared, generate these for same bad rate or approval rate. I.e. choose different cutoffs to get same bad rate. 402 Misclassification: Oversampling  Need to adjust for oversampling if have not done so before this step  Sensitivity/specificity unaffected by oversampling  Multiply cell counts by sample weights (π0 and π1) Predicted Good Bad Good n*(True Ps/Actual Ps)* π1 n*(1-Sens)* π1 Actual Bad n*(1 - Spec)* π0 n*(Spec)* π0 403 Scorecard Strength  Akaike’s Information Criterion (AIC)  Schwartz Bayesian Criterion (SBC)  -(score test statistic) + penalty term • Penalty term = (k + 1). Ln(n) • k = number of variables • n = sample size Penalise for adding parameters to the model ... Smaller values are better. 404 KS Statistic  Max difference between cumulative distributions of goods and bads across score ranges Kolmogorov-Smirnov 0% 20% 40% 60% 80% 100% 0 110 130 145 155 165 175 185 195 205 215 225 235 245 255 265 275 285 300 Score 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% KS Good - A Bad - A Good - B Bad - B KS - B KS- A 405 Scorecard Strength  C - Statistic  Area under the ROC curve, Wilcoxon-Mann-Whitney test. ROC Curve 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 (1 - Specificity) Sensitivity Scorecard - A Random Scorecard - B You may be wondering where the name "Receiver Operating Characteristic" came from. ROC analysis is part of a field called "Signal Detection Theory" developed during World War II for the analysis of radar images. Radar operators had to decide whether a blip on the screen represented an enemy target, a friendly ship, or just noise. Signal detection theory measures the ability of radar receiver operators to make these important distinctions. Their ability to do so was called the Receiver Operating Characteristics. It was not until the 1970's that signal detection theory was recognized as useful for interpreting medical test results. 406 Scorecard Strength • Gains Chart: Cumulative Positive Predicted Value versus Distribution of Predicted Positives (depth) • Lift/concentration curve: Sensitivity versus Depth • Lift = Positive predicted value / % positives in the sample • Misclassification costs (losses assigned to false +positive and –positive) • Bayes Rule (minimizes expected cost) • Cost ratio (at what cutoff do I break even based on prior bad rate I.e. if bad odds are 9:1, you need a cutoff where 1 bad and 9 goods) • Somers’ D, Gamma, Tau-a 407 Information value  The special case of Kullback-Leibler divergence given by: where • are densities of scores of bad and good clients. 408  Informační hodnota (Ival) – spojitý případ (Divergence):   dx xf xf xfxfI BAD GOOD BADGOODval           )( )( ln)()( )()()( xfxfxf BADGOODdiff         )( )( ln)( xf xf xf BAD GOOD LR 409 Information value • Nahradíme hustotu jejím jádrovým odhadem a spočteme integrál numericky (např. pomocí složeného lichoběžníkového pravidla). • S použitím Epanečnikova jádra, , a optimální šířky vyhlazovacího okna dostaneme • Pro daných M+1 bodů dostáváme  Informační hodnota (Ival) – diskretizovaný spojitý případ:     1,11 4 3 )( 2  xIxxK             1 1 0 0 )( ~ )( ~ 2)( ~ 2 M i MIViIVIV M val xfxfxf M xx I            ),( ~ ),( ~ ln),( ~ ),( ~ )( ~ 2, 2, 2,2, OSBAD OSGOOD OSBADOSGOODIV hxf hxf hxfhxfxf kOSh , Mxx ,,0  0x Mx 410 Information value • Vytvoříme intervaly skóre – typicky decily. Počet dobrých (špatných) klientů v i-tém intervalu označíme . • Musí platit • Potom dostáváme               i i iii val nb mg m b n g I ln  Informační statistika/hodnota (Ival) – diskrétní případ:  ii bg ibg ii  0,0 411 Information value  Informační hodnota pro 2 příklady scoringových modelů:  SC 1: SC 2: decile # cleints # bad clients #good % bad [1] % good [2] [3] = [2] - [1] [4] = [2] / [1] [5] = ln[4] [6] = [3] * [5] cum. [6] 1 100 35 65 35,0% 7,2% -0,28 0,21 -1,58 0,44 0,44 2 100 16 84 16,0% 9,3% -0,07 0,58 -0,54 0,04 0,47 3 100 8 92 8,0% 10,2% 0,02 1,28 0,25 0,01 0,48 4 100 8 92 8,0% 10,2% 0,02 1,28 0,25 0,01 0,49 5 100 7 93 7,0% 10,3% 0,03 1,48 0,39 0,01 0,50 6 100 6 94 6,0% 10,4% 0,04 1,74 0,55 0,02 0,52 7 100 6 94 6,0% 10,4% 0,04 1,74 0,55 0,02 0,55 8 100 5 95 5,0% 10,6% 0,06 2,11 0,75 0,04 0,59 9 100 5 95 5,0% 10,6% 0,06 2,11 0,75 0,04 0,63 10 100 4 96 4,0% 10,7% 0,07 2,67 0,98 0,07 0,70 All 1000 100 900 Info. Value 0,70 decile # cleints # bad clients #good % bad [1] % good [2] [3] = [2] - [1] [4] = [2] / [1] [5] = ln[4] [6] = [3] * [5] cum. [6] 1 100 20 80 20,0% 8,9% -0,11 0,44 -0,81 0,09 0,09 2 100 18 82 18,0% 9,1% -0,09 0,51 -0,68 0,06 0,15 3 100 17 83 17,0% 9,2% -0,08 0,54 -0,61 0,05 0,20 4 100 15 85 15,0% 9,4% -0,06 0,63 -0,46 0,03 0,22 5 100 12 88 12,0% 9,8% -0,02 0,81 -0,20 0,00 0,23 6 100 6 94 6,0% 10,4% 0,04 1,74 0,55 0,02 0,25 7 100 4 96 4,0% 10,7% 0,07 2,67 0,98 0,07 0,32 8 100 3 97 3,0% 10,8% 0,08 3,59 1,28 0,10 0,42 9 100 3 97 3,0% 10,8% 0,08 3,59 1,28 0,10 0,52 10 100 2 98 2,0% 10,9% 0,09 5,44 1,69 0,15 0,67 All 1000 100 900 Info. Value 0,67 412 Information value  Označíme-li , dostáváme:       m b n g I ii diffi        nb mg I i i LRi ln  SC 1: SC 2: -0,30 -0,25 -0,20 -0,15 -0,10 -0,05 0,00 0,05 0,10 1 2 3 4 5 6 7 8 9 10 -2,00 -1,50 -1,00 -0,50 0,00 0,50 1,00 1,50 I_diff I_LR 0,00 0,05 0,10 0,15 0,20 0,25 0,30 0,35 0,40 0,45 0,50 1 2 3 4 5 6 7 8 9 10 0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 I_diif * I_LR cum. I_diff * I_LR -0,15 -0,10 -0,05 0,00 0,05 0,10 1 2 3 4 5 6 7 8 9 10 -1,00 -0,50 0,00 0,50 1,00 1,50 2,00 I_diff I_LR 0,00 0,02 0,04 0,06 0,08 0,10 0,12 0,14 0,16 1 2 3 4 5 6 7 8 9 10 0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 I_diif * I_LR cum. I_diff * I_LR K-S = 0.34 Gini = 0.42 Lift20% = 2.55 Lift50% = 1.48 Ival = 0.70 Ival20% = 0.47 Ival50% = 0.50 K-S = 0.36 Gini = 0.42 Lift20% = 1.90 Lift50% = 1.64 Ival = 0.67 Ival20% = 0.15 Ival50% = 0.23 413 Information value Ival for normally distributed scores   bg D   2 2 2 )( 1 2 1 )( g gx g exf       2 2 2 )( 0 2 1 )( b bx b exf       22 * bg bg D      2 DIval           2 2 2 2 2* 2 1 ,1)1( b g g b val AADAI      Assume that standard deviations are equal to a common value :  Generally (i.e. without assumption of equality of standard deviations):  Assume that the scores of good and bad clients are normally distributed, i.e. we can write their densities as where where 414  We can see a quadratic dependence on difference of means.  Ival takes quite high values when both variances are approximately equal and smaller or equal to 1, and it grows to infinity if ratio of the variances tends to infinity or is nearby zero. 415 Ival for normally distributed scores  Ival: ,0b 12 b  Velmi silná závislost na . Navíc hodnota Ival míří velmi rychle k nekonečnu pokud se blíží nule. g 2 g 416 Ival for normally distributed scores Empirical estimate of Ival 417  However in practice, there could occur computational problems. The Information value index becomes infinite in cases when some of n0j or n1j are equal to 0. When this arises there are numerous practical procedures for preserving finite results. For example one can replace the zero entry of numbers of goods or bads by a minimum constant of say 0.0001. Choosing of the number of bins is also very important. In the literature and also in many applications in credit scoring, the value r=10 is preferred. 418 Empirical estimate of Ival where is the empirical quantile function appropriate to the empirical cumulative distribution function of scores of bad clients. Empirical estimate with supervised interval selection  We want to avoid zero values of n0j or n1j .  We propose to require to have at least k, where k is a positive integer, observations of scores of both good and bad client in each interval.  Set , 419  Usage of quantile function of scores of bad clients is motivated by the assumption, that number of bad clients is less than number of good clients.  If n0 is not divisible by k, it is necessary to adjust our intervals, because we obtain number of scores of bad clients in the last interval, which is less than k. In this case, we have to merge the last two intervals.  Furthermore we need to ensure, that the number of scores of good clients is as required in each interval  To do so, we compute n1j for all actual intervals. If we obtain n1j < k for jth interval, we merge this interval with its neighbor on the right side.  This can be done for all intervals except the last one. If we have n1j < k for the last interval, than we have to merge it with its neighbor on the left side, i.e. we merge the last two intervals. 420 Empirical estimate with supervised interval selection  Very important is the choice of k. If we choose too small value, we get overestimted value of the Information value, and vice versa. As a reasonable compromise seems to be adjusted square root of number of bad clients given by  The estimate of the Information value is given by where n0j and n1j correspond to observed counts of good and bad clients in intervals created according to the described procedure. 421 Empirical estimate with supervised interval selection Simulation results  Consider n clients, 100pB% of bad clients with and 100(1-pB)% of good clients with .  Because of normality we know .  Consider following values of parameters:  n = 100 000 , n = 1000  μ0 = 0  σ0 = σ1 = 1  μ1 = 0.5, 1, 1.5  pB = 0.02, 0.05, 0.1, 0.2 ),(: 000 Nf ),(: 111 Nf 2 01           valI 422 1) Scores of bad and good clients were generated according to given parameters. 2) Estimates were computed. 3) Square errors were computed. 4) Steps 1)-3) were repeated one thousand times. 5) MSE was computed. ESISvalKERNvalDECval III ,,, ˆ,ˆ,ˆ 423 Simulation results n=100000, = 0.5 MSE 0.02 0.05 0.1 0.2 IV_decil 0,000546 0,000310 0,000224 0,000168 IV_kern 0,000487 0,000232 0,000131 0,000076 IV_esis 0,000910 0,000384 0,000218 0,000127 n=100000, = 1.0 MSE 0.02 0.05 0.1 0.2 IV_decil 0,006286 0,004909 0,004096 0,002832 IV_kern 0,003396 0,001697 0,001064 0,000646 IV_esis 0,002146 0,000973 0,000477 0,000568 n=100000, = 1.5 MSE 0.02 0.05 0.1 0.2 IV_decil 0,056577 0,048415 0,034814 0,020166 IV_kern 0,019561 0,010789 0,006796 0,004862 IV_esis 0,013045 0,008134 0,007565 0,027943 n=1000, = 0.5 MSE 0.02 0.05 0.1 0.2 IV_decil 0,025574 0,040061 0,026536 0,009074 IV_kern 0,038634 0,017547 0,009281 0,004737 IV_esis 0,038331 0,021980 0,016280 0,008028 n=1000, = 1.0 MSE 0.02 0.05 0.1 0.2 IV_decil 0,186663 0,084572 0,043097 0,029788 IV_kern 0,117382 0,072381 0,045344 0,032131 IV_esis 0,150881 0,071088 0,036503 0,023609 n=1000, = 1.5 MSE 0.02 0.05 0.1 0.2 IV_decil 1,663859 1,037778 0,535180 0,200792 IV_kern 0,529367 0,349783 0,266912 0,196856 IV_esis 0,609193 0,352151 0,172931 0,194676 01   01   01   01   01   01   • worst • average • best performance 424 Simulation results Adjusted empirical estimate with supervised interval selection (AESIS)  Je zřejmé, že volba parametru k je zcela zásadní. Otázkou tedy je:  Je volba optimální (vzhledem k MSE)?  Jaký vliv na optimální k má n0 ?  A jaký vliv, pokud vůbec, má rozdíl středních hodnot ?01   425  Consider 10000 clients, 100pB% of bad clients with and 100(1-pB)% of good clients with . Set and consider , Bp Bp 01  01   MSEk ))ˆ(( 2 valval IIEMSE  )1,(: 00 Nf )1,(: 11 Nf 00  5.1and1,5.01  426 Simulation results  Dependence of MSE on k, .  The highlighted circles correspond to values of k, where minimal value of the MSE is obtained. The diamonds correspond to values of k given by . 101   AESISvalI , ˆ 0.02 0.05 0.1 0.2 0.5 29 42 62 84 1 12 18 23 32 1.5 6 9 8 9 MSEk 01   Bp 0.02 0.05 0.1 0.2 0.5 31 45 61 84 1 12 17 24 32 1.5 7 10 14 19 Bp 01   427 Simulation results ESIS.1  Algorithm for the modified ESIS: 1) 2) 3) 4) Add to the sequence, i.e. 5) Erase all scores 6) While n0 and n1 are greater than 2*k, repeat step 2) – 5) 7)         1 1 11 n k Fqj         0 1 00 n k Fqj ),max( 01max jj qqs  ],[ maxsqq  []q maxs maxs )]max(,,1)[min( scorescore qq  1, ˆ ESISvalI where 428 ESIS.2  U původního ESIS často dochází ke slučování vypočtených intervalů ve druhé fázi algoritmu.  Pro výpočet se používá jen .  Aby byla splněna podmínka n11 >k, je zřejmě nutné, aby hranice prvního intervalu byla větší než  To vede k myšlence použít ke konstrukci intervalů nejprve a následně, od nějaké hodnoty skóre .  Jako vhodná hodnota skóre pro tento účel se jeví hodnota s0, ve které se protínají hustoty skóre, rozdíl distribučních funkcí skóre nabývá své maximální hodnoty a také platí, že funkce fIV nabývá nulové hodnoty. , 0s Point of intersection of densities Point of maximal difference of CDFs Point of zero value of fIV == . 1 0  F . 1 1 1       n k F . 1 1  F . 1 0  F 429  Algorithm for the modified ESIS: 1) 2) 3) 4) 5) Merge intervals given by q1 where number of bads is less than k. 6) Merge intervals given by q0 where number of goods is less than k.                )(,,1, 01 1 1 1 11 sF k n j n kj Fq j  ]1)max(,,,1)[min( 01  scorescore qqq where 1,,)(, 0 00 0 0 1 00                      k n sF k n j n kj Fq j  2, ˆ ESISvalI 010 maxarg FFs s  430 ESIS.2 AESIS.2 – Simulation results  Consider 1000, 10000 and 100000 clients, 100pB% of bad clients with and 100(1-pB)% of good clients with . Set , and consider . ))ˆ(( 2 valval IIEMSE  )1,(: 00 Nf )1,(: 11 Nf 00  5.1and1,5.01  Bp MSEk 01   01   MSEk Bp 0.02 0.05 0.1 0.2 0.5 29 51 77 112 1 15 24 28 45 1.5 6 11 11 14 15 23 32 45 0.02 0.05 0.1 0.2 0.5 15 19 22 45 1 3 8 11 16 1.5 2 3 6 7 5 8 10 15 1000n 10000n Bp MSEk 01   0.02 0.05 0.1 0.2 0.5 118 198 298 371 1 50 61 106 141 1.5 17 28 32 48 5 8 10 15 100000n 431 0.02 0.05 0.1 0.2 0.5 38 60 85 120 1 15 23 32 45 1.5 8 13 18 26 Simulation results  Dependence of MSE on k. 0.02 0.05 0.1 0.2 0.5 29 51 77 112 1 15 24 28 45 1.5 6 11 11 14 MSEk 01   BpBp 01   10000n 2, ˆ AESISvalI            2 01 ˆˆ  np k B            2 01 ˆˆ  np k B 2.0,1000  Bpn 5.001   0.101   5.101   5.001   0.101   5.101   5.001   0.101   5.101   05.0,100000  Bpn 2.0,10000  Bpn 432 Scorecard Strength 433 Scorecard Strength 434 Process Flow Explore Data Data Cleansing Initial Characteristic Analysis (KGB) Preliminary Scorecard (KGB) Reject Inference Initial Characteristic Analysis (AGB) Final Scorecard (AGB)Validate 435 Validation  Why?  Confirm that the model is robust and applicable on the subject population  Holdout sample  70/30, 80/20 or random samples of 50–80%  2 Methods  Compare statistics for development versus validation  Compare distributions of goods and bads for development versus validation. 436 Validation – Comparing Statistics Fit Statistic Label Training Validation Test _AIC_ Akaike's Information Criterion 6214.0279153 . . _ASE_ Average Squared Error 0.0301553132 0.0309774947 . _AVERR_ Average Error Function 0.1312675287 0.1355474611 . _DFE_ Degrees of Freedom for Error 23609 . . _DFM_ Model Degrees of Freedom 7 . . _DFT_ Total Degrees of Freedom 23616 . . _DIV_ Divisor for ASE 47232 45768 . _ERR_ Error Function 6200.0279153 6203.7361993 . _FPE_ Final Prediction Error 0.0301731951 . . _MAX_ Maximum Absolute Error 0.9962871546 0.9959395534 . _MSE_ Mean Square Error 0.0301642541 0.0309774947 . _NOBS_ Sum of Frequencies 23616 22884 . _NW_ Number of Estimate Weights 7 . . _RASE_ Root Average Sum of Squares 0.1736528525 0.1760042464 . _RFPE_ Root Final Prediction Error 0.1737043324 . . _RMSE_ Root Mean Squared Error 0.1736785944 0.1760042464 . _SBC_ Schwarz's Bayesian Criterion 6270.5156734 . . _SSE_ Sum of Squared Errors 1424.295752 1417.777979 . _SUMW_ Sum of Case Weights Times Freq 47232 45768 . _MISC_ Misclassification Rate 0.0320121951 0.0325117986 . _PROF_ Total Profit for GB 3430000 2730000 . _APROF_ Average Profit for GB 145.24051491 119.29732564 . If stats are similar, then scorecard is validated. 437 Validation – Compare Distributions Validation Chart 0% 20% 40% 60% 80% 100% 0 110 130 145 155 165 175 185 195 205 215 225 235 245 255 265 275 285 300 Score Good-Dev Good-Val Bad-Dev Bad-Val Valid if no significant difference. 438 Validation  Common reasons for not validating  Characteristics with large score ranges,  Concentration of a certain type of attribute in one sample (for example, not random sampling),  small sample sizes 439 Comparison with the old scorecard Month by month comparison of performance of the old and the new scorecard, both for development and hold-out sample – on given segment Comparison of performance month by month Mobiles: Model performance 0.2 0.3 0.4 0.5 0.6 0.7 11 12 01 02 03 04 05 06 07 08 09 10 11 12 01 02 03 04 05 06 07 2005 2006 2007 Month of first date due GiniCoefficient (higherisbetter) Actual SC, fraud part Actual SC, defaulter part New SC, fraud part New SC, defaulter part Validation 440 Power on fresh data Use fresh data and compute "softer" good bad definitions (e.g. 1_30, 1_60 instead of 1_90). Measure power of the scorecard on development sample according these definitions and compare it with performance on the fresh data. Comparison with real default Month by month comparison of average predicted pd by the new scorecard and real default, for both development and hold-out samples. Diagonal test - score on x-axis and real default on y-axis – graph of average default should be ideally monotonous (the higher the score, the lower the default) Graph of diagonal test 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.6 0.63 0.66 0.69 0.72 0.75 0.78 0.81 0.84 0.87 0.9 0.93 0.96 0.99 1 10 100 1000 10000 100000 1000000 Nb contracts AVG_default AVG_score Validation 441 Comparison with real default Graph of predicted pd versus the real default 1 - audio-video 0 10000 20000 30000 40000 50000 60000 200606 200607 200608 200609 200610 200611 200612 200701 200702 200703 200704 200705 200706 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Počet smluv Default Score Validation 442 cutoff 10. Cutoff, RAROA, Monitoring 443 Možné zamítací škály – cutoff  cutoff hodnota určuje mez, při které je žádost o úvěr schválena/zamítnuta  Je možné použít tyto zamítací škály:  PD – Pravděpodobnost Defaultu (Probability of Default)  KRN - Kreditní Rizikové Náklady (CRE – Credit Risk Expenses)  Marže (Margin)  RAROA  … 444 Cutoff na škále PD cutoff = 0.1 (tj. zamítám všechny s pravděpodobností defaultu větší než 10 %) SC1 SC2 Cut off • Pro SC1 je reject rate 22 %. • Pro SC2 je reject rate 33 %. 445 Strategická křivka (Strategy curve)   BsFpB  1rateacceptanceBad    )(1 1 sF BsFpB   ratebadActual Při zavádění nové scoringové funkce typicky dochází k tomu, že stávající nastavení schvalovacího procesu (nastavení cutoff) je reprezentováno bodem O , který leží nad novou strategickou křivkou. Otázkou pak je směr, kterým se chceme vydat při stanovení nového cutoff. Pokud se posuneme do bodu A, potom zachováme poměr schválených špatných klientů, ale současně zvýšíme celkový poměr schválených klientů. Při posunu do bodu B schválíme stejný poměr klientů, ale snížíme poměr schválených špatných klientů a tedy i poměr špatných klientů (bad rate). Posunem do bodu C zachováme bad rate při současném zvýšení poměru schválených klientů. )(1 sFrateAcceptance pB 446 Nastavení cutoff maximalizující zisk (profit) Profit - náhodná veličina definovaná jako: špatnýmsestaneaschválenúvěrlije dobrýmsestaneaschválenúvěrlije zamítnutúvěrlije , , ,0           D LR Označme pG a pB proporci dobrých a špatných klientů v populaci. q(G|s) (q(B|s)) označuje podmíněnou pravděpodobnost, že klient mající skóre s bude dobrý (špatný), přičemž q(G|s) + q(B|s) = 1. Nechť p(s) je proporce populace se skóre s. Střední hodnota profitu při schválení klientů se skóre s: Tedy k maximalizaci profitu je třeba schválit ty klienty, jejichž skóre splňuje podmínku: 447 Nastavení cutoff maximalizující profit Nechť A označuje množinu skóre, kde je splněna předchozí podmínka. Pak je střední hodnota zisku (profitu) na jednoho klienta dána vztahem: Pokud L a D navíc závisí na skóre s, je situace ještě o něco složitější. Více viz Thomas et al. (2002). 448 Nastavení cutoff maximalizující profit Body na spodní části křivky odpovídají vyšším cutoff hodnotám, a tedy i menšímu počtu přijatých špatných klientů, zatímco body na horní části křivky odpovídají menším hodnotám cutoff, tj. vyššímu počtu přijatých špatných klientů. Efektivní hranicí je tedy spodní část křivky od bodu C do bodu D. Jestliže aktuální nastavení schvalovacího procesu odpovídá bodu O, opět máme možnost posunu na křivku odpovídající nové scoringové funkci. První možností je zachování poměru schválených špatných klientů, tj. posun do bodu A. Druhou možností je zachování celkového poměru schválených klientů, tj. posun do bodu B. Je zřejmé, že posun do bodu A není vhodná volba, protože tento bod neleží na efektivní hranici a lze snadno dosáhnout stejného očekávaného zisku při nižší očekávané ztrátě. 449 CRE = ((1-Recovery) * SUM(PD * Loss))/(Expected Average Volume) Profit = (Interest rate – CRE)*Expected Average Volume Definice KRN (CRE) Půjčenýobjem Ztráta(Loss) Ztráta Ztráta Ztráta Ztráta Ztráta Ztráta Ztráta Ztráta Ztráta Číslo defaultní splátky (pravděpodobnost (PD)) 1 (.06) 2 (.02) 3 (.02) 4 (.02) 5 (.02) 6 (.02) 7 (.02) 9 (.02)8 (.02) 10 (.03) Pravděpodobnost defaultu silně závisí na scoringové funkci Úroková míra Očekávaný průměrný objem úvěru 450 Recovery (=Late collection(LC)) Číslo defaultní splátky score band1 band2 band3 band4 1. 20% 25% 30% 35% 2.-4. 50% 55% 60% 65% 5. + 75% 80% 85% 90% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 3 4 5 6 7 8 9 10 11 12 13 14 15 LC month 60 LC 1 sp1 1 sp2 1 sp3 1 sp4 4 sp1 4 sp2 4 sp3 4 sp4 5 sp1 5 sp2 5 sp3 5 sp4 odhad 451 Cutoff na škále KRN 0% 20% 40% 60% 80% 100% 120% 0,5% 0,7% 0,9% 1,0% 1,2% 1,4% 1,7% 1,8% 2,0% 2,2% 2,4% 2,7% 2,8% 3,3% 3,8% 4,0% 4,3% 4,7% 5,1% 5,5% 6,0% 6,5% 7,1% 7,5% 8,1% 8,7% 9,3% 9,9% 10,6% 11,5% 12,6% 13,8% 15,5% 18,0% 24,3% KRN %zamítnutýchsmluv 0% 1% 2% 3% 4% 5% 6% 7% KRNprodukce úroveň zamítání počet smluv počet zamitnutých na TK původní úroveň zamítání kumulativní KRN KRN původní 452 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 13,1% 13,1% 13,2% 13,2% 13,2% 13,2% 13,2% 13,3% 13,4% 13,5% 13,6% 13,7% 13,7% 13,7% 13,8% 13,9% 13,9% 13,9% 13,9% 14,0% 14,1% 14,2% 14,3% 14,4% 14,4% 14,4% 14,5% 14,5% 14,7% KRN %zamítnutýchsmluv 4,1% 4,2% 4,3% 4,4% 4,5% 4,6% 4,7% 4,8% 4,9% 5,0% 5,1% KRNprodukce úroveň zamítání původní úroveň zamítání KRN původní kumulativní KRN nastavení cut off při zachování úrovně zamítání - 18% zamítnutých smluv nastavení cut off při zachování úrovně KRN - 15,1% zamítnutých smluv Cutoff na škále KRN 453 (Očekávaná) Marže = Úroková míra (vč. poplatků) – KRN – OPEX  Úroková míra  Efektivní míra ideálního finančního toku (-výše úvěru-poplatky; anuita; anuita; ... ; anuita).  KRN  Viz výše.  OPEX  Cena peněz.  Režijní náklady, variabilní náklady, podpora prodejní sítě.  Náklady na administrátory – vlastní zaměstnance zajišťující zpracování úvěru. (Očekávaná) Marže 454 Marže (Margin)  Optimální cutoff: marže=0 455 RAROA (Risk Adjusted Return On Assets) 456 RAROA 457 RAROA 458 RAROA 459 Výhody RAROA Ideal flow Expected flow Ideal flow Expected flow -1000 -1000 -1000 -1000 1 400 200 150 110 2 400 180 150 100 3 400 170 150 90 4 400 160 150 80 5 150 70 6 150 60 7 150 50 8 150 40 9 150 30 10 150 16 11 150 10 12 150 0 Case A Case B • A – krátkodobý úvěr s vysokým rizikem fraudu • B – dlouhodobý úvěr s vysokým rizikem defaultu Úroková míra (A) = 22% Úroková míra (B) = 10% Úvěr A je lepší, protože z něj plyne vyšší zisk (710>656), navíc je ho dosaženo mnohem dříve. KRN(A) = 44% KRN(B) = 20% cutoff na škále KRN preferuje B Marže (A) = -22% Marže (B) = -10% cutoff na škále marže preferuje B RAROA (A) = -0.29 RAROA (B) = -0.36 cutoff na škále RAROA preferuje A 460 Cutoff segmentace  Možná segmentace podle:  Prodejní síť (skupina obchodních míst)  Profitabilita produktu  Kvalita prodejního místa  Typ zboží (pro spotřebitelské úvěry)  Výše úvěru  … 461 Cutoff scénáře 462 Evaluation of Reject rate, Profitability, Default and Loss rates before and after cutoff change according to Distribution channel or Segment of scorecard. Cutoff impact evaluation table Before Christmas (approved credits) After Christmas (approved credits) Reject rate RAROA Loss rate Profit (per year) Reject rate RAROA Loss rate Profit (per year) Segment 1 24.7% 3.65% 11.33% 414 363 110 24.3% 3.75% 11.19% 428 757 430 Segment 2 12.1% 4.01% 8.22% 160 364 072 12.9% 3.95% 8.29% 159 917 943 Segment 3 45.1% 9.64% 9.69% 747 636 468 45.1% 9.8% 9.5% 758 966 512 Segment 4 22.2% 5.80% 4.89% 52 213 720 20.1% 5.62% 5.05% 51 715 263 Segment 5 20.9% 6.77% 5.41% 54 312 614 19.7% 6.61% 5.48% 53 975 903 Segment 6 33.4% 7.04% 7.22% 212 090 365 32.6% 7.04% 7.16% 211 684 371 Segment 7 49.3% 9.30% 8.93% 36 840 287 49.2% 9.4% 8.8% 37 140 165 Segment 8 19.3% 4.68% 2.96% 15 668 962 14.9% 4.54% 3.16% 15 636 910 Segment 9 32.0% 8.41% 5.06% 3 679 430 27.2% 7.97% 5.26% 3 535 809 Segment 10 33.4% 7.14% 6.69% 1 823 050 341 33.4% 7.2% 6.6% 1 832 986 599 Segment 11 28.5% 6.34% 7.36% 2 633 609 071 28.6% 6.47% 7.24% 2 651 352 740 ALL 32.6% 6.64% 8.37% 6 153 828 440 32.6% 6.96% 8.17% 6 205 669 645 Cutoff impact evaluation 463 Profitability, Default and Loss rates according to reject rate into one graph Characteristics of approved credits according to reject rate -5% 0% 5% 10% 15% 20% 25% 5.5%11.0%17.3%22.3%27.4%31.8%36.1%40.4%44.7%49.1%53.4%57.7%62.0%66.3%70.7%75.0%79.3%83.6%88.0%92.3%96.6% Reject rate 0 1 000 000 000 2 000 000 000 3 000 000 000 4 000 000 000 5 000 000 000 6 000 000 000 7 000 000 000 8 000 000 000 Profit (per year) RAROI Loss rate Decision Reasoning, why the final cutoffs were chosen Cutoff sensitivity analysis 464 Monitoring Stabilita SF -týdny 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 Vzorek 2006- 13 2006- 14 2006- 15 2006- 16 2006- 17 2006- 18 2006- 19 2006- 20 2006- 21 2006- 22 Gini K-S výv. vzorek [1] týden1 [2] [3]=[2] -[1] [4]=[2]/[1] [5]=ln[4] [6]=[3]*[5] skóre_1 10,00% 5,63% -0,044 0,563 -0,574 0,025 skóre_2 10,00% 11,21% 0,012 1,121 0,114 0,001 skóre_3 10,00% 11,00% 0,010 1,100 0,095 0,001 skóre_4 10,00% 10,97% 0,010 1,097 0,092 0,001 skóre_5 10,00% 10,31% 0,003 1,031 0,031 0,000 skóre_6 10,00% 10,12% 0,001 1,012 0,012 0,000 skóre_7 10,01% 9,62% -0,004 0,961 -0,039 0,000 skóre_8 10,00% 9,89% -0,001 0,989 -0,011 0,000 skóre_9 10,00% 10,31% 0,003 1,031 0,030 0,000 skóre_10 10,00% 10,94% 0,009 1,095 0,091 0,001 PSI 0,030 465 Monitoring scoringových modelů  Není překvapivé, že prediktivní modely se ve statistickém slova smyslu chovají nejlépe na vývojovém vzorku dat. Výstupy těchto modelů, např. skóre nebo rating klienta, jsou počítány pomocí jistých vzorců, jejichž koeficienty příslušející nezávislým proměnným (prediktorům) jsou odvozeny na datech vývojového vzorku. Posun distribuce výstupu daného modelu je pak zapříčiněn právě změnou vstupních hodnot modelu, tj. prediktorů, v průběhu času. V podstatě ihned (alespoň většinou) po nasazení prediktivního modelu do praxe dochází k jistému poklesu jeho prediktivní síly, který je způsoben určitou změnou vstupních hodnot modelu. Zásadní je v praxi nastavení takových procesů, které odhalí, že se tak děje, proč se tak děje a jak vážný problém to ve svých důsledcích znamená. 466 Monitoring scoringových modelů  Faktorů způsobujících posun v distribuci prediktorů, a následně posun v distribuci výstupu prediktivního modelu, je několik: Přirozený posun v datech/změna demografické struktury dat Databázové chyby Změna datového zdroje Změna definice/formátu vstupních dat Změna datového univerza Ostatní 467 Monitoring scoringových modelů  Typickým příkladem prvního uvedeného důvodu je příjem klienta (všeobecným trendem je růst příjmu populace). Změnou definice/formátu vstupních dat je myšlena například situace, kdy je rozšířen číselník hodnot, kterých může vstupní proměnná nabývat. Změnou datového univerza je myšlen případ kdy je vyvinutý prediktivní model použit např. pro odlišný/nový segment portfolia nebo odlišný/nový produkt. 468 Monitoring scoringových modelů  K-S, Gini: Stabilita SF -týdny 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 Vzorek 2006- 13 2006- 14 2006- 15 2006- 16 2006- 17 2006- 18 2006- 19 2006- 20 2006- 21 2006- 22 Gini K-S 469 Monitoring scoringových modelů  Čím strmější křivka tím lépe.  V průběhu času se zplošťuje – jde o to, jak moc. Závislost defaultu na Skóre 0% 5% 10% 15% 20% 25% 30% 35% 40%decil1 decil2 decil3 decil4 decil5 decil6 decil7 decil8 decil9 decil10 vzorek 2006-13 2006-14 2006-15 2006-16 2006-17 2006-18 2006-19 2006-20 2006-21 2006-22 470 Monitoring scoringových modelů  c-statistika: 471 Monitoring scoringových modelů    r i i ii E EO 1 2 2 )(          r i i i ii E O EOPSI 1 ln)(  Chceme posoudit zda se distribuce skóre na vývojovém vzorku liší od distribuce skóre v daném časovém intervalu: 472 Monitoring scoringových modelů výv. vzorek [1] týden1 [2] [3]=[2] -[1] [4]=[2]/[1] [5]=ln[4] [6]=[3]*[5] skóre_1 10,00% 5,63% -0,044 0,563 -0,574 0,025 skóre_2 10,00% 11,21% 0,012 1,121 0,114 0,001 skóre_3 10,00% 11,00% 0,010 1,100 0,095 0,001 skóre_4 10,00% 10,97% 0,010 1,097 0,092 0,001 skóre_5 10,00% 10,31% 0,003 1,031 0,031 0,000 skóre_6 10,00% 10,12% 0,001 1,012 0,012 0,000 skóre_7 10,01% 9,62% -0,004 0,961 -0,039 0,000 skóre_8 10,00% 9,89% -0,001 0,989 -0,011 0,000 skóre_9 10,00% 10,31% 0,003 1,031 0,030 0,000 skóre_10 10,00% 10,94% 0,009 1,095 0,091 0,001 PSI 0,030 473 Monitoring scoringových modelů 1,0PSI 25,01,0  PSI 25,0PSI značí žádný nebo jen velmi malý rozdíl daných distribucí skóre. znamená, že došlo k nějakému posunu distribuce, nicméně nikterak významnému. signalizuje významný posun v distribuci skóre, tj. zamítáme hypotézu o shodě daných distribucí. 474 0,00 0,01 0,01 0,02 0,02 0,03 0,03 0,04 0,04 0,05 0,05 2006-13 2006-14 2006-15 2006-16 2006-17 2006-18 2006-19 2006-20 2006-21 2006-22 PSI chi-kvadrat Monitoring scoringových modelů 475         r i i i iiDR DR DR DRDRPSI 1 1 2 ln)12( def_rate Gini PSI_DR PSI chi-kvardat vzorek 7,69% 0,643 200613 9,38% 0,564 0,120 0,030 0,024 200614 9,35% 0,542 0,131 0,034 0,027 200615 8,70% 0,537 0,093 0,032 0,025 200616 8,57% 0,523 0,089 0,033 0,026 200617 8,59% 0,540 0,071 0,030 0,025 200618 9,19% 0,544 0,111 0,030 0,024 200619 8,03% 0,558 0,063 0,034 0,026 200620 8,52% 0,552 0,055 0,023 0,019 200621 8,05% 0,555 0,043 0,027 0,022 200622 7,76% 0,539 0,039 0,045 0,034 Monitoring scoringových modelů 476 0,00 0,02 0,04 0,06 0,08 0,10 0,12 0,14 Vzorek 2006- 13 2006- 14 2006- 15 2006- 16 2006- 17 2006- 18 2006- 19 2006- 20 2006- 21 2006- 22 0,40 0,45 0,50 0,55 0,60 0,65 0,70 Gini Def. rate PSI_DR PSI chi-kvadrat Gini Monitoring scoringových modelů 477 Champion-challenger (mistr – vyzyvatel)  K rozšíření využití strategie champion-challenger došlo v devadesátých letech minulého století. Princip je velmi jednoduchý. Předpokládejme, že existuje nějaký způsob dělání něčeho (např. aktuálně používaný scoringový model pro schvalování/zamítání žádostí o úvěr). Tento způsob nazveme mistrem (champion). Nicméně existují další, jeden nebo více, alternativní způsoby jak dosáhnout téhož (nebo velmi podobného) cíle. Tyto nazveme vyzyvateli (challengers). Na náhodném vzorku otestujeme vyzyvatele a porovnáme s mistrem. To nám umožní nejen porovnat efektivnost vyzyvatelů a mistra, ale získáme možnost identifikovat existenci a rozsah vedlejších efektů. Výsledkem pak může být zjištění, že některý z vyzyvatelů je lepší než mistr a tento vyzyvatel se stane novým mistrem. 478 479 11. Reference 480 Literatura - knihy  Anderson, R. (2007). The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management and Decision Automation, Oxford: Oxford University Press.  Giudici, P. (2003). Applied Data Mining: statistical methods for business and industry, Chichester : Wiley. Han, J., Kamber, M. (2006). Data mining: Concepts and Techniques, 2nd ed. San Francisco: Morgan Kaufmann. Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, New York: Springer-Verlag.  Hosmer, D. W., Lemeshow S. (2000). Applied Logistic Regression, Textbook and Solutions Manual , 2nd ed., New York: John Wiley and Sons. 481 Literatura - knihy  Siddiqi, N. (2006). Credit Risk Scorecards: developing and implementing intelligent credit scoring, New Jersey: Wiley.  Thomas, L.C. (2009). Consumer Credit Models: Pricing, Profit, and Portfolio, Oxford: Oxford University Press.  Thomas, L.C., Edelman, D.B., Crook, J.N. (2002). Credit Scoring and Its Applications, Philadelphia: SIAM Monographs on Mathematical Modeling and Computation. Wilkie, A.D. (2004). Measures for comparing scoring systems, In: Thomas, L.C., Edelman, D.B., Crook, J.N. (Eds.), Readings in Credit Scoring. Oxford: Oxford University Press, pp. 51-62.  Witten, I.H., Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, San Francisco: Morgen Kaufmann. 482 Literatura - časopisy  Crook, J.N., Edelman, D.B., Thomas, L.C. (2007). Recent developments in consumer credit risk assessment. European Journal of Operational Research, 183 (3), 1447-1465  Hand, D.J. and Henley, W.E. (1997). Statistical Classification Methods in Consumer Credit Scoring: a review. Journal. of the Royal Statistical Society, Series A., 160,No.3, 523-541.  Harrell, F.E., Lee, K.L. and Mark, D.B. (1996). Multivariate prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15, 361-387.  Lilliefors, H.W. (1967). On the Komogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 62, 399-402.  Nelsen, R. B. (1998). Concordance and Gini‟s measure of association. Journal of Nonparametric Statistics, 9, Isssue 3, 227–238.  Newson R. (2006). Confidence intervals for rank statistics: Somers' D and extensions. The Stata Journal, 6(3), 309-334.  Somers R. H. (1962). A new asymmetric measure of association for ordinal variables. American Sociological Review, 27, 799-811.  Thomas, L.C. (2000). A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers. International Journal of Forecasting, 16(2), 149-172 . Literatura - časopisy 483 Literatura - web  Coppock, D.S. (2002). Why Lift?, DM Review Online, www.dmreview.com/news/53291.html  Xu, K. (2003). How has the literature on Gini‟s index evolved in past 80 years?, www.economics.dal.ca/RePEc/dal/wparch/howgini.pdf  Xin Ming Tu, Wan Tang (2006). Categorical Data Analysis. http://www.urmc.rochester.edu/smd/biostat/people/faculty/TuSite/bst466/handouts.htm  Jiawei Han and Micheline Kamber (2006). Data Mining: Concepts and Techniques. http://www.cs.illinois.edu/~hanj/bk2/  Jens Peter Dittrich (2007). Data warehousing. http://www.dbis.ethz.ch/education/ss2007/07_dbs_datawh/Data_Mining.pdf  Joe Carthy (2006). Data Warehousing. http://www.csi.ucd.ie/staff/jcarthy/home/DataMining/DM-Lecture02-01.ppt  Jan Spousta (?). Přednášky k data miningu. [cit. 19.03.2009] http://samba.fsv.cuni.cz/~soukup 484 Další zajímavé zdroje informací http://www.cs.uiuc.edu/homes/hanj/ http://www-users.cs.umn.edu/~kumar/  http://www.kdnuggets.com/  http://www.kdnuggets.com/datasets/competitions.html  http://www.crc.man.ed.ac.uk/conference/  http://www.crc.man.ed.ac.uk/conference/archive/  http://www.kmining.com/info_conferences.html  http://en.wikipedia.org/wiki/Data_mining  http://cs.wikipedia.org/wiki/Data_mining  http://en.wikipedia.org/wiki/Credit_scorecards 485 Užitečné zdroje dat http://archive.ics.uci.edu/ml/ http://kdd.ics.uci.edu/ http://sede.neurotech.com.br:443/PAKDD2009/ http://www.dataminingbook.com/ http://www.stat.uni-muenchen.de/service/datenarchiv/welcome_e.html www.kaggle.com 486