DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 1 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -1Accountable Algorithms Joshua A. Kroll, Joanna Huey, Solon Barocas, Edward W. Felten, Joel R. Reidenberg, David G. Robinson, and Harlan Yu† Abstract Many important decisions historically made by people are now made by computers. Algorithms count votes, approve loan and credit card applications, target citizens or neighborhoods for police scrutiny, select taxpayers for IRS audit, grant or deny immigration visas, and more. The accountability mechanisms and legal standards that govern such decision processes have not kept pace with technology. The tools currently available to policymakers, legislators, and courts were developed to oversee human decision makers and often fail when applied to computers instead. For example, how do you judge the intent of a piece of software? Because automated decision systems can return potentially incorrect, unjustified or unfair results, additional approaches are needed to make such systems accountable and governable. This Article reveals a new technological toolkit to verify that automated decisions comply with key standards of legal fairness. We challenge the dominant position in the legal literature that transparency will solve these problems. Disclosure of source code is often neither necessary (because of alternative techniques from computer science) nor sufficient (because of the issues analyzing code) to demonstrate the fairness of a process. Furthermore, transparency may be undesirable, such as when it permits tax cheats or terrorists to game the systems determining audits or security screening or discloses private information. The central issue is how to assure the interests of citizens, and society as a whole, in making these processes more accountable. This Article argues that technology is creating new opportunities--subtler and more flexible than total transparency--to design decisionmaking algorithms so that they better align with legal and policy objectives. Doing so will improve not only the current governance of automated decisions, but also--in certain cases--the governance of decisionmaking in general. The implicit (or explicit) biases of human decisionmakers can be difficult to find and root out, but we can peer into the “brain” of an algorithm: computational processes and purpose specifications can be declared prior to use and verified afterwards. The technological tools introduced in this Article apply widely. They can be used in designing decisionmaking processes from both the private and public sectors, and they can be tailored to verify different characteristics as desired by decisionmakers, regulators, or the public. By forcing a more careful consideration of the effects of decision rules, they also engender policy discussions and closer looks at legal standards. As such, these tools have far-reaching implications throughout law and society. † Respectively, System Engineer, CloudFlare and Affiliate, Center for Information Technology Policy, Princeton; Associate Director, Center for Information Technology Policy, Princeton; Post Doctoral Research Associate, Princeton; Robert E. Kahn Professor of Computer Science and Public Affairs, Princeton; Stanley D. and Nikki Waxberg Chair in Law, Fordham Law School; Principal, Upturn; Principal, Upturn. For helpful comments, the authors are very grateful to participants at the Berkeley Privacy Law Scholars Conference and at the NYU School of Law conference on “Accountability and Algorithms.” DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 2 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -2Part I of this Article provides an accessible and concise introduction to foundational computer science concepts that can be used to verify and demonstrate compliance with key standards of legal fairness for automated decisions without revealing key attributes of the decisions or the processes by which the decisions were reached. Part II then describes how these techniques can assure that decisions are made with the key governance attribute of procedural regularity, meaning that decisions are made under an announced set of rules consistently applied in each case. We demonstrate how this approach could be used to redesign and resolve issues with the State Department’s diversity visa lottery. In Part III, we go further and explore how other computational techniques can assure that automated decisions preserve fidelity to substantive legal and policy choices. We show how these tools may be used to assure that certain kinds of unjust discrimination are avoided and that automated decision processes behave in ways that comport with the social or legal standards that govern the decision. We also show how automated decision making may even complicate existing doctrines of disparate treatment and disparate impact, and we discuss some recent computer science work on detecting and removing discrimination in algorithms, especially in the context of big data and machine learning. And lastly, in Part IV, we propose an agenda to further synergistic collaboration between computer science, law, and policy to advance the design of automated decision processes for accountability. Table of Contents I. How Computer Scientists Build and Evaluate Software.................................................... 10 A. Assessing Computer Systems ....................................................................................... 10 1. Static Analysis: Review from Source Code Alone.................................................... 13 2. Dynamic Testing: Examining a Program’s Actual Behavior .................................... 16 3. The Fundamental Limit of Testing: Noncomputability............................................. 18 B. The Importance of Randomness.................................................................................... 19 II. Designing Computer Systems for Procedural Regularity ................................................. 22 A. Transparency and Its Limits.......................................................................................... 23 C. Technical Tools for Procedural Regularity................................................................... 27 1. Software Verification................................................................................................. 27 2. Cryptographic Commitments..................................................................................... 30 3. Zero-Knowledge Proofs............................................................................................. 32 4. Fair Random Choices................................................................................................. 33 D. Applying Technical Tools Generally............................................................................ 36 1. Current DVL Procedure............................................................................................. 38 2. Transparency Is Not Enough ..................................................................................... 39 3. Designing the DVL for Accountability...................................................................... 39 III. Designing Algorithms to Assure Fidelity to Substantive Policy Choices ....................... 41 DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 3 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -3A. Machine Learning, Policy Choices, and Discriminatory Effects.................................. 42 B. Technical Tools for NonDiscrimination ....................................................................... 45 1. Learning from Experience ......................................................................................... 46 2. Fair Machine Learning............................................................................................... 47 3. Discrimination, Data Use, and Privacy...................................................................... 51 C. AntiDiscrimination Law and Algorithmic Decisionmaking......................................... 53 1. Ricci v. DeStefano: The Tensions Between Equal Protection, Disparate Treatment, and Disparate Impact ............................................................................................................ 54 2. Ricci Impels Designing for NonDiscrimination ........................................................ 56 IV. Fostering Collaboration Across Computer Science, Law, and Policy............................. 56 A. Recommendations for Computer Scientists: Design for After-the-Fact Oversight...... 57 B. Recommendations for Lawmakers and Policymakers .................................................. 60 1. Reduced Benefits of Ambiguity................................................................................. 60 2. Accountability to the Public....................................................................................... 63 3. Secrets and Accountability ........................................................................................ 65 Introduction Many important decisions that were historically made by people are now made by computer systems:1 votes are counted; voter rolls are purged; loan and credit card applications are approved;2 welfare and financial aid decisions are made;3 taxpayers are chosen for audits; citizens or neighborhoods are targeted for 1 In this work, we use the term “computer system” where others have used the term “algorithm.” See, e.g., Frank Pasquale, The Black Box Society: The Secret Algorithms That Control Money and Information (2015). This allows us to separate the concept of a computerized decision from the actual machine that effects it. See infra note 14 for a more detailed explanation. 2 See, e.g., Calyx - More Than Just an LOS, Calyx Software (Mar. 2013), http://www.calyxsoftware.com/company/newsletters/13-03.html [https://perma.cc/E93L-8UGD] (noting that Calyx offers clients an automated underwriting system to vet loan applications for approval against predetermined guidelines). 3 See Virginia Eubanks, Caseworkers v. Computers, PopTech (Dec. 11, 2013, 3:10 PM), http://virginiaeubanks.wordpress.com/2013/12/11/caseworkers-vs-computers [https://perma.cc/37VG-GQC6] (describing and critiquing several states’ efforts to automate welfare eligibility determinations). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 4 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -4police scrutiny;4 air travelers are selected for search;5 and visas are granted or denied. The efficiency and accuracy of automated decisionmaking ensures that its domain will continue to expand. Even mundane activities now involve complex computerized decisions: everything from cars to home appliances now regularly executes computer code as part of its normal operation. However, the accountability mechanisms and legal standards that govern decision processes have not kept pace with technology. The tools currently available to policymakers, legislators, and courts were developed primarily to oversee human decisionmakers. Many observers have argued that our current frameworks are not well adapted for situations in which a potentially incorrect,6 unjustified,7 or unfair8 outcome emerges from a computer. Citizens, and society as a whole, have an interest in making these processes more accountable. If these new inventions are to be made governable, this gap must be bridged. In this Article, we describe how authorities can demonstrate--and how the public at large and oversight bodies can verify--that automated decisions comply with key standards of legal fairness. We consider two approaches: ex ante approaches aiming to establish that the decision process works as expected (which are commonly studied by technologists and computer scientists), and ex post approaches once decisions have been made, such as review and oversight (which are common in existing governance structures). Our proposals aim to use the tools of the first approach to guarantee that the second approach can function effectively. Specifically, we describe how technical tools for verifying the correctness of computer systems can be used to ensure that appropriate evidence exists for later oversight. We begin with an accessible and concise introduction to the computer science concepts on which our argument relies, drawn from the fields of software verification, testing, and cryptography. Our argument builds on the fact that technologists can and do verify, for themselves, that software systems work in accordance with known designs. No computer system is built and deployed in the 4 See David Robinson, Harlan Yu & Aaron Rieke, Civil Rights, Big Data, and Our Algorithmic Future 18-19 (2014), http://bigdata.fairness.io/wp-content/uploads/2014/11/ Civil_Rights_Big_Data_and_Our_Algorithmic-Future_v1.1.pdf [https://perma.cc/UL3G-3MQ7] (describing the Chicago Police Department’s “‘Custom Notification Program,’ which sends police (or sometimes mails letters) to peoples’ homes to offer social services and a tailored warning”). 5 See Notice of Modified Privacy Act System of Records, 78 Fed. Reg. 55,270, 55,271 (Sept. 10, 2013) (“[T]he passenger prescreening computer system will conduct risk-based analysis of passenger data . . . . TSA will then review this information using intelligence-driven, risk-based analysis to determine whether individual passengers will receive expedited, standard, or enhanced screening . . . .”). 6 See Danielle Keats Citron, Technological Due Process, 85 Wash. U. L. Rev. 1249, 1256 (2008) (describing systemic errors in the automated eligibility determinations for federal benefits programs). 7 See id. at 1256-57 (noting the “crudeness” of algorithms designed to identify potential terrorists that yield a high rate of false positives). 8 See Solon Barocas & Andrew D. Selbst, Big Data’s Disparate Impact, 104 Calif. L. Rev. 671, 677 (2016) (“[D]ata mining holds the potential to unduly discount members of legally protected classes and to place them at systematic relative disadvantage.”). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 5 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -5world shrouded in total mystery.9 While we do not advocate any specific liability regime for the creators of computer systems, we outline the range of tools that computer scientists and other technologists already use, and show how those tools can ensure that a system meets specific policy goals. In particular, while some of these tools provide assurances only to the system’s designer or operator, other established methods could be leveraged to convince a broader audience, including regulators or even the general public. The tools available during the design and construction of a computer system are far more powerful and expressive than those that can be bolted on to an existing system after one has been built. We argue that in many instances, designing a system for accountability can enable stakeholders to reach accountability goals that could not be achieved by imposing new transparency requirements on existing system designs. We show that computer systems can be designed to prove to oversight authorities and the public that decisions were made under an announced set of rules consistently applied in each case, a condition we call procedural regularity. The techniques we describe to ensure procedural regularity can be extended to demonstrate adherence to certain kinds of substantive policy choices, such as blindness to a particular attribute (e.g., race in credit underwriting). Procedural regularity ensures that a decision was made using consistently applied standards and practices. It does not, however, guarantee that such practices are themselves good policy. Ensuring that a decision procedure is well justified or relies on sound reasoning is a separate challenge from achieving procedural regularity. While procedural regularity is a well understood and generally desirable property for automated and nonautomated governance systems alike, it is merely one principle around which we can investigate a system’s fairness. It is common, for example, to ask whether a computer system avoids certain kinds of unjust discrimination, even when such systems are blind to certain attributes (e.g., gender in automated hiring decisions). We later expand our discussion and show how emerging computational techniques can assure that automated decisions satisfy other notions of fairness that are not merely procedural, but actively consider a system’s effects. We describe in particular detail techniques for avoiding discrimination, even in machine learning systems that derive their decision rules from data rather than from code written by a programmer. Finally, we propose next steps to further the emerging and critically important collaboration between computer scientists and policymakers. 9 Although some machine-learning systems produce results that are difficult to predict in advance and well beyond traditional interpretation, the choice to field such a system instead of one which can be interpreted and governed is itself a decision about the system’s design. While we do not advocate that any approach should be forbidden for any specific problem, we aim to show that advanced tools exist that provide the desired functionality while also permitting oversight and review. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 6 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -6Legal scholars have argued for twenty years that automated processing requires more transparency,10 but it is far from obvious what form such transparency should take. Perhaps the most obvious approach is to disclose a system’s source code, but this is at best a partial solution to the problem of accountability for automated decisions. The source code of computer systems is illegible to nonexperts. In fact, even experts often struggle to understand what software code will do: inspecting source code is a very limited way of predicting how a computer program will behave.11 Machine learning, one increasingly popular approach to automated decisionmaking, is particularly ill-suited to source code analysis because it involves situations where the decisional rule itself emerges automatically from the specific data under analysis, sometimes in ways that no human can explain.12 In this case, source code alone teaches a reviewer very little, since the code only exposes the machine learning method used and not the datadriven decision rule. Moreover, in many of the instances that people care about, full transparency will not be possible. The process for deciding which tax returns to audit, or whom to pull aside for secondary security screening at the airport, may need to be partly opaque to prevent tax cheats or terrorists from gaming the system. When the decision being regulated is a commercial one, such as an offer of credit, transparency may be undesirable because it defeats the legitimate protection of consumer data, commercial proprietary information, or trade secrets. Finally, when an explanation of how a rule operates requires disclosing the data under analysis and those data are private or sensitive (e.g., in adjudicating a commercial offer of credit, a lender reviews detailed financial information about the applicant), disclosure of the data may be undesirable or even legally barred. Furthermore, making the rule transparent--whether through source code disclosure or otherwise--may still fail to resolve the concerns of many participants. No matter how much transparency surrounds a rule, people can still wonder whether the disclosed rule was actually used to reach a decision in their own cases. Particularly where an element of randomness is involved in the process, a person audited or patted down may wonder: Was I really chosen by the rule, or has some bureaucrat singled me out on a whim? But full disclosure of how particular decisions were reached is often unattractive because the decisions themselves often incorporate sensitive health, financial, or other private information either as input, 10 See, e.g., Citron, supra note 6, at 1253 (describing automated decisionmaking as “adjudicat[ion] in secret”); Paul Schwartz, Data Processing and Government Administration: The Failure of the American Legal Response to the Computer, 43 Hastings L.J. 1321, 1323-25 (1992) (“So long as government bureaucracy relies on the technical treatment of personal information, the law must pay attention to the structure of data processing . . . . There are three essential elements to this response: structuring transparent data processing systems; granting limited procedural and substantive rights . . . and creating independent governmental monitoring of data processing systems.” (emphasis omitted)). 11 See infra subsection I.A.1 (discussing static analysis). 12 See Stanford University, Machine Learning, Coursera, https://www.coursera.org/learn/machinelearning/home/info [https://perma.cc/L7KF-CDY4] (“Machine learning is the science of getting computers to act without being explicitly programmed.”). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 7 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -7output, or both (for example, an individual’s tax audit status may be sensitive or protected on its own, but it may also imply details about that individual’s financial data). Even full disclosure of a decision’s provenance to that decision’s subject can be problematic. Most individuals are ill-equipped to review how computerized decisions were made, even if those decisions are reached transparently. Further, the purpose of computer-mediated decisionmaking is to bring decisions an element of scale, where the same rules are ostensibly applied to a large number of individual cases or are applied extremely quickly. Individuals auditing their own decisions (or experts assisting them) would be both inundated with the need to review the rules applied to them and often able to generalize their conclusions to the results of others, raising the same disclosure concerns described above. That is, while transparency of a rule makes reviewing the basis of decisions more possible, it is not a substitute for individualized review of particular decisions.13 Fortunately, technology is creating new opportunities--more subtle and flexible than total transparency--to make automated decisionmaking more accountable to legal and policy objectives. Although the current governance of automated decisionmaking is underdeveloped, computerized processes can be designed for governance and accountability. Doing so will improve not only the current governance of computer systems, but also--in certain cases--the governance of decisionmaking in general. This Article argues that in order for a computer system to function in an accountable way--either while operating an important civic process or merely engaging in routine commerce--accountability must be part of the system’s design from the start. Designers of such systems--and the nontechnical stakeholders who often oversee or control system design--must begin with oversight and accountability in mind. We offer examples of currently available tools that could aid in that design, as well as suggestions for dealing with the apparent mismatch between policy ambiguity and technical precision. In Part I of this Article, we provide an accessible introduction to how computer scientists build and evaluate computer systems and the software and algorithms14 that comprise them. In particular, we describe how computer scientists 13 Even when experts can pool investigative effort across many decisions, there is no guarantee that the basis for decisions will be interpretable or that problems of fairness or even overt special treatment for certain people will be discovered. Further, a regime based on individuals auditing their own decisions cannot adequately address departures from an established rule, which favor the individual auditing her own outcome, or properties of the rule, which can only be examined across individuals (such as nondiscrimination). 14 In this Article, we limit our use of the word “algorithm” to its usage in computer science, where it refers to a well-defined set of steps for accomplishing a certain goal. In other contexts, where other authors have used the term “algorithm,” we describe “automated decision processes” reflecting “decision policies” implemented by pieces of “software,” all comprising “computer systems.” Our adoption of the phrase “computer systems” was suggested by (and originally due to) Helen Nissenbaum and we are grateful for the precision it provides. See generally Batya Friedman & Helen Nissenbaum, Bias in Computer Systems, 14 ACM Transactions on Info. Sys. 330 (1996). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 8 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -8evaluate a program to verify that it has desired properties, and discuss the value of randomness in the construction of many computer systems. We characterize what sorts of properties of a computer system can be tested and describe one of the fundamental truths of computer science--that there are some properties of computer systems which cannot be tested completely. We observe that computer systems fielded in the real world are (or at least should be) tested regularly during creation, deployment, and operation, merely to establish that they are actually functional. Part II examines how to design computer systems for procedural regularity, a key governance principle enshrined in law and public policy in many societies. We consider how participants, decision subjects, and observers can be assured that each individual decision was made according to the same procedure-- for example, how observers can be assured that the decisionmaker is not choosing outcomes on a whim while merely claiming to follow an announced rule. We describe why mere disclosure of a piece of source code can be impractical or insufficient for these ends. Indeed, without full transparency--including source code, input data, and the full operating environment of the software--even the disclosure of audit logs showing what a program did while it was running provides no guarantee that the disclosed The term “algorithm” is assigned disparate technical meanings in the literatures of computer science and other fields. The computer scientist Donald Knuth famously defined algorithms as separate from mathematical formulae in that (1) they must “always terminate after a finite number of steps;” (2) “[e]ach step of an algorithm must be precisely defined; the actions to be carried out must be rigorously and unambiguously specified for each case;” (3) input to the algorithm is “quantities that are given to it initially before the algorithm begins;” (4) an algorithm’s output is “quantities that have a specified relation to the inputs;” and (5) the operations to be performed in the algorithm “must all be sufficiently basic that they can in principle be done exactly and in a finite length of time by someone using pencil and paper.” 1 Donald E. Knuth, The Art of Computer Programming: Fundamental Algorithms 4-6 (1968). Similarly and more simply, a widely used computer science textbook defines an algorithm as “any well-defined computational procedure that takes some value, or set of values, as input and produces some value, or set of values, as output.” Thomas H. Cormen et al., Introduction to Algorithms 10 (2d ed. 2001). By contrast, communications scholar Christian Sandvig says that “‘algorithm’ refers to the overall process” by which some human actor uses a computer to do something, including decisions made by humans as to what the computer should do, choices made during implementation, and even choices about how algorithms are represented and marketed to the public. Christian Sandvig, Seeing the Sort: The Aesthetic and Industrial Defense of “The Algorithm,” Media-N, http://median.newmediacaucus.org/art-infrastructures-information/seeing-the-sort-the-aestheticand-industrial-defense-of-the-algorithm [https://perma.cc/29E4-S44S]. Sandvig argues that even algorithms as simple as sorting “have their own public relations” and are inherently human in their decisions. Id. Another communications scholar, Nicholas Diakopoulos, defines algorithms in the narrow sense as “a series of steps undertaken in order to solve a particular problem or accomplish a defined outcome,” but also considers them in the broad sense, saying that “algorithms can arguably make mistakes and operate with biases,” which does not make sense for the narrower technical definition. Nicholas Diakopoulos, Algorithmic Accountability: Journalistic Investigation of Computational Power Structures, 3 Digital Journalism 398, 398, 400 (2015). This confusion is common to much of the literature on algorithms and accountability, which we describe throughout this paper. To avoid confusion, this paper adopts the precise definition of the word “algorithm” from computer science and, following Friedman and Nissenbaum, refers to the broader concept of an automated system deployed in a social or human context as a “computer system.” DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 9 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -9information actually reflects a computer system’s behavior.15 In order to move beyond the need for full transparency, we focus on tools that can communicate partial information about secret processes, so that accountability and oversight continue to function even when policy interests, personal privacy, trade secrets, or other concerns protect a computer system, a piece of software, its inputs, its outputs, or its environment from disclosure. Putting it all together, we provide an illustrative example of how to redesign an existing, legally mandated automated decisionmaking system--the State Department’s Diversity Visa Lottery--so that it is provably accountable. Part III considers the broader question of how to assess a computer system’s compliance with policy principles that go beyond procedural regularity. These broader properties include determining whether automated decision systems treat people (including protected groups) in ways that comport with the social or legal standards that govern the decision being made.16 This broadening raises the issue of translating a policy principle into a property of the system. Certain substantive policy choices translate easily: for example, prespecified rules such as blindness to a sensitive attribute.17 However, defining other policy objectives, such as a general notion of nondiscrimination, is a more complicated and fraught affair, particularly when systems rely on machine learning rather than decision rules explicitly predetermined by humans. We explore in particular the discriminatory effect that automated decisionmaking can have, noting real-world examples of newfound risks and describing some system properties that may align with policy goals. Finally, we observe how automated decisionmaking may complicate the existing doctrine of disparate treatment and disparate impact. Part IV concludes by calling for increased collaboration between computer scientists and policymakers to develop and apply technical tools for the governance of computer systems. Given the ever-widening reach of automated decisions, computer scientists need to understand the policy challenges of oversight, and policymakers need to understand where new and emerging software tools can help address those challenges. We offer recommendations for bridging the gap between technologists’ desire for specificity and the policy process’s need for ambiguity. As a first step, we urge policymakers to recognize that accountability is feasible even when the details of a computer system are not fully known or must be kept secret. 15 The environment of a computer system includes anything it might interact with. For example, an outside observer will need to know what other software was running on a particular computer to ensure that nothing modified the behavior of the disclosed program. Some programs also observe (and change their behavior based on) the state of the computer they are running on (such as which files were or were not present or what other programs were running), the time they were run, or even the configuration of hardware on the system on which they were run. 16 This type of evaluation depends upon having already verified procedural regularity: if it cannot be determined that a particular algorithm was used to make a decision, it is fruitless to try to verify properties of that algorithm. 17 A concrete example would be the requirement that a decision only account for certain information for certain purposes, as in a system for screening job applicants that is allowed to take the gender of applicants as input, but only for the purpose of keeping informational statistics and not for making screening decisions. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 10 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -10We also argue that the ambiguities, contradictions, and uncertainties of the policy process need not discourage computer scientists from engaging constructively in it. I. How Computer Scientists Build and Evaluate Software Fundamentally, computers are general purpose machines that can be programmed to do any computational task, though they lack the desirable specificity and limitations of physical devices.18 Engineers often seek strong digital evidence that a computer system is working as intended. Such evidence may be persuasive for the system’s creator or operator only, for a predesignated group of receivers such as an oversight authority, or for the public at large. In many cases, systems are carefully evaluated and tested before they make it to the real world. Evidence that is convincing to the public and sufficiently nonsensitive to be disclosed widely is the most effective and desirable for ensuring accountability. In this Part, we examine how computer scientists think about software assurance, how software is built and tested in the software industry, and what tools are available to get assurances about an individual piece of software or a large computer system. Thus, this section provides a brief and accessible map of key concepts and offers some insight into how computer scientists think about and approach these challenges. A. Assessing Computer Systems In general, a computer program is something that takes a set of inputs and produces a set of outputs. All too often, programs fail to work as their authors intended, because the programs have bugs or make assumptions about the input data which are not always true. Programmers often structure or design programs with an eye towards evaluation and testing in order to avoid or minimize these pitfalls.19 Many respected and popular approaches to software engineering are based on the idea that code should be written in ways that make it easier to analyze.20 For example, the programmer can:  Organize the code into modules that can be evaluated separately and then combined.21  Test these modules for proper functionality both individually and in groups, possibly even testing the entire computer system end-to-end. Such testing generally involves writing test cases, or expected scenarios in which each module 18 For example, hydraulically operated control surfaces in a vehicle will telegraph resistance to the operator when they are close to a dangerous configuration, but the same controls operated by computer can omit feedback, allowing the computer to request configurations of actuators that are beyond their tolerances. This is a problem especially in the design of robotic arms and fly-by-wire systems for aircraft. 19 See Andrew Hunt & David Thomas, The Pragmatic Programmer: From Journeyman to Master (2000). 20 In particular, Test Driven Development (TDD) is a software engineering methodology practiced by many major software companies. For a general description of how TDD integrates automated testing into software design, see Kent Beck, Test-Driven Development: By Example (2003). 21 See Hunt & Thomas, supra note 19, at . DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 11 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -11will run, and may involve running test cases each time the software is changed to avoid introducing new bugs or taking away functionality unintentionally.22  Annotate the code with “assertions,” simple statements about the code that describe error conditions under which the program should crash immediately. Assertions are intended to be true if the program is running as expected. They become false when something is amiss and cause the program to crash (with an error message), rather than continuing in an errant state. In this way, assertions are a special kind of program error--a point at which a piece of software considers its internal state and its environment, and stops if these do not match what had been assumed by the program’s author.23  Provide a detailed description specifying the program’s behavior along with a machine-checkable proof that the code satisfies this specification. This differs from using assertions in that the proof guarantees ahead of time that the program will work as intended in all cases (or, equivalently, that an assertion of the facts covered by the proof will never fail to be true). When feasible, this approach is the most helpful thing a programmer can do to facilitate testing because it can provide real proof (rather than just circumstantial evidence or evidence linked to a particular point in a program’s execution, as with assertions) that the whole program operates as expected.24 These techniques are illustrative examples from a larger toolbox. Testing and verification of software and the development of tools to facilitate it comprise a rich and active subfield of computer science research. A thriving industry builds tools to assist in the development of software, and there is a constant debate about the practice of software engineering in the technology industry more broadly. Below, we give a brief taxonomy of this area, as well as some examples of tools, expectable results, and limitations of each category. We conclude this Section by 22 See Steve McConnell, Code Complete: A Practical Handbook of Software Construction (2d ed. 2004). 23 This technique is originally due to 1 Herman H. Goldstine & John von Neumann, Planning and Coding of Problems for an Electronic Computing Instrument (1947), but it is now a widely used technique. For a historical perspective, see Lori A. Clarke & DavidS. Rosenblum, A Historical Perspective on Runtime Assertion Checking in Software Development, Software Engineering Notes (ACM Special Interest Grp. on Software Eng’g, New York, N.Y.), May 2006, at 25. 24 A simple example is a technique called model checking, usually applied to computer hardware designs, in which the property desired and the hardware or program are represented as logical formulae, and an automated tool performs an exhaustive search (i.e., tries all possible inputs) to check whether those formulae are not consistent. See, e.g., Edmund M. Clarke, Jr. et al., Model Checking (1999). An even simpler example comes from the concept of types in programming languages, which associate the data values on which the program operates into descriptive classes and provide rules for how those classes should interact. For example, it should not be possible to add mathematically a number like “42” to a string of text like “Hello, World!” Because both kinds of data are represented inside the computer as bits and bytes, without a type system, the computer would be free to try executing this nonsensical behavior, which might lead to bugs. Type systems can help programmers avoid mistakes and express extremely complex relationships among the data processed by the program. For a more thorough explanation of type systems and model checking, see Benjamin C. Pierce, Types and Programming Languages (2002). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 12 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -12describing why testing code after it has been written, however extensively, cannot provide true assurance of how the system works, because any analysis of an existing computer program is inherently and fundamentally incomplete. This incompleteness implies that observers can never be certain that a computer system has a desired property unless that system has been designed to guarantee that property. When technologists evaluate computer systems, they attempt to establish invariants, or facts about a program’s behavior which are always true regardless of a program’s internal state or what input data the program receives.25 Invariants can cover details as small as the behavior of a single line of code but can also express complex properties of entire programs, such as which users have access to which data or under what conditions the program can crash. By structuring code modules and programs in a way that makes it easy to establish simple invariants, it is possible to build an entire computer system for which important desirable invariants can be proved.26 Together, the set of invariants that a program should have make up its specification.27 Software code is, ultimately, a rigid and exact description of itself: the code both describes and causes the computer’s behavior when it runs.28 In contrast, public policies and laws are characteristically imprecise, often deliberately so.29 Thus, even when a well-designed piece of software does assure certain properties, there will always remain some room to debate whether those assurances match the 25 See C.A.R. Hoare, An Axiomatic Basis for Computer Programming, Comm. ACM, Oct. 1969, at 576, . 26 See id. at . 27 Specifications can be formal and written in a specification language, in which case they are rather like computer programs unto themselves. For example, the early models of core internet technology were written in a language called LOTOS, built for that purpose. See Tommaso Bolognesi & Ed Brinksma, Introduction to the ISO Specification Language LOTOS, 14 Computer Networks & ISDN Sys. 25, (1987). Other common specification languages in practical use include Z and UML. It is even possible to build an executable program by compiling such a specification into a programming language or directly into machine code, an area of computer science research known as program synthesis. See Zohar Manna & Richard Waldinger, A Deductive Approach to Program Synthesis, 2 ACM Transactions on Programming Languages & Sys. 90, (1980). Research has shown that when the language in which a program or specification is written more closely matches a human-readable description of the program’s design goals, programs are written with fewer bugs. See Michael C. McFarland et al., The High-Level Synthesis of Digital Systems, 78 Proc. IEEE 301 (1990) (summarizing early “high-level language” oriented program synthesis); see also McConnell, supra note 22, at (arguing that the number of software bugs is constant when measured by lines of written code). Specifications can also be informal and take the form of anything from a mental model of a system in the mind of a programmer to a detailed written document describing all goals and use cases for a system. The world of industrial software development is full of paradigms and best practices for producing specifications and building code that meets them. 28 See David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface (5th ed. 2014). 29 See, e.g., Joseph A. Grundfest & A.C. Pritchard, Statutes with Multiple Personality Disorders: The Value of Ambiguity in Statutory Design and Interpretation, 54 Stan. L. Rev. 627, 628 (2002) (explaining how legislators often obscure the meaning of a statute to allow for multiple interpretations). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 13 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -13requirements of public policy. The methods described in this Article are designed to forward, rather than to foreclose, debates about what laws mean and how they ought to work. Our approach aims to empower the policy process, by empowering the policymaker’s tools for dealing with ambiguity and lack of precision, namely review and oversight. We wish to show that software does work as described, allowing a reviewer to determine precisely which properties of the software created a particular rule enforced for a particular decision. Further, if a precise specification of a policy does exist, we wish to show that software which claims to implement that policy in fact does so. The specification of a system is a critical question for assessment, and system implementers should be prepared to describe the invariants that their system provides. Verification allows the claims of a system’s implementer to constitute evidence that the software in question in fact satisfies those claims.30 Without strong evidence of a computer system’s correctness, even the author of that system cannot reliably claim that it will behave according to a desired policy, and no policymaker or overseer should believe such a claim. For example, a medical radiation device with a software control module was approved for use on patients based on the manufacturer’s claims, but a subtle bug in the software allowed it to administer unsafe levels of radiation, which resulted in six accidents and three deaths.31 Computer scientists evaluate programs using two testing methodologies: (1) static methods, which look at the code without running the program; and (2)dynamic methods, which run the program and assess the outputs for particular inputs or the state of the program as it is running.32 Dynamic methods can be divided into (1) observational methods in which an analyst can see how the program runs in the field with its natural inputs; and (2) testing methods, which are more powerful, where an analyst chooses inputs and submits them to the program.33 1. Static Analysis: Review from Source Code Alone Reading source code does allow an analyst to learn a great deal about how a program works, but it has some major limitations. Code can be complicated or obfuscated, and even expert analysis often misses eventual problems with the behavior of the program.34 For example, the Heartbleed security flaw was a potentially catastrophic vulnerability for most internet users that was caused by a 30 See Carlo Ghezzi et al., Fundamentals of Software Engineering (2d ed. 2002). 31 The commission reviewing the accidents determined that overconfidence on the part of engineers and operators led to both a failure to prevent the problem in the first place and a failure to recognize it as a problem even after multiple accidents had occurred. For an overview, see Nancy Leveson, Medical Devices: The Therac-25, in Safeware: System Safety and Computers app. (1995), an update of the earlier article Nancy G. Leveson & Clark S. Turner, An Investigation of the Therac-25 Accidents, Computer, July 1993, at 18. 32 See Flemming Nielson et al., Principles of Program Analysis (1999). 33 See id. at . 34 As an example, the Heartbleed bug was in code that was subjected to significant expert review and careful static analysis with industry leading tools, but was still missed for years. See Edward W. Felton & Joshua A. Kroll, Heartbleed Shows Government Must Lead on Internet Security, 311 Sci. Am. no. 1, 2014, at 14. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 14 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -14common programming error--but that error made it through an open source vetting process and then sat unnoticed for two years, even though anyone was free to read and analyze the code during that time.35 While there exist automated tools for discovering bugs in source code, even best-of-breed commercial solutions designed to discover exactly this class of error did not find the Heartbleed bug because its structure was subtly different from what automated tools had been designed to recognize.36 This experience underscores how difficult it can be to find small and simple mistakes. More complex errors evade scrutiny even more easily. Further, static methods on their own say nothing about how a program interacts with its environment.37 A program that examines any sort of external data, even the time of day, may have different behavior when run in different contexts. For example, it has long been programming practice to use the time of day as the starting value for a chaotic function used to produce random numbers in programs that do statistical sampling.38 Such programs naturally choose a different sample of data based on the time of day when they were started, meaning that their output cannot be reproduced unless the time is explicitly represented as an input to the program. Depending on the technology used to implement a program, static analysis might lead to incomplete or incorrect conclusions simply because it fails to consider the dependencies--that is, the other software that a given program needs in order to operate correctly.39 For some technologies, the same line of code can have radically different meanings based on the version of even a single dependency.40 Because of this, it is necessary for static analysis to cover a large portion of any system, and to include at least some dynamic information about how a program will be run. Within limits, static methods can be very useful in establishing facts about a program, such as the nature of the data it takes in, the kind of output it can 35 Edward W. Felten & Joshua A. Kroll, Heartbleed Shows Government Must Lead on Internet Security, Sci. Am. (Apr. 16, 2014), http://www.scientificamerican.com/article/heartbleed-showsgovernment-must-lead-on-internet-security [https://perma.cc/QLN4-TUQM]. 36 Id. 37 See Nielson et al., supra note 32, at . 38 This practice dates back at least as far as the 1989 standard for the C programming language. See ANSI X3.159-1989 “Programming Language C.” 39 See, e.g., Managing Software Dependencies, Gov.UK Service Manual, https://www.gov.uk/service-manual/technology/managing-software-dependencies [https://perma.cc/3BA8-CXED] (“Most digital services will rely on some third party code from other software to work properly. This is called a dependency.”). 40 For example, it is common to re-use “library” code, which provides generic functionality and can be shared across many programs. Ghezzi et al., supra note 30. Library functions can be very different from version to version, meaning that running a program with a different version of the same library can radically change its behavior. It can even change programs that fail to run at all into running, working programs. Because Microsoft Windows refers to its system libraries as “Dynamic Linked Libraries,” developers often call this “DLL Hell.” Rick Anderson The End of DLL Hell, Microsoft.com (Jan. 11, 2000). Further, in some programming languages, such as PHP, the meaning of certain statements is configurable. See The Configuration File, PHP, http://php.net/manual/en/configuration.file.php) [https://perma.cc/9LFC-62D9] (describing configuration of PHP). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 15 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -15produce, the general shape of the program, and the technologies involved in the program’s implementation.41 In particular, static analysis can reveal the kinds of inputs that might cause the program to behave in particular ways.42 Analysts can use this insight to test the program on different types of inputs. Advanced analysis can, in some cases, determine aspects of a program’s behavior and establish program invariants, or facts about the program’s behavior which are true regardless of what input data the program receives.43 Programs which are specially designed to take advantage of more advanced analysis techniques can enable an analyst to use static methods to prove formally complex invariants about the program’s behavior.44 On the simplest level, some programming languages are designed to prevent certain classes of mistakes. For example, some are designed in such a way that it is impossible to make the mistake that caused the Heartbleed bug.45 These techniques have also been deployed in the aviation industry, for example, to ensure that the software that provides guidance functionality on rockets, airplanes, satellites, and scientific probes does not ever crash, as software failures have caused the losses of several vehicles in the past.46 More advanced versions of these 41 See Nielson et al., supra note 32, at . 42 In programming languages, the most basic structure for expressing behavior that depends on a value is a conditional statement, often written as if X then Y else Z. A conditional statement will execute certain code (Y) if the condition (X) is met and different code (Z) if the condition is not met. Static analysis can reveal where a program has conditional logic, even if it may not always be able to determine which branch of the conditional logic will actually be executed. For example, static analysis of conditional logic might show an analyst that a program behaves one way for inputs less than a threshold and another way otherwise, or that it behaves differently on some particular special case. Generalizing this analysis can allow analysts to break the inputs of a program into classes and evaluate how the program behaves on each class. For an overview of logical constructs in computer programs, see Harold Abelson et al., Structure and Interpretation of Computer Programs (2d ed. 1996). 43 See Hoare, supra note 25, at . 44 See Nielson et al., supra note 32, at . 45 Since Heartbleed was caused by improper access to the program’s main memory, see Felten & Kroll, supra note 35, computer scientists refer to the property that a program has no such improper access as “memory safety.” For a discussion of the formal meaning of software safety, see Pierce, supra note 24, at . For an approachable description of possible memory safety issues in software, see Erik Poll, Lecture Notes on Language-Based Security ch. 3, https://www.cs.ru.nl/E.Poll/papers/language_based_security.pdf [https://perma.cc/YJ23-6PG6]. Several modern programming languages are memory safe, including some, such as Java, that are widely used in industrial software development. However, while any system could be written in a memory safe language, developers often choose memory unsafe languages for performance and other reasons. 46 Both the Ariane 5 and Mars Polar Lander crashed due to software failures. See J.L. Lions, Chairman, Ariane 501 Inquiry Bd., Ariane 5: Flight 501 Failure (1996); James Gleick, Little Bug, Big Bang, N.Y. Times Mag. (Dec. 1, 1996), http://www.nytimes.com/1996/12/01/magazine/littlebug-big-bang.html [https://perma.cc/D4JE-V7K2]; see also Arden Albee et al., JPL Special Review Bd., Report on the Loss of the Mars Polar Lander and Deep Space 2 Missions (2000), http://spaceflight.nasa.gov/spacenews/releases/2000/mpl/mpl_report_1.pdf [https://perma.cc/RE9Z-PX6L]. Similarly, a software configuration error caused the crash of an Airbus A400M military transport. Sean Gallagher, Airbus Confirms Software Configuration Error Caused Plane Crash, Ars Technica (June 1, 2015), http://arstechnica.com/information- DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 16 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -16techniques may eventually lead to strong invariants being much more commonly and less expensively used in a wider range of applications. Transparency advocates often claim that by reviewing a program’s disclosed source code, an analyst will be able to determine how a program behaves.47 Indeed, the very idea that transparency allows outsiders to understand how a system functions is predicated on the usefulness of static analysis. But this claim is belied by the extraordinary difficulty of identifying even genuinely malicious code (“malware”), a task which has spawned a multibillion-dollar industry based largely on the careful review of code samples collected across the internet.48 Of course, under some circumstances, transparency can also use dynamic methods such as emulating disclosed code on disclosed input data. We discuss transparency further in Part II. 2. Dynamic Testing: Examining a Program’s Actual Behavior By running a program, dynamic testing can provide insights not available through static source code review. But again, there are limits. While static methods may fail to reveal what a program will do, dynamic methods are limited by the finite number of inputs that can be tested or outputs that can be observed. This is important because decision policies tend to have many more possible inputs than a dynamic analysis can observe or test.49 Dynamic methods, including structured auditing of possible inputs, can explore only a small subset of those potential inputs.50 Therefore, no amount of dynamic testing can make an observer certain that he or she knows what the computer would do in some other situation that has yet to be tested.51 technology/2015/06/airbus-confirms-software-configuration-error-caused-plane-crash [https://perma.cc/X9A4-X9CH]. 47 See Frank Pasquale, The Black Box Society: The Secret Algorithms That Control Money and Information 40 (2015). 48 Malware analysis can also be dynamic. A common approach is to run the code under examination inside an emulator and then examine whether or not it attempts to modify security-restricted portions of the system’s configuration. For an overview, see Manuel Egele et al., A Survey on Automated Dynamic Malware-Analysis Techniques and Tools, 44 ACM Computing Surveys (CSUR) 6 (2012). 49 Computer scientists call this problem “Combinatorial Explosion.” It is a fundamental problem in computing affecting all but the very simplest programs. Edward Tsang, Combinational Explosion, U. Essex (May 12, 2005), http://cswww.essex.ac.uk/CSP/ComputationalFinanceTeaching/CombinatorialExplosion.html [https://perma.cc/R7KE-4BJD]. 50 Even auditing techniques that involve significant automation may not be able to cover the full range of possible input data if that range cannot be limited in advance to a small enough size to be searched effectively. For programmers testing their own software, achieving complete coverage of a program’s behavior by testing alone is considered impossible. Indeed, if testing for the correct behavior of a program were possible at a modest cost, then there would be no bugs in modern software. For a formal version of this argument, see H.G. Rice, Classes of Recursively Enumerable Sets and Their Decision Problems, 74 Transactions Am. Mathematical Soc’y, 358 (1953). 51 Computer security experts often worry about so-called “back doors,” which are unnoticed modifications to software that cause it to behave in unexpected, malicious ways when presented with certain special inputs known only to an attacker. There are even annual contests in which the organizers “propose a challenge to coders to solve a simple data processing problem, but with covert malicious behavior. Examples include miscounting votes, shaving money from financial DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 17 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -17Dynamic testing can be divided into “black-box testing,” which considers only the inputs and outputs of a system or component, and “white-box testing,” in which the structure of the system’s internals is used to design test cases. Intuitively, white-box evaluation is more powerful, since any test that can be performed in a black-box setting can also be performed in a white-box setting, but white-box evaluation can suggest more robust test cases by showing an analyst when multiple tests will yield the same behavior or what inputs are likely to trigger differences in the output.52 White-box analysis also helps the developers and operators of a system determine how to monitor its operation so that deviations from expected behavior (e.g., unforeseen bugs, security compromise, abuse, and other unexpected behavior) can be detected and remedied.53 Even structured “audits” of software systems, in which systems are provided with related inputs and analyzed for differential behavior, cannot provide complete coverage of a program’s behavior for the same reason: this methodology explains little about what happens to inputs which have not been tested, even those that differ very slightly.54 Additionally, auditing that treats a system as a black box tells an analyst very little about why differential behavior was observed. A computer program could treat two inputs very differently because it has been explicitly designed to use special case logic for one or both, because those inputs naturally fall into different decision categories, or because the decision rule in use is very sensitive to small changes in its input. One extremely straightforward and very commonly used form of dynamic program review comes from the practice of logging, or recording certain program actions in a file either immediately before or immediately after they have taken place.55 Analysis of log messages is among the easiest and is perhaps the most common type of functional review performed on most software programs. However, analyzing program logs requires that programs be written to log when they perform events which might be interesting for analysis (and that they log enough information about those events to actually perform the analysis in transactions, or leaking information to an eavesdropper. The main goal, however, is to write source code that easily passes visual inspection by other programmers.” The Underhanded C Contest, http://www.underhanded-c.org/_page_id_2.html [https://perma.cc/82N4-FBDP]. Back doors have been discovered sitting undetected for many years in commercial, security-focused infrastructure products subject to significant expert review, including the Juniper NetScreen line of devices. See Matthew Green, On the Juniper Backdoor, Few Thoughts on Cryptographic Engineering (Dec. 22, 2015). http://blog.cryptographyengineering.com/2015/12/22/on-juniper-backdoor [https://perma.cc/M7S8-SCM4] (describing the unauthorized code that created a security vulnerability in the Juniper devices). 52 See Glenford J. Myers et al., The Art of Software Testing (3d ed. 2011). 53 See Slawek Ligus, Effective Monitoring and Alerting 1-2 (2012) (describing how to perform monitoring effectively, as opposed to verifying a system’s behavior through testing alone). 54 See infra note 59 and accompanying text. 55 Logging is now sufficiently common to be a basic feature of most programming languages. For an overview of early uses, see Ronald E. Rice & Christine L. Borgman, The Use of ComputerMonitored Data in Information Science and Communication Research, 34 J. Am. Soc’y Info. Sci. 247 (1983). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 18 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -18- question).56 And because logs are just like other files on a computer, they can easily be modified and rewritten to contain a sequence of events that bears no relation to what a system’s software actually did. Because of this, audit logs meant to record sensitive actions requiring reliable review are generally access controlled or sent to special restricted remote systems dedicated to receiving logging data.57 3. The Fundamental Limit of Testing: Noncomputability Testing of any kind is, however, a fundamentally limited approach to determining whether any fact about a computer system is true or untrue. There are some surprising limitations to the ability to evaluate code statically or dynamically. Counterintuitively, the power of computers is generally limited by a concept that computer scientists call noncomputability.58 In short, certain types of problems cannot be solved by any computer program in any finite amount of time. There are many examples of noncomputable problems, but the most famous is Alan Turing’s “Halting Problem,” whichasks whether a given program will finish running (“halt”) and return an answer on a given input or will run forever on that input. No algorithm can solve this problem for every program and every input.59 As a corollary, no testing regime can establish any property for all possible programs, since no regime can even determine whether all programs will actually terminate.60 A related theorem, proposed by Rice, strongly limits the theoretical effectiveness of testing, saying that for any nontrivial property of a program’s behavior, no algorithm can always establish whether a program under analysis has that property.61 Any such algorithm must get some cases wrong even if the algorithm can do both static and dynamic analyses of the program.62 However, testing can be very useful in establishing certain specific invariants on restricted classes of programs, and can be made much more useful when programs are designed to facilitate the use of testing to establish those invariants. That is, while testing is not guaranteed to work 56 See, e.g., Bernard J. Jansen, Search Log Analysis: What It Is, What’s Been Done, How to Do It, 28 Libr. & Info. Sci. Res. 407 (2006). 57 A common feature of security breaches of computer systems is that attackers will rewrite logs to prevent investigation into how the attack was carried out or who did it. See Dr. Eric Cole et al., Network Security Bible 198 (2005) (noting that the “early stages of an attack often deal with deleting and disabling logging”). Modifying logs in this way can even allow attackers to avoid losing access to a compromised system once the compromise has been detected, since it obscures what steps must be taken to remediate the intrusion. See generally id. (describing how security breaches happen, how they are investigated, and how attackers try to cover the traces of their activity). 58 See Michael Sipser, 2 Introduction to the Theory of Computation (2006). 59 See A.M. Turing, On Computable Numbers, with an Application to the Entscheidungsproblem, 42 Proc. London Mathematical Soc’y 230, 259-63 (1937) (proving that the Hilbert Entscheidungsproblem has no solution); see also A.M. Turing, On Computable Numbers, with an Application to the Entscheidungsproblem. A Correction, 43 Proc. London Mathematical Soc’y 544, 544-46 (1937) (providing a correction to the original proof). 60 To see why this is so, imagine writing a new program which halts if it decides that the program it is testing has a certain property, and which runs forever otherwise. For a more detailed version of this argument, see Sipser, supra note 58, at . 61 Rice, supra note 50. 62 Id. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 19 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -19in general, it can often be useful in specific cases, especially when those cases have been designed to facilitate testing. While both static and dynamic methods are after-the-fact assessments--they take the computer system and its design as a given--using both approaches together is often helpful. If an analyst can establish through static methods that a program behaves identically over some class of inputs,63 the analyst can test a single input from that class and infer the program behavior for the rest of the class. However, not every computer program will be able to be fully analyzed, even with such a combination of methods. B. The Importance of Randomness Randomness is essential to the design of many computer systems, so any approach to accountability must grapple with it.64 However, when randomness is used, it is easy to lose accountability, since by definition any outcome which a randomized process could have produced is at least facially consistent with the design of that process.65 Accountability for randomized processes must determine why randomness was needed and determine that the source of that randomness and its incorporation into the process under scrutiny meets those goals. The most intuitive benefit of randomness in a decision policy is that it helps prevent strategic behavior--i.e., “gaming” of a system. When a tax examiner, for example, uses software to choose who is audited, randomization makes it 63 This can be done, for example, by noting where in a program’s source code it considers input values and changes its behavior. See infra note 68, at 161 and accompanying text. 64 In fact, there is suggestive theoretical evidence that the power of randomness may be fundamental: there are problems for which the best known randomized algorithm performs much better than the best known deterministic algorithm. For example, the well-studied “multi-armed bandit” problem in statistics has seen wide application in the field of machine learning, where randomized decisionmaking strategies are provably more efficient than nonrandomized ones. See, e.g., J.C. Gittins, Bandit Processes and Dynamic Allocation Indices, 41 J. Royal Stat. Soc’y 148, 148 (1979) (providing a formal mathematical definition of the multi-armed bandit problem); see also Richard S. Sutton & Andrew G. Barto, Reinforcement Learning (1998) (providing a general overview of the usefulness of the multi-armed bandit problem in machine learning applications). Even outside of machine learning, there are strong indications in computer science theory that certain problems can be solved efficiently only via randomized techniques. Although it is obvious that every efficient algorithm also has an efficient randomized version (which is just rewritten to take some random bits as input and ignore them), it is conjectured but not known that the converse is not true, namely that every efficient randomized algorithm also has a deterministic version that solves the same problem with comparable efficiency. For a summary of work in this area, see Leonid A. Levin, Randomness and Nondeterminism, 1994 Proc. Int’l Congress Mathematicians 1418. Many important problems, from finding prime numbers (which is necessary for much modern cryptography), to estimating the volume of an object (which is useful in computer graphics and vision algorithms), to most machine learning, had well-understood randomized algorithms that solved them long before they had efficient deterministic solutions (many still do not have any known efficient and deterministic algorithms). For an example, see Morton, infra note 107. 65 For example, a winning lottery ticket with the numbers “1 2 3 4 5” is just as likely to be correct as any other ticket, and yet it seems strikingly unlikely. In a similar way, it will always be necessary when randomness is involved in a process to ensure that even outcomes that are “correct” in the sense that the system could have produced them are also correct in the sense that they fulfill the goals which necessitated randomness in the first place (e.g., in a lottery, that the winning ticket numbers not be known in advance of their selection and not be influenced by the lottery operators). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 20 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -20impossible for a taxpayer to be sure whether or not he or she will be audited. Those who are evading taxes, in particular, face an unknown risk of detection—--which can be minimized, but not eliminated--and do not know whether, or when, they should prepare to be audited. Similarly, if additional security screening is applied at random to those crossing a checkpoint, or if the procedures at the checkpoint are changed at random on a day-to-day basis, a smuggler or attacker cannot be as prepared as if the procedures were fully known in advance.66 Additionally, studies of the performance of human guards have shown that randomization in procedures reduces boredom, thereby improving vigilance.67 The card game of poker illustrates a second benefit: randomness can obscure secret information. A good poker player has secret information--how good her cards are--that affects how she will bet. By occasionally bluffing, she randomizes her behavior and makes it more difficult for opponents to infer the quality of her hand. In situations where a scarce or limited resource must be apportioned to equally deserving recipients such that not all qualified applicants can receive a share, randomness can help by fairly apportioning resources to participants in a way that cannot be controlled or predicted by those in control of the resource. For example, the Diversity Visa Lottery, considered in Part II, is a case where a random lottery allocates a scarce resource--a limited number of visas to live and work in the United States. Randomness as a source of fairness requires two attributes: first, the outcome must not be controlled by the system’s operator, or else the randomness serves little purpose when compared to a model where the system’s operator just chooses the winners; and second, the outcome must not be predictable, or else the operator of such a system could put its favored winners into certain slots or slip them “winning tickets” prior to the system’s operation. Further, it is important that the random choices made when the system is run be binding upon the system’s operator, so that the system cannot be run many times to control the eventual output by shopping for a favorable result among many actual runs of the system. We explore precisely how to address these issues below. Many machine learning systems use randomization as part of their normal operation. It turns out that guessing randomly and adjusting the probability of each class of output often leads to much better performance than trying to determine the absolute best decision at any point.68 66 See, e.g., James Pita et al., Deployed ARMOR Protection: The Application of a Game Theoretic Model for Security at the Los Angeles International Airport, 2008 Proc. 7th Int’l Conf. on Autonomous Agents & Multiagent Sys.: Industry & Applications Track 125 (describing a software system that uses a game-theoretic randomized model to improve the efficiency of police and federal air marshal patrols at the Los Angeles International Airport). 67 See, e.g., Richard I. Thackray et al., FAA Civ. Aeromedical Institute, Physiological, Subjective, and Performance Correlates of Reported Boredom and Monotony While Performing a Simulated Radar Control Task (1975) (discussing the improvement of performance through increased unpredictability in procedures). 68 The use of randomization is found throughout artificial intelligence and machine learning. For a survey of the field, see Stuart Russell & Peter Norvig, Artificial Intelligence: A Modern Approach (1995). Some models naturally depend on randomness. See, e.g., Kevin B. Korb & Ann E. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 21 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -21Finally, randomization can give computers more flexibility to perform well in unexpected environments. Consider how the Roomba robot is programmed to vacuum rooms.69 If rules of motion were hard-coded in the software controlling the robot, an unusual furniture configuration might lead to the Roomba getting stuck in a corner or under a table or repeatedly following the same path without cleaning the rest of the room. Adding in randomized motion allows it to escape these patterns and work more effectively without the need to code in all possible room configurations. By allowing for unknowns, randomized strategies can avoid worstcase outcomes with high probability, no matter how unfriendly the environment turns out to be.70 However, poorly designed randomization can lead to unaccountable automated decisions. If a decision depends on a randomly selected value, then any outcome consistent with some possible value of the random choice, no matter how unlikely, must be considered valid. Concretely, if a decision is based on the outcome of a coin flip, even if the coin is biased to land heads up 99 times out of 100, a result based on a tails up flip cannot be shown to be improper, since one out of every 100 results will be derived from the value of tails. A corrupt decisionmaker could influence this supposedly random choice, picking the value of the coin consistent with its preferred outcome, or could flip many coins and then assign the value of each flip to the set of decisions he has to make (perhaps by changing the order in which he considers decisions) in such a way as to pick the outcomes he desires. Random choices generated by a computer system can also be remade by re-running a program until the outputs match a preferred outcome. Without designing the computer system to demonstrate that this is not happening, it is very hard for a decisionmaker to prove that he has not done this. The speed of automated decisionmaking only increases this risk; while physical randomization of balls in a tumbler can only produce a small number of values per hour of effort, a computer can try thousands or millions of outcomes in a matter of minutes. Nicholson, Bayesian Artificial Intelligence § 1.1 (2004) (describing a model of probabilistic reasoning that depends on reckoning with randomness). Other methods simply use randomness as an efficient way to explore a large space of possible strategies, in which case it is generally necessary to try to build a model many times, evaluate the differences, and report back an estimate of confidence in the system’s correct construction. See Volodya Vovk et al., Machine-learning Applications of Algorithmic Randomness, 1999 Proc. Sixteenth Int’l Conf. on Machine Learning 444 (describing one approach to integrating randomness to improve a machine learning model). 69 For a description of the Roomba’s movement algorithm, see Ja-Young Sung et al., “My Roomba is Rambo”: Intimate Home Appliances, in 2007 Proc. 9th Int’l Conf. on Ubiquitous Computing 145. 70 More concretely, one study showed that computer-generated teaching plans customized to particular students can be less effective than lesson plans without customization if the software model used to tailor lessons to individual performance is trained on large groups that do not capture individual-specific patterns. This failure of “big data” methods trained on large groups of students to properly capture the quirks of a “small data” situation (such as a classroom-sized group of students) can be avoided by adding random deviations from the model’s prediction and tracking the results of these deviations. See, e.g., Yun-En Liu et al., Trading Off Scientific Knowledge and User Learning with Multi-Armed Bandits, 2014 Proc. 7th Int’l Conf. on Educ. Data Mining 161 (observing the introduction of small changes on result prediction). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 22 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -22Additionally, a randomized process is not easily reproduced. For example, if it depends on interaction with its environment (e.g., the operating system on which it is running, its human user, or a database with rapidly changing records), its behavior may be altered in a nondeterministic way since that environment can change between runs.71 One unwieldy solution to this problem is to capture all of the environment in which a program runs, so that this environment can be replayed by an analyst. However, this solution does not address how to verify all of the reasons that randomness might be needed in a process. II. Designing Computer Systems for Procedural Regularity The first goal in any plan to govern automated decisionmaking should be to enable the people overseeing the process--whether they are government officials, corporate executives, or members of the public--to know how a computer system makes decisions (or, at the very least, that it makes decisions based on some rule, even if that rule is not fully disclosed). A baseline requirement in most contexts is procedural regularity: each participant will know that the same procedure was applied to her and that the procedure was not designed in a way that disadvantages her specifically.72 This baseline requirement draws on the Fourteenth Amendment principle of procedural due process. Ever since a seminal nineteenth century case, the Supreme Court has articulated that procedural due process requires rules to be generally applicable and not designed for individual cases.73 Similarly, federal statutes articulate the requirement for procedural regularity in administrative agency actions.74 These principles are also enshrined in the Federal Rules of Civil Procedure for civil litigation.75 In this Part, we will demonstrate that the tools of computer science can guarantee important elements of procedural regularity when they are incorporated in the initial design of computer systems. Specifically, these tools can assure that:  The same policy or rule was used to render each decision.  The decision policy was fully specified (and this choice of policy was recorded reliably) before the particular decision subjects were known, reducing the ability to design the process to disadvantage a particular individual. 71 One specific example is a program that chooses a random value based on the time that it has been running but takes different amounts of time to run based on what other programs are running on the same physical computer system. 72 For example, a tax auditing risk assessment should not single out individuals either by name or by identifying characteristics. If a process added extra weight to filers of a particular postal code, gender, and birth month, this could be enough to single out individuals in many cases. See, e.g., Paul Ohm, Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, 57 UCLA L. Rev. 1701, 1716-27 (2010) (showing that an individual’s identity may be reverseengineered from a small number of data points). 73 See Marchant v. Pa. R.R., 153 U.S. 380, 386 (1894) (holding that the plaintiff had due process because “her rights were measured, not by laws made to affect her individually, but by general provisions of law applicable to all those in like condition”). 74 See Administrative Procedure Act, 5 U.S.C. §§ 551-59 (2012) (prescribing exhaustive procedural requirements for most levels of federal administrative agency action). 75 See Fed. R. Civ. P. 1 (noting that the rules apply “in all civil actions and proceedings . . . to secure the just . . . determination of every action and proceeding” (emphasis added)). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 23 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -23 Each decision is reproducible from the specified decision policy and the inputs for that decision.  If a decision requires any randomly chosen inputs, those inputs are beyond the control of any interested party. After describing these properties and showing how they can be implemented, we will apply them to a case study--the Diversity Visa Lottery at the State Department--where application of these tools could greatly improve the legitimacy and fairness of an automated decision procedure. A. Transparency and Its Limits A naïve solution to the problem of verifying procedural regularity is to demand transparency of the source code as well as inputs and outputs for the relevant decisions; if all of these elements are public, it seems easy to determine whether procedural regularity was satisfied. Indeed, full or partial transparency can be a helpful tool for governance in many cases,76 and transparency has often been suggested as a remedy to accountability issues for computerized systems.77 However, transparency alone is not sufficient to provide accountability in all cases. First and foremost, it is often necessary to keep secret the elements of a decision policy, the computer systems that implement it, key inputs, or the outcome. Keeping aspects of a decision policy secret can help prevent strategic “gaming” of a system. For example, the IRS may look for signs in tax returns that are highly correlated with tax evasion based on returns previously audited.78 But if the public knows exactly which things on a tax return are treated as telltale signs of fraud, tax cheats may adjust their behavior and the signs may lose their predictive value for the agency. Moreover, when the decision being regulated is a commercial one, such as a decision to offer credit, a business’s legitimate interest in protecting proprietary information or guarding trade secrets like the underwriting formula may be incompatible with full transparency. And in many contexts, an automated decision may use as inputs, or will create as an output, sensitive or private data that should not be broadly shared to protect business interests, privacy, or the integrity of law enforcement or investigative methods. In some cases, especially with financial or medical data, disclosure may be barred or limited by statutes or regulations.79 Finally, in many situations--such as scoring consumers for credit or insurance risk-the purpose of the automated decision process is to determine something not 76 See Danielle Keats Citron & Frank Pasquale, The Scored Society: Due Process for Automated Predictions, 89 Wash. L. Rev. 1, 8, 18-20 (2014) (arguing that “transparency of scoring systems is essential”). 77 See 14 C.F.R. § 255.4 (2015) (requiring transparency for airline reservation system display information); Frank Pasquale, Beyond Innovation and Competition: The Need for Qualified Transparency in Internet Intermediaries, 104 Nw. U. L. Rev. 105, 160-61 (2010) (suggesting transparency in broadband networks to hold carriers accountable for potential favoritism and discrimination). 78 See Jeff Reeves, IRS Red Flags: How to Avoid a Tax Audit, USA Today (Mar. 15, 2015 12:08 PM), http://www.usatoday.com/story/money/personalfinance/2014/03/15/irs-tax-audit/5864023 [https://perma.cc/BFW5-DG34] (identifying characteristics of tax returns that trigger IRS audit). 79 See, e.g., 45 C.F.R. § 164.502 (2015) (restricting the disclosure of personally identifiable information collected by health care providers). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 24 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -24directly measurable, such as the risk of defaulting on credit or claiming a loss on an insurance policy. Because these values cannot be measured directly, they are computed from proxy variables such as a consumer’s credit history, income, or personal attributes. Consumers who understand these actuarial processes would be rational in attempting to control the input proxy variables, which in turn could render the scoring process less useful in elucidating unmeasurable risk.80 Secrecy discourages strategic behavior by participants in the system and prevents violations of legal restrictions on disclosure of data. Second, while transparency allows for the testing strategies described earlier (i.e., static and dynamic tests including auditing), those methods are often insufficient to verify properties of software systems if these systems have not been designed with future evaluation and accountability in mind.81 Third, for decision processes that involve some element of randomness, even full transparency--of the system’s source code, its inputs, its operating environment, and its results--does not foreclose the possibility that an outcome could be improperly fixed in an undetectable way, as described in Section I.C.82 A classic lottery provides an excellent example. A perfectly transparent algorithm that uses a random number generator to assign a number to each participant and has the participants with the lowest numbers win will yield results that cannot be reproduced or verified because the random number generator will produce different random numbers each time. Reviewing the code alone, or even together with the data fed into it and the environment in which it was operated, does not tell us that it was actually used to produce a particular result. By design, the process produces unpredictable results that are not reproducible. Fourth and finally, systems that change over time cannot be fully understood through transparency alone. System designers regularly change complicated automated decision processes--such as search engine ranking 80 In particular, consumers are rational to modify proxy variables that control their perceived risk when those variables are cheaper or easier to manipulate than the gain obtained via better treatment by the decision system. Intuitively, if proxy variables are weak and easy to alter or sometimes poorly correlated with the feature being measured (e.g., standardized test scores as a measure of student learning), they are more likely to be gamed than features which are highly proximate to the value being estimated, or which are difficult or expensive to alter (e.g., annual income as a measure of creditworthiness in a particular transaction). See generally Cathy O’Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (2016). In economic policymaking, this is sometimes known as Goodhart’s Law, popularly formulated as “[w]hen a measure becomes a target, it ceases to be a good measure;” Goodhart formulated it more formally as “[a]ny observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” C.A.E. Goodhart, Problems of Monetary Management: The U.K. Experience, in 1 Papers in Monetary Economics (1976). Hardt and his co-authors have developed adversarial methods for designing automated decision and classification systems that remain robust even in the face of gaming. See Moritz Hardt et al., Strategic Classification, 2016 Proc. 2016 ACM Conf. on Innovations Theoretical Computer Sci. 111 (discussing methods to strengthen classification models). 81 See supra Part I. 82 There are ways to incorporate randomness that can be replicated. See infra subsection II.B.3. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 25 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -25- methodology,83 spam filter rules,84 intrusion detection system methods,85 or the algorithms that select website ads--in response to strategic behavior by participants in the system.86 Computer systems that choose social media posts to display to users might respond to user behavior. “Online” machine learning systems can update their model for predictions after each decision, incorporating each new observation as part of their training data. Even knowing the source code and data for such systems is not enough to replicate or predict their behavior--we also must know precisely how and when they interacted or will interact with their environment. Whether updates to the system are effected by human engineers and operators (e.g., a search engine engineer modifies the software used to rank pages) or by a machine learning system (e.g. the search engine’s software discovers that users more often click the second link for a certain query instead of the first, so it reverses their order), transparency alone does little to explain either why any particular decision was made or how fairly the system operates across bases of users or classes of queries. With such systems, there is the added risk that the rule disclosed is obsolete by the time it can be analyzed. Online machine learning systems update their decision rules after every query, meaning that any disclosure will be obsolete as soon as it is made. B. Auditing and Its Limits Beyond transparency, auditing is another strategy for verifying how a computer system works. An audit treats the decision process as a black box, whose inputs and outputs are visible but whose inner workings are unseen.87 The approach has a long history in offline contexts, such as testing for discrimination in retail car negotiations.88 For retail car negotiations, the transparency of the bargaining 83 For a list of updates to one search engine, see Google: Algorithm Updates, Search Engine Land, http://searchengineland.com/library/google/google-algorithm-updates [https://perma.cc/XV4Z- AFF9]. 84 Many spam filters work by keeping a list of bad terms, email addresses, and computers to block messages from. The most widely used “blacklist” is produced by the organization Spamhaus. See SBL Advisory, Spamhaus, https://www.spamhaus.org/sbl [https://perma.cc/V9LN-EPK9] (describing the Spamhaus Block List Advisory, “a database of IP addresses from which Spamhaus does not recommend the acceptance of electronic mail”). 85 Intrusion detection systems work in a similar way, using a set of “signatures” to identify bad network traffic from attackers. See Karen Kent Frederick, Network Intrusion Detection Signatures, Part One, Symantec (Dec. 19, 2001), https://www.symantec.com/connect/articles/networkintrusion-detection-signatures-part-one [https://perma.cc/Q9XY-TQCA] (discussing signatures for network intrusion detection systems). 86 See Jure Leskovec et al., Mining of Massive Data Sets ch. 8 (2d ed. 2014). 87 See generally Christian Sandvig et al., Auditing Algorithms: Research Methods for Detecting Discrimination on Internet Platforms (May 22, 2014), , http://www- personal.umich.edu/~csandvig/research/Auditing%20Algorithms%20--%20Sandvig%20-- %20ICA%202014%20Data%20and%20Discrimination%20Preconference.pdf [https://perma.cc/DS5D-3JYS] (describing algorithm audits and reviewing possible audit study designs). 88 See Ian Ayres, Fair Driving: Gender and Race Discrimination in Retail Car Negotiations, 104 Harv. L. Rev. 817, 818 (1991) (using auditing to determine “[w]hether the process of negotiating for a new car disadvantages women and minorities”). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 26 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -26process for the purchase of a car is insufficient to determine if different prices are offered based on race or gender.89 In computer science, auditing refers to the review of digital records describing what a computer did in response to the inputs it received.90 Auditing is intended to reveal whether the appropriate procedures were followed and to uncover any tampering with a computer system’s operation. For example, there is a substantial body of literature in computer science that addresses audits of electronic voting systems,91 and security experts generally agree that proper auditing is necessary but insufficient to assure secure computer-aided voting systems. Computer scientists, however, have shown that black-box evaluation of systems is the least powerful of a set of available methods for understanding and verifying system behavior.92 Even for measuring demonstrable properties of software systems, such as testing whether a system functions as desired without bugs, it is much more powerful to be able to understand the design of that system and test it in smaller, simpler pieces.93 Approaches that attempt to review system failures simply by looking at how the output responds to changes in input are limited by either an inability to attribute a cause to those changes or an inability to interpret whether or why a change is significant.94 Instead, software developers regularly use other, more powerful evaluation techniques.95 These include whitebox testing (in which a the person doing a test can see the system’s code, and uses that knowledge to more effectively search for bugs) and using programming languages that automatically preclude certain types of mistakes.96 89 See id. at 827-28 (observing that women and minorities received worse prices than white males even when using the same negotiation strategy). 90 See Douglas W. Jones, Auditing Elections, 47 Comms. ACM 46, (2004). 91 See Joseph Lorenzo Hall, Election Auditing Bibliography (Feb. 12, 2010), https://josephhall.org/eamath/bib.pdf [https://perma.cc/L397-AATD] (collecting scholarly literature discussing audits in elections). 92 Specifically, white-box testing, in which an analyst has access to the source code under test, is generally considered to be superior; even in cases where the basic testing approach does not make use of the structure of the software (e.g., so-called “fuzz testing” where a program is subjected to randomly generated input), testing benefits from some access to the structure of programs. See supra note 52 and accompanying text. Also, consider the difficulties encountered in one such audit study. The authors show a causal relationship between changing sensitive, protected attributes (e.g., gender) and the advertisements presented to a user (e.g., advertisements for high-paying jobs). See Amit Datta et al., Automated Experiments on Ad Privacy Settings: A Tale of Opacity, Choice, and Discrimination, 2015 Proc. Privacy Enhancing Techs. 92, 105-06. However, the methodology is unable to identify the mechanism of this causation or even whether the results discovered will generalize beyond the data seen in the study. Id. at 105. 93 See supra notes 20-23 on approaches to structuring software. 94 For example, if the output of a system is an error or other failure such as a crash, it is not obvious to an analyst how to modify the output to learn much at all. 95 See supra Section I.A. 96 See supra note 24. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 27 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -27C. Technical Tools for Procedural Regularity As we demonstrated above, transparency and auditing often do not suffice for accountability. In this Section, we introduce computational methods that can provide accountability for procedural regularity even when some information is kept secret. These methods can be used alongside transparency and auditing when appropriate and apply to all computer systems.97 Our approach harnesses the power of computational methods and does not take the design of the computer system as a given. Instead, we explicitly advocate for systems to be designed to use methods such as the ones described here. Nor do we give up on governance when all or part of a computer system must remain secret. We rely on several advanced techniques from computer science to enable the governance of secret decision systems: software verification, cryptographic commitments, zero-knowledge proofs, and fair random choices. Counterintuitively, even when a piece of software or the data input to it is secret, these methods can guarantee that the software and inputs satisfy the requirements for procedural regularity. They can verify that the same decision policy was used for each decision, that the policy was determined and recorded before inputs were known, and that the outcomes are reproducible. Just because a given computer system or piece of software is secret does not mean that nothing about that system can be known. 1. Software Verification Software verification is a set of techniques for proving mathematically that software has certain properties, either by analyzing existing code or by building software using specialized tools for extracting proved correct invariants. It has been a promising field for many decades, and while many benefits are only realized in research prototypes today, these methods are finding increasing industrial adoption, especially in sectors where software is safety or security-critical and in domains with strong liability regimes.98 While the complete verification of any program is 97 While the methods we propose are general, they can be inefficient for certain applications. The cost of providing a certain level of accountability must be considered as part of the design of any policy requirement. For more detail, see Kroll, infra note 119. 98 While software verification has been embraced by the aviation and industrial control sectors and for some financial applications (for example, the hedge fund Jane Street regularly touts its use of formal software analysis in recruiting materials sent to computer science students), it has yet to see much adoption in the critical fields of healthcare and automotive control. See Jean Souyris et al., Formal Verification of Avionics Software Products, 2009 Proc. of the 2nd World Congress on Formal Methods 532 (describing the use of software verification at Airbus); Norbert Volker & Bernd J. Kramer, Automated Verification of Function Block-Based Industrial Control Systems, 42 Sci. Computer Programming 101 (2002). Indeed, researchers have effected compromises of embedded healthcare devices such as pacemakers. See, e.g., Daniel Halperin et al., Pacemakers and Implantable Cardiac Defibrillators: Software Radio Attacks and Zero-Power Defenses, 2008 IEEE Symp. on Security and Privacy 129, 141 (finding that implantable cardioverter defibrillators are “potentially susceptible to malicious attacks that violate the privacy of patient information” and “may experience malicious alteration to the integrity of information or state”). News reports also indicate that former Vice President Dick Cheney had the remote software update functionality on his pacemaker disabled so that updating the software would require surgery, ostensibly in order to prevent remote compromise of his life-critical implant. Andrea Peterson, Yes, Terrorists Could Have Hacked Dick Cheney’s Heart, Wash. Post (Oct. 21, 2013), DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 28 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -28an expensive undertaking largely reserved to technologists versed in this particular area, it is important as a matter of policy to understand the options that are available so that costs and benefits can be weighed and acted upon. Unlike static analysis, which aims to examine already-written code for bugs or deviations from its specification, or software testing, which aims to verify that software meets a specific set of functional requirements by explicitly executing the software in a particular configuration, software verification aims to prove invariants about a program mathematically, using logic to reason about a program’s behavior under all conditions.99 Verified programs come with a mathematically checkable proof demonstrating that they have certain invariants, rendering testing for those invariants unnecessary, as the proof implies that such tests will always pass.100 There are many ways to verify software. For instance, a program can be carefully annotated using formal logic to determine its behavior in a precise manner, though this can be expensive and will not always yield the desired invariants;101 a program can be certified by another program which translates it to https://www.washingtonpost.com/news/the-switch/wp/wp/2013/10/21/yes-terrorists-could-have- hacked-dick-cheneys-heart[https://perma.cc/VY5S-6XR6]. Additionally, researchers have also demonstrated spectacular compromises of automobile control systems, including disabling brakes, controlling steering and acceleration, and completely cutting engine power. See Karl Koscher et al., Experimental Security Analysis of a Modern Automobile, 2010 IEEE Symp. on Security and Privacy 447 (performing early analyses of the security of automobile computers); see also Stephen Checkoway et al., Comprehensive Experimental Analyses of Automotive Attack Surfaces 2011 Proc. 20th USENIX Conf. on Security 77 (same). Subsequently, researchers have demonstrated problems in other models, including luxury models with significant telematics capabilities and remote software upgrade capability, showing that active maintenance of these software systems does not completely defend against attacks. See, e.g., Jonathan M. Gitlin, Man Hacks Tesla Firmware, Finds New Model, Has Car Remotely Downgraded, Ars Technica (Mar. 8, 2016 11:36 AM), http://arstechnica.com/cars/2016/03/manhacks-tesla-firmware-finds-new-model-has-car-remotely-downgraded [https://perma.cc/R9C5- 9RTY] (describing an incident where a Tesla car model was hacked despite frequent software updates). Problems with spontaneous acceleration in many Toyota vehicles were later traced to software issues. See Phil Koopman, A Case Study of Toyota Unintended Acceleration and Software Safety (Sept. 18, 2014), https://users.ece.cmu.edu/~koopman/pubs/koopman14_toyota_ua_slides.pdf [https://perma.cc/VP9T-VYMF] (presenting a detailed analysis of the issue). And of course, Volkswagen designed its engine control software to defeat an emissions testing regime. For a complete timeline of the Environmental Protection Agency’s actions on this matter, see Volkswagen Light Duty Diesel Vehicle Violations for Model Years 2009-2016, EPA.gov, https://www.epa.gov/vw [https://perma.cc/C83U-UZLG] (last updated Nov. 7, 2016). 99 See supra Section I.A. 100 See supra notes XX-XX. 101 For one of the earliest approaches to representing programs as statements in formal logic, see C.A.R. Hoare, An Axiomatic Basis for Computer Programming, 12 Comm. ACM 576, 576-80 (1969). While Hoare’s techniques form the basis of many modern methods, some methods attempt to build software that is correct by virtue of its construction, rather than analyzing software that has already been written. For an overview of different approaches and their tradeoffs, see B. Bérard et al., Systems and Software Verification: Model-Checking Techniques and Tools (2001). For a classic reference on how to include these techniques in the software engineering process, see Carlo Ghezzi et al., Fundamentals of Software Engineering (2d ed. 2003). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 29 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -29a form which is guaranteed to have the desired property;102 a program can be exhaustively tested for all possible inputs to ensure that an invariant is never violated;103 or a program can be built using tools that allow for the careful specification of invariants (and proofs of those invariants).104 Researchers have even verified entire operating systems using these techniques.105 Verification can be communicated to clients in a number of ways: so called proof-carrying code comes with a machine-checkable proof of its verified invariants which can be checked by a user just prior to running the program;106 a user can recompute the analysis used to generate the proof; or the truth of the proof can be confirmed by an entity the user trusts, with cryptography used to guarantee that the version a user is running matches the version that was verified.107 102 These tools are known as “certifying compilers.” The advantage of a certifying compiler is that one need only expend effort verifying the certifying compiler itself, not the software being compiled, in order to prove that the desired invariant holds for the compiled software. For a description of the original concept and a first implementation, see George C. Necula & Peter Lee, The Design and Implementation of a Certifying Compiler, 33 ACM SIGPLAN Notices 333 (1998). There are many examples of certified compiler systems. See, e.g., Joshua A. Kroll et al., Portable Software Fault Isolation, 2014 Proc. IEEE 27th Computer Security Found. Symp. 18 (describing a certifying software fault isolation compiler built out of CompCert’s certified back end). 103 This technique, known as “model checking,” could also be described as a form of static analysis. Model checking aims to verify an invariant by finding a counterexample (an input to the program which makes the invariant untrue and hence not an invariant). If a counterexample can be found, the program has a demonstrable bug. If no counterexample can be found, that invariant has been verified. See supra note XX and infra note XX. 104 Several such programming languages exist, though one of the more successful toolkits in active research is the proof assistant Coq, which allows for writing complex programs and theorems and invariants about those programs, in such a way that the proved-correct programs can be “extracted” into executable code. For an introduction to Coq, see Adam Chlipala, Certified Programming with Dependent Types: A Pragmatic Introduction to the Coq Proof Assistant (2013) and Yves Bertot & Pierre Castéran, Interactive Theorem Proving and Program Development: Coq’Art: The Calculus of Inductive Constructions (2004) (describing advanced programming techniques). Several large and complex programs have been written in Coq, which demonstrates that it is a robust tool capable of supporting nontrivial development tasks and proofs of correctness about those tasks. Perhaps the most famous of these was the proof of the “four-color theorem,” which states that any map can be drawn using only four colors such that no border on the map uses the same color for the regions on both sides of the border. Georges Gonthier, Formal Proof--The Four-Color Theorem, 55 Notices AMS 1382 (2008). Similar tools include a theorem prover for programs written in ANSI Common Lisp 2 and the interactive theorem prover Isabelle. See Lawrence C. Paulson, The Foundation of a Generic Theorem Prover, 5 J. Automated Reasoning 363 (describing the design and implementation of Isabelle). 105 See Gerwin Klein et al., seL4: Formal Verification of an OS Kernel, 2009 Proc. ACM SIGOPS 22nd Symp. on Operating Sys. Principles 207 (documenting the verification of the seL4 microkernel). 106 See, e.g., George C. Necula & Peter Lee, Safe Kernel Extensions Without Run-Time Checking, 1996 Proc. Second UNENIX Symp. on Operating Sys. Design & Implementation 229 (describing the concept of proof-carrying code and a first implementation with applications to operating system security); see also George C. Necula, Proof-Carrying Code Design and Implementation., in Proof and System Reliability 261 (H. Schwichtenberg & R. Steibruggen eds., 2002) (giving a detailed overview of the concepts of proof carrying code and their development over time). 107 This approach would consist of the certifying authority making a cryptographically signed statement that it had verified the proof for a binary with a certain cryptographic hash value and the DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 30 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -30However, just because a program has been verified or proven correct does not mean that it has been vetted at all for correctness or compliance with a policy. Verification typically constitutes a proof that the software object in use matches its specification, but this analysis says nothing about whether the specification is sufficiently detailed, correct, lawful, or socially acceptable, or constitutes good policy. Software verification is a rapidly developing field, and the costs of building fully verified software will likely drop precipitously in the coming decades, leading to wide adoption in the software industry due to the benefits of reduced security exposure and the elimination of many types of software bugs. 2. Cryptographic Commitments A cryptographic commitment is the digital equivalent of a sealed document held by a third party or in a safe place. It is possible to compute a commitment for any digital object (e.g., a file, a document, the contents of a search engine’s index at a particular time, or any string of bytes). Commitments are a kind of promise that binds the committer to a specific value for the object being committed to (i.e., the object inside the envelope) such that the object can later be revealed and anyone can verify that the commitment corresponds to that digital object.108 In this way, as in the envelope analogy, an observer can be certain that the object was not changed since the commitment was issued and that the committer did indeed know the value of the object at the time the commitment was made (e.g., the source code to a program or the contents of a document or computer file). Importantly, secure cryptographic commitments are also hiding, meaning that knowledge of the commitment (or possession of the envelope in the analogy) does not confer information about the contents. This gives rise to the sealed document analogy: once an object is “inside” the sealed envelope, an observer cannot see it nor can anyone change it. However, unlike physical envelopes, commitments can be published, transmitted, copied, and shared at very low cost and do not need to be guarded to prevent tampering. In practice, cryptographic commitments are much smaller than the digital objects they represent.109 Because of this, commitments can be used to lock in knowledge of a secret (say, an undisclosed decision policy) at a certain time (say, by publishing it or sending it to an oversight body) without revealing the contents of the secret, while still allowing the secret to be disclosed later (e.g., in a court case under a discovery order) and guaranteeing that the secret distribution of a signed copy of that piece of software. For an overview of code signing systems, see Code Signing, Certificate Authority Security Council, https://casecurity.org/wpcontent/uploads/2013/10/CASC-Code-Signing.pdf [https://perma.cc/DZU8-TA36]. 108 See Ariel Hamlin et al., Cryptography for Big Data Security, in Big Data: Storage, Sharing, and Security 1, 29 (Fei Hu ed., 2016) (describing cryptographic commitments as a method of verification). 109 See id. at (noting that commitments can be smaller than the statements to which they relate). A typical commitment will be 128 or 256 bytes, regardless of the size of the committed object. See Info. Tech. Lab., Nat’l Inst. of Standards & Tech., FIPS PUB 180-4, Secure Hash Standard(2015) (describing the hash algorithms accepted for government computer applications, which provide widely-used standards in industry). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 31 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -31was not changed in the interim (for example, that the decision policy was not modified from one that was explicitly discriminatory to one that was neutral).110 When a commitment is computed from a digital object, the commitment also yields an opening key, which can be used to verify the commitment.111 Importantly, a commitment can only be verified using the precise digital object and opening key related to its computation; it is computationally implausible for anyone to discover either another digital object or another opening key which will allow the commitment to verify properly. In the envelope metaphor, this is tantamount to proof that neither the envelope nor the document inside the envelope was replaced clandestinely with a different envelope or document. Any digital object (e.g., a file, document, or any string of bytes) can have a commitment and an opening key such that it is: 1) impossible to deduce the original object from the commitment alone; 2) possible to verify, given the opening key, that the original object corresponds to the commitment, and 3) impossible to generate a fake object and fake opening key such that using the (real) commitment and the fake opening key will reveal the fake object. Cryptographic commitments have useful implications for procedural regularity in automated decisions. They can be used to ensure that the same decision policy was used for each of many decisions. They can ensure that rules implemented in software were fully determined at a specific moment in time. This means a government agency or other organization can commit to the assertions that (1) the particular decision policy was used, and (2) the particular data were used as input to the decision policy (or that a particular outcome from the policy was computed from the input data). The agency can prove the assertions by taking its secret source code, the private input data, and the private computed decision outcome and computing a commitment and opening key (or a separate commitment and opening key for each policy version, input, or decision). The company or agency making an automated decision would then publish the commitment or commitments publicly and in a way that establishes a reliable publication date, perhaps in a venue such as a newspaper or the Federal Register. Later, the agency could prove that it had the source code, input data, or computed results at the time of commitment by revealing the source code and the opening key to an oversight 110 As a curiosity, we remark that the popular board game Diplomacy is essentially based on physical world commitments: each player negotiates a set of moves for the next round of the game, but then these moves are written on paper and passed secretly to a game master who stores them in an envelope. Once all players have entered their moves, the moves are revealed and taken simultaneously. This commitment mechanism allows players to simulate simultaneous moves without any risk that a player will fall behind or change their moves in a particular round in response to their perception of what another player is doing in that round. However, the commitment mechanism alone does not prevent players from entering incorrect or impossible moves, writing nonsense on their paper instead of moves, or simply refusing to enter a move at all (the game master, however, enforces that all moves placed into the envelope are correct and all players must trust her to do this to ensure that the game is not spoiled). Below, in the section on zero-knowledge proofs, we describe how techniques from computer science can address the role of the game master purely through computation without the need for an entity trusted by all players of the game. 111 See Oded Goldreich, Foundations of Cryptography – A Primer (2005). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 32 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -32body such as a court. This technique assures that the software implementing the decision policy was determined and recorded prior to the publication of the commitment, which can be useful in demonstrating that neither the software nor the decision policy were influenced by later information or events. By themselves, however, cryptographic commitments do not prevent the committer from lying and generating a fake commitment that it cannot open at all or from destroying (or refusing to disclose) the information that allows a valid commitment to be opened. In either case, when the time comes to reveal the contents of the commitment, it will be demonstrable that the committer has misbehaved. However, an observer does not know the nature of the misbehavior. The committer may not have a correct opening key (analogous to having sealed an unintelligible or irrelevant document in a physical envelope) or may want to lie about what was in the original file (analogous to discovering that the contents of the envelope may be embarrassing under scrutiny of oversight). In either case, an oversight authority might punish the committer for lying and assume the worst about the contents of the missing file.112 However, it would be preferable to be able to avoid this scenario altogether, which we can do with another tool, zeroknowledge proofs, described below. 3. Zero-Knowledge Proofs A zero-knowledge proof is a cryptographic tool that allows a decisionmaker, as part of a cryptographic commitment, to prove that the decision policy that was actually used (or the particular decision reached in a certain case) has a certain property, but without having to reveal either how that property is known or what the decision policy actually is.113 For example, consider how money flows in an escrow transaction. Traditionally, an escrow agent holds payment until certain conditions are met. Once they are, the agent attests to this fact and disburses the money according to a predetermined schedule. Zero-knowledge proofs can allow escrow without a trusted agent. Suppose that an independent sales contractor wishes to certify that she has remitted appropriate taxes from her sales in order to be paid by a counterparty, but without revealing precisely how much she was able to sell an item for. Using a zero-knowledge proof, she can demonstrate that sufficient taxes were paid without disclosing her sales prices or earnings to a third party. Another classic example used in teaching cryptography posits that two millionaires are out to lunch and they agree that the richer of them should pay the bill. However, neither is willing to disclose the amount of her wealth to the other. A zero-knowledge proof allows them both to learn who is wealthier (and thus who should pay the restaurant tab) without revealing how much either is worth. 112 A parallel to this assumption is a spoliation inference, which sanctions a party who withholds, tampers, or destroys evidence by assuming that the missing or changed evidence was unfavorable to the spoliator. See Fed. R. Civ. P. 37(e)(2)(A) (providing that if electronically stored information is lost because a party, intending to deprive the other party of the information, failed to take reasonable steps to preserve it, the court may “presume that the lost information was unfavorable to the party”). 113 See Goldreich, supra note 111, at . DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 33 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -33A zero-knowledge proof works with cryptographic commitments to verify procedural regularity in the following manner. If a decisionmaker makes a trio of commitments, A, B, and C, where A is a commitment to the decision policy, B a commitment to the inputs that were used in a particular case, and C a commitment to the decision actually reached in that case, then zero-knowledge proofs let the public verify that A, B, and C really do correspond to each other. In other words, the decisionmaker can prove that, when the committed policy A is applied to the committed input data B, the result is the committed outcome C. This allows decisionmakers to build audit logs, which can be verified by the public to confirm that the decisionmaker applied the appropriate policy to the correct input in order to reach the stated outcome, all without revealing the decision policy itself and without revealing private data that might be included in the input or outcome. Later, if the outcome is challenged, a court or other oversight body can compel the decisionmaker to reveal the actual policy and input used and can verify that it matches the published commitment, effectively providing digital evidence that the decisionmaker was honest about its announced decision. By using a commitment to the same policy in decisions for multiple decision subjects, a decisionmaker can demonstrate that it is applying a consistent policy across those subjects. Such zero-knowledge proofs can be enhanced to test parts of the decision policy, either by exhibiting properties of the input-output relation (e.g., that a credit score would have been the same if the subject’s gender were reversed) or properties of the policy itself (e.g., that the policy only uses certain inputs for certain purposes). 4. Fair Random Choices Where random choices are part of a decisionmaking process, the fairness of the randomness used in those computer systems should be verifiable. Poorly designed randomization can lead to unaccountable automated decisions. The decisionmaker could influence the supposedly random choices or could generate many sets of random values and then pick the set that gives its preferred outcome. Additionally, a randomized process is not easily reproduced. For example, if it depends on interaction with its environment (e.g., the operating system on which it is running or its human user), its behavior may be altered in a nondeterministic way since that environment can change between runs.114 Automated decision processes must therefore be designed from the beginning to allow for oversight of the decisionmaker and to avoid problems with unpredictable behavior. To solve this problem, a decisionmaker could demonstrate that any unpredictable behavior or random choices in the software does not affect the eventual output; for example, a program designed to find the top of a hill (i.e., 114 One example is a program that chooses a random value based on the time that it has been running but takes different amounts of time to run based on other programs that are running on the same physical computer system. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 34 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -34optimize some objective) can start at any randomly chosen point and take any arbitrary path upwards and will still ultimately return the same maximum value.115 More often, the random choices made by an automated decision process will affect the results. In these cases, the software implementing the decision can always be redesigned to replace the set of random choices made by the software with a small recorded, random input (a seed value) from which any necessary random values can be computed in a deterministic, pseudorandom way. In this way, the decisionmaking process can be replayed so long as the seed is known and the randomness of the input is completely captured by the randomness of the seed. Using this technique, a decisionmaker would not have to generate a new random choice each time a random value is needed by a piece of software (such choices can be made by a cryptographic algorithm that uses the seed to yield reproducible values), nor know in advance how many random choices must be made. This technique allows software that makes random choices, such as a lottery, to be made fully reproducible and reviewable. Unlike capturing the entire environment, as was discussed above, this technique reduces the relevant portion of the environment to a very small and manageable value (the seed) while preserving the benefits of using randomness in the system. If this technique is used, we also must prevent the decisionmaker from tampering with the seed value, as it fully determines all random data accessed by the program implementing the decision policy. Several methods can aid in ensuring the fair choice of seed values. A public procedure can be used to select a random value: for example, rolling dice or picking ping pong balls from the sort of device used by state lotteries.116 Alternatively, the seed value could be provided by a trusted third party, such as the random “beacon” operated by the U.S. National 115 In general, this approach will only find the top of some crest, which may or may not be the highest point on a hill (for instance, if a mountain has two peaks, one much higher than the other). Randomness helps fix this problem, however, since an algorithm can start climbing the hill at many different randomly chosen points and verify that they all reach the same highest point. Additionally, for many important problems, one can prove that only a certain limited number of optimal (i.e., highest or lowest) values exist. That is, if an analyst knows that the hill only has one peak, then which path a program takes to the top is irrelevant. For a description of the gradient descent approach to optimization and other approaches, see Richard O. Duda et al., Pattern Classification (2d ed. 2001). 116 Currently known strategies for generating public random values (“randomness beacons”) all have advantages as well as disadvantages--dice could be weighted; ping pong balls could be put in the freezer and the cold ones picked out of the machine. The National Institute of Standards and Technology runs a randomness beacon that has come under scrutiny because of distrust of the National Security Agency. To minimize these types of issues, the algorithm designer should pick the source of randomness most likely to be trusted by participants, which may vary. The algorithm designer could choose to collect many sources of random choices and mix them together to maximize the number of participants who will trust the randomness of the chosen seed. However, even physical sources of randomness that have not been tampered with have failed to be accountable for their goals in unexpected ways; for instance, the 1969 lottery for selecting draftees by birthday was later shown to be biased, with a disproportionate number of selectees coming from months early in the year. For a detailed overview of the problem and its causes, see Joan R. Rosenblatt & James J. Filliben, Randomization and the Draft Lottery, Science, Jan. 22, 1971, at 306. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 35 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -35Institute of Standards and Technology (NIST).117 In addition, it is possible for a set of mutually distrustful parties (possibly including decision subjects themselves) to engage in an interaction that produces a value that is unpredictable so long as at least one participant provided random input.118 Perhaps, the best option is to mix together randomness (sometimes called entropy) from many different sources. The simplest form of this practice would involve a decision subject entering a short random number as part of the input for their decision (e.g., on an application form). Then, the decisionmaker would generate a seed value for each decision by combining this known-to-the-subject, personal random value with (1) a pre-chosen random value to which the decisionmaker committed to using far in advance of seeing the personal random value, and (2) a unique identifier for the particular decision or decision subject that is difficult to change (e.g., the social security number of the participant). In order to foster maximum confidence that random choices are not improperly influenced, the decisionmaker should derive them using a combination of (1) a random value from a trusted third-party; (2) a random value chosen by the decisionmaker and possibly kept secret; (3) a participant- or decisionspecific identifier that cannot be changed or controlled by the decisionmaker, such as a social security number, identification number, or other immutable piece of the subject’s name or data; and (4) a value chosen by the decisionmaker. Since these values are either outside of the decisionmaker’s control or are known, fixed, and subsequently verifiable before the inputs to a decision are known, using these 117 Computer science refers to a trusted third-party source of randomness as a “beacon.” The best known beacon is operated by the NIST, which publishes new random data every few minutes, ostensibly based on the measurement of quantum mechanical randomness via a device maintained in a NIST lab. NIST Randomness Beacon, Nat’l Inst. Standards & Tech. http://www.nist.gov/itl/csd/ct/nist_beacon.cfm [https://perma.cc/UNT3-6N6P] (last updated Sept. 21, 2016). Recent revelations about NIST’s role in allowing the U.S. National Security Agency to undermine the security of random number generation techniques standardized by NIST have led to some distrust of the NIST beacon, although it may be trustworthy in some applications. NIST standardized the Duel EC Deterministic Random Bit Generator (DUAL-EC) in SP 800-90A in 2007. At that time, cryptographers already knew the standard could accommodate a “backdoor,” or secret vulnerability. See Dan Shumow & Niels Ferguson, On the Possibility of a Back Door in the NIST SP800-90 Dual Ec Prng, in 7 Proc. Crypto (2007). Later, it was discovered that the NSA had very likely made use of this mechanism to create a backdoor in the standard itself. See Daniel J. Bernstein et al., Dual EC: A Standardized Back Door, in The New Codebreakers 256 (2016). Other beacon implementations have been proposed, including beacons based on “cryptocurrencies” such as Bitcoin. See, e.g., Joseph Bonneau et al., On Bitcoin as a Public Randomness Source, https://eprint.iacr.org/2015/1015.pdf [https://perma.cc/XQ38-FJ3H] (outlining a specific alternate proposal involving the use of Bitcoin as a source of publicly verifiable randomness). 118 Computer science has methods to simulate a trusted third party making a random choice. These methods require the cooperation of many mutually distrustful parties, such that as long as any one party chooses randomly, the overall choice is random. By selecting many participants in this process, one can maximize the number of people who will believe that the chosen value is in fact beyond undue influence. For an easy-to-follow introduction to these methods, see Manuel Blum, Coin Flipping by Telephone: A Protocol for Solving Impossible Problems, in 1981 Crypto, at 11 DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 36 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -36methods gives assurance that the decisionmaker is not skewing the results by controlling the selection of random values.119 Where random choices are part of a decisionmaking process, the fairness of the randomness (i.e., the consistency with the goal for which randomness was deployed in a particular system) used in those decisions should be verifiable. This can be achieved by relying on a small random seed and verifying its source. Once a random seed has been chosen in a satisfactory manner, it is still necessary to verify that the seed was in fact used in later decisions.120 This can be accomplished by the techniques we describe. D. Applying Technical Tools Generally Our general strategy in designing systems accountable for their procedural regularity is to require systems to create cryptographic commitments as digital evidence of their actions. Systems can be designed to publish commitments describing what they will do (i.e., a commitment to the decision policy enforced by the system, as represented by source code) before they are fielded and commitments describing what they actually did (i.e., a commitment to the inputs121 and outputs of any particular decision) after they are fielded. Zero-knowledge proofs can be used to ensure that these commitments actually correspond to the actions taken by a system.122 Indeed, it is possible to use zero-knowledge proofs to verify, for each decision, that the committed-to decision policy applied to the committed-to inputs yields the committed-to outputs.123 These zero-knowledge proofs could either be made public or provided to the system’s decision subjects along with their results. 119 When the fairness of random choices is key to the accountability of a decision process, great care must be taken in determining the source of random seed values, as many very subtle accountability problems are possible. For example, by changing the order in which decisions are taken, the decisionmaker can effectively “shop” for desirable random values by computing future deterministic pseudorandom values and picking the order of decisions based on its preference for which decisions receive which random choices. To prevent this, it may also be necessary to require that the decisionmaker take decisions in a particular order or that the decisionmaker commit to the order in which it will take decisions in advance of the seed being chosen. For a detailed description of the problems with randomness “shopping” and post-selection by a decision authority, see Joshua Alexander Kroll, Accountable Algorithms (Sept. 2015) (unpublished Ph.D dissertation, Princeton University) (on file with author). 120 For example, several state lotteries have been defrauded by insiders who were able to control what random values the lottery system used to decide winners. Specifically, an employee of the Multi-State Lottery Association (MUSL) was convicted of installing software on the system that controlled the random drawing and using the information gleaned by the software to purchase winning tickets for the association’s “Hot Lotto” game. See Grant Rodgers, Hot Lotto Rigger Sentenced to 10 Years, Des Moines Register (Sept. 9, 2015, 7:12 PM), http://www.desmoinesregister.com/story/news/crime-and-courts/2015/09/09/convicted-hot-lottorigger-sentenced-10-years/71924916 [https://perma.cc/U26A-8VMD] (describing the Iowa lottery fraud sentencing). 121 Note that, for these commitments to function, systems must also be designed to be fully reproducible, capturing all interactions with their environments as explicit inputs that can then be contained in published commitments. The use of seed values for randomization, discussed above in subsection II.C.4, offers one example of ensuring reproducibility. 122 The approach here was introduced in Kroll, supra note 119. 123 Id. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 37 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -37By disclosing commitments instead of source code or inputs and outputs, system operators can fully explain what their systems do without actually disclosing how those systems work up front. Later, if it becomes necessary to review the actions or decision policy of a system during a court case or regulatory action, system operators can disclose the contents of their commitments (that is, source code, inputs, and outputs), possibly under a protective regime. If it is possible to disclose these values publicly, then system operators may also choose (or be required) to do so. However, whether these data are disclosed or not, the published commitments and zero-knowledge proofs allow overseers and the public at large to verify that the decisions of some authority actually correspond to a specific predetermined policy rather than the arbitrary whim of a decisionmaker. Further, by observing that all decisions arise from the same policy, anyone reviewing these commitments can be certain that the policy was used for all decisions simply by checking that the commitments to the decision policy are consistent across decisions. By requiring commitments to be published far in advance of any decision, it is possible to ensure that the particulars of a decision policy were chosen independently of the factors in the decisions it would render. For example, a decision policy that selects which individuals will receive a tax audit should be based on the risk of tax evasion, which in turn can be inferred by properties of the tax return itself. However, a corrupt tax authority could pick out individuals for audit and guess the particulars of their tax return data, then tailor the audit decision policy accordingly. Further, if a policy must be approved in advance by some oversight or certification body, the policy would need to be decided on and implemented in software far enough in advance to admit certification or review. Finally, if such certification does take place, subjects of the policy’s decisions (or overseers of those decisions) can be certain after the fact that the policy which was certified is the policy which was actually used in practice.124 To the extent that the invariants of interest in a computer system are simple enough to compute, their truth can be verified by the same zero-knowledge proofs that attest to the relationship between the code, the inputs, and the outputs. Because powerful, modern zero-knowledge techniques can be applied to any code, they can also be applied to code that performs the analysis of these invariants, and the execution of that code can be considered as part of the operation of the system.125 124 Electronic voting systems have suffered from such problems in practice. In many jurisdictions, voting system software must be certified before it can be used in polling places. Systems are tested by the Election Assistance Commission (EAC), an independent commission created by the 2002 Help America Vote Act. See Testing and Certification Program, U.S. Election Assistance Commission, http://www.eac.gov/testing_and_certification [https://perma.cc/8DFX-LTYD] (detailing the EAC’s testing and certification regime). However, in many cases, updated, uncertified software has been used in place of certified versions because of pressure to include updated functionality or bug fixes. See, e.g., Fitrakis v. Husted, No. 2:12-cv-1015, 2012 WL 5411381 (S.D. Ohio Nov. 6, 2012) (involving a suit arising out of updates to voting systems immediately prior to the 2012 presidential election in Ohio). 125 For example, suppose that we wish to demonstrate that a decision would be the same if the subject’s gender were reversed. The software implementing the decision could simply compute the DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 38 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -38Simply publishing commitments to the inputs and outputs of a system rather than making them transparent will not solve all of the issues with transparency brought up in Part I. However, it can address the need for legitimate secrecy of the system, its inputs, or its outputs. Because it is possible using these methods to verify that a particular input and a particular decision policy correspond to a particular output, it is not strictly necessary to see these values in order to investigate the system’s procedural regularity. We describe how certification of procedural regularity can be done for randomized software, such as software implementing a lottery, in greater detail in Section II.E below. Later, in Part III, we explain how these tools can extend to certify other, more complicated invariant properties of interest, enabling proof that a system comports with substantive goals or principles beyond simple procedural regularity. E. Applying Technical Tools to Reform the Diversity Visa Lottery Armed with these tools, we can turn to the question of how to ensure the procedural regularity of automated decisionmaking. To illustrate how designing a computer system can make it more accountable, we will apply the methods described above to a case study: the Diversity Visa Lottery (DVL) operated by the U.S. Department of State. 1. Current DVL Procedure The DVL is run annually by the State Department to grant U.S. permanent resident visas (“green cards”) to 50,000 immigrants from around the world. The process, prescribed by 8 U.S.C. 1153(c), is intended to increase the national and regional diversity of immigrants to the U.S. by granting visas to a sample of people from countries otherwise underrepresented in the immigrant population. The annual DVL process operates as follows.126 Would-be immigrants apply to be entered in the lottery, and applicants are grouped according to their country of birth. Within each country group, applicants are put into a rank-ordered list in a random order (the lottery step). The Attorney General then calculates the number of applicants to accept from each country using a formula based on the number of immigrants to the U.S. in recent years from each country and region. The calculated number of applicants is selected from the top of each country’s rankordered list. These “winners” are screened for eligibility to enter the U.S., and they receive visas if they are eligible. In some years, additional winners are selected so that all statutorily available visas can eventually be awarded, even if some applicants fail the screening process, drop out, or fail to proceed with their visa application.127 decision with a different gender for each subject and confirm that the same result is reached in each case. 126 See generally Immigration and Nationality Act § 203(c), 8 U.S.C. § 1153(c) (2012); U.S. Dep’t of State, Foreign Affairs Manual ch. 9, § 502.6. 127 Visa Bulletin for June 2015, U.S. Dep’t St. Bureau Consulate Aff., https://travel.state.gov/content/visas/en/law-and-policy/bulletin/2015/visa-bulletin-for-june- 2015.html [https://perma.cc/7H7L-SJKX]. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 39 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -39Questions have been raised about the correctness and accountability of this process. Would-be immigrants sometimes question whether the process is truly random or, as some suspect, is manipulated in favor of individuals or groups favored by the U.S. government. This suspicion, in turn, may subject DVL winners to reprisals, on the theory that winning the DVL is evidence of having collaborated secretly with U.S. agencies or interests. There have also been undeniable failures in carrying out the DVL process. For example, the 2012 DVL initially reported incorrect results due to programming errors coupled with lax management.128 An accountable implementation of the DVL could address both issues by demonstrating that there is no favoritism in the process and by making it easy for outsiders to check that the process was executed correctly. 2. Transparency Is Not Enough The DVL is an automated decision system for which transparency alone cannot solve its problems. First, the software implementing the decisions appears to be written in an irreproducible way.129 The system relies on the computer’s operating system to provide random numbers; thus, attempts to replicate the program’s execution at another time or on another computer will yield different random numbers and therefore a different DVL result. Notably, no amount of reading, analyzing, or testing of the software can remedy the nonreplicability of this software. Second, the privacy interests of participants bar full transparency. People who apply to the DVL do not want their information, or even the fact that they applied, to be published. However, such publication is needed for the process to be verified through transparency and auditing. The Department could try to work around this problem by assigning an opaque record ID to each applicant and then having the lottery choose record IDs rather than applicants, but lottery operators could manipulate the outcome by retroactively assigning winning record IDs to people they wanted to favor. Further, it would be difficult to verify that no extra record IDs corresponding to actual participants had been added. 3. Designing the DVL for Accountability Instead of this inherently unverifiable approach, we propose a technical solution for building an accountable version of the DVL.130 Using the techniques we have described, the State Department can demonstrate that it is running a fair lottery among a hidden set of participants.131 128 Memo from Howard W. Geisel, Deputy Inspector General, U.S. Dep’t of State, Review of the FY 2012 Diversity Visa Program Selection Process, ISP-I-12-01 (Oct. 13, 2011), https://oig.state.gov/system/files/176330.pdf [https://perma.cc/4FWM-URYJ]. 129 Id. 130 A full technical analysis is beyond the scope of this paper. 131 Note that it is less straightforward to prove that the set of participants actually considered in the lottery matches the set of individuals who applied to be included. For example, the operator of the lottery might insert “shills,” or lottery entries that do not correspond to any real applicant, and if one of these applications were to be chosen, that place could be given to anyone of the Department’s choosing. It is technically nontrivial to prove that no extra applications were considered; studies of end-to-end secure voting protocols provide methods to do so. See, e.g., Daniel Sandler et al., DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 40 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -40We can solve the nonreplicability problem by choosing a random seed, as described previously. The third-party generating the random value used to create the seed could be one or more trusted NGOs, or applicants could provide a “PIN” on their applications. Recall that the decision policy for the DVL is fixed in statute and hence publicly known. To provide additional oversight, the State Department could publish in the Federal Register a commitment to its software source code (far in advance of any decisions being made) and a commitment to all the inputs (i.e., to each data element in an application for the US visa) used to create the rank-ordered list and calculate the cut-off points. The State Department also could provide a zeroknowledge proof showing that applying the committed-to software to the committed-to inputs produces the announced lottery results. The proof could also demonstrate that the commitment to the software published in advance of all decisions is a commitment to the same software as the one used in each individual decision. These actions would bind the State Department to its choices of software, source code, and applicant data; ensure that the commitment to the software was not a fake; and prove that the same procedure was used to render each decision. Subsequent auditing by an oversight body could establish that the source code in the commitment faithfully implements the policy specified by statute (the code should be designed to enable this). Finally, the State Department could determine an adequate method for generating a random seed to be used in the lottery step. This method should guarantee to the public that it is not possible for the State Department to choose winners by rearranging applications.132 This could be accomplished by combining random data chosen at a public ceremony (as is done for state lotteries); alternatively, the State Department could cooperate with interested NGOs to produce a verifiable random seed with a random value selected exclusively by the State Department (and published prior to the ceremony and any lottery applications) along with something that identifies a particular lottery entry uniquely (e.g., the applicant’s full application data, reduced by cryptographic hash, to a small numeric value). Depending on the implementation and application, the State Department could also include randomness selected by DVL applicants on their application, which could be harvested passively by tracking mouse movements during the application process.133 Once these steps are taken, each applicant can be assured that the State Department’s decision on his application is fully explainable. If the applicant has questions regarding the process or a governmental overseer wants to audit it, the decisions will be replicable, and, if necessary, the secret source code and secret input data (including the random choices made in the lottery step) can be revealed VoteBox: A Tamper-Evident, Verifiable Electronic Voting System, 2008 Proc. 17th Conf. on Security Symposium 349 (enunciating the measures necessary to make electronic voting secure). 132 Random choices in the DVL must be demonstrably random even to nonparticipants so that winners can plausibly claim that they were chosen by lottery and not because of sympathy for U.S. interests. 133 See supra note XX DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 41 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -41and verified--by a court or auditing agency--to be the proper code and data used to render the decision.134 These solutions depend on both redesigning the software code (a technical solution) and adopting procedures relating to how the software program is used (a legal or policy solution). They must be deployed during the design of the decision process and cannot salvage a poorly designed system after the fact. In hindsight, it should not be surprising that the path to accountability for computational processes requires some redesign of the processes themselves. The same is true for noncomputational administrative processes, where the most accountable processes are those that are designed with accountability in mind. III. Designing Algorithms to Assure Fidelity to Substantive Policy Choices In Part I, we described methods that permit certification of properties of systems, and in Part II, we demonstrated how those methods can ensure that automated decisions are reached in accordance with agreed upon rules, a goal we called procedural regularity. In this Part, we examine how those methods could be used to certify other system properties that policymakers desire. Accountability demands not only that we certify that a policy was applied evenly across all subjects, but also that those subjects can be certain that the policy furthers other substantive goals or principles. A subject may want to know: Is the rule correctly implemented? Is it moral, legal, and ethical? Does it operate in the aggregate with fidelity to substantive policy choices? We focus here on the goal of nondiscrimination135 in part because specific, additional technical tools have developed to assist with it and in part because the use of automated decisionmaking already has raised concerns about discrimination and the ability of current legal frameworks to deal with technological change.136 The well-established potential for unfairness in systems that use machine learning, in which the decision rule itself is not programmed by a human but rather inferred from data, has heightened these discrimination concerns. However, what makes a rule unacceptably discriminatory against some person or group is a fundamental and contested question. We do not address that question here, much less claim to 134 In fact, just as the applicant can be convinced that his decision is explainable without seeing the secret algorithm or secret inputs, an oversight body can be convinced that particular decisions were made correctly without seeing the applicant’s inputs, which might contain sensitive data, like health records or tax returns. Thus, subsequent auditing is rendered more useful and more acceptable to participants, as it can determine the basis for every decision without revealing sensitive information. 135 The word “discrimination” carries a very different meaning in engineering conversations than it does in public policy. Among computer scientists, the word is a value-neutral synonym for differentiation or classification: a computer scientist might ask, for example, how well a facial recognition algorithm successfully discriminates between human faces and inanimate objects. But, for policymakers, “discrimination” is most often a term of art for invidious, unacceptable distinctions among people–-distinctions that either are, or reasonably might be, morally or legally prohibited. We use the latter meaning here. 136 See Pasquale, supra note 47, at 8-9 (describing the problem of discrimination through the use of automated decisionmaking). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 42 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -42resolve it with computational precision. Instead, we describe how an emerging body of computer science techniques may be used to avoid outcomes that could be considered discriminatory. Fidelity to policy choices like nondiscrimination is a more complicated goal than procedural regularity, and the solutions that currently exist to address it are less robust. Technical tools can ameliorate these problems, but they generally require a well-defined notion of what sort of fairness they are supposed to be enforcing. In this Part, we outline a few proposed well-defined notions. We present these techniques as examples of system properties that could be certified using the techniques described in Part I, but we do not necessarily advocate for any of them; ultimately, policymakers must decide whether these properties or others square with nondiscrimination goals. In addition, the precision of computer code often brings into sharp focus the tensions surrounding antidiscrimination within current legal frameworks. Computers favor hard and fast rules over the types of standards and balancing tests often found in our common law system and civil rights law. While this tension suggests that doctrinal reform would make it easier to apply computerized decisionmaking in an area, we are not advocating a policy regime entirely made of bright line rules or predetermined fairness criteria. In fact, we believe that investigations of fairness should always be in the purview of ex post review processes. Instead, we offer an overview of the problem of algorithmic discrimination, the current state of the related technical tools, and the relationship of these tools to existing legal frameworks. Our aim is to both elucidate the current state of the art and suggest directions for further research and action. A. Machine Learning, Policy Choices, and Discriminatory Effects We focus here on decisions developed through machine learning--on situations where a machine has been “trained” through exposure to a large quantity of data and infers a rule from the patterns it observes. Computers are especially well-suited to discover patterns in these input-output pairs that can then guide future decisionmaking. In contrast to human-made rules, these rules for decisionmaking are induced from historical examples—-they are, quite literally, rules learned by example. Humans orchestrate a computerized rule-creation process, rather than imparting the rules directly. These kinds of decisions raise problems for the methods described in Part I because they do not have obvious decision rules fixed by the system’s designer, which can be verified in the manner we have described. Instead, for the tools to apply, policymakers must determine that certain system properties would reflect substantive policy goals, and if such properties exist, they can then be certified and permit the type of accountability we have proposed above. Machine learning is an increasingly common approach to solving problems that once seemed computationally intractable due to their complexity (e.g., object recognition in a photograph). The recent movement of software systems into a growing number of domains owes primarily to successful applications of machine learning, which is thus the primary focus of our analysis. A significant concern about automated decisionmaking is that it may simultaneously systematize and conceal discrimination. Because it can be difficult DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 43 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -43to predict the effects of a rule in advance (especially for large, complicated rules or rules that are machine-derived from data), regulators and observers may be unable to tell that a rule has discriminatory effects. In addition, decisions made by computers may enjoy an undeserved assumption of fairness or objectivity.137 However, the design and implementation of automated decision systems can be vulnerable to a variety of problems that can result in systematically faulty and biased determinations.138 These decision rules are machine-made and follow mathematically from input data, but the lessons they embody may be biased or unfair nevertheless. Below, we describe a few illustrative ways that models, or decision rules derived from data, generated through machine learning, may turn out to be discriminatory. We adapt a taxonomy laid out in previous work by Solon Barocas and Andrew D. Selbst139 and make use of the “catalog of discriminatory evils” of machine learning systems laid out by Hardt140 and Dwork et al.141 First, algorithms that include some type of machine learning can lead to discriminatory results if the algorithms are trained on historical examples that reflect past prejudice or implicit bias, or on data that offer a statistically distorted picture of groups comprising the overall population. Tainted training data would be a problem, for example, if a program to select among job applicants is trained on the previous hiring decisions made by humans, and those previous decisions were themselves biased.142 Statistical distortion, even if free of malice, can produce similarly troubling effects: consider, for example, an algorithm that instructs police to stop and frisk pedestrians. If this algorithm has been trained on a dataset that overrepresents the incidence of crime among certain groups (because these groups have historically been the target of disproportionate enforcement), the algorithm may direct police to detain members of these groups at a disproportionately high rate (and nonmembers at a disproportionately low rate). Such was the case with the 137 See Paul Schwartz, Data Processing and Government Administration: The Failure of the American Legal Response to the Computer, 43 Hastings L.J. 1321, 1342 (1992) (describing the deference that individuals give to computer results as the “seductive precision of output”). 138 See id. at 1342-43 (noting that the computer creates “new ways to conceal ignorance and subjectivity” because people overestimate its “accuracy and applicability”). 139 See Barocas & Selbst, supra note 8 (describing a taxonomy that isolates specific technical issues to create a decisionmaking model that may disparately impact protected classes). 140 Moritz A.W. Hardt, A Study of Privacy and Fairness in Sensitive Data Analysis (Nov. 2011) (Unpublished Ph.D dissertation, Princeton University) (on file with author). 141 Cynthia Dwork et al., Fairness Through Awareness, 2012 Proc. 3rd Innovations in Theoretical Computer Sci. Conf. 214. 142 See Barocas & Selbst, supra note 8, at 682 (citing Stella Lowry & Gordon Macpherson, A Blot on the Profession, 296 Brit. Med. J. 657, 657 (1988)) (describing how a hospital developed a computer program to sort medical school students based on previous decisions that had disfavored racial minorities and women). Another example is a Google algorithm that showed ads for arrest records much more frequently when black-identifying names were searched than when whiteidentifying names were searched—-likely because users clicked more often on arrest record ads for black-identifying names and the algorithm learns from this behavior with the purpose of maximizing click-throughs. Id. at 682-83 (citing Latanya Sweeney, Discrimination in Online Ad Delivery, Comm. ACM, May 2013, at 44, 47 (2013)). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 44 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -44New York City Police Department’s stop-and-frisk program, for which data from 2004 to 2012 showed that 83% of the stops were of black or Hispanic persons and 10% were of white persons in a resident population that was 52% black or Hispanic and 33% white.143 Note that the overrepresentation of black and Hispanic people in this sample may lead an algorithm to associate typically black or Hispanic traits with stops that lead to crime prevention, simply because those characteristics are overrepresented in the population that was stopped.144 Second, machine learning models can build in discrimination through choices in how models are constructed. Of particular concern are choices about which data models should consider, a problem computer scientists call feature selection. Three types of choices about inputs could be of concern: using membership in a protected class directly as an input (e.g., decisions that take gender into account explicitly); considering an insufficiently rich set of factors to assess members of a protected class with the same degree of accuracy as nonmembers (e.g., in a hiring application, if fewer women have been hired previously, data about female employees might be less reliable than data about male employees); and relying on factors that happen to serve as proxies for class membership (e.g., women who leave a job to have children lower the average job tenure for all women, causing this metric to be a known proxy for gender in hiring applications). Eliminating proxies can be difficult, because proxy variables often contain other useful information that an analyst wishes the model to consider (for example, zip codes may indicate both race and differentials in local policy that is of legitimate interest to a lender). The case against using a proxy is clearer when alternative inputs could yield equally effective results with fewer disadvantages to protected class members. A problem of insufficiently rich data might be remedied in some cases by gathering more data or more features, but if discrimination is already systemic, new data will retain the discriminatory impact. While it is tempting to say that technical tools could allow perfect enforcement of a rule barring the use of protected attributes, this may in fact be an undesirable policy regime. As previously noted, there may be cases where allowing an algorithm to consider protected class status can actually make outcomes fairer. This may require a doctrinal shift, as, in many cases, consideration of protected status in a decision is presumptively a legal harm. Third and finally, there is the problem of “masking”: intentional discrimination disguised as one of the above-mentioned forms of unintentional discrimination. A prejudiced decisionmaker could skew the training data or pick proxies for protected classes with the intent of generating discriminatory results.145 143 David Rudovsky & Lawrence Rosenthal, Debate: The Constitutionality of Stop-and-Frisk in New York City, 162 U. Pa. L. Rev. Online 117, 120-21 (2013). 144 The underrepresentation of white people would likely cause the opposite effect, though it could be counter-balanced if, say, the police stopped a subset of white people who were significantly more likely to be engaged in criminal behavior. 145 See Barocas & Selbst, supra note 8, at 692-93 (describing ways to intentionally bias data collection in order to generate a preferred result). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 45 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -45More pernicious masking could occur at the level of designing a machine learning model, which is a very human-driven, exploratory process.146 B. Technical Tools for NonDiscrimination As mentioned in the previous Part, transparency and after-the-fact auditing can only go so far in preventing undesired results. Ideally, those types of ex post analyses should be used in tandem with powerful ex ante techniques during the design of the algorithm. The general strategy we proposed in Section II.D-- publishing commitments and using zero-knowledge proofs to ensure that commitments correspond to the system’s decisionmaking actions--can certify any property of the decision algorithm that can be checked by a second examination algorithm.147 Such properties can be proven by making the examination algorithm public and giving a zero-knowledge proof that, if the examination algorithm were run on the secret decision algorithm, it would report that the decision algorithm has the desired property. The question then is which, if any, properties policymakers would want to build into particular decision systems. A simple example of such a property would be the exclusion of a certain input from the decisionmaking process. A decisionmaker could show that a particular algorithm does not directly use sensitive or prohibited classes of information, such as gender, race, religion, or medical status. The use of machine learning adds another wrinkle because decision rules evolve on the fly. However, the absence of static, predetermined decision rules does not necessarily preclude the use of our certification strategy. Computer scientists, including Hardt,148 Dwork et al.,149 and others, have developed techniques that formalize fairness in such a way that they can constrain the machine learning process so that learned decision rules have specific well-defined fairness properties. These properties also can be incorporated in the design of systems such that their inclusion in the decisionmaking process can be certified and proven.150 We describe three such properties below. First, decisions can incorporate randomness to maximize the gain of learning from experience. Second, computer science offers many emerging approaches to maximize fairness, defined in a variety of ways, in machine learning systems. At a high level, all of these definitions reduce to the proposition that similarly situated people should be treated similarly, without regard to sensitive attributes. As we shall see, simple blindness to these attributes is not sufficient to guarantee even this simplified notion of fairness. Finally, related ideas from differential privacy can also be used to guarantee that protected status could not have been a substantial factor in certain decisions. 146 In other words, the machine learning model would be intentionally coded to develop bias. 147 Such an algorithm might be a tool for verifying properties of software or simply a software test. See supra Part I (discussing software testing and software verification in greater detail). 148 Hardt, supra note 140. 149 Dwork et al., supra note 141. 150 We concentrate on certification and proof of a system property to an overseer, observer, or participant. However, these tools are also valuable for compliance (since proofs can certify to the implementer of a system that the system is working as intended) and for demonstration that a decisionmaker will be able to show how and why they used certain data after the fact in case of an audit or dispute. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 46 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -46Those who use algorithmic decisionmaking today regularly make assertions about properties of these algorithms without proving them. This likely occurs because they are required by law to disclose certain facts about their decision process to regulators and consumers,151 they simply want to generate good will,152 or they demonstrate better behavior than a competitor.153 But without proof, these assertions are just words on paper, subject to challenge by skeptical regulators and disbelief by skeptical consumers. This skepticism is not entirely unfounded: these assertions have proved false in the past.154 Digital evidence such as zero-knowledge proofs gives a direct connection between the fact being asserted and the technical mechanism of decisionmaking. This proof provides the consumer with a high assurance that the assertion proffered relates meaningfully to the facts on the ground. 1. Learning from Experience As mentioned in Section I.B, incorporating randomness into an algorithm can give it flexibility to operate outside of the environment for which it was designed. Similarly, randomness can prevent hidden biases in the design or deployment of an algorithm from leading to consistent discriminatory outcomes. There is a large and rich literature on how to maximally learn from previous data and how to use random choices to ensure that a model is as faithful as possible to the real world.155 Consider a machine learning algorithm for hiring that is trained using a biased set of initial data indicating that women are weak candidates, even though gender does not predict job performance among the full population. If the resulting model would hire mostly men, the algorithm for hiring can create a self-fulfilling prophecy in which it finds that characteristics of successful hires correlate strongly with proxies for gender. But, if the algorithm is designed to incorporate an element of randomness such that some candidates who are not predicted to do well get hired (and have their performance tracked), the validity of the initial assumptions can be tested and the accuracy and fairness of the entire system will benefit over time. By occasionally guessing about candidates for which the model cannot make confident predictions, the model can gather additional data and evolve to become more faithful to the real world. Similarly, randomness is often necessary when training machine learning models. Models may become too specialized or specific to the data used for training, a problem called “overfitting.” Randomness can prevent this problem. Likewise, models may find a decision rule is well-suited for some portion of the input, but not the best rule overall. Randomness can also help avoid this bias. Consider, for example, a credit-scoring model trained initially on a biased set of 151 See, e.g., 12 C.F.R §§ 203.4-5 (2015) (providing requirements for the compilation, disclosure, and reporting of loan data). 152 [CITE example TK] 153 [CITE example TK] 154 [CITE example TK] 155 This literature is divided between the machine learning research community in Computer Science and the study of optimal decisionmaking in Statistics. See supra note 17. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 47 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -47data that underrates the creditworthiness of some minority group. Even if the model is the best possible decision rule for a population matching the biased input data, the model may unfairly deny access to credit to members of that minority group. In addition to the discrimination, the use of this model would deny creditors business opportunities with the unfairly rejected individuals. Here again, allowing the model to occasionally guess randomly, while tracking expected versus actual performance, can improve the model’s faithfulness to the population on which it is actually used rather than the biased population on which it was trained. The information from this injection of randomness can be fed back to the model to improve the accuracy and fairness of the system overall. 2. Fair Machine Learning One commonly understood way to demonstrate that a decision process is independent of sensitive attributes is to preclude the use of those sensitive attributes from consideration. For example, race, gender, and income may be excluded from a decisionmaking process to assert that the process is “race-blind,” “gender-blind,” or “income-blind.”156 From a technical perspective, however, this approach is naive. Blindness to a sensitive attribute has long been recognized as an insufficient approach to making a process fair. The excluded or “protected” attributes can often be implicit in other nonexcluded attributes. For example, when race is excluded as a valid criterion for a credit decision, redlining may occur when a zip code is used as proxy that closely aligns with race.157 This type of input “blindness” is insufficient to assure fairness and compliance with substantive policy choices. Although there are many conceptions of what fairness means, we consider here a definition of fairness in which similarly situated people are given similar treatment--that is, a fair process will give similar participants a similar probability of receiving each possible outcome. This is the core principle of a developing literature on fair classification in machine learning, an area first formalized by Dwork, Hardt, Pitassi, Reingold, and Zemel.158 This work stems from a longer line of research on mechanisms for data privacy.159 We further describe the relationship between fairness in the use of data and privacy below. 156 See, e.g., 12 C.F.R. § 1002.5(b) (2015) (“A creditor shall not inquire about the race, color, religion, national origin, or sex of an applicant or any other person in connection with a credit transaction.”); 12 C.F.R. § 1002.6(b)(9) (2015) (“[A] creditor shall not consider race, color, religion, national origin, or sex (or an applicant’s or other person’s decision not to provide the information) in any aspect of a credit transaction.”) 157 See Jessica Silver-Greenberg, New York Accuses Evans Bank of Redlining, N.Y. Times Dealbook (Sept. 2, 2014), http://dealbook.nytimes.com/2014/09/02/new-york-set-to-accuse-evansbank-of-redlining [https://perma.cc/3YFA-6N4J] (detailing a redlining accusation in great detail). 158 Dwork et al., supra note 141. 159 Specifically, the work of Dwork et al. is a generalization of ideas originally presented in Cynthia Dwork, Differential Privacy, 2006 Proc. 33rd Int’l Colloquium on Automata, Languages & Programming 1. As discussed below, fairness can be viewed as the property that sensitive or protected status attributes cannot be inferred from decision outcomes, which is very much a privacy property. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 48 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -48The principle that similar people should be treated similarly is often called individual fairness and it is distinct from group fairness in the sense that a process can be fair for individuals without being fair for groups.160 Although it is almost certainly more policy-salient, group fairness is more difficult to define and achieve. The most commonly studied notion of group fairness is statistical parity, the idea that an equal fraction of each group should receive each possible outcome. While statistical parity seems like a desirable policy because it eliminates redundant or proxy encodings of sensitive attributes, it is an imperfect notion of fairness. For example, statistical parity says nothing about whether a process addresses the “right” subset of a group. Imagine an advertisement for an expensive resort: we would not expect that showing the advertisement to the same number of people in each income bracket would lead to the same number of people clicking on the ad or buying the associated product. For example, a malicious advertiser wishing to exclude a minority group from a resort could design its advertising program to maximize the likelihood of conversion for the desired group while minimizing the likelihood that the ad will result in a sale to the disfavored group. In the same vein, if a company aimed to improve the diversity of its staff by offering the same proportion of interviews to candidates with minority backgrounds as are minority candidate applications, that is no guarantee that the number of people hired will reflect the population of applicants or the population in general. And the company could hide discriminatory practices by inviting only unqualified members of the minority group to apply, effectively creating a self-fulfilling prophecy for decision rules established by machine learning. The work of Dwork et al. identifies an additional interesting problem with the “fairness through blindness” approach: by remaining blind to sensitive attributes, a classification rule can select exactly the opposite of what is intended.161 Consider, for example, a system that classifies profiles in a social network as representing either real or fake people based on the uniqueness of their names. In European cultures, from which a majority of the profiles come, names are built by making choices from a relatively small set of possible first and last names, so a name which is unique across this population might be suspected to be fake. However, other cultures (especially Native American cultures) value unique names, so it is common for people in these cultures to have names that are not shared with anyone else. Since a majority of accounts will come from the majority of the population, for which unique names are rare, any classification based on the uniqueness of names will inherently classify real minority profiles as fake at a higher rate than majority profiles,162 and may also misidentify fake profiles using names drawn from the minority population as real. This unfairness could be remedied if the system were “aware” of the minority status of a name under 160 Sometimes, a more restrictive notion of individual fairness implies group fairness. Id. Intuitively, this is because if people who are sufficiently similar are treated sufficiently similarly, there is no way to construct a minority of people who are treated in a systematically different way. 161 See Dwork et al., supra note 141. 162 That is, the minority group will have a higher false positive rate. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 49 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -49consideration, since then the algorithm could know whether the implication of a unique name is that a profile is very likely to be fake or very likely to be real.163 This insight explains why the approach taken by Dwork et al. is to enforce similar probabilities of each possible outcome on similar people, requiring that the aggregate difference in probability of any individual receiving any particular outcome be limited.164 Specifically, Dwork et al. require that this difference in chance of outcome be less than the difference between individuals subject to classification.165 This requires a mathematically precise notion of how “different” people are, which might be a score of some kind or might naturally arise from the data in question.166 This notion of similarity must also capture all relevant features, including possibly sensitive or protective attributes such as minority status, gender, or medical history. Because this approach requires the collection and explicit use of sensitive attributes, the work describes its definition of fairness as fairness through awareness.167 While the work of Dwork et al. provides only a theoretical framework for building fair classifiers, others have used it to build practical systems that perform almost as well as classifiers that are not modified for fairness. The work of Dwork et al. also provides the theoretical basis for a notion of fair affirmative action, the idea that imposing an external constraint on the number of people from particular subgroups who are given particular classifications should have a minimal impact on the principle that similar people are treated similarly. This provides a technique for forcing a fairness requirement such as statistical parity even when it will not arise naturally from some classifier. A more direct approach to making a machine learning process fair is to modify or select the input data in such a way that the output satisfies some fairness property. For example, in order to make sure that a classifier does not over-reflect the minority status of some group, we could select extra training samples from that group or duplicate samples we already have. In either case, care must be taken to avoid biasing the training process in some other way or overfitting the model to the nonrepresentative data. Other work focuses on fair representations of data sets. For example, we can take data points and assign them to clusters, or groups of close-together points, treating each cluster as a prototypical example of some portion of the original data 163 In this case, differential treatment based on a protected status attribute improves the performance of the automated decision system in a way that requires that the system know and make use of the value of that attribute. 164 See Dwork et al., supra note 141, at 215 (explaining that fairness can be captured under the principle that “two individuals who are similar with respect to a particular task should be classified similarly”). 165 This is formalized as the proposition that the difference in probability distributions between outcomes for each subgroup of the population being classified is less than the difference between those groups, for a suitable measurement of the difference between groups. For technical reasons, this particular formulation is mathematically convenient, although different bounds might also be useful. For the formal mathematical definition, see Dwork et al., supra note 141, at 216. 166 For example, if the physical location of subjects is a factor in classification, we might naturally use the distance between subjects as one measure of their similarity. 167 Id. at 215. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 50 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -50set. This is the approach taken by Zemel, Wu, Swersky, Pitassi, and Dwork.168 Specifically, Zemel et al. show how to generate such prototypical representations automatically and in a way that guarantees statistical parity for any subgroup in the original data. In particular, the probability that any person in the protected group is mapped to any particular prototype is equal to the probability that any person not from the protected group is mapped to the same prototype.169 Therefore, classification procedures which have access only to the prototypes must necessarily not discriminate, since they cannot tell whether the prototype primarily represents protected or unprotected individuals. Zemel et al. test their model on many realistic data sets, including the Heritage Health Prize data set, and determine that it performs nearly as well as best-of-breed competing methods while ensuring substantial levels of fairness.170 This technique allows for a kind of “fair data disclosure,” in which disclosing only the prototypes allows any sort of analysis, fair or unfair, to be run on the data set to generate fair results. A related approach is to use a technique from machine learning called regularization, which involves introducing new information to make trained models more generalizable in the form of a penalty assigned to undesirable model attributes or behaviors. This approach has also led to many useful modifications to standard tools in the machine learning repertoire, yielding effective and efficient fair classifiers.171 The work of Zemel et al. suggests a related approach, which is also used in practice: the approach of generating fair synthetic data. Given any set of data, we can generate new data such that no classifier can tell whether a randomly chosen input was drawn from the real data or the fake data. Furthermore, we can use approaches like that of Zemel et al. to ensure that the new data are at once representative of the original data and also fair for individuals or subgroups. Because synthetic data are randomly generated, they are useful in situations where training a classifier on real data would create privacy concerns. Also, synthetic data can be made public for others to use, although care must be taken to avoid allowing others to infer facts about the underlying real data. Such model inversion attacks172 have been demonstrated in practice, along with other inference or deanonymization attacks that allow sophisticated conclusions without direct access to the actual data that give rise to the conclusions.173 168 Richard Zemel et al., Learning Fair Representations, 28 Proc. 30th Int’l Conf. on Machine Learning 325 (2013). 169 Id. 170 Id. 171 See, e.g., Toshihiro Kamishima et al., Fairness-Aware Learning Through Regularization Approach, 2011 Proc. 3rd IEEE Int’l Workshop on Privacy Aspects of Data Mining 643 (describing a model in which two types of regularizers were adopted to enforce fair classification). 172 See Matthew Fredrikson et al., Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing, 2014 Proc. 23rd USENIX Security Symp. 17 (describing privacy risks in which attackers can predict a patient’s genetic markers if provided with the model and some demographic information). 173 For an overview of these techniques, see Arvind Narayanan & Edward W. Felten, No Silver Bullet: De-identification Still Doesn’t Work (July 9, 2014), DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 51 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -51All of these approaches demonstrate that it is possible to build a wide variety of definitions of fairness into a wide variety of data analysis and classification systems, at least to the extent that a definition of fairness is known or can be approximated in advance. There are no bright-line rules that allow the designer or operator of a machine learning system to guarantee that the system’s behavior is compliant with antidiscrimination laws. Nor do we believe that such rules can or even should exist. It is not for technologists to define an ex ante notion of fairness that applies in all cases or even just for a specific system. Rather, fairness must be determined contextually and often must be reviewed ex post. Regardless, it is certainly not impossible to build fairness into automated decision systems, which shows that unconstrained use of data analysis is not always necessary. Uses of data that do not employ methods to investigate or ensure fairness must account for their decision policies in some other way. Many of these approaches rely on the insufficient notion of group fairness by statistical parity. To the extent that more technical research can help to address the problem of unfairness in big data analysis, it is by expanding the repertoire of definitions of group fairness that can be usefully applied in practice and by providing better exploratory and explanatory tools for comparing different notions of fairness. From a public policy perspective, it would be extremely useful to system designers to have a set of rules, standards, or best practices that explain what notions of fairness should be used in specific real-world applications. A complementary notion to machine learning systems that can guarantee prespecified, formal fairness properties is the work of Rudin on machine learning systems that are interpretable.174 Such systems generate models that can be used to classify individuals, but also explanations for why those classifications were made. These explanations can be reviewed later to understand why the model behaves a certain way, and in some cases how changes in the input data would affect the model’s decision. These explanations can be extremely valuable to experts and oversight authorities, who wish to avoid treating models as black boxes. 3. Discrimination, Data Use, and Privacy A different way to define whether a classification is fair is to say that we cannot tell from the outcome whether the subject was a member of a protected group or not. That is, if an individual’s outcome does not allow us to predict that individual’s attributes any better than we could by guessing them with no information, we can say that outcome was assigned fairly. To see why this is so, observe the contrary: if the fact that an individual was denied a loan from a particular bank tells you that this individual is more likely to live in a certain neighborhood, this implies that you hold a strong belief that the bank denies credit to residents of this neighborhood and hence a strong belief that the bank makes http://randomwalker.info/publications/no-silver-bullet-de-identification.pdf [https://perma.cc/VT2G-7ACG], and Arvind Narayanan et al., A Precautionary Approach to Big Data Privacy (Mar. 19, 2015), http://randomwalker.info/publications/precautionary.pdf [https://perma.cc/FQR3-2MM2]. 174 Cynthia Rudin, Algorithms for Interpretable Machine Learning, 2014 20th ACM SIGKDD Conf. on Knowledge Discovery & Data Mining 1519. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 52 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -52decisions based on factors other than the objective credit risk presented by applicants. Thus, fairness can be seen as a form of an information hiding requirement similar to privacy. If we accept that a fair decision does not allow us to infer the attributes of a decision subject, we are forced to conclude that fairness is protecting the privacy of those attributes. Indeed, it is often the case that people are more concerned that their information is used to make some decision or classify them in some way than they are that the information is known or shared. This concern relates to the famous conception of privacy as the “right to be left alone,” in that generally people are concerned with the idea that disclosure interrupts their enjoyment of an “inviolate personality.”175 Data use concerns also surface in the seminal work of Solove, who refers to concerns about “exclusion” in “information processing,” or the lack of disclosure to and control by the subject of data processing and “distortion” of a subject’s reputation by way of “information dissemination.”176 Solove argues that these problems can be countered by giving subjects knowledge of and control over their own data.177 In this framework, the predictive models of automated systems, which might use seemingly innocuous or natural behaviors as inputs, create anxiety on the part of data subjects. We propose a complementary approach: if a system’s designer can prove to an oversight entity or to each data subject that the sorts of behaviors that cause these anxieties are simply not possible behaviors of the system, the use of these data will be more acceptable. We can draw an analogy between data analysis and classification problems and the more familiar data aggregation and querying problems that are much discussed in the privacy literature. Decisions about an individual represent (potentially private) information about that individual (i.e., one might infer the input data from the decision), and this raises concerns for privacy. In essence, privacy may be at risk from an automated decision that reveals sensitive information just like fairness may be at risk from an automated decision. In this analogy, a vendor or agency using a model to draw automated decisions wants those decisions to be as accurate as possible, corresponding to the idea in privacy that it is the goal of a data analyst to build as complete and accurate a picture of the data subject as is feasible. A naive approach to making a data set private is to delete “personally identifying information” from the data set. This is analogous to the current practice of making data analysis fair by removing protected attributes from the input data. However, both approaches fail to provide their promised protections.178 The failure 175 Samuel D. Warren & Louis D. Brandeis, The Right to Privacy, 4 Harv. L. Rev. 193, 205 (1890). 176 Daniel J. Solove, A Taxonomy of Privacy, 154 U. Pa. L. Rev. 477, 521, 546 (2006). 177 Id. at 546 (detailing privacy statutes that allow individuals to access and correct information that is maintained by government agencies). 178 Reidentification of individuals based on inferences from disparate data sets is a growing and important concern that has spawned a large literature in both Computer Science and Law. See Ohm, supra note 72, 1704 (arguing that developments in computer science demonstrate that “[d]ata can DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 53 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -53in fairness is perhaps less surprising than it is in privacy--discrimination law has known for decades about the problem of proxy encodings of protected attributes and their use for making inferences about protected status that may lead to adverse, discriminatory effects.179 The work of Hardt180 relates the work on fairness by Dwork et al.181 to the work on differential privacy by Dwork.182 As differential privacy is a well-founded notion of protection against inferences and the recovery of an individual identity from “anonymous” data, so are formal fairness properties for automated decision systems sound notions of fairness for individuals and a theoretical framework on which to ground more complicated notions of fairness for protected groups. The many techniques of building fair data analysis and classification systems described above mostly require decisionmakers to have access to protected status information, at least during the design phase of the algorithm. However, in many cases, concerns about misuse, reuse, or abuse of this information have led to a policy regime where decisionmakers are explicitly barred from using such information. The deployment of these technical tools would require a policy change.183 The techniques described above could be used to make such a change less prone to engendering the very real concerns of data abuses that have led to the current regime. C. AntiDiscrimination Law and Algorithmic Decisionmaking The goal of Part II-—procedural regularity-—is relatively simple from a legal standpoint. Procedural regularity is a core idea behind due process: the state cannot single out an individual for a different procedure.184 An argument that governance measures ensuring algorithmic procedural regularity are required by due process is more tenuous,185 but an agency that implements such measures will not risk violating a legal requirement. be either useful or perfectly anonymous but never both,” and that such developments should “trigger a sea change” in legal scholarship). 179 For example, the law explicitly forbids the (sole) use of certain attributes that are likely to be highly correlated with protected status categories, as in protections against redlining. See, e.g., 12 C.F.R. § 1002.5(b) (2015) (“A creditor shall not inquire about the race, color, religion, national origin, or sex of an applicant or any other person in connection with a credit transaction”); 12 C.F.R. § 1002.6(b)(9) (2015) (“[A] creditor shall not consider race, color, religion, national origin, or sex (or an applicant’s or other person’s decision not to provide the information) in any aspect of a credit transaction.”). 180 Hardt, supra note 140. 181 Dwork et al., supra note 141. 182 Dwork, supra note 159. 183 One example is the privacy regime created by the Health Insurance Portability and Accountability Act, see supra note 79, which forbids the disclosure of certain types of covered information beyond those for which the data subject was previously given notice and which limits disclosure to covered entities subject to the same restrictions. 184 See, e.g., Arthur S. Miller, An Affirmative Thrust to Due Process of Law?, 30 Geo. Wash. L. Rev. 399, 403 (1962) (“Procedural due process (‘adherence to procedural regularity’), as we have often been told by Supreme Court justices, is the very cornerstone of individual liberties.”). 185 See Citron, supra note 6, at 1278-1300 (arguing that current procedural protections are inadequate for automated decisionmaking). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 54 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -54In contrast, governance of algorithms to promote nondiscrimination runs into the complicated field of antidiscrimination law. Here, the movement toward a colorblind interpretation of equal protection has created friction with the precedents involving disparate impact. We argue that, given the current state of antidiscrimination law, designing for nondiscrimination is important because users of algorithms may be legally barred from revising processes to correct for discrimination after the fact, and technical tools offer solutions to help. 1. Ricci v. DeStefano: The Tensions Between Equal Protection, Disparate Treatment, and Disparate Impact Antidiscrimination law is based upon both the constitutional guarantee of equal protection186 and supplemental statutory protections. Modern interpretations of the Equal Protection Clause generally have been divided into two camps: those who believe in a color-blind Constitution-—protecting individualized assessments and eschewing any evaluations based on group status--and those who support antisubordination attempts to remedy inequalities between groups.187 The general trend has been toward colorblindness.188 For statutory measures, we will focus on Title VII of the Civil Rights Act of 1964.189 Under Title VII, remedies are available for disparate treatment-- discriminatory intent or the formal application of different rules to people of different groups--and disparate impact--results that differ for different groups.190 186 See U.S. Const. amend. XIV, § 1 (“No State shall . . . deny to any person within its jurisdiction the equal protection of the laws.”). The Equal Protection Clause has also been interpreted to apply to the federal government through the Due Process Clause of the Fifth Amendment. See, e.g., Kenji Yoshino, The New Equal Protection, 124 Harv. L. Rev. 747, 748 n.10 (2010). 187 See, e.g., Reva B. Siegel, From Colorblindness to Antibalkanization: An Emerging Ground of Decision in Race Equality Cases, 120 Yale L.J. 1278, 1281 (2011) (describing this binary as the common interpretation of equal protection jurisprudence). 188 See The Supreme Court, 2008 Term--Leading Cases, 123 Harv. L. Rev. 153, 289 (2009) (“The Court’s conception of equal protection turns largely on its swing voter, Justice Kennedy, who appears to support a moderate version of the colorblind Constitution.”). But see Reva B. Siegel, The Supreme Court, 2012 Term--Foreword: Equality Divided, 127 Harv. L. Rev. 1, 6 (2013) (agreeing that “[s]hifts in equal protection oversight . . . are continuing to grow” but arguing that these changes are “neither colorblind nor evenhanded” because “the Court has encouraged majority claimants to make discriminatory purpose arguments about civil rights law based on inferences the Roberts Court would flatly deny if minority claimants were bringing discriminatory purpose challenges to the criminal law”). 189 42 U.S.C. §§ 2000e to 2000e-17 (2012). Title VII applies to employment discrimination on the basis of race, national origin, gender, and religion. The disparate impact framework is also used for housing discrimination, and employment, public entity, public accommodation, and telecommunications discrimination for people with disabilities. See Tex. Dep’t of Hous. & Cmty. Affairs v. Inclusive Cmtys. Project, Inc., 135 S. Ct. 2507 (2015) (holding that disparate impact claims are cognizable under the Fair Housing Act); Lopez v. Pac. Mar. Ass’n, 657 F.3d 762 (9th Cir. 2011) (deciding a disparate impact claim brought under the Americans with Disabilities Act). 190 See Richard Primus, The Future of Disparate Impact, 108 Mich. L. Rev. 1341, 1350-51 & n.56 (2010) (describing the evolution of the “disparate impact” and “disparate treatment” terminology, and the types of discrimination they are associated with). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 55 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -55Algorithmic decisionmaking blurs the definitions of disparate treatment and disparate impact and poses a number of open questions.191 Is it disparate treatment when the inputs used are a proxy for membership in a protected class? Different rules are effectively applied to different groups in this case, but that difference may have no effect on the outcomes.192 If the people responsible for a decision know that an algorithm behaves in a way that has disparate impact, does that mean that they intend a discriminatory result?193 If an algorithm generates poor outcomes for a group of people, how accurate does the algorithm need to be (and how carefully does the decisionmaker need to test alternative algorithms) before the decisionmaker can escape disparate impact liability because the factors used are job-related?194 If, as noted in subsection III.B.2, knowledge of class membership can be used to improve the fairness of outcomes for members of all classes, should doing so be considered disparate treatment? These doctrines were recently considered in Ricci v. DeStefano, in which the Supreme Court held that “before an employer can engage in intentional discrimination for the asserted purpose of avoiding or remedying an unintentional disparate impact, the employer must have a strong basis in evidence to believe it will be subject to disparate-impact liability if it fails to take the race-conscious, discriminatory action.”195 At issue was the City of New Haven’s test for firefighter promotions; though the tests had been constructed in an attempt to ensure there was no discrimination by race,196 the pass rates for minorities were about half of the pass rate for whites.197 The New Haven Civil Service Board did not certify the results of the test (and validate the promotions) due to concerns about fairness and disparate impact liability for the City.198 Ricci demonstrates the tension between disparate treatment and disparate impact. Facially neutral policies can produce unequal results for protected classes, but remedying that disparate impact would require the state to treat people differently based on class membership, which Ricci forbids. Ricci also hints at the difficulties in squaring the Court’s move towards a colorblind interpretation of the Equal Protection Clause and the doctrine of disparate impact. The holding does not directly address the constitutional issue, but Justice Scalia’s concurrence does note that the “war between disparate impact and equal protection will be waged sooner or later.”199 Both of these doctrinal tensions are of concern to lawmakers and policymakers. 191 See Barocas & Selbst, supra note 8, at 694-714 (noting the ways in which algorithmic data mining techniques can lead to unintentional discrimination against historically prejudiced groups). 192 Id. at 695. 193 Id. at 700. 194 Id. at 707. 195 557 U.S. 55, 585 (2009). 196 Id. at 565. 197 Id. at 586-87. 198 Id. at 579. 199 Id. at 595-96 (Scalia, J., concurring). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 56 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -56- 2. Ricci Impels Designing for NonDiscrimination Although Ricci has generated wide-ranging conversation about equal protection, disparate treatment, and disparate impact, we wish to emphasize its implications for the governance of decision algorithms for processes where nondiscrimination is a goal. The holding in Ricci suggests that we cannot solely rely on auditing for legal reasons in addition to the reasons discussed in Section II.A. If an agency runs an algorithm that has a disparate impact, correcting those results after the fact will trigger the same kind of analysis as New Haven’s rejection of its firefighter test results. It is even possible that the Court will “subject some range of disparate impact compliance efforts to strict scrutiny,”200 a high bar that will be difficult to satisfy in most cases. The legal difficulties with correcting discriminatory algorithms ex post make measures to design algorithms for nondiscrimination even more important. The Court in Ricci took no issue with New Haven’s process of designing the tests with an eye towards nondiscrimination, reasoning that “Title VII does not prohibit an employer from considering, before administering a test or practice, how to design that test or practice in order to provide a fair opportunity for all individuals, regardless of their race.”201 However, “once that process has been established and employers have made clear their selection criteria, they may not then invalidate the test results, thus upsetting an employee’s legitimate expectation not to be judged on the basis of race.”202 The uneasy fit of algorithmic decisionmaking into the disparate treatment/disparate impact framework does mean that someone could allege disparate treatment because the design of the algorithm includes inputs that are a proxy for class membership, resulting in a formal application of different rules to different groups of people. However, such a claim would be valid against virtually any system with a significant number of inputs. It seems more likely that courts would reject the formal-rule subset of disparate treatment for algorithmic decisions than that they would hold the majority of algorithmic decisionmaking to be disparate treatment. In the end, incorporating nondiscrimination in the initial design of algorithms is the safest path that decisionmakers can take, and we should encourage the development and deployment of technical tools to aid in that design. IV. Fostering Collaboration Across Computer Science, Law, and Policy In this Part, we consider how the types of technological assurance described in previous Parts relate to mechanisms of oversight in law and public policy. In technical approaches, it is traditional to have a detailed, well-defined specification of the behavior of a system for all types of situations. In lawmaking and the application of public policy, it is normal, and even encouraged, for rules to be left open to interpretation, with details filled by human judgment emerging from disputes in specific cases that are resolved after-the-fact. We offer 200 The Supreme Court, 2008 Term--Leading Cases, supra note 188, at 290. 201 Ricci, 557 U.S. at 585. 202 Id. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 57 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -57recommendations for dealing with this apparent mismatch, arguing for greater collaboration between experts in two different fields. We emphasize that computer scientists cannot assume that the policy process will give them a meaningful, universal, and self-consistent theory of fairness to use as a specification for algorithms. There are structural, political, and jurisprudential reasons why no such theory exists today. Likewise, the policy process would likely not accept such a theory if it were generated by computer scientists. At the same time, lawmakers and policymakers will need to adapt in light of these new technologies. We highlight changes that stem from automated decisionmaking. First, choices made when designing computer systems embed specific policy decisions and values in those systems whether or not they provide for accountability. Algorithms can, nevertheless, permit direct accountability to the public or to other third parties, despite the fact that full transparency is neither sufficient nor always necessary for accountability. For both groups, we note that the interplay between these areas will raise new questions and may generate new insights into what the goals of these decisionmaking processes should be. A. Recommendations for Computer Scientists: Design for After-the-Fact Oversight Computer scientists may tend to think of accountability in terms of compliance with a detailed specification set forth before the creation of an algorithm. For example, it is typical for programmers to define bugs based on the specification for a program--anything that differs from the specification is a bug; anything that follows it is a feature.203 This Section is intended to inform computer scientists that no one will remove all the ambiguities and offer them a clear, complete specification. Although lawmakers and policymakers can offer clarifications or other changes to guide the work done by developers,204 drafters may be unable to remove certain ambiguities for political reasons or be unwilling to resolve details to meet flexibility objectives. As such, computer scientists must account for the lack of precision--and the corresponding need for after-the-fact oversight by courts or other reviewers--when designing decisionmaking algorithms. A computer scientist’s mindset can conflict deeply with many sources of authority to which developers may be responsible. Public opinion and social norms are inherently not precisely specified. The corporate requirements to satisfy one’s supervisor (or one’s supervisor’s supervisor) may not be clear. Perhaps most importantly and least intuitively for computer scientists, the operations of U.S. law 203 See, e.g., Michael Dubakov, Visual Specifications, Medium (Oct. 26, 2013), https://medium.com/@mdubakov/visual-specifications-1d57822a485f [https://perma.cc/SE46- 6B2C] (“No specs? No bugs.”); SF, What Is the Difference Between Bug and New Feature in Terms of Segregation of Responsibilities?, StackExchange (July 12, 2011, 6:51), http://programmers.stackexchange.com/questions/92081/what-is-the-difference-between-bug-andnew-feature-in-terms-of-segregation-of-re [https://perma.cc/PPM6-HFAA] (“You could put an artificial barrier: if it’s against specs, it’s a bug. If it requires changing specs . . . it’s a feature.”). 204 See infra Section IV.B. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 58 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -58and public policy also work against clear specifications. These processes often deliberately create ambiguous laws and guidance, leaving details--or sometimes even major concepts--open to interpretation.205 One cause of this ambiguity is the political reality of legislation. Legislators may be unable gather majority support to agree on the details of a proposed law, but may be able to get a majority of votes to pass relatively vague language that leaves various terms and conditions unspecified.206 For example, different legislators may support conflicting specific proposals that can be encompassed by a more general bill.207 Even legislators who do not know precisely what they want may still object to a particular proposed detail; each detail that caused sufficient objections would need to be stripped out of a bill before it could become law.208 Another explanation of ambiguity is that legislators may have uncertainty about the situations to which a law or policy will apply. Drafters may worry that they have not fully considered all of the possibilities. This creates an incentive to build in enough flexibility to cover unexpected circumstances that currently exist or may exist in the future.209 The U.S. Constitution is often held up as a model in this regard: generalized provisions for governance and individual rights continue to be applicable even as the landscape of society changes dramatically.210 Finally, ambiguity may stem from shared uncertainty about how best to solve even a known problem. Here, drafters may feel that they know what situations will arise but still not know how they want to deal with them. They may, in effect, choose to delegate authority to other parties by underspecifying particular aspects of a law or policy. Vagueness supports experimentation to help determine what methods are most effective or desirable.211 205 See, e.g., Marbury v. Madison, 5 U.S. (1 Cranch) 137 (1803) (establishing the practice of judicial review, on which the Constitution was silent); 47 U.S.C. § 222(c)(1) (2012) (requiring a telecommunications carrier to get the “approval of the customer” to use or disclose customer proprietary network information, but neglecting to define “approval”). 206 See Victoria F. Nourse & Jane S. Schacter, The Politics of Legislative Drafting: A Congressional Case Study, 77 N.Y.U. L. Rev. 575, 593 (2002) (“Several staffers thought that pressures of time, and the political imperative to get a bill ‘done,’ bred ambiguity. Indeed, one staffer emphasized that while it was well and good to draft a bill clearly, there was no guarantee that the clear language would be passed by the House or make it through conference.”). 207 Richard L. Hasen, Vote Buying, 88 Calif. L. Rev. 1323, 1339 (2000) (describing the practice of “legislative logrolling”). 208 Id. at . 209 See, e.g., 17 U.S.C. § 1201 (2012) (granting the Copyright Office the power to create exemptions from the statute’s prohibition on anti-circumvention). 210 See David A. Strauss, The Living Constitution (2010). Laws governing law enforcement access to personal electronic records are often cited as a counterexample, with over-specific provisions in the Electronic Communications Privacy Act (18 U.S.C. §§ 2510-2704 (2012)) that fail to account for a shift in technology to a regime where most records reside with third party service providers, not users’ own computers. For a more detailed explanation, see Orin S. Kerr, Applying the Fourth Amendment to the Internet: A General Approach, 62 Stan. L. Rev. 1005 (2010). 211 A similar logic—policy experimentation among the states is one of the principles underlying federalism. See New State Ice Co. v. Liebmann, 285 U.S. 262, 311 (1932) (Brandeis, J., dissenting) (praising the ability of a state to “serve as a laboratory” for democracy). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 59 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -59The United States has a long history of dealing with these ambiguities through after-the-fact and retroactive oversight by the courts.212 In our common law system, ambiguities and uncertainties are left unaddressed until there is a dispute and their resolution becomes necessary. Disagreements about the application of a law or regulation to a specific set of facts are resolved through cases, and the areas of ambiguity are clarified over time by the accretion of many rulings on specific situations.213 Even when statutes and regulations may have specific and detailed language, they are interpreted through cases--with extensive deference often given to the expertise of administrative agencies.214 Those cases form binding precedents, which, in the U.S. common law system, are an additional source of legal authority alongside the statutes themselves.215 The gradual development and extension of law and regulations through cases with specific fact patterns allows for careful consideration of meaning and effects at a level of granularity that is usually impossible to reach during the drafting process.216 In practice, these characteristics imply that computer scientists should focus on creating algorithms that are reviewable, not just compliant with the specifications that are generated in the drafting process.217 For example, this means it would have been good for the Diversity Visa Lottery described in Section II.C to use an algorithm that made fair, random choices and it would be desirable for the State Department to be able to demonstrate that property to a court or a skeptical lottery participant.218 The technical approaches described in this Article219 provide several ways for algorithm designers to ensure that the actual basis for a decision can be verified later. With these tools, reviewers can check whether an algorithm actually was used to make a particular decision,220 whether random inputs were chosen fairly,221 and whether the algorithm comports with certain principles specified at the time of the design.222 Essentially, these technical tools allow continued after-the-fact evaluations of algorithms by allowing for and assisting the judicial system’s 212 See generally E. Allan Farnsworth, An Introduction to the Legal System of the United States (Steve Sheppard ed., 4th ed. 2010). 213 See generally id. 214 See Chevron U.S.A. Inc. v. Nat. Res. Def. Council, Inc., 476 U.S. 837 (1984). 215 See Farnsworth, supra note 213. 216 Id. 217 Another possible conclusion is that certain algorithms should also be developed to be flexible, permitting adaptation as new cases, laws, or regulations add to the initial specifications. The need to adapt algorithms is discussed further in subsection IV.B.1. This also reflects the current insufficiency of building a system in accord with a particular specification, though oversight or enforcement bodies evaluating the decision at a later point in time will still need to be able to certify compliance with any actual specifications. 218 Algorithms offer a new opportunity for decisionmaking processes to be reviewed by nontraditional overseers: decision recipients, members of the public, or even concerned nongovernmental organizations. We discuss this possibility further in subsection IV.B.2. 219 See supra Sections II.B & III.B. 220 See supra Section II.C. 221 See supra notes XX-YY and accompanying text. 222 See supra notes XX-YY and accompanying text. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 60 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -60traditional role in ultimately determining the legality of particular decisionmaking.223 Implementing the approaches described in this Article would improve the accountability of decisionmaking algorithms dramatically, but we see that implementation as only a first step. We encourage research into extensions of these technical tools, as well as new techniques designed to facilitate oversight. B. Recommendations for Lawmakers and Policymakers The other side of the coin is that lawmakers and policymakers need to recognize and adapt to the changes wrought by algorithmic decisionmaking. Characteristics of algorithms offer both new opportunities and new challenges for the development of legal regimes governing decisionmaking: algorithmic decisionmaking can reduce the benefits of ambiguity, increase accountability to the public, and permit greater accountability than was previously possible in cases where aspects of the decision process remain secret. 1. Reduced Benefits of Ambiguity Although computer scientists can build algorithms to permit after-the-fact assessment and accountability, they cannot alter the fact that any algorithm design will encode specific values and involve specific rules. Furthermore, the design of a computer system may limit opportunities for after-the-fact accountability. In other words, if a system is not designed to permit certification of a particular characteristic, an oversight body cannot be certain that it will be able to certify that characteristic. Both of these traits imply that automated decisionmaking can exacerbate certain disadvantages of legal ambiguities. In the framework set forth above,224 we identify key drivers for ambiguity: political stalemate, uncertainty about future circumstances, and desire for policy experimentation. Here, with respect to each of these drivers, we will discuss how the shift to algorithmic decisionmaking diminishes the appeal of ambiguity, and we will suggest ways of retaining the functional benefits that ambiguity provides in the U.S. lawmaking system and ways that are more amenable to automation. Ambiguity stemming from political stalemate essentially passes the buck for determining details from legislators to someone later on in the process. These later actors tend to be more sheltered from political pressures and thus able to make specific decisions without risking their jobs at the next election. Judges and administrative agencies frequently fill this role. Courts are expected to offer impartial decisions resistant to public pressure225 and administrative agencies are 223 Computer scientists model this after-the-fact input as an “oracle” that can be consulted only rarely on the acceptability of the algorithm. See Kroll, supra note 119. 224 Supra Section IV.A. 225 See, e.g., The Federalist No. 78 (Alexander Hamilton), (laying out the philosophy that the judiciary’s roleis to secure an “impartial administration of the laws”). However, the rise of elected judges raises questions about this traditional role of the court system. See Stephen J. Choi et al., Professionals or Politicians: The Uncertain Empirical Case for an Elected Rather Than Appointed Judiciary (Univ. of Chi. Law Sch., John M. Olin Law & Economics Working Paper No. 357, 2007) (finding that elected judges behave more like politicians than appointed independent judges). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 61 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -61expected to retain staff who offer subject matter expertise beyond what is expected of legislators, despite changes in political administrations.226 However, this transfer of responsibility often works in less than ideal ways when it comes to software systems.227 Fully automated decisionmaking may exacerbate these problems by adding another actor to whom the responsibility can devolve: the developer who programs the decisionmaking software. Citron offers examples of failures in automated systems that determine benefits eligibility, notably the airport “No Fly” lists, terrorist identifications, and punishment for “dead-beat” parents.228 Lawmakers should consider this possibility and avoid giving the responsibility for filling in the details of the law to program developers because (1) the algorithms will apply broadly, affecting all participants; (2) the program developer is unlikely to be held accountable by the current political process; and (3) the program developer is unlikely to have substantive expertise about the political decision being made.229 One potential method for restricting the discretion of developers without requiring specifications in the legislation itself would be for administrative agencies to publish guidance for software development. Difficulties in translating between code choices and policy effects still would exist, but they could be partly eased using the technical methods we have described.230 For example, administrative agencies could work together with developers to identify the properties they want a piece of software to possess, and the program could then be designed to satisfy those properties and permit proof. Ambiguity generated by uncertainty about the situational circumstances or ambiguity motivated by a desire for policy experimentation presents a more difficult concern. Here, the problem raised by automated decisionmaking is that a piece of software locks in a particular interpretation of law for the duration of its use, and, especially in government contexts, provisions to update the software code may not be made. Worries about changing or unexpected circumstances could be assuaged by adding sunset provisions to software systems,231 requiring periodic 226 This is the rationale of the Chevron doctrine of judicial deference to administrative agency actions Chevron U.S.A. Inc. v. Nat. Res. Def. Council, Inc., 476 U.S. 837 (1984). 227 For example, Citron argues that “[d]istortions in policy have been attributed to the fact that programmers lack ‘policy knowledge,’” and that this leads to software that does not reflect policy goals. Citron, supra note 6, at 1261. Ohm also reports on a comment of Felten that “[i]n technology policy debates, lawyers put too much faith in technical solutions, while technologists put too much faith in legal solutions.” Paul Ohm, Breaking Felten’s Third Law: How Not to Fix the Internet, 87 Denv. L. Rev. Online (2010), http://www.denverlawreview.org/how-to- regulate/2010/2/22/breaking-feltens-third-law-how-not-to-fix-the-internet.html [https://perma.cc/6RGQ-KUMW] (internal quotation marks omitted). 228 Citron, supra note 6, at 1256-57. 229 Id. at . A distinction should be drawn here between the responsibilities given to individual developers of particular algorithms and the responsibilities given to computer scientists in general. Great gains can be made by improved dialogue between computer scientists and lawmakers and policymakers about how to design algorithms to reach social goals. 230 See supra Sections II.B & III.B. 231 The effectiveness of sunset provisions in leading to actual reconsideration and change is debatable. The inertia of the pre-existing choices can be hard to overcome. See, e.g., Mark A. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 62 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -62review and reconsideration of the software. Additionally, software should be designed with eventual revisions and updates in mind. As for preserving the benefits of policy experimentation, the traditional solution might be having multiple programs that take multiple approaches deployed simultaneously. A more sophisticated version of this solution is the incorporation of machine learning into decisionmaking systems. Again, machine learning can have its own fairness pitfalls,232 and care should be taken to consider fair machine learning methods233 and to build in precautions like persistent testing of the hypotheses built into the machine learning model.234 More generally, the benefits of ambiguity decrease in the case of algorithmic decisionmaking. Here, an uninformed programming actor may determine the details and then apply them broadly. In addition, the choice of algorithm cements the particular policy choices encoded in that software for as long as it is used. Drafters should instead consider whether they should increase the specificity offered by law and policy governing these algorithms to prevent coders from filling the ambiguity. To a certain extent, this question mirrors the rules versus standards debate about the relative merits of laws that specify actions and their repercussions (for example, a speed limit) and those that espouse a principle open to interpretation (for example, “drive at a speed reasonable for the conditions”).235 Rules give clarity and forewarning, while standards offer greater flexibility for interpretation.236 Here, the question is whether drafters should include additional and clearer specifications for developers. In practice, drafters may wish to incorporate a set of narrow rules within a broad, overarching standard. For example, drafters could include specifications of each of the properties that they want a piece of software to possess and requirements that the developer design that program in a way that renders those properties provable upon review. Additionally, drafters might consider requiring a general statement of purpose for the algorithm. Doing so would give the developer some flexibility in writing the code while also ensuring that particular properties can be checked later. Lemley & David McGowan, Legal Implications of Network Economic Effects, 86 Calif. L. Rev. 479, 481-82 (1998) (noting that stare decisis, confusion regarding the role of theory, differing normative values, and other factors impede the progress of the law). 232 See supra note XX (noting that machine learning programs give predictions but not confidence levels). 233 See supra Sections III.A-B. 234 In other words, even after a machine learning algorithm determines that a particular rule should be used to produce particular results, it always should continue to test inputs that do not follow that rule. See, e.g., Russell & Norvig, supra note 68. 235 See, e.g., Louis Kaplow, Rules versus Standards: An Economic Analysis, 42 Duke L.J. 557, 562- 66 (1992) (arguing that rules are more costly to promulgate while standards are most costly on individuals). 236 See Kathleen M. Sullivan, The Supreme Court, 1991 Term--Forward: The Justices of Rules and Standards, 106 Harv. L. Rev. 22, 26 (1992) (explaining the rule versus standard choice in terms of force of precedent, constitutional reading, and formulating operative tests). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 63 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -63- 2. Accountability to the Public Oversight is traditionally performed by courts, enforcement agencies, or other designated entities such as government prosecutors.237 Typically, the public and third parties have an indirect oversight role through the ability to provide political feedback and the ability to bring lawsuits if their specific circumstances allow.238 The use of software can alter how effectively the legal system and the public can oversee the decisionmaking process. In one sense, decisionmaking computer systems can enhance accountability to the public and interested third parties by permitting greater involvement in oversight. The technical tools we describe allow for a more direct form of oversight by these parties. Unlike traditional legal oversight mechanisms that generally require discovery or the gathering of internal evidence, the technical tools may enable verifications by the public and by third parties that are completely independent from the organizations using the software. For example, technologically proficient members of the public or third parties could verify that a particular algorithm was used in a program or that the program has particular properties. In addition, a system could be built to enable participants to check these properties for their own outcomes so that nontechnical users could verify these facts while the system as a whole would be overseen by others--potentially both inside and outside of government--who have the necessary technological expertise. As another example, third parties could be involved in generating fair randomness.239 In contrast to the possibility for enhanced public accountability, the use of software without the reliance on technical tools for oversight, as we have described, can reduce accountability to the public because courts and other policy actors are generally ill-equipped to evaluate software, thereby hampering our traditional scrutiny of decisionmaking. The U.S. court system is designed to protect against wrongful government actions through the power of judicial review.240 Judicial review gives judges the power and responsibility to determine if government actions comply with legal obligations. Similarly, for private actions, the legal system vests judges and regulatory agencies with the authority to determine whether those actions are consistent with legal standards. The use of software systems to make decisions shifts these burdens to external experts or to the organizations creating and deploying the software. Courts and enforcement agencies are no longer able to make a determination as to whether the rules have been properly applied or whether fairness obligations have been met. That determination shifts to the experts evaluating the automated decisionmaking process. One way to address this unintended shift in responsibility is to appoint 237 See Farnsworth, supra note 213. 238 The public can vote political leaders out of office and aggrieved parties can bring lawsuits to seek vindication. 239 See supra note 117 (using a quantum source to generate randomness). 240 See Marbury v. Madison, 5 U.S. (1 Cranch) 137, 177 (1803) (“It is emphatically the province and duty of the judicial department to say what the law is.”). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 64 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -64technical experts as special masters. Courts typically appoint special masters to perform functions on behalf of the court that require special skill or knowledge.241 Another issue that challenges public accountability is the validation of the technical tools we have described. For courts, technical tools cannot be accepted until their integrity and reliability are proven. Courts have long confronted the problem of the admissibility of scientific evidence. There is a rich literature about the standards courts should use to admit expert scientific evidence, and, even now, federal and state standards vary.242 Following many years of debate, the Federal Rules of Evidence now spell out the federal standard for the acceptability of new scientific methods in adversarial proceedings.243 In 1993, the Supreme Court adjusted those standards to take account of factors that include testing, peer review and publication.244 The courts took years during the 1980s and 90s to establish and accept the scientific validity of DNA and the methods used to isolate and test DNA245 and even now, there are concerns that some scientific methods (e.g. matching DNA based on mixtures of DNA) may be receiving undeserved deference from courts and thus resulting in faulty findings of fact.246 The Federal Rules of Evidence now provide for the acceptability of new scientific methods in adversarial proceedings.247 In 1993, the Supreme Court set out standards to meet the Federal Rules requirements that include testing, peer review and publication.248 This addresses the validation of technical tools used to examine automated 241 See, e.g., United States v. Microsoft Corp., 147 F.3d 935, 959 n.4 (D.C. Cir. 1998) (noting Larry Lessig’s role as a special master for technical issues in the antitrust case brought against Microsoft). 242 See, e.g., Paul C. Giannelli, The Admissibility of Novel Scientific Evidence: Frye v. United States, a Half-Century Later, 80 Colum. L. Rev. 1197 (1980) (highlighting the development of the standards used for evidentiary scientific evidence); Note, Heather G. Hamilton, The Movement from Frye to Daubert: Where Do the States Stand?, 38 Jurimetrics 201 (1998) (emphasizing the lack of uniformity of state approaches). 243 See Fed. R. Evid. 702 (“A witness who is qualified as an expert by knowledge, skill, experience, training, or education may testify in the form of an opinion or otherwise if . . . the expert’s scientific, technical, or other specialized knowledge will help the trier of fact to understand the evidence or to determine a fact in issue”). 244 See Daubert v. Merrell Dow Pharm., Inc., 509 U.S. 579, 592-95 (1993) (explaining that a judge faced with a proffer of expert scientific testimony must assess whether the testimony’s underlying reasoning is valid, and in doing so, consider whether the technique or theory in question can be tested and whether it has been subjected to peer review and publication). 245 See, e.g., Nat’l Research Council, The Evaluation of Forensic DNA Evidence 166-211 (1996) (discussing the legal implications of the use of forensic DNA testing as well as the procedural and evidentiary rules that affect such use). 246 Logan Koepke, Should Secret Code Help Convict?, Medium (Mar. 24), https://medium.com/equal-future/should-secret-code-help-convict-7c864baffe15#.j9k1cwho0 [https://perma.cc/6LNW-WN6W]. 247 See Fed. R. Evid. 702 (“A witness who is qualified as an expert by knowledge, skill, experience, training, or education may testify in the form of an opinion or otherwise if . . . the expert’s scientific, technical, or other specialized knowledge will help the trier of fact to understand the evidence or to determine a fact in issue”). 248 See Daubert v. Merrell Dow Pharm., Inc., 509 U.S. 579, 592-95 (1993) (explaining that a judge faced with a proffer of expert scientific testimony must assess whether the testimony’s underlying reasoning is valid, and, in doing so, consider whether the technique or theory in question can be tested and whether it has been subjected to peer review and publication). DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 65 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -65decisionmaking, but still leaves open the assurance of the technical tools’ reliability. Ordinarily, the U.S. legal system relies on the adversarial process to assure the accuracy of findings. This attribute may be preserved by allowing multiple experts to test software-driven processes. 3. Secrets and Accountability Implementing automated decisionmaking in a socially and politically acceptable way requires progress in our ability to communicate and understand fine-grained partial information about how decisions are reached. Full transparency (disclosing everything) is technically trivial but politically and practically infeasible and may not be useful, as described in Section II.A. However, disclosing nothing about the basis for a decision is socially unacceptable and generally poses a technical challenge. Lawmakers and policymakers should remember that it is possible to make an algorithm accountable without the evaluator having full access to the algorithm.249 U.S. law and policy often focus on transparency and sometimes even equate oversight with transparency for the overseer.250 As such, accountability without full transparency may seem counterintuitive. However, oversight based on partial information occurs regularly within the legal system. Courts prevent consideration of many types of information for various policy reasons: disclosures of classified information may be prevented or limited to preserve national security;251 juvenile records may be sealed because of the notion that mistakes made in one’s youth should not follow them forever;252 and other evidence is deemed inadmissible for a multitude of reasons, including being unscientific,253 hearsay,254 inflammatory,255 or illegally obtained.256 Thus, all of the rules of evidence could be construed as precedent for the idea that optimal oversight does not require full information. There are strong policy justifications for holding back information in the case of automated decisionmaking. Revealing software source code and input data can expose trade secrets, violate privacy, hamper law enforcement, or lead to gaming of the decisionmaking process.257 The advantage of computer systems is that concealment of code and data does not imply an inability to analyze the code and data. The technical tools we describe give lawmakers and policymakers the ability to keep software programs and their inputs secret while still rendering them 249 See supra note XXX. 250 See, e.g., 5 U.S.C. § 552 (2012) (requiring agencies to make certain information available to the public); 15 U.S.C. § 6803 (2012) (requiring financial institutions to provide annual privacy notices to customers as a transparency measure). 251 See 18 U.S.C. § 798(a) (2012) (providing that the disclosure of classified government information may result in criminal liability). 252 See, e.g., N.Y. Crim. Proc. § 720.15 (requiring filing under seal in juvenile proceedings). 253 See Fed. R. Evid. 702 (establishing the court’s discretion to admit scientific evidence). 254 See Fed. R. Evid. 802 (stating that hearsay evidence is inadmissible unless a federal statute, the rules of evidence, or the Supreme Court provides otherwise). 255 See Fed. R. Evid. 403 (providing for the exclusion of relevant evidence for prejudice). 256 See, e.g., 18 U.S.C. § 2515 (2012) (setting an exclusionary rule for evidence obtained through wire tap or interception). 257 See Section II.A. DRAFT PLEASE CITE TO FINAL VERSION 11-19-2016 66 UNIVERSITY OF PENNSYLVANIA LAW REVIEW [Vol. 165: forthcoming] -66accountable. They can implement these tools in government-run algorithms, such as the DVL, and incentivize nongovernmental actors to use them, perhaps by mandating use or by requiring transparency--at least to courts--of code and inputs if they do not employ such technical tools.