1 AI-driven Software Development Source Code Quality Petr Kantek AI-driven Software Development Source Code Quality2 Introduction ̶ First segment ̶ Bit of theory behind code generation models ̶ Natural language processing (NLP) tasks ̶ Transformers ̶ Large Language Models (LLMs) ̶ Second segment ̶ Problem statement ̶ Source code quality ̶ Tools for code generation ̶ Sample of experiments AI-driven Software Development Source Code Quality3 First Segment AI-driven Software Development Source Code Quality4 Naturalness of Software 1/2 ̶ We are able to create statistical language models of natural language ̶ In paper „On the naturalness of software“ from 2012, authors show that software contains similar patterns as natural language ̶ Although potentionally complex, only small fraction is used ̶ A lot of repetitions ̶ Clear patterns that can be statistically modelled ̶ Cross-entropy of source code corpus is smaller than of English corpus ̶ And thus, code generation could be handled as a standard Natural Language Processing (NLP) task AI-driven Software Development Source Code Quality5 Naturalness of Software 2/2 ̶ Using N-Gram language model, authors implemented an Eclipse plugin for code completion ̶ N-Gram = given local context (previous tokens) I can get the most likely next token ̶ For smaller token sequences worked relatively well ̶ Foundation of using standard NLP tools for code generation, instead of language-based rules (for instance type context) 6 NLP Tasks for Source Code 1/5 AI-driven Software Development Source Code Quality AI-driven Software Development Source Code Quality7 NLP Tasks for Source Code 2/5 ̶ Code Generation ̶ Code generation is an an automated process of transforming natural language specifications or descriptions into executable source code ̶ Natural language specifications can be in the form of code-level comments, prompts, documentation and other ̶ Code Completion ̶ A feature that suggests and automates the insertion of code elements as developers type ̶ A common feature of integrated development environments (IDEs) ̶ Code Suggestion ̶ Subtask of code generation providing developers with intelligent recommendations of code snippets for code enhancements, optimizations, or alternative implementations AI-driven Software Development Source Code Quality8 NLP Tasks for Source Code 3/5 ̶ Code Translation ̶ Also called transpilation ̶ Code translation is the conversion of code from one programming language to equivalent code in another programming language ̶ Enables interoperability and adaptation across diverse programming languages, technological environments, and domains ̶ Transpiler ̶ Code Refinement ̶ Improving or optimizing existing source code ̶ Code Summarization ̶ Creating concise and informative summaries of code snippets or entire codebases to facilitate comprehension, documentation, and knowledge transfer ̶ Useful for legacy codebases AI-driven Software Development Source Code Quality9 NLP Tasks for Source Code 4/5 ̶ Defect Detection ̶ The identification and analysis of bugs, errors, or imperfections in software code to improve its correctness, reliability, and functionality ̶ Can be implemented as a binary classification task, where the input code snippet is categorized either as defective or correct ̶ Code Repair ̶ Automatic or semi-automatic techniques for identifying and fixing issues or errors in source code ̶ Additional functionality on top of defect detection ̶ Self-healing applications rewrite themselves by prompting AI ̶ Clone Detection ̶ Identifying redundant or similar sections of code within a software project ̶ DRY principle 10 NLP Tasks for Source Code 5/5 ̶ Documentation Translation ̶ The translation of software documentation from one language to another ̶ Close to common NLP tasks such as machine translation ̶ NL Code Search ̶ Search for relevant code snippets using natural language „queries“ ̶ Contextual descriptions rather than full-text search AI-driven Software Development Source Code Quality 11 Transformer Neural Architecture 1/2 ̶ Encoder ̶ Generates Hidden state Embeddings of semantic/syntactic information ̶ Decoder ̶ Generates output ̶ Encoder & Decoder ̶ Machine translation ̶ Natural language description to code ̶ Not just for text ̶ Vision Transformers AI-driven Software Development Source Code Quality 12 Transformer Neural Architecture 2/2 ̶ Tokenization of input ̶ Input embedding ̶ Positional encoding ̶ Multi-Head attention ̶ Feed forward network ̶ Layer normalization ̶ Softmax for output AI-driven Software Development Source Code Quality 13 Transformers Training AI-driven Software Development Source Code Quality 14 Transformers Inference AI-driven Software Development Source Code Quality AI-driven Software Development Source Code Quality15 Attention ̶ Function ̶ Resembling retrieval of information ̶ Query What I am searching for ̶ Keys Description of the information available ̶ Values The actual information ̶ Self-attention ̶ Scaled dot product attention ̶ Multi-head attention AI-driven Software Development Source Code Quality16 Multi-head Attention AI-driven Software Development Source Code Quality17 Attention Rollout ̶ https://alphacode.deepmind.com 18 Large Language Models (LLMs) AI-driven Software Development Source Code Quality 19 Large Language Models (LLMs) ̶ Generative models ̶ Based on Transformer architecture ̶ Mostly using only decoder part ̶ Prompting ̶ Chatbots ̶ Giving instructions to LLMs ̶ Either general LLM or fine-tuned to downstream NLP tasks ̶ Source code NLP tasks ̶ Self-supervised pre-training ̶ On vast amounts of text ̶ Generate the next word in the sentence AI-driven Software Development Source Code Quality 20 LLMs Fine-tuning ̶ Instruction-based tuning ̶ The model is provided with user‘s message and generates prediction/answer ̶ It then tries to minize the difference between predictions and correct answers ̶ Reinforcement Learning from Human Feedback ̶ To maximize helpfulness ̶ Minimize harm ̶ Avoid dangerous topics ̶ Based on Reinforcement learning – environment, rewards ̶ Model generates multiple predictions and human ranks them from best to worst ̶ Aligns predictions with human preferences AI-driven Software Development Source Code Quality 21 Large Language Models (LLMs) ̶ OpenAI: GPT-1, GPT-2, GPT-3, GPT-3.5, GPT-4, GPT-5, Codex ̶ GPT-1/2 open-source ̶ The rest closed-source ̶ Meta: Llama 1, Llama 2, Llama Code ̶ Open-source ̶ HuggingFace: Falcon, CodeParrot ̶ Open-source ̶ DeepMind: Chinchilla, AlphaCode ̶ Cloused-source AI-driven Software Development Source Code Quality 22 Risks of LLMs ̶ Known for not being optimal models in terms of risks and security ̶ Mostly due to the fact that LLMs are trained on diverse datasets harvested from public internet ̶ Bias ̶ propensity to favor or disfavor particular groups or concepts based on the patterns observed in the training data ̶ Hallucinations ̶ Refer to instances where the model generates content that is factually incorrect, fictional, or entirely fabricated ̶ Code snippets AI-driven Software Development Source Code Quality AI-driven Software Development Source Code Quality23 Risks of LLMs ̶ Trustworthiness ̶ Reliability and dependability of the information generated ̶ Racism ̶ generation of content that discriminates against or perpetuates stereotypes about individuals or groups based on their race or ethnicity ̶ Security ̶ potential vulnerabilities, that could be exploited to manipulate or compromise the model or its output ̶ Indirect vulnerabilities by freely using whatever code LLMs generate ̶ Toxicity & Hate Speech ̶ generation of content, that is harmful, offensive, or abusive AI-driven Software Development Source Code Quality24 Second Segment 25 Problem Statement AI-driven Software Development Source Code Quality ̶ Code generation models can output low quality code ̶ Can contain vulnerabilities ̶ Type errrors ̶ Non-existing libraries or syntax ̶ Might break best practices principles ̶ Or might not work at all ̶ Experiment with various LLM-based code generation tools 26 Software Vulnerabilities ̶ Software vulnerabilities are weaknesses or flaws in software code that can be exploited by attackers to compromise the security or functionality of a system ̶ SQL Injection ̶ Cross-site Scripting ̶ Authorization Attacks AI-driven Software Development Source Code Quality AI-driven Software Development Source Code Quality27 Common Weakness Enumeration ̶ A list of most common weaknesses/vulnerabilities ̶ Every year a top 25 most dangerous weaknesses ̶ Hierarchical structure 28 Source Code Quality ̶ Static analysis ̶ Linters ̶ Check source code against a set of rules ̶ Adherence to style guides ̶ Python linter Mypy checking against PEP8 ̶ Source code metrics ̶ Lines of code ̶ Average number of methods ̶ Cyclomatic complexity ̶ … AI-driven Software Development Source Code Quality AI-driven Software Development Source Code Quality29 CodeQL ̶ Static analyzer from GitHub ̶ Specific query language ̶ Queries for CWE detection ̶ CLI or CI/CD 30 AI Tools for Code Generation ̶ GitHub Copilot ̶ Built on OpenAI Codex model ̶ VS Code extension ̶ Paid license ̶ TabNine ̶ Combination of GPT models ̶ No additional information about architecture ̶ VS Code extension ̶ Free & Paid license ̶ ChatGPT ̶ Web interface ̶ 3rd Party VS Code plugins need OpenAI API key AI-driven Software Development Source Code Quality 31 Experiments – RQ1 ̶ Which tool from GitHub Copilot, TabNine, and ChatGPT is able to suggest the least vulnerable Python code according to a defined list of 5 CWEs (sql injection, ssh missing host key, server-side cross-site scripting, …) ̶ 25 code snippets x 3 lengths (short, medium, longer) per tool AI-driven Software Development Source Code Quality Tool Number of snippets # Containing vulnerabilities % Containing vulnerabilities GH Copilot 75 12 0.16 TabNine 75 17 0.22 ChatGPT 75 21 0.28 AI-driven Software Development Source Code Quality32 Experiments - RQ2 ̶ Which tool from GitHub Copilot, TabNine, and ChatGPT is able to suggest code with least amount of Python linting errors ̶ Python linter Mypy in strict mode ̶ 25 code snippets x 3 lengths (short, medium, longer) per tool Tool Number of snippets # Containing errors % Containing errors # Total errors GH Copilot 75 43 0.57 155 TabNine 75 42 0.56 172 ChatGPT 75 40 0.53 168 33 Experiments – RQ3 ̶ Which tool from GitHub Copilot, TabNine, and ChatGPT is able to write the most adhering Python docstring to PEP8 given function signatures ̶ Python linter Pydocstyle ̶ 25 code snippets x 3 lengths (short, medium, longer) per tool AI-driven Software Development Source Code Quality Tool Number of snippets # Containing errors % Containing errors # Total errors GH Copilot 75 3 0.04 8 TabNine 75 5 0.06 6 ChatGPT 75 12 0.16 27 34 Sources ̶ A. Hindle, E. T. Barr, Z. Su, M. Gabel and P. Devanbu, "On the naturalness of software," 2012 34th International Conference on Software Engineering (ICSE), Zurich, Switzerland, 2012, pp. 837-847, doi: 10.1109/ICSE.2012.6227135. ̶ Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017). ̶ https://cwe.mitre.org ̶ https://github.com, https://www.tabnine.com, https://chat.openai.com AI-driven Software Development Source Code Quality 35 Thank You AI-driven Software Development Source Code Quality