1
AI-driven Software Development
Source Code Quality
Petr Kantek
AI-driven Software Development Source Code Quality2
Introduction
̶ First segment
̶ Bit of theory behind code generation models
̶ Natural language processing (NLP) tasks
̶ Transformers
̶ Large Language Models (LLMs)
̶ Second segment
̶ Problem statement
̶ Source code quality
̶ Tools for code generation
̶ Sample of experiments
AI-driven Software Development Source Code Quality3
First Segment
AI-driven Software Development Source Code Quality4
Naturalness of Software 1/2
̶ We are able to create statistical language models of natural
language
̶ In paper „On the naturalness of software“ from 2012, authors
show that software contains similar patterns as natural language
̶ Although potentionally complex, only small fraction is used
̶ A lot of repetitions
̶ Clear patterns that can be statistically modelled
̶ Cross-entropy of source code corpus is smaller than of English corpus
̶ And thus, code generation could be handled as a standard Natural
Language Processing (NLP) task
AI-driven Software Development Source Code Quality5
Naturalness of Software 2/2
̶ Using N-Gram language model, authors implemented an Eclipse
plugin for code completion
̶ N-Gram = given local context (previous tokens) I can get the most
likely next token
̶ For smaller token sequences worked relatively well
̶ Foundation of using standard NLP tools for code generation,
instead of language-based rules (for instance type context)
6
NLP Tasks for Source Code 1/5
AI-driven Software Development Source Code Quality
AI-driven Software Development Source Code Quality7
NLP Tasks for Source Code 2/5
̶ Code Generation
̶ Code generation is an an automated process of transforming natural
language specifications or descriptions into executable source code
̶ Natural language specifications can be in the form of code-level comments, prompts,
documentation and other
̶ Code Completion
̶ A feature that suggests and automates the insertion of code elements as developers type
̶ A common feature of integrated development environments (IDEs)
̶ Code Suggestion
̶ Subtask of code generation providing developers with intelligent
recommendations of code snippets for code enhancements, optimizations, or alternative
implementations
AI-driven Software Development Source Code Quality8
NLP Tasks for Source Code 3/5
̶ Code Translation
̶ Also called transpilation
̶ Code translation is the conversion of code from one programming language to equivalent
code in another programming language
̶ Enables interoperability and adaptation across diverse programming languages,
technological environments, and domains
̶ Transpiler
̶ Code Refinement
̶ Improving or optimizing existing source code
̶ Code Summarization
̶ Creating concise and informative summaries of code snippets or entire codebases to
facilitate comprehension, documentation, and knowledge transfer
̶ Useful for legacy codebases
AI-driven Software Development Source Code Quality9
NLP Tasks for Source Code 4/5
̶ Defect Detection
̶ The identification and analysis of bugs, errors, or imperfections in software code to improve
its correctness, reliability, and functionality
̶ Can be implemented as a binary classification task, where the input code snippet is
categorized either as defective or correct
̶ Code Repair
̶ Automatic or semi-automatic techniques for identifying and fixing issues or errors in source
code
̶ Additional functionality on top of defect detection
̶ Self-healing applications rewrite themselves by prompting AI
̶ Clone Detection
̶ Identifying redundant or similar sections of code within a software project
̶ DRY principle
10
NLP Tasks for Source Code 5/5
̶ Documentation Translation
̶ The translation of software documentation from one language to another
̶ Close to common NLP tasks such as machine translation
̶ NL Code Search
̶ Search for relevant code snippets using natural language „queries“
̶ Contextual descriptions rather than full-text search
AI-driven Software Development Source Code Quality
11
Transformer Neural Architecture 1/2
̶ Encoder
̶ Generates
Hidden state
Embeddings of semantic/syntactic information
̶ Decoder
̶ Generates output
̶ Encoder & Decoder
̶ Machine translation
̶ Natural language description to code
̶ Not just for text
̶ Vision Transformers
AI-driven Software Development Source Code Quality
12
Transformer Neural Architecture 2/2
̶ Tokenization of input
̶ Input embedding
̶ Positional encoding
̶ Multi-Head attention
̶ Feed forward network
̶ Layer normalization
̶ Softmax for output
AI-driven Software Development Source Code Quality
13
Transformers Training
AI-driven Software Development Source Code Quality
14
Transformers Inference
AI-driven Software Development Source Code Quality
AI-driven Software Development Source Code Quality15
Attention
̶ Function
̶ Resembling retrieval of information
̶ Query
What I am searching for
̶ Keys
Description of the information available
̶ Values
The actual information
̶ Self-attention
̶ Scaled dot product attention
̶ Multi-head attention
AI-driven Software Development Source Code Quality16
Multi-head Attention
AI-driven Software Development Source Code Quality17
Attention Rollout
̶ https://alphacode.deepmind.com
18
Large Language Models (LLMs)
AI-driven Software Development Source Code Quality
19
Large Language Models (LLMs)
̶ Generative models
̶ Based on Transformer architecture
̶ Mostly using only decoder part
̶ Prompting
̶ Chatbots
̶ Giving instructions to LLMs
̶ Either general LLM or fine-tuned to downstream NLP tasks
̶ Source code NLP tasks
̶ Self-supervised pre-training
̶ On vast amounts of text
̶ Generate the next word in the sentence
AI-driven Software Development Source Code Quality
20
LLMs Fine-tuning
̶ Instruction-based tuning
̶ The model is provided with user‘s message and generates prediction/answer
̶ It then tries to minize the difference between predictions and correct answers
̶ Reinforcement Learning from Human Feedback
̶ To maximize helpfulness
̶ Minimize harm
̶ Avoid dangerous topics
̶ Based on Reinforcement learning – environment, rewards
̶ Model generates multiple predictions and human ranks them from best to worst
̶ Aligns predictions with human preferences
AI-driven Software Development Source Code Quality
21
Large Language Models (LLMs)
̶ OpenAI: GPT-1, GPT-2, GPT-3, GPT-3.5, GPT-4, GPT-5, Codex
̶ GPT-1/2 open-source
̶ The rest closed-source
̶ Meta: Llama 1, Llama 2, Llama Code
̶ Open-source
̶ HuggingFace: Falcon, CodeParrot
̶ Open-source
̶ DeepMind: Chinchilla, AlphaCode
̶ Cloused-source
AI-driven Software Development Source Code Quality
22
Risks of LLMs
̶ Known for not being optimal models in terms of risks and security
̶ Mostly due to the fact that LLMs are trained on diverse datasets
harvested from public internet
̶ Bias
̶ propensity to favor or disfavor particular groups or concepts based on the patterns
observed in the training data
̶ Hallucinations
̶ Refer to instances where the model generates content that is factually incorrect, fictional,
or entirely fabricated
̶ Code snippets
AI-driven Software Development Source Code Quality
AI-driven Software Development Source Code Quality23
Risks of LLMs
̶ Trustworthiness
̶ Reliability and dependability of the information generated
̶ Racism
̶ generation of content that discriminates against or perpetuates stereotypes about
individuals or groups based on their race or ethnicity
̶ Security
̶ potential vulnerabilities, that could be exploited to manipulate or compromise the model or
its output
̶ Indirect vulnerabilities by freely using whatever code LLMs generate
̶ Toxicity & Hate Speech
̶ generation of content, that is harmful, offensive, or abusive
AI-driven Software Development Source Code Quality24
Second Segment
25
Problem Statement
AI-driven Software Development Source Code Quality
̶ Code generation models can output low quality code
̶ Can contain vulnerabilities
̶ Type errrors
̶ Non-existing libraries or syntax
̶ Might break best practices principles
̶ Or might not work at all
̶ Experiment with various LLM-based code generation tools
26
Software Vulnerabilities
̶ Software vulnerabilities are weaknesses or flaws in software code
that can be exploited by attackers to compromise the security or
functionality of a system
̶ SQL Injection
̶ Cross-site Scripting
̶ Authorization Attacks
AI-driven Software Development Source Code Quality
AI-driven Software Development Source Code Quality27
Common Weakness Enumeration
̶ A list of most common
weaknesses/vulnerabilities
̶ Every year a top 25 most
dangerous weaknesses
̶ Hierarchical structure
28
Source Code Quality
̶ Static analysis
̶ Linters
̶ Check source code against a set of rules
̶ Adherence to style guides
̶ Python linter Mypy checking against PEP8
̶ Source code metrics
̶ Lines of code
̶ Average number of methods
̶ Cyclomatic complexity
̶ …
AI-driven Software Development Source Code Quality
AI-driven Software Development Source Code Quality29
CodeQL
̶ Static analyzer from GitHub
̶ Specific query language
̶ Queries for CWE detection
̶ CLI or CI/CD
30
AI Tools for Code Generation
̶ GitHub Copilot
̶ Built on OpenAI Codex model
̶ VS Code extension
̶ Paid license
̶ TabNine
̶ Combination of GPT models
̶ No additional information about architecture
̶ VS Code extension
̶ Free & Paid license
̶ ChatGPT
̶ Web interface
̶ 3rd Party VS Code plugins need OpenAI API key
AI-driven Software Development Source Code Quality
31
Experiments – RQ1
̶ Which tool from GitHub Copilot, TabNine, and ChatGPT is able to
suggest the least vulnerable Python code according to a defined
list of 5 CWEs (sql injection, ssh missing host key, server-side
cross-site scripting, …)
̶ 25 code snippets x 3 lengths (short, medium, longer) per tool
AI-driven Software Development Source Code Quality
Tool Number of snippets # Containing
vulnerabilities
% Containing
vulnerabilities
GH Copilot 75 12 0.16
TabNine 75 17 0.22
ChatGPT 75 21 0.28
AI-driven Software Development Source Code Quality32
Experiments - RQ2
̶ Which tool from GitHub Copilot, TabNine, and ChatGPT is able to
suggest code with least amount of Python linting errors
̶ Python linter Mypy in strict mode
̶ 25 code snippets x 3 lengths (short, medium, longer) per tool
Tool Number of
snippets
# Containing
errors
% Containing
errors
# Total errors
GH Copilot 75 43 0.57 155
TabNine 75 42 0.56 172
ChatGPT 75 40 0.53 168
33
Experiments – RQ3
̶ Which tool from GitHub Copilot, TabNine, and ChatGPT is able to
write the most adhering Python docstring to PEP8 given function
signatures
̶ Python linter Pydocstyle
̶ 25 code snippets x 3 lengths (short, medium, longer) per tool
AI-driven Software Development Source Code Quality
Tool Number of
snippets
# Containing
errors
% Containing
errors
# Total errors
GH Copilot 75 3 0.04 8
TabNine 75 5 0.06 6
ChatGPT 75 12 0.16 27
34
Sources
̶ A. Hindle, E. T. Barr, Z. Su, M. Gabel and P. Devanbu, "On the naturalness of software," 2012 34th International Conference on
Software Engineering (ICSE), Zurich, Switzerland, 2012, pp. 837-847, doi: 10.1109/ICSE.2012.6227135.
̶ Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
̶ https://cwe.mitre.org
̶ https://github.com, https://www.tabnine.com, https://chat.openai.com
AI-driven Software Development Source Code Quality
35
Thank You
AI-driven Software Development Source Code Quality