Self-admitted technical debt identification from source code comments and commits using NLP and machine learning

Dissertant

Sabbah, Ahmad

Thesis advisor

Hanani, Abu al-Suud

University

Birzeit University

Faculty

Faculty of Engineering and Technology

Department

Department of Computer Systems Engineering

University Country

Palestine (West Bank)

Degree

Master

Degree Date

2021

English Abstract

Technical Debt (TD) is a metaphor that describes the immature software that contains the issues are occurred by the developers intentionally.

This usually happens when they postponed the optimal code implementation to the future to achieve short-term benefits.

TD has been unintentional since the lack of developers’ knowledge.

Recently, self-admitted technical debt (SATD) term is introduced to express the TD that occurs intentionally.

The developers write the comments in the source code file to admit the TD by themselves.

Previous studies have shown that the automatic detection of SATD can be done from the source code comments.

Most of these studies have focused on the syntactic or pattern analysis of the comments, and little of them considered the sentimental analysis of the comments sentences.

This thesis investigates the effectiveness of Natural Language Processing (NLP), machine learning, and deep learning techniques for automatically identifying SADT from source comments and commits.

NLP involves pre-processing the source code comments, which are defined as SATD, then extract features from the comments using count-based and word embedding representations.

For the count-based approach, TF-IDF method was used.

The word embedding for features extracted using pre-trained models was trained on large universal vocabularies such as word2vec, GolVe, BERT, Fasttext, and specific software engineering models.

In the last step, classical machine learning (ML) algorithms will be used, such as Naive Bayes (NB), Random Forest(RF), and Support-Vector Machines (SVM).

Additionally, the state-of-the-art neural network techniques that are known as deep learning (DL), such as Convolutional Neural Network (CNN) are used to classify the comments and commits into five classes representing the types of SATD.

The types of SATD include requirement debt, design debt, defect debt, test debt, and documentation debt.

To achieve that, more than one dataset is used in previous studies was combined.

5082 comments and commits were collected that classified as SATD, but they are not labeled according to the considered five SADT categories, so the manually annotate 1147 comments and 366 commits into one of these five categories.

Adataset includes a total of 1513 comments and commits classified to the five categories of SATD .

Additionally, another dataset consisting of 4,071 comments were manually labeled and publicly available by the author of [59] (Mdataset).

The proposed system can classify the SATD comments and commits into the five categories using classical ML and DL with Adataset, Mdataset, and combined datasets.

The RF classifier achieved an accuracy of 0.822 with Adataset, 0.820 with Mdataset, and 0.826 with the combined dataset.

With Adataset the CNN achieved an accuracy of 0.838 with BERT.

With Mdataset, the CNN achieved an accuracy of 0.809 and 0.812 with BERT and Word2Vec, respectively.

Finally, the CNN reached the best accuracy of 0.849 with BERT when using the combined dataset.

AS a result, the proposed systems outperform the results of similar studies, which classify the SATD into two types (Requirement and design) by at least 0.11 when using the Mdataset

Main Subjects

Information Technology and Computer Science

No. of Pages

139

Table of Contents

Table of contents.

Abstract.

Chapter One : Introduction.

Chapter Two : Background.

Chapter Three : Literature review.

Chapter Four : Research methodology.

Chapter Five : Experimental setup.

Chapter Six : Experiments and results.

Chapter Seven : Conclusion and future work.

References.

American Psychological Association (APA)

Sabbah, Ahmad. (2021). Self-admitted technical debt identification from source code comments and commits using NLP and machine learning. (Master's theses Theses and Dissertations Master). Birzeit University, Palestine (West Bank)
https://search.emarefa.net/detail/BIM-1412985

Modern Language Association (MLA)

Sabbah, Ahmad. Self-admitted technical debt identification from source code comments and commits using NLP and machine learning. (Master's theses Theses and Dissertations Master). Birzeit University. (2021).
https://search.emarefa.net/detail/BIM-1412985

American Medical Association (AMA)

Sabbah, Ahmad. (2021). Self-admitted technical debt identification from source code comments and commits using NLP and machine learning. (Master's theses Theses and Dissertations Master). Birzeit University, Palestine (West Bank)
https://search.emarefa.net/detail/BIM-1412985

Language

English

Data Type

Arab Theses

Record ID

BIM-1412985