Flowchart-Based Cross-Language Source Code Similarity Detection

Joint Authors

Song, Qian
Zhang, Feng
Li, Guofan
Liu, Cong

Source

Scientific Programming

Issue

Vol. 2020, Issue 2020 (31 Dec. 2020), pp.1-15, 15 p.

Publisher

Hindawi Publishing Corporation

Publication Date

2020-12-18

Country of Publication

Egypt

No. of Pages

15

Main Subjects

Mathematics

Abstract EN

Source code similarity detection has various applications in code plagiarism detection and software intellectual property protection.

In computer programming teaching, students may convert the source code written in one programming language into another language for their code assignment submission.

Existing similarity measures of source code written in the same language are not applicable for the cross-language code similarity detection because of syntactic differences among different programming languages.

Meanwhile, existing cross-language source similarity detection approaches are susceptible to complex code obfuscation techniques, such as replacing equivalent control structure and adding redundant statements.

To solve this problem, we propose a cross-language code similarity detection (CLCSD) approach based on code flowcharts.

In general, two source code fragments written in different programming languages are transformed into standardized code flowcharts (SCFC), and their similarity is obtained by measuring their corresponding SCFC.

More specifically, we first introduce the standardized code flowchart (SCFC) model to be the uniform flowcharts representation of source code written in different languages.

SCFC is language-independent, and therefore, it can be used as the intermediate structure for source code similarity detection.

Meanwhile, transformation techniques are given to transform source code written in a specific programming language into an SCFC.

Second, we propose the SCFC-SPGK algorithm based on the shortest path graph kernel to measure the similarity between two SCFCs.

Thus, the similarity between two pieces of source code in different programming languages is given by the similarity between SCFCs.

Experimental results show that compared with existing approaches, CLCSD has higher accuracy in cross-language source code similarity detection.

Furthermore, CLCSD cannot only handle common source code obfuscation techniques used by students in computer programming teaching but also obtain nearly 90% accuracy in dealing with some complex obfuscation techniques.

American Psychological Association (APA)

Zhang, Feng& Li, Guofan& Liu, Cong& Song, Qian. 2020. Flowchart-Based Cross-Language Source Code Similarity Detection. Scientific Programming،Vol. 2020, no. 2020, pp.1-15.
https://search.emarefa.net/detail/BIM-1209170

Modern Language Association (MLA)

Zhang, Feng…[et al.]. Flowchart-Based Cross-Language Source Code Similarity Detection. Scientific Programming No. 2020 (2020), pp.1-15.
https://search.emarefa.net/detail/BIM-1209170

American Medical Association (AMA)

Zhang, Feng& Li, Guofan& Liu, Cong& Song, Qian. Flowchart-Based Cross-Language Source Code Similarity Detection. Scientific Programming. 2020. Vol. 2020, no. 2020, pp.1-15.
https://search.emarefa.net/detail/BIM-1209170

Data Type

Journal Articles

Language

English

Notes

Includes bibliographical references

Record ID

BIM-1209170