Analysis and semi automated detection of design level similarity patterns in software

Several techniques have been proposed to detect the same or similar code fragments in software, henceforth called simple clones, with some gains in helping to reduce update anomalies an

Trang 1

HAMID ABDUL BASIT

(B.S Engg., GIK Institute of Engineering Sciences &

Technology, Pakistan)

A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE

NATIONAL UNIVERSITY OF SINGAPORE

2006

Trang 2

First of all, I am thankful to Allah, the most Magnificent, for His countless blessings that He continues to shower upon me

I am deeply indebted to my PhD supervisor, Dr Stanislaw Jarzabek, for all the help that he rendered during the course of this thesis; the guidance, the insight, the continuous encouragement, and the trust

This work would not have been possible without expert guidance from Prof Bill Smyth and timely help from Simon Puglisi, for which I am truly grateful

Many thanks are due to the thesis advisory committee members; Dr Jin Song Dong and Dr Irene Woon for their useful feedback during the course of this project

I also owe special thanks to my colleague Damith Chatura Rajapakse for his wonderful company, help and feed back on my work

I am also very grateful to the HYP and UROP students whom I supervised; Melvin Low Jen

Ku, Goh Kwan Kee, Chan Jun Liang, and Zhang Yali, for their hard work and invaluable contribution in this project

Finally, I am thankful to all my family members and especially my wife, Sidra, for being with

me through thick and thin

Trang 3

ACKNOWLEDGEMENTS I

TABLE OF CONTENTS II

SUMMARY VII

LIST OF TABLES IX

LIST OF FIGURES X

CHAPTER 1 INTRODUCTION 1

1.1 OPEN CHALLENGES 3

1.2 THE GOALS, SCOPE AND CONTRIBUTIONS OF THISTHESIS 5

1.3 OUTLINE OF THE THESIS 7

CHAPTER 2 CLONING – OVERVIEW AND RELATED WORK 9

2.1 TYPES OF SIMPLE CLONES 10

2.2 REASONS FOR CLONES 11

2.3 NEGATIVE IMPACT OF CLONES 14

2.4 CLONE DETECTION 15

2.4.1 Program Representation 15

2.4.2 Generality 17

2.4.3 Granularity of Detected Clones 17

2.5 CLONE MANAGEMENT 18

2.5.1 Preventive Clone Management 18

Trang 4

2.5.2 Corrective Clone Management 18

2.5.3 Compensatory Clone Management 21

2.6 SIMPLE CLONE TAXONOMIES 23

2.7 HIGHER LEVEL CLONES AND DESIGN RECOVERY 24

2.8 CONCLUSIONS 27

CHAPTER 3 STRUCTURAL CLONES – HIGHER LEVEL SIMILARITIES IN PROGRAMS 28 3.1 INTRODUCTION AND MOTIVATION 29

3.2 FROM SIMPLE CLONES TO STRUCTURAL CLONES 30

3.2.1 Clones 30

3.2.2 Program Structures 31

3.2.3 Structure Hierarchies 32

3.2.4 Structural Clones 33

3.3 EXAMPLES OF STRUCTURAL CLONES 34

3.3.1 Acknowledgement 35

3.3.2 A File-Level Structural Clone 35

3.3.3 A Module-Level Structural Clone 36

3.3.4 Multiple Structural Clones in the Same File 37

3.3.5 Crosscutting Structural Clones 38

3.3.6 Heterogeneous Entity Structural Clones 38

3.3.7 Structural Clones Based on Inheritance Hierarchy 39

3.3.8 Structural Clone Spanning Multiple Layers 40

3.4 TOWARDS CLASSIFICATION OF STRUCTURAL CLONES 41

3.5 CONCLUSIONS 43

CHAPTER 4 EFFICIENT TOKEN-BASED DETECTION OF SIMPLE CLONES 44

4.1 ACKNOWLEDGEMENTS 44

4.2 INTRODUCTION 45

4.3 FLEXIBLE TOKENIZATION 46

4.3.1 Tokenization Example 49

Trang 5

4.4 E C D 52

4.4.1 Basic Repeat Finding Algorithm 56

4.5 CURBING FALSE POSITIVES 59

4.6 CONCLUSION 60

CHAPTER 5 DETECTING STRUCTURAL CLONES WITH DATA MINING 61

5.1 SCOPE OF THE TECHNIQUE 62

5.2 RE-ORGANIZING THE DATA 63

5.3 FINDING RECURRING PATTERNS OF SIMPLE CLONE CLASSES 64

5.4 CLUSTERING HIGHLY CLONED FILES 67

5.5 RAISING THE ABSTRACTION – ANALYZING DIRECTORIES 70

5.6 METHOD LEVEL ANALYSIS 72

5.7 CONCLUSION 72

CHAPTER 6 TOOL IMPLEMENTATION 73

6.1 TOOL IMPLEMENTATION 73

6.2 OUTPUT FORMAT 75

6.3 PERFORMANCE OF SIMPLE CLONE DETECTION 78

6.4 PERFORMANCE OF STRUCTURAL CLONE DETECTION 81

6.5 CONCLUSION 82

CHAPTER 7 STRUCTURAL CLONE ANALYSIS TECHNIQUES 83

7.1 NEED FOR CLONE ANALYSIS 83

7.2 CLONE ANALYSIS TECHNIQUES 86

7.2.1 Ease of Clone Analysis 86

7.2.2 Overview of Cloning Intensity 88

7.2.3 Clones Manipulation Features 89

7.2.4 Refocusing the Detection 90

7.2.5 Tool Implementation 90

7.3 CONCLUSIONS 91

CHAPTER 8 APPLICATIONS 92

Trang 6

8.1 PROGRAM UNDERSTANDING 93

8.2 IMPROVING MAINTAINABILITY OF CODE 95

8.2.1 Refactoring 95

8.2.2 Creating Generic Representation 96

8.2.3 Change Impact Analysis 99

8.3 REENGINEERING FOR REUSE 99

8.4 CONCLUSION 100

CHAPTER 9 EXPERIMENTATION 101

9.1 CORRECTNESS VALIDATION 101

9.2 USEFULNESS VALIDATION 104

9.3 QUALITATIVE ANALYSIS 110

9.3.1 Eclipse Graphical Editing Framework 111

9.3.2 Eclipse Visual Editor 113

9.3.3 OpenJGraph 0.9.2 114

9.3.4 J2ME Wireless Toolkit 2.2 115

9.3.5 Java Pet Store 1.3.2 115

9.4 COVERAGE ANALYSIS 117

9.5 CONCLUSIONS 121

CHAPTER 10 CONCLUSIONS AND FUTURE WORK 123

BIBLIOGRAPHY 126

APPENDIX A SURVEY OF CLONE DETECTION TECHNIQUES 148

A.1 DUPLOC 148

A.2 FINGERPRINTING TECHNIQUE 149

A.3 WEB CLONE DETECTOR 149

A.4 CCFINDER 150

A.5 DUP 151

A.6 DOTPLOT 153

A.7 AST BASED TECHNIQUE 153

A.8 METRICS BASED TECHNIQUE BY MAYRAND ET AL 154

Trang 7

A.9 M B T K 155

A.10 DYNAMIC PROGRAMMING TECHNIQUE BY KONTOGIANNIS ET AL 157

A.11 DYNAMIC PROGRAMMING TECHNIQUE BY BALAZINSKA ET AL.: 157

A.12 PDG BASED TECHNIQUE BY KOMONDOOR ET AL 158

A.13 PDG BASED TECHNIQUE BY KRINKE 159

A.14 NEURAL NETWORK BASED TECHNIQUE 159

APPENDIX B CASE STUDIES IN TYPE PARAMETERIZATION MECHANISMS 161

B.1 STUDYING JAVA GENERICS WITH BUFFER LIBRARY 162

Acknowledgement 162

Study Overview 162

Buffer Library 162

Can We Have a Generic Buffer Library? 164

B.2 STUDY OF CLONES IN THE STL 171

Acknowledgements 171

Introduction and Motivation 171

Structure of the STL 173

Study Methodology 174

Analysis of Clones in the STL 175

Effects of Clones in the STL 183

XVCL solution 183

Discussion of Results 185

B.3 CONCLUSIONS 186

Trang 8

Code clones are similar program structures of any type and granularity recurring in variant forms in a program Cloning in software systems is known to create problems during software maintenance Several techniques have been proposed to detect the same or similar code

fragments in software, henceforth called simple clones, with some gains in helping to reduce

update anomalies and the software size Further gains, however, can be obtained by elevating the level of clone analysis We observed that recurring patterns of simple clones may indicate the presence of interesting higher-level similar program structures that often map to design or

application domain concepts We call these high-level similarities structural clones Detection

of these structural clones leads to a better understanding of the design of the system, which helps in day-to-day software maintenance, long-term evolution and re-engineering Unification of structural clones with generic program structures offers interesting opportunities for program simplification and reuse

In this thesis, we first present an efficient token-based technique for simple clone detection, based on the current advancements in the field of string pattern matching algorithms and data structures Next, we define a class of useful structural clones and propose a technique to systematically detect them We consider structural clones formed by groups of highly similar methods, classes or source files and their recurring patterns in various parts of the system Here, the novelty of our approach is in formulating the concept of structural clone, in applying data mining techniques to detect them, and in applying visualization and analysis

Trang 9

techniques to further improve effectiveness of structural clone detection with involvement of human experts.

We implemented the proposed method for structural clone detection into a tool called Clone Miner Finally, we validated the usefulness of the proposed method via experimentation, showing that Clone Miner finds many useful structural clones and scales up to big programs This thesis advances the state-of-the-art in clone detection and design recovery research as follows: First, our technique for simple clone detection is more efficient than other tools described in the literature, due to our choice of suffix arrays as data structure and novel maximal repeats finding algorithm Clone Miner is also more flexible than other tools in customizing the clone detection process

Second, we introduce the concept of structural clone that extends research on cloning from similar code fragments to similar program structures of any kind and granularity, potentially more meaningful than just similar code fragments Clone Miner provides practical means to detect structural clones in a semi-automated process that involves data mining techniques at the initial stage, followed up with user-assisted visualization/abstraction/filtering techniques Third, with the concept of structural clone, we revisit research on reverse engineering and design recovery which have received much attention in last decades Despite much work, not many practical and scalable techniques have been transferred from labs to the programming practice It appears that structural clones often represent important concepts from application domain or design Clone Miner offers a pragmatic and scalable method to recover these concepts, feeding developers with information that is vital in program understanding, evolution and re-engineering Finally, structural clones offer opportunities for unconventional reuse that reaches beyond reuse rates achievable with architecture-centric, component-based approaches Unification of structural clones with generic structures also brings reduction of cognitive program complexity

Trang 10

TABLE 1: A SAMPLE REPRESENTATION OF TOKEN CLASSES WITH TOKEN SYMBOLS - 48

TABLE 2: LANGUAGE TOKENS - 74

TABLE 3: CASE STUDY SYSTEMS - 78

TABLE 4: PERFORMANCE OF SIMPLE CLONE DETECTION - 79

TABLE 5: PERFORMANCE OF STRUCTURAL CLONE DETECTION - 82

TABLE 6: XVCL COMMANDS - 96

TABLE 7: CLONE CLUSTER ANALYSIS OF J2SE 1.5 -103

TABLE 8: CLONING ACROSS AND WITHIN MODULES OF CAP-WP -106

TABLE 9: CLONE DETECTION RESULTS ON INDIVIDUAL MODULES OF CAP-WP [GOH06] -109

TABLE 10: CASE STUDY SYSTEMS -117

TABLE 11: SIMPLE CLONE CLASSES (SCC) -117

TABLE 12: SIMPLE CLONE STRUCTURES (SCS) -118

TABLE 13: FILE CLONE CLASSES (FCC) -118

TABLE 14: FILE CLONE STRUCTURES(FCS) -119

TABLE 15: METHOD CLONE CLASSES (MCC) -119

TABLE 16: METHOD CLONE STRUCTURES -120

TABLE 17: CLONING STATISTICS IN CASE STUDY SYSTEMS -120

TABLE 18: SUMMARY OF CLONING IN THE STL -175

TABLE 19: FEATURE COMBINATIONS OF ASSOCIATIVE CONTAINERS -178

Trang 11

FIGURE 1: STRUCTURAL CLONES IN A MULTI-TIER SYSTEM -3

FIGURE 2: A PARAMETERIZED CLONE ADAPTED FROM X WINDOW SOURCE CODE [BAK92] - 10

FIGURE 3: A GAPPED CLONE FROM THE UNIX UTILITY BISON[KH01B] - 11

FIGURE 4: A REORDERED CLONE SEGMENT FROM UNIX UTILITY BISON [KH01A] - 11

FIGURE 5: AN INTERTWINED CLONE FROM UNIX UTILITY SORT[KH01A] - 11

FIGURE 6: FOUR QUADRANTS OF CLONES - 22

FIGURE 7: A STRUCTURE - 31

FIGURE 8: A STRUCTURE OF STRUCTURES - 32

FIGURE 9: FIVE EXAMPLE STRUCTURES - 34

FIGURE 10: FILE-LEVEL STRUCTURAL CLONES - 35

FIGURE 11: MODULE-LEVEL STRUCTURAL CLONES - 36

FIGURE 12: MULTIPLE STRUCTURAL CLONES IN ONE FILE - 37

FIGURE 13: TWO CROSSCUTTING STRUCTURAL CLONES - 38

FIGURE 14: STRUCTURAL CLONE WITH HETEROGENEOUS ENTITIES - 39

FIGURE 15: STRUCTURAL CLONE BASED ON INHERITANCE - 40

FIGURE 16: STRUCTURAL CLONE SPANNING MULTIPLE LAYERS - 40

FIGURE 17: CODE CLONES DIFFERING IN OPERATORS FOUND IN C++ STL - 47

FIGURE 18: SAMPLE CODE [KKI02] - 49

FIGURE 19: SUFFIX TREE FOR STRING 'MISSISSIPPI$' - 53

FIGURE 20: NERF IMPLEMENTATION - 58

FIGURE 22: AN EXAMPLE OF SIMPLE CLONE CLASSES LISTED PER FILE - 64

FIGURE 23: AN EXAMPLE OF TRANSFORMED SIMPLE CLONE CLASS IDS PER FILE - 66

Trang 12

FIGURE 26: FREQUENT CLONE PATTERN WITH FILE COVERAGE - 68

FIGURE 27: ALGORITHM FOR CLUSTER PRUNING - 69

FIGURE 28: CLUSTERS OF SIMILAR FILES - 70

FIGURE 29: DIRECTORY DETAILS - 70

FIGURE 30: PATTERNS OF FILE CLONE CLASSES - 71

FIGURE 31: PATTERNS OF SIMILAR FILES IN DIRECTORIES - 71

FIGURE 32: PHASES IN EXTRACTING VALUE FROM CLONES - 84

FIGURE 33: LEADING AND TRAILING CLONE TOKENS - 85

FIGURE 34: COMPARING CLONES IN CLONE ANALYZER - 87

FIGURE 35: MATCHING SIMPLE CLONE INSTANCES - 88

FIGURE 36: FILE SIMILARITY CHART - 89

FIGURE 37: CONFIGURING A SIMPLE CLONE STRUCTURE - 90

FIGURE 38: CLONE ANALYZER ARCHITECTURE - 91

FIGURE 39: SIMILAR PROCESS FLOWS - 94

FIGURE 40: SIMILARITIES IN CAP-WP -105

FIGURE 41: VIEW HANDLERS SIMILARITY PATTERN [KUA05] -107

FIGURE 42: A STRUCTURAL CLONE FOUND IN ECLIPSE GEF -111

FIGURE 43: REORDERED STRUCTURAL CLONE FOUND IN ECLIPSE GEF -112

FIGURE 44: STRUCTURAL CLONE BASED ON METHOD CALLING FOUND IN ECLIPSE GEF -113

FIGURE 45: SIMILAR STRUCTURAL CLONES -114

FIGURE 46 MODULE-LEVEL CLONING IN J2ME WTK -115

FIGURE 47: MODULE-LEVEL CLONING IN PET STORE -116

FIGURE 48: FILE-LEVEL CLONING IN PET STORE -116

FIGURE 49: FEATURE DIAGRAM FOR BUFFER LIBRARY -163

FIGURE 50: LOC COMPARISON OF ORIGINAL, XVCL AND GENERIC BUFFER LIBRARY -166

FIGURE 51: METHOD ARRAY() OF FLOATBUFFER -166

FIGURE 52: METHOD ARRAY() OF LONGBUFFER -166

FIGURE 53: GENERIC ARRAY() METHOD AFTER REPLACING PRIMITIVE TYPES WITH REFERENCE TYPES167 FIGURE 54: ILLEGAL USAGE OF A TYPE PARAMETER -167

Trang 13

FIGURE 57: KEYWORDS VARIATION -168

FIGURE 58: KEYWORDS VARIATION -168

FIGURE 59: METHOD ORDER() OF DIRECTDOUBLEBUFFERS -169

FIGURE 60: METHOD ORDER() OF DIRECTDOUBLEBUFFERU -169

FIGURE 61: DECLARATION OF CLASS CHARBUFFER -169

FIGURE 62: DECLARATION OF CLASS DOUBLEBUFFER -169

FIGURE 63: METHOD TOSTRING() OF LONGBUFFER -170

FIGURE 64: METHOD TOSTRING() OF CHARBUFFER -170

FIGURE 65: METHOD GET(INT) OF DIRECTINTBUFFERS -171

FIGURE 66: METHOD GET(INT) OF DIRECTFLOATBUFFERS -171

FIGURE 67: A CLONE THAT VARIES BY OPERATORS -175

FIGURE 68: TWO CLONE INSTANCES THAT DIFFER BY OPERATOR -176

FIGURE 69: KEYWORD VARIATION EXAMPLE -176

FIGURE 70: FEATURE DIAGRAM FOR ASSOCIATIVE CONTAINERS -177

FIGURE 71: VARIANTS OF SET ALGORITHMS -179

FIGURE 72: EXACT CLONES FOUND AMONG ITERATORS -180

FIGURE 73: ACCESS LEVEL VARIATION EXAMPLE -180

FIGURE 74: A CLONE FOUND AMONG TYPE TRAITS -181

FIGURE 75: CLONES DUE TO SWAPPING -182

FIGURE 76: CLONED COPYRIGHT NOTICE -182

FIGURE 77: GENERIC SOLUTIONS IN XVCL -183

FIGURE 78: A SAMPLE META-COMPONENT -184

FIGURE 79: SAMPLE SPC -184

Trang 14

With the ever increasing role of computing in all aspects of life, more and more software is being created A large part of the software costs go to the maintenance of a system rather than its initial development [Som96] A significant amount of legacy code developed many years ago is still operational and plays critical role in businesses and industries Even when new software is developed, the pressure to meet schedule is intense Non ideal environments result

in non-ideal code “Bad smells” starts to emanate from the code, adversely affecting its

maintainability This calls for remedial actions [Fow99] Duplicated code, also called clones,

is at the top of the list of bad smells [Fow99] and is a known problem in software maintenance

Code clones are similar program structures of considerable size and significant similarity.Cloning is a common phenomenon that exists in almost all kind of software systems because

of the presence of certain inherent similarities Similar design solutions are repeatedly applied

to solve similar problems Programmers often find themselves solving similar design problems by copying existing code or writing similar code all over again Architecture-centric and pattern-driven development further encourages standardization of program solutions

Structural clones or high level similarities that represent these repeated higher level structures

are often induced by the application domain (analysis patterns), design technique (design

Trang 15

patterns), programming language limitations for defining flexible generics at higher levels, and mental templates repeatedly used by programmers.

Cloning is believed to have a negative impact on the maintenance of large legacy software systems By cloning code sections, files and designs, programmers end up maintaining software that is overly complex, error-prone and difficult to change Cloning complicates software by increasing program size It also increases the risk of update anomalies as the location of cloned structures may not be known

However, there are positive aspects of clones too Sometimes cloning is done intentionally to improve design modularity as found in the Java Buffer Library case study by Jarzabek et al.[JL03] Cloning may also be used to enhance performance by avoiding function calls and in-lining the functions Cloning can also help in better program understanding, by keeping the software architectures simple and avoiding complicated abstractions [KG06]

Clone detection and analysis is currently an active area of research, with a multitude ofdetection techniques being proposed recently [LLMZ06][KKI02][DRD99][BYM+98] While evidence suggests that cloning exists at multiple levels, most of the work done so far is restricted to finding and treating similar fragments of contiguous code, henceforth called

simple clones in this thesis.

Figure 1 shows an example of a structural clone found in a real system developed in the industry Each clone is made up of 4 modules and spans multiple layers, from GUI to database The boxes represent modules, possibly composed of several classes, whereas the arrows indicate runtime interaction Same shading in the boxes represents high similarity between the modules Similar patterns of interaction between modules across different layers give rise to the structural clone

Trang 16

BusinessLogic

DBEntity

executes executes

visualizes visualizes

Figure 1: Structural clones in a multi-tier system

Refactoring [Opd92][Fow99] is the state-of-the-art in removing clones from the code Refactoring is the technique of improving the internal design of the software system without changing its external behavior or functionality For example, we could replace simple clones with macros [BYM+98] or function calls Sometimes, duplicated code can be moved to a parent class in the inheritance hierarchy, or a suitable design pattern can be applied to avoid duplication [BMD+99a][BMD+00] However, refactoring is not always a viable option Intentional clones cannot be refactored as they are meant to be there, while refactoring of other clones may be either impossible given a programming technique used or just too complicated, defeating the very purpose of clone elimination

Trang 17

addresses these questions Some of the challenges that we face in dealing with structural clones are:

Define the structural clones: A precise definition of the structural clones is required along

with the related terminology to better understand this phenomenon and for better communication for mutual understanding and future research

Classification: Given the variety of forms in which the structural clones can appear, a proper

classification is required Classification can serve various purposes like studying the more frequently occurring structural clones, prioritizing different types of structural clones, devising re-engineering strategies for different types of structural clones etc Different classifications can be made based on the objective of the user who is dealing with them

Detecting structural clones: Much work has been done on the detection of simple clones,

but little has been done for detecting these higher level similarities There are many challenges related to clone detection problem:

 Techniques to detect simple clones need to be enhanced to detect structural clones In

addition, different classes of structural clones may require different detection techniques

 Scalability of the technique to analyze big software systems.

 Easy adaptability to analyze code written in different programming languages.

 Flexibility; to allow the user to customize the detection in different ways.

 Usefulness of the detection results for the user.

 Correctness and completeness of the detection results.

Managing structural clones: What are the possible ways in which the knowledge about the

structural clones can be put to use? What are the mechanisms available to remove them or to remove their harmful affects? Which kind of project activities can benefit from the knowledge

of structural clones?

Trang 18

1.2 The Goals, Scope and Contributions of this Thesis

When analyzing similarities within the different software systems, we observed that the

recurring configurations of simple clones, related in different ways, are the core of the high level similarities This observation formed the basis of our work in defining and detecting

structural clones

In the Thesis, we aim at improving on current simple clone detection techniques to make them more suitable for structural clone detection; formalizing the notion of a structural clone and providing effective means for their detection; validating the proposed techniques in experimental studies to demonstrate the usefulness of the information recovered from programs

The theoretical, implementation and experimental work conducted under this Thesis covered the above areas and led to the following contributions that advance the state-of-the-art in the clone detection and design recovery research:

1 In the area of simple clones, our clone detection technique improved on current

techniques for simple clone detection as follows:

 we find simple clones that differ in more diverse ways than the existing simple clone detection tools can find,

 we improve performance of clone detection,

 we provide flexible ways to customize the detection process so that it targets at specific classes of simple clones that are of a programmer’s interest in a given situation

2 In the area of structural clones:

 we formalized the notion of structural clones in general, and described in detail a class

of structural clones that show as configurations of simple clones,

Trang 19

 we proposed a data mining technique to detect structural clones, and applied follow up visualization and analysis techniques to further improve effectiveness of structural clone detection with the involvement of human experts,

 we implemented the above techniques into a tool called Clone Miner,

 we conducted experimental work to validate the proposed techniques, demonstrating the usefulness of the structural clones detected with Clone Miner

3 In the area of design recovery: we showed that structural clones represent important

design information that is useful in program understanding, evolution, re-engineering and reuse Therefore, our notion of structural clone and Clone Miner contribute a novel, scalable and effective approach to design recovery that has been subject of intensive research over last two decades The Thesis bridges two, so far unrelated, research branches on clone detection and design recovery

In addition to the above this thesis also presents a comprehensive overview of the relevant background material published on clone detection, including clone detection techniques and clone treatment mechanisms

The following papers were presented in international conferences based on the work covered

by this thesis:

[BJ05] Basit, H A., and Jarzabek, S Detecting higher-level similarity patterns in

programs In Proceedings of the European Software Engineering Conference

and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE), pages 156-165 Lisbon, Portugal, September 2005 ACM Press.

[BRJ05a] Basit, H A., Rajapakse, D C., and Jarzabek, S Beyond templates: a study of

clones in the STL and some general implications In Proceedings of the 28th

International Conference on Software Engineering (ICSE), pages 451-459,

May 2005

[BRJ05b] Basit, H A., Rajapakse, D C., and Jarzabek, S An empirical study on limits of

clone unification using generics In Proceedings of the 17th International

Trang 20

Conference on Software Engineering and Knowledge Engineering (SEKE),

pages 109-114, July 2005

This thesis is divided into ten chapters and two appendices In Chapter 2 we give an overview

of the clones and the cloning problem We also present the related work in clone management, higher level similarities and design recovery

In Appendix A we present a survey of clone detection techniques with a brief description of each technique based on the factors described in Chapter 2

In Appendix B we present two case studies dealing with the assessment of language level parameterization technique of generics to unify simple and structural clones and its comparison with the language independent meta-level technique of XVCL1

In Chapter 3 we elaborate the concept of structural clones We provide a definition of structural clones and some examples to show structural clones in real systems We show how structural clones can be extracted from the simple clones iteratively A basic taxonomy of structural clones is also proposed

In Chapter 4 we present a token-based technique for the detection of simple clones The flexibility induced in the tokenization and the choice of efficient algorithms and data structures is explained

In Chapter 5 we propose a mechanism of detecting structural clones at different levels based

on their physical co-location It covers detecting recurring patterns of simple clones in different methods and files, patterns of similar methods within and across different files, and patterns of similar files within and across different directories

In Chapter 6 we present the implementation details of the clone detection tool called Clone Miner, and its performance statistics for the detection of simple and structural clones

1http://xvcl.comp.nus.edu.sg/

Trang 21

In Chapter 7 we discuss the analysis techniques for structural clones based on filtering, abstraction, ranking, incorporating user inputs, and visualizations

In Chapter 8 we describe the different scenarios where the information regarding structural clones can prove to be beneficial Metal-level unification of structural clones using XVCL,for better maintenance and future reuse, is dealt with in more detail

In Chapter 9 we experimentally validate our techniques for the detection of meaningful and interesting structural clones

Chapter 10 concludes the thesis and gives pointers to the future work

Trang 22

Cloning – Overview and Related

Work

Reuse in software systems is made possible through different mechanisms such as components, architectures, inheritance, shared libraries, object composition, and so on Still, programmers often need to reuse components which haven’t been designed for reuse This may happen when software systems go through the expansion phase and new requirements have to be satisfied In this situation, the programmers may follow the low-cost copy-paste

technique instead of costly redesigning-the-system approach, hence causing the clones This

type of code cloning is the most basic and widely used mode of software reuse However, it does not lead to systematic reuse such as in product line approach [CN02] Several studies suggest that as much as 20-50% of large software systems consist of cloned code [Bak95a][MLM96][DRD99] The study conducted by Jarzabek & Li [JL03] reports almost 68% of cloned code in the Java Buffer Library

Cloning has negative impact on the maintainability of the systems In this chapter we look into this phenomenon, and show the types of clones (Section 2.1), why they exist (Section 2.2), and why they are harmful (Section 2.3) Features of clone detection techniques are presented in Section 2.4 while Section 2.5 discusses clone management techniques and their tradeoffs Work on classifying simple clones is mentioned in Section 2.6 We also present

Trang 23

related work done in the identification of high-level cloning and design recovery in Section 2.7.

2.1 Types of Simple Clones

Several definitions of simple clones are mentioned in literature that convey almost the same

idea: code clones are fragments of contiguous code having considerable size and significant

similarity The length and the similarity of clone are left to the human judgment The simplest

type of clones is the exact or identical clone, when the two or more code portions match exactly, except for the line breaks and white spaces Parameterized clones are defined as

“code sections that match except for a one-to-one correspondence between candidates for parameters such as variables, constants, macro names, and structure member names” [Bak92] Some other authors [KKI02][LLMZ06] do not consider strict one-to-one relationship between the parameters of cloned portions Apart from these two basic types, some researchers

[KH01a][KH01b] have also mentioned about reordered clones and intertwined clones In a

reordered clone, the exact or parameterized matching lines of the code are re-ordered,

whereas in an Intertwined clone, these lines are intertwined with each other Another type of non-exact clone is a gapped clone, where the differences between the matching code sections

cannot be parameterized and form gaps of non similarities These definitions are illustrated in the examples given in Figure 2 to Figure 5

Figure 2: A parameterized clone adapted from X Window source code [Bak92]

Note the correspondence between the variable names pfi/pfh and the pairs of structure member names lbearing/left and rbearing/right in Figure 2

*pmin++ = *pmax++ = ‘,’;

Copy+_number(&pmin, &pmax,

pfi->min_bounds.rbearing, pfi->max_bounds.rbearing);

*pmin++ = *pmax++ = ‘,’;

Trang 24

Figure 3: A gapped clone from the UNIX utility bison [KH01b]

The bold lines in Figure 3 form the gaps in the cloned fragments

Figure 4: A reordered clone segment from UNIX utility bison [KH01a]

Figure 5: An intertwined clone from UNIX utility sort [KH01a]

The bold lines in Figure 5 are a clone of the normal lines

2.2 Reasons for Clones

The reasons for the emergence of clones in software systems are often mentioned Some of the reasons are quite obvious and are mentioned frequently A detailed analysis is also present

in [Rie05]

Cloning is cheap: Cloning is faster and cheaper than writing code from scratch Time

constraints due to deadlines push the programmers to copy and modify the existing code that implements a similar logic instead of writing a generalized routine from scratch, which may take longer

tmpa = UCHAR(*a);

tmpb = UCHAR(*b);

while (blanks[tmpa]) tmpa = UCHAR(*++a);

while (blanks[tmpb]) tmpb = UCHAR(*++b);

if (tmpa == ‘-’)

…else if (tmpb == ‘-’)

Trang 25

Cloning is less complex: Cloning with modification is simpler than writing a more general,

parameterized function or module

Cloning is safer: Cloning of already tested code is considered safe as having little unplanned

effect on the original code

Unclear requirements: When the new requirement is not fully understood and a similar

piece of code is present, it is copied, pasted and modified to fulfill the requirement

Better performance: Efficiency considerations may make the cost of a procedure call seem

too high Systems with tight time constraints are hand optimized by replicating frequent computations, especially when a compiler does not offer in-lining of arbitrary expressions or computations

Language limitations: Limitations of source language abstraction mechanisms cause cloning

of code When standard reuse mechanisms e.g., inheritance, shared libraries, object composition etc., could not be applied, the only way out is to clone the code

Performance evaluation: Performance evaluation based on LOC encourages programmers to

copy existing code and do the required changes instead of putting in more time and thought in designing a generalized solution

Cloning comes naturally: When adding functionality similar to an existing logic in the

system, the natural instinct of a programmer is to copy, paste and modify the existing code to meet new requirement

Coding style: Following a coding style leads to the appearance of clones in the code.

Mental macros: Mental macros are standard solutions to typical problems that occur

frequently and the programmer always applies the same solution without giving a second thought to it Repetitive application of these mental macros generates cloned code [BYM+98]

Porting software to new platform: When an application is required to be ported to a new

hardware platform, or a new device driver has to be written for it, the existing code is copied and pasted with the required modifications, consequently creating clones Managing multiple configurations of a system is a similar issue

Trang 26

Code generating tools: Code generation tools such as IDEs for creating GUI components

result in similar code fragments being repeatedly generated

Accidental clones: Some clones just appear accidentally Sometimes, the similarities occur

not because of copying and pasting of code, but rather because of the similar nature of the problem being addressed For example, clones can occur unintentionally across different systems because of the usage of a common GUI toolkit or a library [AKHG05]

In addition to the above mentioned general reasons, specific reasons for the emergence of structural clones include repeated applications of the same analysis pattern [Fow97] or design pattern [GHJV97]; similar patterns of components at the architecture level; design solutions repeatedly applied by programmers to solve similar problems (so-called “mental templates” [BYM+98]) Sometimes programmers may not explicitly copy these patterns; rather they just implement them all over again It is even possible that two similar patterns are created by two programmers unaware of each other’s solutions Another likely cause of this high level similarity can be the ‘feature combinatorics problem’ [BSST93]

The file-level/module-level structural clones are a common phenomenon in software Duplication of certain parts or whole of a file (e.g., interface, logic, method call structure) or a set of files (e.g., a module) is a common practice To illustrate this case, consider a situation where a child class needs to be added to an abstract parent class An already existing child class may be cloned to create a sibling, in order to duplicate the code related to overriding abstract methods of the parent Another likely cause for file-level structural clones is cloning

in piece-meal fashion That is, only a part of the file is cloned initially, but as the developer realizes more and more similarity between the reuse context and reused context, more and more code fragments are cloned from original file to new file These new fragments may be inserted into locations of the new file that are different from the original file (e.g., a method copied from the original class may be pasted in a different location in the destination class), resulting in different ordering in cloned fragments across files (as the case in Figure 10) In the same fashion, similar modules of an application may be created by copying the whole module and modifying contents

Trang 27

The architectural design of the software system can also lead to cloning For example, process flows and interfaces of the components within the system may be similar or fixed,resulting in file or method level structural clones Furthermore, architecture-centric and pattern-driven development encouraged by modern component platforms, such as NET and J2EE, leads to highly uniform and similar design solutions [YJ05].

2.3 Negative Impact of Clones

With excessive cloning, evolution and further development (Maintenance) become prohibitively expensive Hence, most of the harmful effects of clones are related to maintenance [MLH96][MLM96][Joh94a] Listed below are some of the obvious harmful effects A more thorough analysis is given in [Rie05]

Error replication: Unknown errors get replicated by ad-hoc copy and paste

Difficulty in understanding software: Cloning makes the system larger in size and

complexity, making it harder to understand

Difficulty in bug fixing: If a bug fix is required in a fragment of code that is cloned at several

places, an analysis of all the other copies is necessary to avoid update anomalies Study of multiple releases of a large software system showed that programmers often missed modifying duplicated copies of code [LPM+97]

Difficulty in changing software: A change required in one clone may be required in other

copies also If clones are not formally managed, a lot of rework will be done If this is handled

by different persons, different solutions for the same problem will arise This can have undesirable side-effects in the system Hence, cloning contributes to “software aging” that makes software difficult to change [Par94]

Generate dead code: When code is cloned without full understanding, “dead code” is

generated i.e., code that is never accessed [Joh94a]

Generate implicit links: Clones form implicit links between components that share

functionality

Trang 28

Generate hidden bugs: Errors in the systematic renaming can lead to unintended aliasing,

resulting in hidden bugs that show up later

Increase in code size: There is considerable increase in size of source code because of

cloning This, in turn, increases compile time and the size of the executable

The first step towards analyzing a software system for cloning is to represent the code in terms of the code features that are to be compared for similarity Several representations havebeen tried for clone detection

Raw Text

The simplest approach is to use the raw text of the source code for the detection of similarities The comparison of strings is used in several fields, such as molecular biology, speech recognition, and code theory, for similar reasons An advantage of the raw text representation

is high adaptability to various programming languages since no lexical analysis or parsing is needed As the character by character matching is very expensive, some numeric values as representations for lines or small code fragments are used for the similarity computation

Trang 29

Some simple transformations can be applied to the raw text, like removal of comments and blank lines To reduce the large computations involved in matching each line of code with all others, hashing techniques can be used

Lexical Tokens

A source language lexer is run on the system that tokenizes all the code in it Then these tokens are compared for similarity Since every line of code is composed of many tokens, the token-based computation is more expensive than line-by-line comparison But, token based representation can easily employ various transformations to eliminate differences of coding styles in order to detect clones Parameterized clones detection can be easily done by the token based techniques

Parse Tree

A parse tree or an abstract syntax tree can be constructed from the lexical tokens to represent the source code in a tree form, from where the similar code fragments can be found by matching sub trees But complete parsing of the program makes the clone detection technique difficult to adapt to new languages as it requires a parser for the exact language dialect of interest

Program Dependence Graphs

In the Program Dependence Graphs (PDG) representation, nodes represent program statements and predicates, and edges represent data and control dependences In addition to being heavily language dependent, working with PDGs is complex and the data structure is not very scalable

Metrics

Metrics can be computed for different units of code, like methods, classes, files etc and compared for similarity Similar metrics values will indicate similar code units Metrics can

Trang 30

Visual Representation

The code is represented visually with different views and colors to detect duplicated code by looking at the graphs and plots Visualization provides quick insight into the duplication situation and is helpful for initial analysis Scatter-plot is a technique widely used in DNA analysis and is quite easy to draw Code is put on both x and y axes; a match between two elements is a dot in the matrix Similar code fragments appear as lines parallel to the diagonal, with broken lines indicating gapped clones, and concentrated lines in a region indicating high cloning activity in that region [CH93][DRD99]

The more preprocessing that is performed, the more language dependent the clone detection technique becomes Hence, techniques based on raw text are most general compared to those based on constructing a parse tree or calculating metrics based on the parsing information, which are heavily tied to the source language In between these two extremes in the spectrum are the techniques using the token based representation for clone detection

Another categorization of the clone detection techniques can be done on the basis of the granularity of the clones found by the clone detection techniques

Arbitrary granularity

Many clone detection techniques considers lines as the basic unit of clone, but nevertheless, single line clones are rarely interesting Instead, a threshold limit is set on the minimum number of lines of the code that will be considered as a clone

However, in many languages like C, C++, and Java, line breaks in source code have no semantic meaning; their placement is usually dependent on the programmer’s preference This hampers the detection of code clones when code differ only in the placement of line breaks,and in some cases this causes the detected clone to be shorter than the actual clone

Trang 31

2.5 Clone Management

Clone management covers activities targeted at removing or minimizing the negative effects

of clones for maintenance [Rie05] Clone management has been divided into three categories,

preventive, corrective and compensatory

Preventive clone management is about preventing the creation of new clones This is mostly related to new code being developed, but can also be applied to the existing code that isundergoing changes [MLH96] It aims at removing clones from the code before it is released

or submitted to the configuration management system An experience with preventive clone management is described in [LPM+97]

Corrective clone management is about reengineering a system to remove clones Most of the techniques and tools proposed for clone management belong to this category Completely rebuilding the software system to get rid of clones is a costly and risky decision and does not guarantee a clone free code The detection and subsequent resolution of clones by refactoring,

Trang 32

subroutine calls, in-line functions, macros, templates etc., however, promises decrease in maintenance costs and code size.

Refactoring & Restructuring

Restructuring in this context means making changes to the system to improve its internal

structure without affecting the external behavior For object oriented system, the term

refactoring is used [Opd92][Fow99] Refactorings that are generally applied for the removal

of duplicated code are extract method (the duplicated code is extracted into a separate method), remove method (the duplicated methods are merged), pull up method (the duplicated

methods are moved up the class hierarchy and inherited by all subclasses), or any combination of the above An advantage of refactoring is that we stay in the same paradigm

as the source code itself, but it is not always feasible to resolve all the clones by this technique.Refactoring based on design techniques (design patterns, inheritance with dynamic binding) is

a clone unification option that is closely tied with the design of the program To eliminate the redundant code in a Java software system, Balazinska et al [BMD+99a][BMD+00] applied the refactoring based on ‘strategy’ and ‘template’ design patterns, by factoring out the commonalities of methods and parameterizing the differences according to the design patterns However, the applicability of this technique is restricted only to specific types of clones.Automated refactoring tools are described in [HUK+02][FR99][RBJ97]

Macros

Baxter et al [BYM+98] propose to automatically replace clones with macros Most of the macro systems are merely implementation level mechanisms for handling variant features (or changes, in general) Although macros can unify all code clones, their lack of consideration of the semantics of the code pose certain drawbacks When lexical changes are introduced to the macro, a manual verification is necessary to ensure that the intended semantic change correctly propagates to all the contexts of use of the macro Failing to address change at analysis and design levels, macros never evolved towards full-fledged “design for change”

Trang 33

methods [Bas97][KRT97] Programs instrumented with macros tend to be difficult to understand and test Moreover, this technique is restricted to languages that support macros

Generics

This is another language dependent technique having the potential to deal with parameterized clones Unlike C++ and Ada, many programming languages have not so far developed such extensive frameworks for generic programming Java generics have become a part of java 1.5

We conducted an initial test on applying the generics to the Java Buffer Library, but found them to be inadequate for practical purposes Even the powerful C++ templates fail at certain places to unify very similar code structures, as we found out in the analysis of clones in STL1 These studies are presented in Appendix B

A comparison of generics in six programming languages is presented in [GJL+03] A considerable part of the Boost Graph Library has been implemented in all six languages using their respective generic capabilities The authors identified several language features that are useful to enhance the generics capabilities beyond implementing simple type-safe polymorphic containers These features are essential to implement reusable libraries of software components, which is fast emerging as a promising area where the generics can be effectively utilized However, the presence of all these features does not solve the clone unification problem; rather it is only of help in avoiding “awkward designs, poor maintainability, unnecessary run-time checks, and painfully verbose code” [GJL+03]

Higher Order Functions

Higher order functions from the functional programming paradigm offer an attractive reuse option [Tho97] Skeleton objects are introduced by [BSKC02] as an object-oriented alternative for the higher order functions and mechanism to build adaptable components using these skeleton objects are discussed However, the approach may be difficult to implement in

1 SGI – STL homepage at http://www.sgi.com/tech/stl/

Trang 34

languages that do not support function pointers Even in C++, passing long lists of function pointers as arguments to class constructors severely degrades the readability of the code.

Domain Specific Technologies

There are some domain specific reuse supports such as portlet technology offered by J2EE that defines generic functionality that can be reused to create variants But these approaches also have their limitations in building generic representations due to the programming language’s restrictions and limitations [YJ05]

Compensatory clone management aims at minimizing the negative effects of clones without removing them from the actual code One of the straightforward ways to do this is to document the clones [Rie05] Another possibility is to use the meta-programming approach to unify clones at the meta-level This is further elaborated below

Meta-level Unification of Clones

The techniques discussed under the corrective clone management aim at totally removing clones from the source code However, this objective is not always feasible For example, when clones are specifically created for some positive effect, like better performance, it is not advisable to remove them altogether Similarly, at times the clone resolution may be possible through refactoring, but the resultant design of the system may conflict with other important design goals

Here we make a distinction between the useful and useless clones, and those that can be cleanly resolved by refactoring etc., and those that cannot be Here the usefulness is in terms

of performance or clarity of design [JL03], or in terms of avoiding the fear of breaking the code by clone unification [Cor03], but not in terms of maintainability The four quadrants are shown in Figure 6

Trang 35

Figure 6: Four quadrants of clones

This consideration leads to four categories of clones:

Category 1 clones can be resolved by some conventional technique, but are useful.

Category 2 clones are useless and at the same time, can be neatly resolved by conventional

techniques

Category 3 clones are useful and also not unifiable.

Category 4 clones are useless, but also not unifiable by conventional techniques.

Based on the above categorization, conventional techniques applied in the corrective and preventing clone management can only target clones in category 2 Here we are focusing on clones in categories 1, 3 & 4 While we cannot eliminate these clones from runtime code, we can effectively deal with the negative impact of clones on software maintenance by resolving all the clones at meta-program level This technique targets at better software maintenance byproviding a source code free of clones for easy maintenance, and having an executable code with the presence of useful clones that are required to be kept in the software at runtime The technique is based on XVCL1, which is a method and tool for managing changes during software evolution and reuse XVCL works on the principle of clean separation of program construction-time concerns (such as changeability and adaptability) from other properties of program structures Such separation, achieved by a simple text-based “composition with adaptation” mechanism, provides a virtual window to deal with changes, without compromising other important design goals or properties of programs at runtime XVCL can

be applied on top of programs written in any language It complements and enhances design methods supported by OO and component-based paradigms

Category 4Category 3

Non-Unifiable

Category 2Category 1

Unifiable

Useless Useful

Trang 36

The cloned parts of the code are separated out as meta-components of XVCL, called x-frames The variations between the cloned components are taken care of by suitable XVCL commands X-frames at lower level are combined to form higher level x-frames and eventually represent the structure of the entire system During this configuration of x-frames, variability is injected into the framework to handle the differences in otherwise similar code fragments The XVCL processor would read and process how the x-frames are configured, after which it will reconstruct the program and output the program in its native language such

as Java or C++

The transformation from the meta-component architecture to the executable code is 100% invisible to the programmer This code will be containing all the desired clones that are kept there for specific reasons

Unlike macros, XVCL is a full-fledged method for generic design, in which variant features are directly addressed at both program design and implementation levels Over time, an XVCL meta-component structure emerges as a well-organized architecture that explicates the impact of variant features on components (or classes) and automates production of custom components XVCL has unique features to support reuse and evolution such as propagation of meta-variables across meta-components, meta-variable scoping rules that allow us to adapt generic meta-components at inclusion points, meta-expressions to formulate generic names, code selection or insertion at designated breakpoints and a while loop construct to implement code generator

Application of XVCL technique to overcome limitations of language level generics are presented in the Buffer Library and STL case studies in Appendix B

Clone taxonomies can help in prioritizing clones for reengineering and for specifying reengineering guidelines They can help in studying what types of clones are most common,

so that specialized clone detection techniques can be developed for them Clone taxonomies

Trang 37

can also provide a summarization of cloning present in a system and can also help in forming criteria to evaluate the effectiveness of the different clone detection techniques

Kapser and Godfrey [KG03] groups clones according to their relative locations The reasons for each category of clones and guidelines for reengineering are also given Mayrand et al [MLH96] defines eight levels of function clones depending upon the extent of similarity, from exact clones to clones having totally different control flow Balazinska et al [BMD+99b] defines a categorization based on 18 possible syntax changes between clones Bellon [Bel02] defines a useful categorization of clones into exact, parameterized and gapped clones for the purpose of comparing different clone detection tools

The overwhelming number of simple clones detected by different clone detection tools when analyzing large software systems has prompted different solutions that are related to the idea

of detecting structural clones As structural clones represent a very broad concept, the problem of detecting structural clones is multi-dimensional and complex

Detecting structural clones is directly related to domain analysis Different techniques may be required to detect different types of structural clones Given the complex nature of the problem, a totally automated detection may not be possible [MM01], and human intervention

is required at some stage to extract the details of the repeating structure However, automated tools can help the human user to a great extent beyond the detection of simple clones

Detecting structural clones requires syntactic as well as semantic analysis To find the relationships between different repeating entities, semantic links between them need to be traced In addition to code, other related artifacts like design diagrams and code comments can also be quite useful for locating and analyzing structural clones However, not all similar design structures will lead to clones at the code level Consequently, both top-down (from design to code) and bottom-up (from code to design) analyses are complementary to each other However, for some legacy software, such design diagrams may not be available Even

Trang 38

if these were developed initially, software that is maintained over a long period of time undergo several changes and the initial design diagrams may not be totally synchronized with the current state of the system

A closer look at other clone related research shows that some of them do cross the boundary from simple clone detection to detecting higher level clones – even though it is not specifically mentioned there Usually such works target some specific types of structural clones, notably, file-level structural clones For example, in [DST04], a whole web page is considered a ‘clone’ of another page if they are similar beyond a given threshold However, rather than considering simple clones inside the page for calculating the similarity, they consider the Levenshtein distance of the whole page to measure the similarity In contrast, Gemini [UKKI02a] shows the file similarity between files – calculated based on simple clones (detected by CCFinder [KKI02]) inside the files However, Gemini does not go as far

as to explicitly identify the file as a structural clone Another limitation of these tools in terms

of identifying similar files is that only pairs of similar files are detected rather than groups of similar files

The work of De Lucia et al in [DFST04] involves detecting a web-specific type of structural clones They tackle structural clones where a clone consists of several web pages linked by hyperlinks They use a graph-based pattern matching algorithm for identifying this type of clones

Work presented in [MM01] approaches the detection of structural clones from a different perspective This work defines ‘high-level concept clones’ as manifestation of higher-level abstractions in the problem or solution domain, giving the example of the ADT list that has been duplicated in one form or another throughout a system The clone detection method they use is based on examining source code text (comments and identifiers) to identify similar high level concepts An information retrieval approach is used for static analysis of the system and

to determine the semantic similarities in the source code It is proposed to use these similarity measures to guide the simple clone detection process They sum up their work as an attempt

to show that domain concepts can be used to identify clones (in contrast to common approach

Trang 39

of trying to identify domain concepts using clone analysis) While there are similarities in the goals of their work and ours (i.e., both approaches try to move beyond simple clones) – the promises made, and the methods used – are very much different, and complementary A structural clone is an attempt to move towards high level similarity patterns, yet firmly rooted

in patterns of concrete similarities at implementation level A structural clone may indicate a cloned concept (in the requirements or design space) A ‘high level concept clone’ stems from

a similarity in concepts There is no emphasis on the structure of the clone found, although it may be a structural clone as well There may be some overlap between similarities found by both methods, and also there may be many “concepts” that are not captured in a “structure” (e.g., two List containers implemented in totally different ways) and there may be many

“structures” that are not accounted by a single concept (e.g., a structure made up of several interlinked concepts) When there is structural as well as conceptual similarity present between two program parts, but developers rename the identifiers in one part, a structural clone detector should detect them easily although the detection method in [MM01] will not detect it However, this seems to be a weakness in the particular detection method, rather than

a flaw in the concept of high level concept clones

There is also strong connection between clone detection and the work done previously on the design recovery and program understanding of large legacy systems for ease of maintenance and reuse [Big89][BMG+94][KNE92] Clones, especially structural clones of large granularity, provide useful insights into the program structure for better understanding of the program We expect that some of the structural clones may hint at important concepts behind

a program Clichés, as discussed in the programmer’s apprentice project [RW90a], and

programming plans, mentioned by Hartman [Har92] and Rich et al [RW90b] represent

commonly used program structures, which may appear as file level structural clones within or across software systems (product line members) In these early works, software was searched for these plans (or clichés) for program understanding, but there were scalability limitations The contribution of clone detection towards program understanding is also discussed in

Trang 40

2.8 Conclusions

In this chapter we have given an introduction of the cloning problem, its reasons and problems Different aspects of clone detection techniques are presented We also discusseddifferent techniques available for clone unification, both at language level and meta-level.Related work on detecting higher level similarities and design recovery is also given to set the stage for the forthcoming chapters where we present our own work that attempts to bridge the two fields

Định dạng
Số trang	199
Dung lượng	1,39 MB