How-To's

SOLID Veteran or Copy/Paste Master? Finding duplicate code

Of course, life is not only black and white. Quite often, code duplication can be introduced unintentionally, which is even more likely in bigger teams. The reasons don’t really matter. What’s more important is to have tools at hand. Tools that help to reliably detect code duplication in growing and evolving codebases.

In today’s episode, we’ll cover the dupFinder command-line tool, which can find duplicates in C# and VB.NET code! It is available from the ReSharper Command-Line Tools NuGet package or as a TeamCity build step. Since it is free, it’s also perfect to use in any other continuous integration (CI) or DevOps-related environments.

Note: DupFinder is currently only supported on Windows. There is a YouTrack issue if you like to stay informed about support for other platforms.

What is code duplication?

Obviously, copying and pasting is code duplication in its simplest form. A more subtle phenomenon is that developers independently write code that is similarly structured, but has differences in identifier naming, code formatting, and code style.

public static string Indent(this string text, int count)
{
    return new string(c: ' ', count) + text;
}

public static string AddIndentation(this string str, int c)
{
    return new string(c: ' ', c) + str;
}

In dupFinder, the similarity of code fragments is weight in costs. The costs are provided in a relative unit, such as with cyclomatic complexity. The bigger the costs, the bigger the fragments of code duplication.

Gathering duplication metrics

DupFinder is a CLI tool and can be invoked on solution files, folders or individual files. A common requirement would also be to exclude generated code:

dupFinder FunkyApp.sln --output=report.xml --exclude=“**/*.Generated.cs”

Right at the top of the generated XML report, there will be accumulated statistical data, which could be used for historical analysis:

<Statistics>
  <CodebaseCost>141438</CodebaseCost>
  <TotalDuplicatesCost>913</TotalDuplicatesCost>
  <TotalFragmentsCost>1826</TotalFragmentsCost>
</Statistics>

More interestingly, it also contains individual data about discovered duplicates. This includes the actual costs of the duplication as well as file name, line offsets and column offsets for the related fragments:

<Duplicate Cost="95">
  <Fragment>
    <FileName>..\src\FunkyApp.Core\Tooling\Tool.cs</FileName>
    <OffsetRange Start="309" End="615"></OffsetRange>
    <LineRange Start="12" End="18"></LineRange>
  </Fragment>
  <Fragment>
    <FileName>..\src\FunkyApp.Core\Tooling\ToolExecutor.cs</FileName>
    <OffsetRange Start="502" End="832"></OffsetRange>
    <LineRange Start="21" End="27"></LineRange>
  </Fragment>
</Duplicate>

Depending on our solution, we might need to play with the --discard-cost parameter, which acts as a costs threshold to make the result set most relevant. We may also decide to add the --show-text parameter, to include the actual code in the report. Plenty of additional command-line options are explained in our help documentation.

Human-readable reports

Certainly, dangling through XML is not the most efficient way of analyzing a report. By applying a custom XSL transformation, we can make this much more readable. Out-of-the-box, TeamCity shows a new Duplicates report tab, which allows to navigate through the result by scope and show the fragments side-by-side:
DupFinder Duplicates Report in TeamCity

Avoid duplication as you type

Going beyond the capabilities of the dupFinder CLI, ReSharper (and Rider) can help to avoid often recurring duplicates just in time. Suppose you’ve identified a fragment and encapsulated it into its own method so that it can be called more concisely from anywhere in the codebase. Still, your fellow developers use the old fragment over and over again. This is a good time to create a custom SSR pattern, which would show a code inspection just at the moment the duplicate is written.

Download ReSharper or give Rider a try. We’d love to hear your feedback!

image description