Data Flow Analysis for Go
GoLand 2023.3 comes with support for data flow analysis (DFA). In this post, we’ll introduce the feature, explain how it works, and show some real-world examples of how DFA can detect bugs on the fly!
Thanks to the CLion team for helping us by porting their powerful DFA engine. For now, the GoLand engine only implements a limited number of DFA features, but more will be added in subsequent releases. The CLion team has also covered a variety of other topics, including a deeper dive into implementation specifics, on their blog.
What is data flow analysis?
DFA is a type of static code analysis that analyzes how data flows through a program. In basic terms, it calculates the possible values of variables at different points in the program’s execution. With this information, you can find various potential bugs, such as nil
dereferences, endless loops, constant conditions, and other incorrect or atypical program behavior.
Let’s look at a straightforward example of DFA in action in the form of a simple function:
func dummy(initializeResource bool) { 1 var resource *Resource = nil 2 init := false 3 4 if initializeResource { 5 resource = new(Resource) 6 init = true 7 } 8 9 r := resource 10 _ = r.Name }
A common way to perform DFA starts with building a control flow graph (CFG) of the analyzed function. The CFG of this dummy function is provided below. For clarity, all statements (lines of code) are numbered according to their corresponding code snippet.
You can think about CFG as a simple graph that reflects the function’s execution. The graph nodes correspond to code blocks, and the edges reflect conditional and unconditional jumps between them. You don’t need to know the exact formal definition of a CFG or how to build them for this article, but if you’d like to learn about CFGs, you can visit this link.
Once the CFG has been built, the main stage of the analysis can begin. During this stage, the DFA computes all of the possible values of variables that can follow each function statement. Roughly speaking, this can be done by propagating values through statements (such as assignments) and the edges of the CFG, taking into account the reachability of the CFG’s nodes.
For example, the possible value set of the variable r after statement 9 is {nil, new(Resource)}
. The nil
value is obtained from statement 1
by propagating it through the path 1 -> 2 -> 4 -> 9. The new(Resource)
value is obtained by propagating it through the path 1 -> 2 -> 4 -> 5 -> 6 -> 9, taking into account assignments on lines 5 and 9.
We can use the information obtained about variable values to find potential bugs in the corresponding programs. For example, since we know that, after statement 9, the variable r
can be nil
, that means there can be a nil
dereference on line 10.
Challenges of data flow analysis
As an introduction to the potential pitfalls and difficulties associated with implementing data flow analysis, let’s take a look at a slightly modified version of our previous example function:
func dummy(initializeResource bool) { 1 var resource *Resource = nil 2 init := false 3 4 if initializeResource { 5 resource = new(Resource) 6 init = true 7 } 8 9 r := resource 10 if init { 11 _ = r.Name 12 } 13 }
The corresponding CFG looks like this:
DFA, like other static analysis tools, over-approximates the behavior of programs. In our case, this means that DFA computes the upper bound of possible variable values. In this example, data flow analysis infers that the possible value set of the variable init in statement 10 is {false, true}
. Hence, both the branches of the condition on line 10 are reachable, which means an execution can reach statement 11. In statement 11, the value set of variable r
is {nil, new(Resource)}
. Thus, we can infer that there is a potential nil
dereference on line 11.
But that’s not really true. In fact, the variable init
can take both true
and false
values at statement 10. However, the reachability of the condition init == true
also depends on the condition initializeResource == true
. If the latter is met, then init
can only take the value true, and if it isn’t met, then init
can only be false. To identify such cases, we must use contexts. Let’s assume that we’re analyzing a function in two different contexts. The first context corresponds to a case where initializeResource
is true
, the second one corresponds to a case where initializeResource
is false
. These contexts are most easily described as a clone of the CFG:
As you can see, there are two different contexts (surrounded by a dashed border). In each context, we can analyze statements 9, 10, and 11 in different ways. For example, in statement 10 of the context initializeResource == true
, the variable init can only take a true
value, and variable r
can only take a new(Resource)
value. Therefore, in this context, statement 11 is reachable, but a nil
dereference isn’t possible. In statement 10 of the context initializeResource == false
, variable r
can take a nil
value. Since the value of init
can only be false
, statement 11 is not reachable in this context and therefore a nil
dereference isn’t possible. As such, a nil
dereference cannot occur in either context.
Using contexts in static analysis allows us to improve the quality of the analysis and weed out false-positives. To support this, exit statements of the if statement are split into two different contexts, duplicating the subsequent nodes of the control flow graph and analyzing them independently to identify all the possible data paths.
The capabilities of data flow analysis in GoLand
Constant conditions detection. Constant conditions represent a crucial type of data flow inspection. The constant condition inspection uses the DFA execution data to determine if certain conditions are constant. Here are two examples:
Example 1: In this example, DFA has deduced that the condition err != nil
is always false
. To show that this is indeed the case, let’s consider what values the err
variable can take in the condition on line 191. There are two main cases – when allF
is true and when allF
is false after the for
loop is executed. If allF
is true
after the for
loop is executed, then err
will be nil
on line 191, otherwise there will be a return
from the function on line 186. The remaining eventuality, when allF
is false
after the for
loop is executed, can only happen if line 177 is reachable. After line 177 is executed, we can be sure of two things: firstly, that err
is nil
(otherwise the execution of the loop would have continued on line 175), and secondly, that the execution of the loop has been interrupted, and therefore the variable err
won’t be assigned any other value. Hence, in cases where allF
is false
, the variable err
can only ever be nil
. Thus, the condition err != nil
is always false
.
Example 2: Here is a simpler example in which DFA deduces that the condition r0k != nil
on line 193 is always true
. This happens because the implicit dereference r0k.License
is present on line 191, after which the variable r0k
cannot be nil
. Although the derived constant condition inspection does not accurately show a real problem in the code, it reveals strange behavior in the program. In fact, issues could become apparent at runtime after the potential nil dereference of the variable r0k
on line 191, as the author of the code implies that r0k
can be nil
.
These examples show how the constant condition inspection can allow you to identify peculiar points or strange behavior in the program’s code.
Potential nil
dereference. DFA can detect a nil
dereference for a variable even in code that seems absolutely normal to the naked eye. Let’s see how this is possible:
Example 3: In this example, DFA infers that there can be a nil
dereference of the variable conf on line 273. This may seem strange because the code takes into account cases where the variable conf
is nil
. In such a case, there should be a return from the function on line 269, and so dereference of the conf
variable shouldn’t be reachable. However, there are actually two different conf
variables. The first one is declared on line 249 and the second on line 261. Thus, the last declaration shadows the previous one.
This can lead to a nil dereference. Let’s assume that the first conf
variable (line 249) is nil
but the second conf variable (line 261) is not nil
, and let’s also assume that the corresponding error
variable is nil
. In this case, there is no return
from the function on lines 262 and 269, which means we reach the conf
dereference on line 273. This dereference corresponds to the first conf
variable and causes issues at runtime. For this reason, we should be careful with nested short variable declarations!
Example 4: DFA deduces a potential nil
dereference of the variable pod
on line 260. This function takes into account the nilablility of the variable pod
(there is a condition pod != nil
on line 254), so the author of the code implies that the variable pod
can be nil
within the if
statement. But in this case, there will be a nil
dereference on line 260.
Error may be not nil
. This inspection reports cases in which variables might have nil
or an unexpected value because of the associated error that is not checked for being non-nil
.
The analysis is currently intra-procedural and does not consider user-imposed contracts on the function. Therefore, in specific situations, there may be false positives. For cases such as these, you can use a quick-fix to ask the DFA not to analyze or report these errors.
func _() { file, err := os.Open("file.txt") // Error check is omitted here name := file.Name() print(name, err) }
In the example provided, the variable file could either have the value nil or an unexpected value if err is not nil.
Try DFA for yourself!
You can try out these improvements now in the GoLand 2024.1 Release Candidate or wait for the 2024.1 version. It’s also available for 2023.3 users in early access, but is disabled by default. To enable it, go to Settings | Editor | Inspections | Go | Data Flow Analysis (experimental).