Proof
- Semantic vs Pedantic Diffs
- How Proof Works

Proof

Proof is a tool for generating and viewing a semantic diff from two sources.

Semantic vs Pedantic Diffs

Software Engineers spend a large amount of time looking at diffs. We do this when we are…

Reviewing code
Getting an update on what has changed in code
Remembering what we did in code

The first point is the most important. Human error is inevitable but costly, and code reviews reduce the chance of human errors having costly consequences. This is especially true in the medical software world, where a bug can mean the difference between life and death.

There are two ways to look at how a piece of source code has changed. The first way is to compare the text of the previous source with the text of the new source. I call this a pedantic diff. This is the way practically all diff tools work in the modern day. The reason it is so widespread is:

It is easy. Implementing a pedantic diff is pretty much trivial. There are some ways to improve the presentation quality of the diff using substrings, but this has been explored and described in depth.
It is language independent. Any piece of text can be diffed.

The biggest problem with this method is that source code is not text. Yes, it is written and displayed as text, but it is governed by a tree structure that represents how the program will function. This tree, known as an Abstract Syntax Tree, is the truest representation of the source code. It represents the semantic meaning of the code, which is why I call a diff between two ASTs a semantic diff. Semantic Diffs have the following advantages:

Unimportant changes to the source code are ignored, which allows reviewers to focus on the meaning of the code.
More meaningful information can be represented. Statements can be broken down to reveal the specific parts that changed. Moved code can be tracked through files.
Insights can be more effectively gathered. New variables can be easily tracked, for instance.

How Proof Works

The above flowchart shows the generalized flow from source code to semantic diff. Notice that the only two parts of the code that are language dependent are the Parser and the “Unparser”. The real meat and potatoes of the system hold for any language.

The Parser and the AST

This is same AST that you probably know and love, so I will skip explaining what a parser and AST are.

The Change Distiller and Linked AST Pair

A Linked AST Pair is a pair of ASTs, with matching nodes linked. For example, if we have this AST, … and this AST, The linked AST pair looks like this Of course, ASTs are not just circles with letters on them. Each node in the AST has two properties - the “Label”, which is the type of element, and the “Value”, the information in the element. We define the following rules for the Linked AST Pair:

Only nodes with the same label may be linked
Any node may be linked to at most one other node

Outside of that, matching can be done in any way that makes sense. I implemented the [Change Distillation Algorithim](https://www.researchgate.net/publication/3189787_Change_DistillingTree_Differencing_for_Fine-Grained_Source_Code_Change_Extraction) described by Fluri et. al., and have so far had stellar results.

The DiffTree Populator and the DiffTree

A DiffTree is a tree where every node has one of a few labels:

CREATE: This is a new node
DELETE: This node will be deleted
MOVE_FROM: This node was deleted here and moved elsewhere
MOVE_TO: This node was moved from somewhere else.
MODIFY: The value of the node has changed.
NONE: This node stays the same

The DiffTree Populator takes a Linked AST Pair and constructs a DiffTree based on how nodes are matched. For example, here's a DiffTree constructed from the previous Linked AST Pair.

Although the algorithim is fairly complex (it took over a month to iron out all the kinks and determine the correct behavior for all cases), the relationship between the Linked AST Pair and the DiffTree can be summarized like this

If an unmatched node is in the after tree, the node is created.
If an unmatched node is in the before tree, the node is deleted.
If a matched pair of nodes has different parents or exists at a different position under the same parent, it needs to be moved.
If a matched pair of nodes has different values, it needs to be modified.
If none of the above apply, no operation needs to be taken.

Using this simple set of rules, we start from the top of the tree and proceed in level order. As you might expect, the ability for nodes to move complicates the algorithm significantly, both in terms of implementation and expected output 😁. (The test suite used for testing the algorithm contains 53 tests, and is over 2600 lines in length.)

Unparser and Semantic Diff

The Unparser's job is to take a DiffTree, and return something resembling code, albeit with a bit more structure than code. Since this is the opposite job of the Parser, I thought it was fitting to call it an “unparser”.

Table of Contents