Tokenizing LLVM

Motivation

The reason LLVM needs to be tokenized is so those tokens can be used to generate a trie. The trie is used to figure out if sections of the decompiled code match with any snippets that we create so that it is known that they can be swapped.

Possible Methods

There are two methods to tokenize LLVM: the first is the treat the LLVM textual assembly as strings, and the second is to build custom tokens out of the the LLVM IR using the LLVM API.

Advantages and Disadvantages of using String Manipulation

Advantages

String manipulation is very easy to tokenize compared to trying to tokenize via the LLVM API. Typical delimitation rules are followed i.e. tokens are delimited by delimiters such as whitespace, commas, braces, etc.

Creating the trie with strings is much more straightforward than using LLVM classes.

Disadvantages

I can’t think of any disadvantages. Information will be lost with using string manipulation, but it won’t be harmful unless that information is needed.

Advantages and Disadvantages of using LLVM API

Advantages

As much information about every aspect of every instruction can be preserved. For example, if a BinaryOperator is tokenized, it’s ValueName, both Operands, the relations of each Operand with previous instructions. It may turn out that this information is needed, but I can’t think of a use case where this information is needed.

Disadvantages

Working with the LLVM API is very tedious, and it may be the case that every child of the Value class will need it’s own code to tokenize, as there is no built-in tokenizing code within LLVM.

Current Plan of Action

Currently, tokenizing will take place by normalizing all the variables from the start point onwards, then doing a token-by-token walk through the decompiled code comparing it with the tree to find if there are matching snippets. We will use string manipulation because that is all that is needed effectively tokenize.