cprover
goto-programs

Folder goto-programs

Author
Kareem Khazem, Martin Brain

Overview

Goto programs are the intermediate representation of the CPROVER tool chain. They are language independent and similar to many of the compiler intermediate languages. Each function is a list of instructions, each of which has a type (one of 19 kinds of instruction), a code expression, a guard expression and potentially some targets for the next instruction. They are not natively in static single-assign (SSA) form. Transitions are nondeterministic (although in practise the guards on the transitions normally cover form a disjoint cover of all possibilities). Local variables have non-deterministic values if they are not initialised. Goto programs can be serialised in a binary (wrapped in ELF headers) format or in XML (see the various _serialization files).

The cbmc option --show-goto-programs is often a good starting point as it outputs goto-programs in a human readable form. However there are a few things to be aware of. Functions have an internal name (for example c::f00) and a ‘pretty name’ (for example f00) and which is used depends on whether it is internal or being presented to the user. NONDET(some_type) is use to represent a nondeterministic value. IF guard GOTO x represents a GOTO instruction with a guard, not a distinct IF instruction. The comment lines are generated from the source_location field of the instructiont structure.

GOTO programs are commonly produced using the initialize_goto_model function, which combines the complete process from command-line arguments specifying source code files, through parsing and generating a symbol table, to finally producing GOTO functions.

Alternatively lazy_goto_modelt can be used, which facilitates producing GOTO functions on demand. See Lazy GOTO Conversion for more details on this method.

Data Structures

A goto_functionst object contains a set of GOTO programs. Note the counter-intuitive naming: goto_functionst instances are the top level structure representing the program and contain goto_programt instances which represent the individual functions.

An instance of goto_programt is effectively a list of instructions (a nested class called goto_programt::instructiont). It is important to use the copy and insertion functions that are provided as iterators are used to link instructions to their predecessors and targets and careless manipulation of the list could break these.

Individual instructions are instances of type goto_programt::instructiont. They represent one step in the function. Each has a type, an instance of goto_program_instruction_typet which denotes what kind of instruction it is. They can be computational (such as ASSIGN or FUNCTION_CALL), logical (such as ASSUME and ASSERT) or informational (such as LOCATION and DEAD). At the time of writing there are 19 possible values for goto_program_instruction_typet / kinds of instruction. Instructions also have a guard field (the condition under which it is executed) and a code field (what the instruction does). These may be empty depending on the kind of instruction. These are of type exprt and codet respectively. The next instructions (remembering that transitions may be non-deterministic) are given by the list targets (with the corresponding list of labels labels) and the corresponding set of previous instructions is get by incoming_edges. Finally instructiont has informational function and source_location fields that indicate where they are in the source code.

Goto Conversion

In the goto-programs directory.

Key classes:

At this stage, CBMC constructs a goto-program from a symbol table. Each symbol in the symbol table with function type (that is, the symbol's type field contains a code_typet) will be converted to a corresponding GOTO program. The parse tree and source file are not used at all for this step.

The conversion happens in two phases:

  1. Function goto_convertt::convert turns each codet in the symbol table into corresponding GOTO instructions.
  2. goto_convertt::finish_gotos and others populate the GOTO and CATCH instructions' targets members, pointing to their possible successors. DEAD instructions are also added when GOTO instructions branch out of one or more lexical blocks.

Here is an example of a simple C program being converted into GOTO code (note the targets member of each instruction isn't populated yet):

The codet representation of method bodies that is stored in the symbol table resembles C code quite strongly: it uses lexical variable scoping and has complex control-flow constructs such as code_whilet and code_ifthenelset. In converting to a GOTO program this structure is changed in several ways:

Subsequent Transformation

It is normal for a program that calls goto_convert to immediately pass the GOTO functions through one or more transformations. For example, the ubiquitous remove_returns transformation turns the method return sequence generated by goto_convert (something like RETURN x; GOTO end;) into function_name::return_value = x; GOTO end;, and a callsite y = function_name(); into function_name(); y = function_name::return_value;.

Some subsequent parts of the CBMC pipeline assume that one or more of these transforms has been applied, making them effectively mandatory in that case.

Common transformations include:

By convention these post-convert transformations are applied by a driver program in a function named process_goto_program or process_goto_functions; examples of these can be found in cbmc_parse_optionst, goto_analyzer_parse_optionst and goto_instrument_parse_optionst among others.

Lazy GOTO Conversion

The conventional source-code-to-GOTO-program pipeline is horizontally structured: all source-file information is converted into a symbol table, then each symbol-table method is converted into a GOTO program, then each transform is run over the whole GOTO-model in turn. This is fine when the GOTO model consumer will use most or all of the results of conversion, but if not then lazy conversion may be beneficial.

Lazy conversion is coordinated by lazy_goto_modelt. Early source code parsing proceeds as usual, but while the same symbol table symbols are created as using initialize_goto_model, method bodies in the symbol table are not populated. lazy_goto_modelt::get_goto_function then allows each function to be converted on demand, passing individually through symbol-table codet production, GOTO conversion and driver-program transformation.

goto_symext is able to use such a lazy GOTO model in place of the usual goto_modelt, enabling it to create GOTO programs on demand, and thus avoid spending time and memory generating GOTO programs that symex determines cannot in fact be reached. This is particularly useful when the source program contains many indirect calls (virtual function calls, or calls via function pointer), as the remove_virtual_functions and remove_function_pointers passes treat these very conservatively, producing large dispatch tables of possible callees which symex often finds to be largely unreachable.

JBMC exhibits typical usage of lazy_goto_modelt when it is called with --symex-driven-lazy-loading. It defines a function jbmc_parse_optionst::process_goto_function which performs driver-program- specific post-GOTO-convert transformation. The lazy GOTO model is then passed to goto_symext, which will trigger symbol-table population, GOTO conversion and post-processing as needed.

Binary Representation

An instance of goto_modelt can be serialised to a binary stream (which is typically a file on the disk), and later deserialised from that stream back to an equivalent goto_modelt instance.

Serialisation

The serialisation is implemented in C++ modules:

To serialise a goto_modelt instance gm to a stream ostr call the function write_goto_binary, e.g. write_goto_binary(ostr, gm).

The content of the written stream will have this structure:

An important propery of the serialisation is that each serialised irept instance occurs in the stream exactly once. Namely, in the position of its first serialisation query. All other such queries save only a hash code (i.e. reference) of the irept instance.

A similar strategy is used for serialisation of string constants shared amongst irept instances. Such a string is fully saved only in the first serialisation query of an irept instance it appears in and all other queries only save its integer hash code.

Details about serialisation of irept instances, strings, and words in 7-bit encoding can be found here.

Deserialisation

The deserialisation is implemented in C++ modules:

The first two modules are responsible for location of the stream with the serialised data within a passed file. And the remaining two modules perform the actual deserialisation of a goto_modelt instance from the located stream.

To deserialise a goto_modelt instance gm from a file /some/path/name.gbf call the function read_goto_binary, e.g. read_goto_binary("/some/path/name.gbf", gm, message_handler), where message_handler must be an instance of message_handlert and serves for reporting issues during the process.

The passed binary file is assumed to have the same structure as described in the previous subsection. The process of the deserialisation does not involve any seeking in the file. The content is read linearly from the beginning to the end. irept instances and their string constants are deserialised into the memory only once at their first occurrences in the stream. All subsequent deserialisation queries are resolved as in-memory references to already deserialised the irept instances and/or strings, based on hash code matching.

NOTE: The first deserialisation is detected so that the loaded hash code is new. That implies that the full definition follows right after the hash.

Details about serialisation of irept instances, strings, and words in 7-bit encoding can be found here.

Deserialisation from ELF image

One can decide to store the serialised stream as a separate section, named goto-cc, into an ELF image. Then the deserialisation has a support of automatic detection of that section in an ELF file and the deserialisation will be automatically started from that section.

For reading the ELF images there is used an instance of elf_readert implemented in the C++ module:

Deserialisation from Mach-O fat image

One can decide to store the serialised stream into Mach-O fat image as a separate non-empty section with flags CPU_TYPE_HPPA and CPU_SUBTYPE_HPPA_7100LC. Then the deserialisation has a support of automatic detection of that section in a Mach-O fat image, extraction of the section from the emage into a temporary file (this is done by calling lipo utility with -thin hppa7100LC option), and the deserialisation will be automatically started from that temporary file.

For reading the Mach-O fat images there is used an instance of osx_fat_readert implemented in the C++ module:

NOTE: This functionality is available only when the modules are built on a MacOS machine.

Checking file type

You can use function is_goto_binary to check whether a passed file contains a deserialised goto_modelt instance or not. This is done by checking the magic number of the stream (see subsection Serialisation). However, in the case when the deserialised data were stored into ELF or Mach-O fat image, then only the check for presence of the concrete section in the image is performed.

Linking Goto Models

Similar to linking of object files together by C/C++ linker, the module provides linking of a dereserialised goto_modelt instance into a given (i.e. previously deserialised or otherwise created) goto_modelt instance.

This is implemented in function read_object_and_link. The function first deserialises the passed file into a temporary goto_modelt instance, and then it performs 'linking' of the temporary into a passed destination goto_modelt instance.

Details about linking of goto_modelt instances can be found here.

Linking Goto Models

C++ modules:

Dependencies: