Description

Various tools for data-flow analysis.

Comparison with source code data-flow

Binary data-flow analysis differs from source code data-flow analysis in that source code has well-known, named variables and the analysis is often simplified by considering only those variables (e.g., whole arrays, not array element granularity), and by ignoring aliasing issues. But because binaries have no named variables but rather one large array of memory with computed addresses, a binary data-flow analysis must be at address granularity and account for aliasing. In fact, because of the rich set of addressing modes, it is very common that within a single instruction some of the inputs and/or outputs depend on other inputs.

Another difference is in the number of basic operations that need to be supported. Source-level languages typically have at most a few dozen basic operations for which data-flow must be explicitly encoded in the analysis, while binary analysis needs basic data-flow information about hundreds of CPU instructions across the supported architectures. In fact, there are so many unique instructions that the implementation here takes a different approach: it leverages the fact that ROSE has instruction semantics and is able to discern data-flow within an instruction (or even across an entire basic block) by executing the instruction (or block) in a special data-flow semantic domain. The domain is an aggregate domain consisting of a main user-specified domain (often symbolic), a domain to track data-flow, and any number of additional user-specified domains. This approach works to discover multiple data-flow within the executed item (instruction or basic block); most CPU instructions have multiple data flows (e.g., "PUSH" has at least three independent data flows: incrementing of the instruction pointer, moving data between stack and register, incrementing the stack pointer).

Two additional minor things to be aware of in binary data-flow are (1) the difference in size between registers and memory locations, and (2) the use of idioms to perform common tasks. For instance, a read from memory to a 32-bit register is actually four separate 8-bit reads at the data-flow level since memory holds 8-bit values. An example of an idiom is the "XOR EAX, EAX" instruction to set the EAX register to zero (this is important for taint analysis, since it effectively removes any taint from EAX).

The major components

The core component of the binary data-flow is an Engine class template. Its purpose is to collect all the types and objects necessary for a data-flow analysis and to iteratively solve a data-flow equation.

Additional components provide functionality that's useful during the core solving process, but not necessary. For instance, one can use abstract locations (AbstractLocation) to represent variables. An abstract location is either a register or a memory address. Registers are represented by a RegisterDescriptor while addresses are represented by a semantic value in some user-specified domain (often symbolic, see Rose::BinaryAnalysis::InstructionSemantics::BaseSemantics::SValue).

Prerequisites for the core data flow-engine

See Engine for how data-flow is solved. In short, data-flow consists of values (taken from a lattice) that are stored in variables (abstract locations, memory, registers). A set of variables and their values (a.k.a., state) is stored at each vertex in a control flow graph. The solver iterates over the control flow graph. It invokes a user-defined transfer function at each vertex. The transfer function takes an incoming state and produces an outgoing state. The incoming state for a vertex is calculated by merging the outgoing states of its predecessor vertices. When the iteration reaches a fixed point the analysis is complete.

Definition at line 71 of file DataFlow.h.

#include <Rose/BinaryAnalysis/DataFlow.h>

Collaboration diagram for Rose::BinaryAnalysis::DataFlow:

[legend]

Classes
class	DefaultVertexUnpacker
	Functor to return instructions for a cfg vertex. More...

class	Engine
	Data-flow engine. More...

class	Exception
	Data-flow exception base class. More...

class	NotConverging
	Exceptions when a fixed point is not reached. More...

class	PathAlwaysFeasible
	Trivial path feasibility predicate. More...

class	SemanticsMerge
	Basic merge operation for instruction semantics. More...

Public Types
typedef InstructionSemantics::DataFlowSemantics::DataFlowGraph	Graph
	Data-flow graph.

typedef AbstractLocation	Variable
	Variable participating in data flow.

typedef std::list< Variable >	VariableList
	List of variables.

typedef Sawyer::Container::Map< size_t, Graph >	VertexFlowGraphs
	Map from CFG vertex to data-flow graph.

Public Member Functions
	DataFlow (const InstructionSemantics::BaseSemantics::DispatcherPtr &userDispatcher)
	Constructor.

VariableList	getUniqueVariables (const VertexFlowGraphs &)
	Get list of unique variables.


Graph	buildGraph (const std::vector< SgAsmInstruction * > &)
	Compute data-flow.


template<class CFG , class VertexUnpacker >
VertexFlowGraphs	buildGraphPerVertex (const CFG &cfg, size_t startVertex, VertexUnpacker vertexUnpacker)
	Compute data-flow per CFG vertex.

template<class CFG >
VertexFlowGraphs	buildGraphPerVertex (const CFG &cfg, size_t startVertex)
	Compute data-flow per CFG vertex.

Static Public Member Functions
static void	initDiagnostics ()
	Initialize diagnostics.

Static Public Attributes
static Sawyer::Message::Facility	mlog

Member Typedef Documentation

◆ Graph

typedef InstructionSemantics::DataFlowSemantics::DataFlowGraph Rose::BinaryAnalysis::DataFlow::Graph

Data-flow graph.

Vertices are abstract locations (register descriptors and/or memory addresses) and edges indicate the flow of data from abstract location to another. The edge IDs are in the order they were added; each edge also has a sequence number that is initially equal to the edge ID, although edge IDs can change if edges are erased from the graph).

Definition at line 78 of file DataFlow.h.

◆ Variable

typedef AbstractLocation Rose::BinaryAnalysis::DataFlow::Variable

Variable participating in data flow.

A variable in binary data-flow analysis is an abstract location referencing either a register or memory cell. The address for memory locations is an arbitrary semantic expression (Rose::BinaryAnalysis::InstructionSemantics::BaseSemantics::SValue).

Definition at line 85 of file DataFlow.h.

◆ VariableList

typedef std::list<Variable> Rose::BinaryAnalysis::DataFlow::VariableList

List of variables.

Definition at line 88 of file DataFlow.h.

◆ VertexFlowGraphs

typedef Sawyer::Container::Map<size_t, Graph> Rose::BinaryAnalysis::DataFlow::VertexFlowGraphs

Map from CFG vertex to data-flow graph.

This data structure holds a data-flow graph for each control flow graph vertex. The data-flow graph represents data-flow from one abstract location to another. The union of all abstract locations across all control flow vertices is the set of variables present in the specimen being analyzed. The map is keyed by CFG vertex IDs.

Definition at line 95 of file DataFlow.h.

Constructor & Destructor Documentation

◆ DataFlow()

Rose::BinaryAnalysis::DataFlow::DataFlow ( const InstructionSemantics::BaseSemantics::DispatcherPtr & userDispatcher )

Constructor.

Constructs a new data-flow analysis framework using the specified user-level instruction semantics. The semantics should consist of an instruction dispatcher appropriate for the architecture being analyzed, and an associated semantic domain that will be used to calculate memory addresses. The SymbolicSemantics domain is typically used.

Member Function Documentation

◆ initDiagnostics()

static void Rose::BinaryAnalysis::DataFlow::initDiagnostics ( )

static

Initialize diagnostics.

This is called by Rose::Diagnostics::initialize.

◆ buildGraph()

Graph Rose::BinaryAnalysis::DataFlow::buildGraph ( const std::vector< SgAsmInstruction * > & )

Compute data-flow.

Computes and returns a graph describing how data-flow occurs for the specified instructions (which must have linear control flow, e.g., from a single basic block). The vertices of the returned graph contain the abstract locations that are referenced (read or written) and the edges indicate the data-flow. The edge values are integers imparting an ordering to the data-flow.

Referenced by buildGraphPerVertex().

◆ buildGraphPerVertex() [1/2]

template<class CFG , class VertexUnpacker >

VertexFlowGraphs Rose::BinaryAnalysis::DataFlow::buildGraphPerVertex	(	const CFG &	cfg,
		size_t	startVertex,
		VertexUnpacker	vertexUnpacker
	)

inline

Compute data-flow per CFG vertex.

This method traverses the control flow graph to discover the abstract locations that are referenced by each vertex and the flow edges between those abstract locations.

The algorithm currently implemented here is to use an aggregate semantic domain consisting of a user-supplied semantic domain in conjunction with a data-flow discovery domain, and to process each CFG vertex one time in some arbitrary depth-first order. This isn't a rigorous analysis, but it is usually able to accurately identify local and global variables, although typically not through pointers or array indexing.

Definition at line 173 of file DataFlow.h.

References buildGraph(), Sawyer::Container::Map< K, T, Cmp, Alloc >::insert(), Rose::StringUtility::plural(), and Sawyer::Container::Map< K, T, Cmp, Alloc >::size().

Referenced by buildGraphPerVertex(), and Rose::BinaryAnalysis::TaintedFlow::computeFlowGraphs().

◆ buildGraphPerVertex() [2/2]

template<class CFG >

VertexFlowGraphs Rose::BinaryAnalysis::DataFlow::buildGraphPerVertex	(	const CFG &	cfg,
		size_t	startVertex
	)

inline

Compute data-flow per CFG vertex.

This method traverses the control flow graph to discover the abstract locations that are referenced by each vertex and the flow edges between those abstract locations.

The algorithm currently implemented here is to use an aggregate semantic domain consisting of a user-supplied semantic domain in conjunction with a data-flow discovery domain, and to process each CFG vertex one time in some arbitrary depth-first order. This isn't a rigorous analysis, but it is usually able to accurately identify local and global variables, although typically not through pointers or array indexing.

Definition at line 203 of file DataFlow.h.

References buildGraphPerVertex().

◆ getUniqueVariables()

VariableList Rose::BinaryAnalysis::DataFlow::getUniqueVariables ( const VertexFlowGraphs & )

Get list of unique variables.

This returns a list of unique variables by computing the union of all variables across the index. Uniqueness is determined by calling AbstractLocation::mustAlias. The variables are returned in no particular order.

Referenced by Rose::BinaryAnalysis::TaintedFlow::computeFlowGraphs(), and Rose::BinaryAnalysis::TaintedFlow::vertexFlowGraphs().

Member Data Documentation

◆ mlog

Sawyer::Message::Facility Rose::BinaryAnalysis::DataFlow::mlog