Analysis/ Findings
The related work reviewed in this paper is limited to static classification approaches. Dynamic approaches examining run-time behavior are not closely related to this paper and are therefore not examined. Automatic extraction of string signatures has been proposed and employed by commercial Antivirus software [3-4]. In these systems, all possible signatures of fixed size are extracted and then culled to eliminate strings that appear in benign samples. However, polymorphic malware make string signatures prone to failure when the byte level content changes due to mutation, recompilation, and source code modification of the malware. Code normalization has been investigated to canonize malware before string scanning [8-9] and improve detection rates of malware variants. In [8], static analysis eliminated superfluous control flow. Additionally, instruction sequences that had no effect were also removed using a decision procedure to perform the analysis. Malware normalization improves detection but does not always effectively canonize a program to a unique form which affects the effectiveness and efficiency of the system. Fingerprinting malware based on opcode distributions has shown to be partly effective [10]. An improved approach was to examine byte sequences and opcode sequences in an n-gram analysis. N-grams and n-perms were proposed to identify similarity between malicious programs to build evolutionary trees [11]. By extracting relevant n-grams as particular features, feature vectors could be compared in order to construct similarity. Machine learning and classification extended these systems to detect unknown malware in [12-13]. These systems improve the effectiveness of static string signatures, but instruction level classification has similar problems when the instruction stream changes to any significant degree. Malware classification using the basic blocks of a program has been investigated in [14]. A basic block represents a sequence of instructions without an intervening control transfer instruction. The edit distance between basic blocks identifies similarity. Existence of a basic block in a malicious sample can be determined using an inverted index or bloom filters. Previous research demonstrated that this is effective at detecting some malware variants, but is not effective when byte and instruction level changes occur.
In the systems that implemented basic block classification, efficiency was not shown to be sufficient for desktop adoption. The ordering of system API calls of a program can be extracted and used for malware classification. [15] Proposed using association mining of API calls to detect unknown malicious programs. The problem of this approach is that code packing can obscure the API calls [16]. An approach to malware classification that is more resistant to byte and instruction level changes is by using control flow as a feature. Combining data flow analysis and control flow analysis was proposed in [17-18]. Annotations were made to the control flow graphs to incorporate abstractions of the instructions and data flow. These annotated flow graphs were compared to signatures, or automata, that describe the malware. If the malware signature is present in the query program, a malware instance can be detected. In [19], value set analysis was used as a specific data flow analysis to construct signatures. Our control flow graph based system is more effective at detecting modified variants because it allows for errors in the signature generation. Interprocedural control flow using the call graphs of a program have been compared to show similarity to existing malware [6, 20-22]. Procedures or nodes in the call graph are fingerprinted and common nodes between two program call graphs are identified. By identifying matching nodes, an approximation is made to the graph edit distance. The edit distance allows similarity to be measured. An approach to transform the interprocedural control flow information into a context free grammar, allowing for homomorphism testing using string equality was also proposed in [23], but this technique does not allow for approximate matches. An alternative approach to using the call graph is using the control flow graphs. Ignoring the edges in the call graph and identifying exact or isomorphic control flow graphs was proposed for a real-time system and shown to be effective [24]. Intraprocedural control flow using the control flow graphs have been compared by decompiling the graph to source code and using the string edit distance [25], but was not as efficient as exact matching. Identifying ismorphisms and the maximum common subgraph of a whole program control flow graph using tree automata was proposed in [26]. A worm detection system fingerprinted control flow using subgraphs of a small and fixed size as a feature in [7]. Our research employs a novel approach to finding the distance between malware based on individual distances or similarity between control flow graphs. We allow for efficiently approximate matching of sets of graphs whereas previous research only provided efficient exact matching. Our research additionally extends the concept of use subgraphs as a feature by incorporating it into larger classification system based representing a set of graphs as feature vectors and consequently using similarity searching. We also demonstrate a more effective feature using the q grams of decompiled control flow graphs.