KEMBAR78
ACER An AST-based Call Graph Generator Framework | PDF | Parsing | Class (Computer Programming)
0% found this document useful (0 votes)
14 views6 pages

ACER An AST-based Call Graph Generator Framework

The document introduces ACER, an AST-based call graph generator framework designed to facilitate the development of call graph generators across various programming languages. It emphasizes the advantages of using abstract syntax trees (ASTs) over intermediate representations (IRs) for speed and simplicity, while also detailing the framework's architecture and components. The paper also discusses the taxonomy of call graph generation algorithms and the challenges associated with implementing them.

Uploaded by

joashjohnson2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

ACER An AST-based Call Graph Generator Framework

The document introduces ACER, an AST-based call graph generator framework designed to facilitate the development of call graph generators across various programming languages. It emphasizes the advantages of using abstract syntax trees (ASTs) over intermediate representations (IRs) for speed and simplicity, while also detailing the framework's architecture and components. The paper also discusses the taxonomy of call graph generation algorithms and the challenges associated with implementing them.

Uploaded by

joashjohnson2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM)

2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM) | 979-8-3503-0506-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/SCAM59687.2023.00035

ACER: An AST-based Call Graph Generator


Framework
Andrew Chen Yanfu Yan Denys Poshyvanyk
Computer Science Department Computer Science Department Computer Science Department
William & Mary William & Mary William & Mary
Williamsburg, USA Williamsburg, USA Williamsburg, USA
achen08@wm.edu yyan09@wm.edu dposhyvanyk@wm.edu

Abstract—We introduce ACER, an AST-based call graph enclosing classes, name, and argument types are all taken into
generator framework. ACER leverages tree-sitter to interface account.
with any language. We opted to focus on generators that operate
A call site’s left hand side (if it exists) is referred to as the
on abstract syntax trees (ASTs) due to their speed and simplicitly
in certain scenarios; however, a fully quantified intermediate receiver. In b.bar(), the receiver is b. Call containers are
representation usually provides far better information at the cost nodes that lexically contain call sites. There are two call con-
of requiring compilation. To evaluate our framework, we created tainers in the figure: class Foo and void method1(Bar
two context-insensitive Java generators and compared them to a).
existing open-source Java generators.
Code: https://github.com/WM-SEMERU/ACER Call graph generators vary mainly by the following three
Index Terms—Call Graph, Static Analysis, Software Frame- parameters: algorithm, source format, and scope of analysis.
works We list examples in Table I.

I. I NTRODUCTION Paramters Examples


Long-standing research endeavors in static call graph gen- Algorithm NR, CHA, SCHA, RTA, k-CFA
eration have made significant contributions to the field of Scope of Analysis Application-only, Library-only, In-
program analysis over the years. Call graphs can be used between, Full software
in a wide variety of areas: from critical tasks like compiler Source Format Raw source, AST, IR
optimization [1] and detecting security vulnerabilities [2], to TABLE I: Generator parameters and examples.
useful applications such as code profiling [3], refactoring [4],
and navigation [5]. Formally, a call graph is defined as a
Generators first differ drastically by their algorithm. Grover
directed graph G = (V, E) wherein each vertex v ∈ V
et al. [6] established a taxonomy for call graph algorithms over
represents a method, and an edge e = (v1 , v2 ) ∈ E signifies
twenty years ago, which Tip et al. [7] broadly categorized into
that the method v1 invokes the method v2 within its body.
three levels:
1) Simple and few-passed, such as Name-based Resolu-
class Bar { tion (NR) and Class Hierarchy Analysis (CHA). NR
void bar() {} disregards receivers and resolves call sites using solely
} the method name. CHA, on the other hand, considers
public class Foo { receivers and even their subtypes to support polymor-
Bar b = new Bar(); phism.
void method1(Bar b) { 2) Context-insensitive but incorporates some points-to anal-
b.bar(); ysis. Examples include Variable Type Analysis (VTA)
} and Andersen-style Pointer Analysis (APA) [8].
} 3) Context-sensitive, exemplified by k-CFA (k-Control
Flow Analysis), where k means to utilize the last k call
Fig. 1: For jargon demonstration purposes. container contexts.

In call graph terminology, an invocation is known as a Call graph generators also vary in their scope of analysis, the
call site. There exist two call sites in Figure 1: new Bar() part of the full software analyzed. By full software, we mean
and b.bar(). The main operation of call graph generators the application source plus its library dependencies. There are
lies in call site resolution — the process of identifying the four levels of scope of analysis (from narrow to broad):
fully quantified method(s) a call site corresponds to. In Java, 1) Application-only: Only includes edges between applica-
a method is considered fully quantified when its package, tion methods.

979-8-3503-0506-7/23/$31.00 ©2023 IEEE


254
DOI 10.1109/SCAM59687.2023.00035
Authorized licensed use limited to: Fr C Rodrigues Institute of Technology Vashi. Downloaded on August 25,2025 at 09:05:04 UTC from IEEE Xplore. Restrictions apply.
2) Library-only: While libraries by themselves could just erators do not impose. Further, IR compilation may com-
be regarded as applications, Reif et al. introduced al- pute information irrelevant to generators. Consequently,
gorithms dedicated to libraries [9]. The key distinction AST-based generators theoretically perform faster.
is that libraries could be analyzed from the user’s 2) Generalizable: Most languages can be parsed to an
perspective and assume certain private methods to be AST while only some statically-typed languages have
closed off. This constructs a smaller set of entry points compilers that output rich, fully-resolved IRs.
and thus a more precise graph. 3) Synergizes with the application-only scope: If the goal
3) In-between: This contains application-only edges and is to generate application-only edges, there’s no need
library methods invoked in application code [10]. to consider the dependencies. AST-based generators
4) Full-software: This contains in-between edges and in- complement this by operating directly on ASTs from ap-
cludes library methods invoked in library code. plication source. Conversely, IR-based generators must
Lastly, generators differ by their source format. Source- take into account the dependencies, as most IRs can only
based generators use the raw source files while AST-based be produced when the entire software is compilable.
generators have access to the abstract syntax trees. The choice We have outlined the key generator parameters and provided
of algorithm is tied with the source format — source-based reasons for using AST-based generators. We now turn to the
generators are likely to implement the simplest NR algorithm focus of the paper — how to construct AST-based generators.
because extracting information from unstructured raw code The first piece of our approach is the powerful tree-sitter
is hard and unscalable. Examples of source-based generators [22] library. Tree-sitter is both a parser generator tool and
include the internal utility within Doxygen [11] and this Perl- an incremental parsing library. It provides parsers for all
based multilingual generator [12]. Examples of AST-based popular languages and a common interface to interact with
generators are the popular Python generators [13] [14], which the parsers and their concrete syntax trees outputs. Though
operate on the outputs of the ast and symtable modules. the common interface is written in C, tree-sitter provides
Additionally, there exist IR-based generators, which operate additional language-bindings — we use the Python binding.
on some internal representation (IR) created during compi- The second piece of our approach consists of language-
lation. Typically, only statically-typed languages have IRs. agnostic components and utilities. These abstract the shared
IRs deliver information that neither raw sources nor ASTs logic common to generators across various algorithms, scopes,
directly provide — the most important data being fully- and languages. These components are detailed in section III.
qualified functions. Since a call graph’s vertices are just fully- Together, these two pieces form the backbone of our frame-
resolved functions, the construction of a sound call graph from work, ACER, designed to simplify the development of call
IR is simple (but not trivial due to polymorphism) because IR graph generators.
has already resolved the functions. AST-based generators, on Here are our main contributions:
the other hand, need to implement resolution logic. 1) Introduced ACER, an AST-based Callgraph GEnerator
Examples of IR include JVM bytecode [15] and the fully- FRamework.
resolved syntax tree outputted by Clang-based compilers dur- 2) Provided a taxonomy for generator design choices and
ing semantic analysis [16]. We consider Clang’s rich tree to document implementational challenges.
be IR and no longer just an AST. Tools that generate IR 3) Developed and evaluated a single NR and a single SCHA
usually require the source to be compilable — a program’s (Simple CHA) AST-based Java generator using ACER.
dependencies thus must be pulled in and built. Ideal and
error-resilient tools like Roslyn [17] make exceptions — II. BACKGROUND AND R ELATED WORK
they can generate the IR without building the dependencies In the last section, we introduced the main call graph
because they resolve all resolvable entities and simply flag parameters: algorithms, source formats, and scope of analysis.
the unresolvable ones. On the other hand, most tools like In addition to these main traits, there exist subtle details like
Clang AST and GCC [18] either throw or remove unresolvable reachability. In this section, we expand on all parameters and
entities. discuss the challenges in implementing them.
As an aside, languages that can compile to JVM bytecode
can get call graph generators “for free” through bytecode A. Algorithm
analysis frameworks WALA [19] and Soot [20]. However, An advanced call graph algorithm builds upon a simpler one
the implementation details matter greatly — JVM-hosted im- by either pre-caching supplementary structures or monitoring
plementations of Groovy, Clojure, Python, and Ruby produce additional runtime data. High recall and soundness is easier
very unsound call graphs due to dynamic translation schemes to achieve while high precision demands significantly more
[21]. computational power [7].
Although IRs contain more useful data than ASTs, we still As previously mentioned, the source format is tied inti-
build our call graph framework based on AST for the following mately with the algorithm. The Java bytecode IR, if interpreted
reasons: directly, encourages an algorithm that is more precise than NR
1) Compile-free: Most IRs can only be generated if the but less sound than CHA. Surprisingly, four out of the six most
source is compilable, a restriction that AST-based gen- popular Java generators use this direct interpretation and thus

255

Authorized licensed use limited to: Fr C Rodrigues Institute of Technology Vashi. Downloaded on August 25,2025 at 09:05:04 UTC from IEEE Xplore. Restrictions apply.
do not fully support CHA. We illustrate the difficulty of fully D. Ambiguity Problems
supporting CHA in Figure 2. Application-only, AST-based generators face ambiguity
problems. Because IRs come from compilation, they have
class A { void method() {}; } fully-resolved all entities. But, application-only AST-based
class B extends A {} generators can never fully resolve because they do not have
class C extends B {} access to the library sources. We illustrate this ambiguity in
Figure 3. For a reasonable implementation, methods in the
class Bar {
void foo(A a) { // Could be A, B, C import java.util.*;
a.method();
} class Bar {
} int add(int a, int b) {...}
int add(float a, float b) {...}
Fig. 2: Liskov’s substitution principle [23] in play. void foo(List<?> l1, List<?> l2) {
int sum = add(l1.size(), l2.size());
The call site a.method(), in JVM bytecode, is }
represented as 1: invokevirtual #2 // Method }
A.method:()V. A simple, direct generator deduces
the edge from this information alone, thereby including Fig. 3: Without library analysis, we do not know the type of
only the edge from Bar.foo → A.method. However, l1.size() and l2.size(). So, we can at best return all
polymorphism also allows objects of type B and C to be methods named add with two arguments.
passed into foo. Thus, sound analysis must include the
additional edges Bar.foo → B.method and Bar.foo → figure should only aim to be identified by their package, class,
C.method. method names, and the number of arguments, which creates
To support CHA, the generator must pre-cache a class unnecessary edges.
hierarchy. After the type of a receiver is calculated, the In-between scoped and AST-based generators like PyCG do
generator considers the receiver to inhabit any subtype of the not face this problem. PyCG uses importlib to attempt to
type calculated. To support VTA and k-CFA, the generator also resolve library definitions.
need to consider intraprocedural assignments and data flow.
III. T OOL D ESCRIPTION AND I NTERNALS
B. Handling Language Features
Practical call graph generators abiding by the same al-
gorithm and scope of analysis can still differ due to the
attention to details [24]. Some Java generators might not regard
static initializer blocks as call containers, or disregard default
interface methods. Conversely, some Python generators may
not handle higher-order functions and iterators. In the last
subsection, we also touched on the importance of handling Fig. 4: Call graph Generation Flow. Preprocessor and Gener-
polymorphism, which is addressed by building and using a ator are user-implemented.
class hierarchy.
In this section, we describe the two essential classes of
C. Scope of Analysis ACER: Preprocessor and Generator. We also show how
Earlier, we mentioned that AST-based generators synergize ACER can be used to implement a hypothetical, application-
with application-only scope. However, for broader scopes only, AST-based, and fully-typed CHA Java generator. Gen-
starting from in-between, IRs are preferred because most IRs erators built using ACER are by default confined to ASTs
compile the libraries by default. for their source format and initiate their scope of analysis at
As a side note, what Karim et al. [10] regards as the application-only level. The algorithm, reachability, and the
“application-only” is what we consider as “in-between”. Their treatment of language features can be freely explored.
framework, CGC, parses just the structural information of the Users create new generators by extending the Preprocessor
libraries to efficiently include edges from application to library and the Generator and implementing a handful of abstract
and vice versa. methods. Internally, we make use of strict type hints and
An additional parameter in the realm of scope is reachabil- generics to guide implementation.
ity. The most sound analysis considers all methods as potential
entry points to handle potential multi-threading [9]. Precise Preprocessor
generators may consider a smaller set of entry points (the The sole purpose of the Preprocessor is to cache lookup
singleton set of just the main method). structures that the Generator utilizes. In this section, we first

256

Authorized licensed use limited to: Fr C Rodrigues Institute of Technology Vashi. Downloaded on August 25,2025 at 09:05:04 UTC from IEEE Xplore. Restrictions apply.
describe the inputs and outputs of the Preprocessor. Then, can narrow down these entry points; for example, they can
we give a high level description of our CHA generator’s filter out methods whose names are not “main”.
Preprocessor implementation. We now turn to explaining the generation flow, summarized
Preprocessor first loads the tree-sitter libraries: The C run- by the pseudocode in algorithm 1:
time and the input, language-specific parser. Preprocessor then
parses the input source files into tree-sitter trees. These trees Algorithm 1: Generation Flow
are composed of tree-sitter Nodes, which contain information Input: Preprocess Results P , Entry Points EP
like node type, text, and pointers to parent and children. Data: AnalysisDeque
Provided with the roots of these trees, the Preprocessor then Output: Call graph G(V, E)
produces its outputs. These outputs are the desired lookup 1 initializeDeque(EP )
structures. The Preprocessor class has two abstract methods, 2 visited = empty set()
which mandates the constructions of the two following struc- 3 while AnalysisDeque.not empty() do
tures: 4 if call site.id ∈ visited then
1) method_dict: A mapping between unique 5 continue
method keys to their method bodies (tree- 6 end
sitter Node objects). Our Java CHA generator 7 visited.add(call site.id)
creates the method_dict by first locating all 8 next contexts = resolve(context, call site)
nodes of types method_declaration and 9 foreach context ∈ next contexts do
constructor_declaration. From these nodes, 10 call container = context.call container
the generator navigates upwards to locate their enclosing 11 call sites =
class and package for full resolution. seek call sites(call containers)
2) unique_dict: A cache between non-unique keys to 12 foreach call site ∈ call sites do
unique keys. The unique_dict exists due to the 13 AnalysisDeque.add((context, call site))
application-only ambiguity problem described in sub- 14 end
section II-D: Call sites can not always resolve their 15 end
argument types and should instead be identified by their 16 end
argument count, but this introduces plurality which we 17 return G = (V, E)
keep track of through unique_dict.
While every Preprocessor must cache method_dict and We clarify a few aspects. First, similar to Preproces-
unique_dict, advanced algorithms require more structures sor, Generator is also an abstract class. It requires two
to be cached. Our Java CHA Preprocessor builds two addi- abstract methods to be implemented: seek_call_sites
tional structures: and resolve. seek_call_sites finds new call sites
1) package_importables: Contains the mapping to be added to the analysis from call containers. resolve
from package to its exports (e.g., fully quantified classes, determines the fully quantified function(s) that a call site
interfaces, and enums). This aids full type resolution. corresponds to.
2) class_cache: Contains the mapping from each class The generation algorithm keeps the following two runtime
to its fields, subclasses, and method signatures. The structures:
cached fields and method signatures help resolve com- 1) AnalysisDeque: An explicit deque is used to keep
plex receivers. The cached subclasses mappings effec- track of the call sites to resolve. By default, the algo-
tively form a class hierarchy, which the generator uses rithm automatically adds call sites contained by meth-
to handle polymorphism. ods. This explicitness, however, allows users to add call
Both of these structures are created in a similar manner to sites contained by non-method call containers.
how method_dict was created — a parallelizable operation 2) context: Caches data-flow and points-to analysis re-
that runs on each file. All of these structures are then passed sults required by VTA and k-CFA.
to the Generator. Our CHA generator must implement seek_call_sites
and resolve. Our generator seeks call sites by
Generator returning all instances of method_invocation and
The Generator creates the call graph from the specified object_creation_expression nodes within call
source files. Similar to the previous section, we first describe containers.
the I/O behavior of the Generator. Then, we discuss the Implementing resolve requires more sophistication. We
internal flow of the Generator before concluding with how show just the logic to resolve a method_invocation
we can implement the hypothetical CHA Generator. node, which looks like receiver.method(...). The
The inputs to the Generator are the source files, entry receiver may be of any expression. In JDK8, we found that
points, and the Preprocessor results. By default, the entry implicit/explicit this and identifiers make up 88% of the re-
points are all methods (the keys of method_dict). Users ceivers while field_access and method_invocation

257

Authorized licensed use limited to: Fr C Rodrigues Institute of Technology Vashi. Downloaded on August 25,2025 at 09:05:04 UTC from IEEE Xplore. Restrictions apply.
Soot OSA SPOON JCG WALA JDT NR NR-ALL SCHA SCHA-ALL
Soot 34574 44.6% 41.3% 42.4% 16.4% 44.0% 59.8% 69.5% 1.7% 25.5%
OSA 45.8% 33685 45.5% 51.1% 12.8% 56.5% 56.4% 65.0% 1.4% 25.5%
SPOON 87.7% 94.3% 16263 82.7% 27.0% 92.2% 64.6% 76.6% 2.5% 40.1%
JCG 64.3% 75.4% 58.9% 22834 17.8% 74.4% 53.8% 62.1% 2.1% 35.7%
WALA 100.0% 75.8% 77.4% 71.5% 5668 71.3% 66.2% 73.3% 6.1% 41.1%
JDT 78.1% 97.8% 77.1% 87.3% 20.8% 19463 70.5% 81.9% 2.1% 33.6%
NR 5.4% 4.9% 2.7% 3.2% 1.0% 3.6% 384315 100.0% 0.2% 3.5%
NR-ALL 5.5% 5.0% 2.9% 3.2% 1.0% 3.7% 88.0% 436726 0.2% 3.4%
SCHA 18.0% 14.1% 12.5% 14.6% 10.7% 12.5% 23.6% 24.4% 3242 100.0%
SCHA-ALL 14.5% 14.2% 10.8% 13.4% 3.8% 10.8% 21.9% 24.7% 5.3% 60677

TABLE II: Common calls of the ArgoUML project between 6 existing generators and 4 of our generators.

nodes only made up 8%. A call site whose receiver is of type measure our speed performances against existing tools, we
field_access will look like a.b.c, where the receiver is cannot directly compare the call graph generation times. Speed
a.b.; a call site whose receiver is a method invocation node depends on multiple factors, the most important being the
looks like a.b().c, where the receiver is a.b(). preciseness of the algorithm. Given that the algorithms of
In all of these scenarios, we must first resolve the full type the evaluated generators are quite different, we first present
of the identifier a. We implement this by walking upwards edge comparisons across all tools. This establishes a baseline
from the identifier to find where it was introduced. We scan level of precision that serves to validate the more important
the declaration statements, method arguments, and class fields. measurements.
However, we can only expect to retrieve the shorthand type We borrow the edge results of six open-source Java gener-
from these declarations because Java supports type aliasing. ators from Jász et al. [24]. They evaluated existing tools on
To resolve the full type, we build a list of possible alias to ArgoUML, a sufficiently large, UML diagramming application
full type mappings by analyzing the import statements and written in Java. They thoughtfully performed full-software, in-
utilizing package_importables. Then, we determine the between, and application-only analysis. Because our genera-
full type by querying the aliased type against the built list. tors are application-only, we care only for the application-only
If the query yields no results, the aliased type is probably findings, which are shown in Table II.
defined by the libraries, which the application-only scope The cells along the diagonal represents the number of edges
ignores. Once the receiver’s full type is resolved, we address the corresponding generator created. The other cells represents
polymorphism by identifying the type’s subclasses through a the ratio of the edges discovered by the row generator that were
query to class_cache. also discovered by the column generator. For example, 44.6%
With the full type of the identifier a determined, call sites of Soot’s edges were also discovered by OSA. The last four
like a.b() can be easily resolved. To resolve scenarios like rows and columns show the edge performances of our tool.
a.b.c() and a.b().c, we just have to utilize the lookup The “ALL” suffix signifies that a generator considers the entry
structures generated by the Preprocessor a bit more. points to be the set of all methods.
We have covered the high level overview of building a As expected, the NR generators have high recall but low
Java CHA generator with ACER. Evidently, much of the user precision — they contain, on average, 52.1% of the edges
implemented logic deals with type resolution, which comes that other tools generated. But other tools, on average, only
with freely with IR. contain 3% of the edges NR generated. Conversely, the SCHA
IV. E XAMPLES AND E VALUATION generators have higher precision but lower recall.
Generating more edges seems to correlate with a faster
Under ACER, we implemented a NR generator and a SCHA
and sounder algorithm while generating less edges seem to
generator. We elaborate on our SCHA algorithm.
imply a slower and more precise algorithm. This observation
SCHA (Simple Class Hierarchy Analysis) is indeed true for our sound NR tool, which had the most
Our SCHA generator is simple in comparison to the hy- edges, and WALA, which ran an approximation of the APA
pothetical CHA generator of section III. First, our generator algorithm. In reality, this is not always the case. The difference
uses aliases, so collisions will occur if two methods belong in choosing which language features to handle drastically
to classes of the same alias and have the same name and change the edges. In the case the SCHA tool, it generated the
number of arguments. Second, only call sites with a simple least edges due to its simplicity (e.g., only resolving call sites
identifier receiver like a.b() and those with no callers are with identifier receivers) and not its algorithm rigor (which
considered. Call sites with complex receivers are ignored (e.g., falls behind WALA’s APA and Soot’s CHA).
(builder.make_class()).method(). We admit that further detailed analysis on why the edges
differ should be conducted. But, with this anchor set, we now
Evaluation turn to discuss speed performances.
Our evaluation goals revolve around measuring framework- We evaluated the time it took for JCG, WALA (in CHA
centric metrics, e.g., speed and lines-of-code. However, to mode), NR-all, and SCHA-all to operate on JDK8’s jar, on

258

Authorized licensed use limited to: Fr C Rodrigues Institute of Technology Vashi. Downloaded on August 25,2025 at 09:05:04 UTC from IEEE Xplore. Restrictions apply.
Generator Preprocessing Time Generation Time 2022 for contributing to the initial version of the project. This
NR 00:32 01:23 research has been supported in part by the NSF CCF-2311469,
SCHA 00:59 02:29 CNS-2132281, CCF-2007246, and CCF-1955853. We also
JCG 20:00 00:07 acknowledge support from Cisco Systems. Any opinions,
WALA 20:00 01:50 findings, and conclusions expressed herein are the authors’
and do not necessarily reflect those of the sponsors.
TABLE III: Generator speeds in mm:ss
R EFERENCES
[1] S. Muchnick, Advanced Compiler Design and Implementation. San
a machine with a Intel® Core™ i7-11800H processor. We Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998.
[2] V. Prokhorenko, K.-K. R. Choo, and H. Ashman, “Web application
zoomed in on JCG because it most closely resembles NR-ALL protection techniques,” vol. 60, no. C, p. 95–112, jan 2016. [Online].
in terms of straightforwardness and WALA (in CHA mode Available: https://doi.org/10.1016/j.jnca.2015.11.017
instead of APA) because it most closely resembles SCHA. At [3] J. M. Spivey, “Fast, accurate call graph profiling,” Softw: Pract. Exper.,
vol. 34, pp. 249–264, 2004.
last, we consider all methods as entry points because JDK8 is [4] A. Feldthaus, T. Millstein, A. Møller, M. Schäfer, and F. Tip,
a library and does not have main methods. “Tool-supported refactoring for javascript,” SIGPLAN Not., vol. 46,
Per the table, other tools outperform our tools in terms of no. 10, p. 119–138, oct 2011. [Online]. Available: https://doi.org/10.
1145/2076021.2048078
generation speed. But, this can be reasonably attributed to the [5] A. Feldthaus, M. Schäfer, M. Sridharan, J. Dolby, and F. Tip, “Efficient
fact that IR has already fully quantified all entities. For a fair construction of approximate call graphs for javascript ide services,” 2013
comparison, we must consider the preprocessing time it took 35th International Conference on Software Engineering (ICSE), 2013.
[6] D. Grove and C. Chambers, “A framework for call graph construction
to build the heavy JDK8 jar, which takes 20 minutes. For our algorithms,” ACM Trans. Program. Lang. Syst., vol. 23, no. 6, p.
tools, we attribute preprocessing time to the Preprocessor and 685–746, nov 2001. [Online]. Available: https://doi.org/10.1145/506315.
generation time to the Generator. With the preprocessing time 506316
[7] F. Tip and J. Palsberg, “scalable propagation-based call graph construc-
considered, our tools perform much faster. tion algorithms,” SIGPLAN Not., vol. 35, pp. 281–293, 2000.
At last, we note that the framework itself, including utilities, [8] L. O. Andersen and P. Lee, “Program analysis and specialization
contains 800 lines of code in Python. The NR generator was for the c programming language,” 2005. [Online]. Available: https:
//api.semanticscholar.org/CorpusID:20876553
written in only 100 lines and the SCHA took 300 lines. [9] M. Reif, M. Eichberg, B. Hermann, J. Lerch, and M. Mezini, “Call
The additional lines in SCHA went towards generating class graph construction for java libraries,” Proceedings of the 2016 24th
hierarchy and resolving the full type of identifiers. We actually ACM SIGSOFT International Symposium on Foundations of Software
Engineering, 2016.
wrote two incomplete CHA generators for Java and Python [10] K. Ali and O. Lhoták, “application-only call graph construction,” pp.
that attempted to resolve all expression types. They came 688–712, 2012.
out to around 3K lines — most lines going towards type [11] “Doxygen,” accessed: 2023-07-05. [Online]. Available: http://www.
doxygen.nl/
resolution. Thus, we plan to research further in designing [12] A. F. N. Koknat, “callgraph,” 2023, gitHub repository. [Online].
language-agnostic abstractions for assisting entity resolution. Available: https://github.com/koknat/callGraph
As mentioned before, the ideal solution is a source format that [13] D. Fraser, “Pyan,” 2021, gitHub repository. [Online]. Available:
https://github.com/davidfraser/pyan
is fully quantified wherever possible, resilient to errors, and [14] V. Salis, T. Sotiropoulos, P. Louridas, D. Spinellis, and D. Mitropoulos,
does not require full compilation and dependency building. “Pycg: practical call graph generation in python,” 2021 IEEE/ACM 43rd
International Conference on Software Engineering (ICSE), 2021.
V. C ONCLUSION AND F UTURE W ORK [15] T. Lindholm, F. Yellin, G. Bracha, and A. Buckley, The Java
Virtual Machine Specification, Java SE 8 Edition. Pearson
In this paper, we presented ACER, an AST-based call graph Education, 2014. [Online]. Available: https://books.google.com/books?
generator framework. First, we introduced the parameters of id=0o-AAwAAQBAJ
call graph generators and highlighted implementation chal- [16] L. Project. (2023) Clang: a c language family frontend for llvm.
[Online]. Available: https://clang.llvm.org/
lenges. Then, we discussed the process in building a hypo- [17] M. Corporation. (2023) Roslyn (.net compiler platform). [Online].
thetical CHA generator under ACER. At last, we evaluated Available: https://github.com/dotnet/roslyn
our Java NR and SCHA generators against existing, IR-based [18] F. S. Foundation. (2023) Gnu compiler collection. [Online]. Available:
https://gcc.gnu.org/
Java generators. [19] IBM Corporation, “WALA - T.J. Watson libraries for analysis,” https:
Here is a list of items we plan to address in our future work: //github.com/wala/WALA, 2019.
1) Conduct an in-depth analysis on edge outputs of differ- [20] Sable Research Group, “Soot: A java bytecode optimization framework,”
https://soot-oss.github.io/soot/, 2023.
ent generators on JDK8. Build a table similar to Table II [21] K. Ali, X. Lai, Z. Luo, O. Lhotak, J. Dolby, and F. Tip, “A study of
and attribute edges variations to specific design choices. call graph construction for jvm-hosted languages,” Ieee Transactions on
2) Build abstractions to assist name resolution and full Software Engineering, vol. 47, pp. 2644–2666, 2021.
[22] T. sitter Contributors, “Tree-sitter: An incremental parsing system
quantification. for programming tools,” 2021. [Online]. Available: https://github.com/
3) Build generators for other languages to demonstrate the tree-sitter/tree-sitter
multilinguality of the framework. [23] B. H. Liskov and J. M. Wing, “A behavioral notion of subtyping,”
ACM Trans. Program. Lang. Syst., vol. 16, no. 6, p. 1811–1841, nov
1994. [Online]. Available: https://doi.org/10.1145/197320.197383
VI. ACKNOWLEDGEMENTS [24] J. Jász, I. Siket, E. Pengő, Z. Ságodi, and R. Ferenc, “systematic
We would like to thank the CSci 435 students Vinny Allegra, comparison of six open-source java call graph construction tools,” 2019.
Daniel Lee, Aamir Mohammed, and Chas Rinne from Fall

259

Authorized licensed use limited to: Fr C Rodrigues Institute of Technology Vashi. Downloaded on August 25,2025 at 09:05:04 UTC from IEEE Xplore. Restrictions apply.

You might also like