Node and edge diagrams used to visualize software often become difficult to interpret when they contain more than a few dozen nodes. When a large software system contains hundreds of thousands of entities linked in complex ways, those difficulties are greatly compounded.
This article describes an approach to visualizing large software dependency relationships using scalable and intuitive techniques developed originally for geospatial visualization. This approach allows us to show graphs representing twenty million lines of source code while maintaining a smooth user experience.
The appeal – and limitations – of simple node and edge diagrams
Node and edge diagrams are a common sight on whiteboards when software engineers meet. On a small scale, they succinctly capture dependency relationships (such as the calls relationship) between functions and files. With appropriate layout, such diagrams can quickly give a sense of larger dependency patterns, separating utility code from high level business logic, or highlighting fragile cyclic dependencies.
However, the utility of simple node and edge diagrams diminishes at larger scales; graph-layout algorithms struggle with large graphs, and the dependency edges of a typical project overwhelm a user. Large software projects can have multiple millions of lines of code, containing hundreds of thousands of functions. Unfortunately, such systems are those that need visualization the most, since they are the hardest to understand and monitor using other techniques, and the most costly when problems are missed.
The problem of creating a concise visual representation of dense information is not new; cartographers have been dealing with it in a slightly different form for millennia. Mapmakers select a level of detail based on the scale of the map, and leave out information that will not fit. This approach has been continued in online map and geospatial applications like Google Earth , with the added feature that the user can change the scale interactively (by zooming in or out). The result is an easy-to-use visualization of vast amounts of data. We adapted this approach to visualizing software.
Using mapping methods to visualize large software projects
Our tool gives engineers and managers a canvas for exploring and monitoring software projects. This canvas gives users a sense of the overall structure of the software and facilitates the display of additional data. For example, problematic modules or modules experiencing rapid change can be highlighted and explored.
The tool also supports more detailed investigation when warranted and was explicitly designed to target large systems with millions or tens of millions of lines of code. Currently the tool supports C/C++ programs, though the approach can be applied to other languages.
Learning from Google Earth. We adapted techniques employed by modern geospatial applications like Google Earth and NASA World Wind  to show the dense information that makes up the surface of the Earth. The key techniques are:
- Divide information into levels of detail, where large scale maps (or maps where the virtual camera is at a higher altitude) show less detail than small scale maps.
- Pre-compute the levels of detailso information can be retrieved at various levels of detail quickly when requested later.
- Retrieve information on-demand based on what is on screen and the scale of the current image(or the height of the current camera position).
- Use modern graphics hardwareto allow smooth zooming and panning of complex scenes.
- Include a text-based search capabilityso the user can quickly locate information whose position may not be known.
Step #1: The Map View
The Map view is the tool’s primary mode for viewing large projects. In this view, the software is divided into components that form a hierarchy. The leaf elements of the hierarchy are procedures. Procedures are grouped into components, which can themselves be grouped into supercomponents, and so on.
By default, the tool will infer this hierarchy based on file structure. Procedures are grouped by their containing file, and files are grouped into directories; users can optionally specify an alternate component hierarchy. The complete call graph of a software project is extracted using static analysis.
It is not feasible to display the entire call graph for large graphs, so the tool computes bundled edges that summarize dependencies between files and components. For example, if a procedure read_char() in parser.c calls procedure log_msg() in the logging.c , we add an edge from parser.c to logging.c (Figure 1 ). We also add extension edges linking procedures to the bundled edges that contain their calls. Bundled edges give us scalability at the cost of some ambiguity; for example, in Figure 1, it is not clear if main() or read_char() calls the insert() procedure.
Click on image to enlarge.
The call graph is displayed using a hierarchical node and edge diagram, similar to that of Figure 1. Components are drawn as boxes. The size of a component is proportional to the cumulative number of functions contained within that component.
Edges between sibling components are drawn as arrows or wedges. Components are positioned according to one of several graph layout algorithms, each of which uses edges and node sizes to determine a compact layout. The contents of components are drawn in a similar manner—a component’s subcomponents and procedures are drawn as boxes within the outer component box, and positioned using the same layout algorithm, and edges are drawn as arrows or wedges.
If a procedure calls another procedure that is not in the same subcomponent, we draw an extender edge linking it to the bundled edge between the components that contain the caller and callee. Every node is labeled with a name, usually the name of the procedure, file or directory. In most real projects, the procedure nodes and many interior components are too small to be seen unless the user zooms in to a subsection of the graph. (We exploit this by not rendering boxes if they fall below a certain width on the screen.)
The Map view is intended to be used with controls for easy zooming and panning; the user can ‘dive in’ for details, or ‘back out’ to get an overall picture. When a node is selected but is too small to see, its enclosing component is colored to indicate that it contains a selected node; this allows users to easily locate selected nodes deep in the graph by zooming into the components that are colored. Similarly, colorings indicating other data about small nodes ‘bubble up’ to the smallest visible enclosing component so the user can locate the nodes when zoomed out.
Additional interactive features give access to other information: when a node is selected, information about it is shown in a sidebar, including a source code excerpt; when a bundled edge is selected, the sidebar shows the source and destination and the procedure-to-procedure edges it summarizes. A menu allows the user to select groups of nodes that are connected by a single step or transitively.
Step #2: Zooming in for more detail
The tool offers several other views that allow a user to choose a subgraph and investigate it in more detail. The user can hide nodes that are not of interest from the full map and use the same zooming and panning controls. The user can also switch to views where call graph edges are not bundled (that is, all edges between procedures are shown, even if the procedures are in different components). The tool also supports tree views.
In all the supporting views, the user can show nodes by clicking on the nodes that are already visible. The user can also show nodes by clicking on search engine results. In addition, the tool incorporates multiple tabs, each of which can show a different view of the dependency graph. For example, the user can view the whole Map on one tab, a partial Map on a second tab, and a call tree on the third tab.
The architectural building blocks
The present implementationof our geospatial visualization has three major components: an analysisprocess, a special-purpose HTTP server, and a client. The analysisprocess extracts the call-graph from the source code of the targetproject and sends it to the server as XML. The server component parsesthe XML and populates an SQL database with information about thecall-graph, and serves this data to the client via HTTP. The clientretrieves data from the server and renders it using a Java applet. Thesecomponents are described in more detail below.
Call-graphextraction.We extract the call-graph and other information usingCodeSonar, a commercial static-analysis tool. The visualizationtechnology has been integrated with CodeSonar to make this processseamless.
HTTP server.The server component has two phases: anoffline import phase, where it parses the XML-encoded call-graph andcomputes the bundled and extension edges of the Map view; and a servephase, where it answers queries sent by the client. Whenever possible,information is pre-computed during the import phase and stored tominimize the time required to respond to user actions during the servephase. The call-graph is stored in a SQLite database. In the servephase, a session begins with the server sending an HTML page with anembedded Java applet. Once the applet starts, the server is responsiblefor responding to queries. Query responses are encoded using JSON.
Click on image to enlarge.
Renderingand manipulating large graphs can be time- and memory-intensive. Weavoid many scalability issues by retrieving and rendering graph dataonly when it is required by the position and depth of the virtualcamera.
When the user is zoomed out, the client only needs toretrieve and lay out the top one or two layers of the componenthierarchy. As the user zooms into a subcomponent, the client willretrieve deeper layers of the hierarchy, but only those parts that willappear on screen. We use background threads to handle time-intensivetasks like asynchronous requests to the server and graph layout; thisallows the user to continue working with the tool while new data isretrieved.
Figures 3a-d show the Map view of some components present in an analysis of curl-7.19.7. Figure 3a is a composite image that shows views at three different levels ofdetail; these levels of detail are shown individually in higherresolution in Figures 3b-c. Figure 3b shows the lib and src components at a broad level of detail. Both lib and src contain a hierarchy of files and procedures. Figure 3c shows the file ftp.c, a child of lib, in finer detail. Figure 3d shows the ftp.c file in greater detail still, showing its procedures and their associated calls. (View a video demonstration of the tool at work)
Click on image to enlarge.
Click on image to enlarge.
Click on image to enlarge.
Click on image to enlarge.
We use the JOGL library to access OpenGL from a Java applet. UsingOpenGL allows us to render graphs (with smooth zooming and panning) thatare much larger than we can with the Java 2D drawing packages.
Someof the graph layout duties are handled by the commercial yFileslayoutengine, and some are handled by layout algorithms we developed in house.The client leverages some of the user interface features that come withCodeSonar. We link to and embed CodeSonar’s sophisticated sourcelistings so users can see both textual and graphical representations oftheir program.
Putting geospatial visualization to work
Thetool we’ve developed has been tested on a proprietary project withtwenty million lines of code (MLOC), as well as on several large opensource projects including Wireshark with its X11 dependencies (3.5 MLOC)and Firefox (1.8MLOC).
Extracting the call graph from a largeproject requires negligible extra time during a CodeSonar analysis. On amodern Linux workstation, the import phase for a 20MLOC project takesapproximately 2 hours. Once the import is finished, a user can connectto the server and browse the project in real time with latencies similarto a typical web browsing session.
The primary feature of theMap view is that it shows the entire project in a single picture. Ofcourse, the user does not see all the details at one time, but since thetool supports smooth zooming and panning, it feels like one is seeingeverything, much like Google Earth and NASA Worldwind give one the senseof seeing the whole planet in detail.
This ‘whole world’property makes the Map view a useful canvas for projecting additionallayers of data. For example, we can color zones of the view according tometrics like code complexity or rate of change. Or we can color thosezones of the Map view that contain matches to a search engine query.
TheMap view does have limitations. It requires a project to be organizedinto a component hierarchy, though this can be implicit in the filesystem layout. Also, if the hierarchy is too flat— for example, all theprocedures are in one file—the Map view offers few advantages over othernode and edge visualizations.
In addition, if the hierarchy doesnot group procedures into logical components with limitedinter-component dependencies, the Map view may display a tangled mess ofnodes and edges, though arguably this indicates that the componentstructure should be changed.
Finally, the Map view is not suitedfor studying the relationship between components that are distant fromeach other in the layout. To address this weakness, the tool supportscustom maps where components can be selectively hidden and shown, asdiscussed earlier in this article.
Other ways of visualizingcode. There are several other tools under development elsewhere thattake an explicitly cartographic approach to software visualization.Codemap  visualizes a software project as a sequence of mountainousislands. CodeCity  renders software as a set of columns that resemblebuildings in a city. Steinbrückner and Lewerentz  show softwareevolution as a growing city. CodeCanvas  supports deep zooming in andout of software in a manner similar to modern mapping applications,though they do not use cartographic metaphors for representing thesoftware.
By reusing techniques originally employed to displaylarge geographic data sets, we’ve been able to display large softwareprojects (tens of millions of lines of code) in a single view withsmooth zooming and panning.
Where we’re going next
Werecently extended the visualization tool to overlay information aboutcode metrics and defect density. We are exploring displaying additionalinformation about software projects. When adding new information, thereare two primary (and related) challenges: 1) the technical challenge ofmanaging and collating large datasets so they can be retrieved quickly,and 2) the usability challenge of presenting those datasets to the userso they can be interpreted while dealing with hundreds of thousands ofprocedures.
 Google Earth
 World Wind
A. Kuhn, D. Erni, and O. Nierstrasz, “Embedding spatial softwarevisualization in the IDE: an exploratory study”, in Proceedings ofSOFTVIS 2010, pp 113–122.
 R. Wettel and M. Lanza, “Visualexploration of large-scale system evolution”, in Proceedings of the 15thWorking Conference on Reverse Engineering, 2008, pp. 219 – 228.
F. Steinbrückner, C. Lewerentz. “Representing development history insoftware cities”, In Proceedings of SOFTVIS 2010, pp.193–202.
 R. DeLine, G. Venolia, and K. Rowan, “Software development with code maps”, in Communications of the ACM, vol. 53,.
Paul Anderson is VP of Engineering at GrammaTech. He received his B.Sc. from KingsCollege, University of London, and his Ph.D. in computer science fromCity University, London. Paul manages GrammaTech’s engineering team andis the architect of the company’s static analysis tools. Paul has workedin the software industry for 20 years, with most of his experiencefocused on developing static analysis, automated testing, and programtransformation tools. He can be contacted at firstname.lastname@example.org.
David Cok is Associate Vice President of Technology at GrammaTech. He graduatedwith honors from Calvin College with A.B. degrees in Physics andMathematics. He also earned an A.M. and Ph.D. in Physics from HarvardUniversity. Dr. Cok's career includes research in static analysis withan emphasis on practical application to industrial software development.Before joining GrammaTech, Dr. Cok was a Senior Research Associate andleader of research and product development groups at Eastman KodakCompany.
Michael McDougall is a GrammaTech SeniorScientist. He received a B.Sc. in Mathematics and Computer Science fromMcGill University in 1997. He also graduated with a Ph.D in ComputerScience from the University of Pennsylvania in May 2005. At GrammaTech,he has led and contributed to a variety of research projects in theareas of software engineering and security. He works on tools forfinding and mitigating security flaws and other flaws in software and isthe lead developer on the software visualization project for largesystems, and on improving tools used by NASA for software quality.
John Von Seggern joined GrammaTech as a software engineer in 2007. He received his B.A.in Computer Science at the University of Chicago in 2007. In his time atGrammaTech, John has played an integral role in implementing many ofthe enterprise features of CodeSonar, a tool for detecting bugs andsecurity vulnerabilities in software. These features include CodeSonar'ssearch language, charting wizard, and architecture visualizationplatform.
Ben Fleis joined GrammaTech in 2006 as asoftware engineer and was affiliated with the company until 2010. Hereceived his B.S. in Computer Science from Michigan State University in1998, followed by his M.S. in Computer Science from the University ofMichigan in 2000. In addition to a decade of experience as a computerscientist, Fleis, who is currently a consultant in computer science, isalso a successful designer and maker of elegant furniture.