Friday, January 3, 2014

Tips for Forensic Data Visualization

Notes from Garfinkel's talk at OSDF 2013

Dr. Simson Garfinkel kicked off the 2013 Open Source Digital Forensics Conference with a talk on the use of data-driven visualization in Digital Forensics.  Specifically, Simson focused on 3 core concepts:
Garfinkel, 2013
  1. Never Use Pie Charts
  2. Histograms with Cumulative Distribution Function
  3. Build consistent Data Driven Visualizations

Never Use Pie Charts

Garfinkel borrows a line from Stephen Few and says "Save the Pies for Dessert!" due to the manipulative nature of pie charts.  Pie charts can be accidentally or purposefully manipulative and make comparison of numeric data difficult.  


Histograms with Cumulative Distribution Function

As you may have seen in the PCAP pre-processing tool, tcpflow, the addition of a cumulative distribution function in-line with a bar-chart showing major categories quickly helps an analyst understand the context of data they are reviewing. Whether your histogram is tracking data transfer over time, data by protocol or port or data transfer by IP address, this approach to visualizing forensic data quickly helps you consider "which one of these is not like the other."  To illustrate this, below is a histogram presented by Garfinkel:
Garfinkel, 2013
In the histogram above we can see some periods of lulls in traffic, as well a rapid rise in data transfer about mid-way through the time period.  I can clearly identify the majority of traffic as HTTP.  Port 5222 is identified in the label/key, however, I'm not able to see a bar of that color - so the traffic using that port and it's associated protocol is minuscule in comparison.

This focus on clear, concise and easily digestible context is a win for a field as prone to data over-load as digital forensics.

Consistent Data Driven Visualizations

In Garfinkel's presentation he pays homage to both the stalwarts of data visualization (matplotlib and GraphViz) as well as to more recent newcomer D3JS - a Javascript based group of libraries for creating data driven documents.  While matplotlib and GraphViz are well suited for use in automated generation of visualization data, D3JS poses some problems if work-flows are not web browser based.

One of the fundamental concepts in Digital Forensics is the "repeatability" of a workflow - something that can be difficult to achieve with data driven visualizations.  Graphing libraries such as GraphViz and D3JS use a pseudo-random number to assist in the initial layout of a graph - nodes are "randomly" placed on the canvas.  This random placement can pose substantial problems for the repeatability requirement. One solution to this problem is store out the pseudo-random number with each visualization manufactured.

Other Notes

Other information and considerations that may be helpful as you begin building Data Driven Visualizations include:

SVG or other Vector output - including PDF.  This will allow the graphic to be used in a wide variety of tools - but also will allow "infinite" zoom for larger and more complex visualizations.
Garfinkel, 2013

A Common Vocabulary - Each professional field has a continuously evolving vocabulary that is used to communicate very specific ideas.  Data Driven Visualization is in it's infancy, especially in digital forensics.  This is the time to get in early.

The original presentation slides are linked to here:
simson.net | simson.net/ref/2013/2013-11-05_VizSec.pdf | +Simson Garfinkel