Hypothesis Generation Via Node-Link Diagrams
Analyzing large quantities of textual data can be a time consuming and daunting task. For this project, we pulled in a subset of the 2006 VAST Challenge dataset, which consisted of almost 250 fictitious news stories from the fictitious town of Alderwood. As analysts, our goal was to determine if anything fishy was going on in this town. To help enable this, we built a tool using D3, leveraging a force layout with entities as nodes to enable high-level navigation of the document space. The force layout leverages the Gestalt principle of proximity to suggest groups of entities that are likely to have a causal connection or other significant relationship. In this way, the tool also supported hypothesis generation by suggesting clusters of entities that merited further investigation. Additionally, access to raw document text and the timeline, then enables further interrogation and either the acceptance or rejection of information within one’s hypotheses. The visual nature of the force layout also supports random discovery of nodes that might seem interesting to humans, but that a machine would not identify as being of interest. For example, what the heck is the FDA investigating in Alderwood?
Client CS8801 Visual Design Analysis
Date Spring 2018
Skills HTML5, CSS, JS
- Graph visualization (node-link diagram)
- Named entities are nodes (Related documents can be explored as a “cloud” around the named entities)
- Node size indicates frequency of appearance in corpus (Number of docs it appears in or number of times it appears)
- Proximity between entities indicates co-occurrence
- We chose to give entities the most visible encoding, instead of documents.
- Entities have more immediate meaning to the user, and there are fewer of them, simplifying the graph.
The distance between entities is calculated based on the Jaccard similarity of the documents that a pair of nodes appears in. As intuited by its name, the Jaccard similarity metric is a measure of similarity for the two sets of data that compares members in each of the sets, in this case entities, to see which ones are shared across documents and which ones are distinct
- The Entity Link Visibility controls the visibility of the Jaccard relationships between entities, which are the blue connections between the entities. So you can see links crop up as we move the slider to the right.
- The Label Link Visibility controls the visibility of how similar the names or entities are. These are shown in pink. This relates to figuring out if entities are the same or not, such as John and John Kerry, but you can see here that it’s also connected to John Torch and other Johns throughout the corpus. We’ve used this as a way to help get a better idea of how one might go about cleaning up some of the data in the dataset.
- Node Visibility controls how many nodes are visible at a time. So as I move the slider to the right, entities that are mentioned less frequently crop up. However, if I move it to the left, you see the most common entities mentioned across the corpus.
- For the simulation control, each node has a force associated with it (hence the name), which can repel (or attract) other nodes. Links between nodes act like springs to draw them back together. These pushing and pulling forces work on the network over a number of iterations, and eventually the system finds an equilibrium. Simulation allows the user to control how quickly or accurately the force diagram gets drawn.
- Node Packing describes the force between the nodes
- Lastly, the outlier Node Distance shows how far the outliers are from the main entities and you can control them a little bit to reel them in.
- The tool also includes a timeline of documents, which is linked to the main entity graph. When an entity is double-clicked, the timeline filters to the set of documents in which that entity appears, showing a neat “history”.