I’ve been lucky enough to have worked on many interesting and challenging projects throughout my career, especially in the space of detecting “bad” people.
As one can imagine, much focus is placed on detecting persons of interest in the government sector. After all, you can’t ensure national security, for instance, without being able to detect behaviours and profiles of those that mean to create harm.
What you soon realise in this domain is the fact that most of these “baddies” work as part of a network, and detecting those networks is anything but trivial. Picking out one bad person can often be relatively easy, but surfacing the entire network they’re connected to can be a very challenging problem.
To enable detection of such networks, we often must turn to the field of graph analytics, also known as network analysis. It allows us to efficiently solve specific classes of problems that have data arranged in a certain way.
It has far reaching applications, such as in fraud detection, social network analysis, and in physical environments, such as power grids.
Graph analytics is based on a fundamentally important area of mathematics known as Graph Theory.
The Theory of Graphs
In this context, graphs are defined as mathematical structures used to study pairwise relationships between objects and entities.
A graph, \(G\), is defined as follows:
$$G = (V, E)$$
where \(V\) represent vertices, or nodes, and \(E\) represent edges, or lines.
In a standard graph, \(E\) is made up of pairs of elements from \(V\), and these pairs are unordered ie \((v_{1}, v_{2}) = (v_{2}, v_{1})\).
Nodes typically represent people, organisations or cities, and the edges represent relationships between nodes, such as transfer of funds, or person \(A\) is related to person \(B\).
Here’s a simple graph showing the transfer of money between some friends, for instance:
Below is a list of the vertices and edges of the graph:
- \( V = \{John, Sarah, Phil, Mary\}\) ie a list of all the people in the graph
- \( E = \{(John, Sarah), (Sarah, Phil), (Phil, Mary), (Mary, Sarah)\}\) ie a list of all the transaction relationships between all the people
In order to fully appreciate graph theory, it’s worth stepping back in time and. see where it all started…
The Seven Bridges of Königsberg
The city of Königsberg in Prussia, which is now known as Kaliningrad (Russia), was the topic of a famous mathematical puzzle in the early 18th century.
image courtesy of Google Maps
At the time, there were seven bridges connecting two large islands surrounded by the Pregolya river, along with two portions of mainland divided by the river.
Below is a diagrammatic view of the seven bridges:
The problem posed was how could one traverse the city of Königsberg by crossing all seven bridges only once?
Remember, this was during the time before GPS, before the internet even, and even before graph theory existed!
What you’ll soon realise is that it’s impossible to do. It was Leonhard Euler who managed to prove it, and with that proof, graph theory was born.
In order to solve the problem, Euler converted the bridges and graphs into what we now know as a graph:
This representation should now look familiar, with vertices and edges.
To understand how Euler solved the problem, we need to first calculate the degree of each vertex, ie the number of edges connected to each vertex. To do so, we simply count the number of bridges that are connected to each of the portions of land:
He proved that for such a path around the city to exist, two vertices have to be of odd degree, or all the vertices have to be of even degree.
In the Königsberg case, no such path exists as we do not have exactly two vertices of odd degree (we have four) and nor are all the vertices of even degree, as all four vertices are in fact of odd degree.
Further details on Euler graphs can be found here.
The Structure of Graphs
There’s an important distinction between graphs; they can be either directed or undirected.
The distinction lies in whether or not the edges have a direction.
If we consider our first example, of a group of four friends transferring money to one another, then we can see that the graph is in fact directed, as John sending money to Sarah, for instance, doesn’t mean that Sarah is sending any money to John:
Graphs can also be either connected or disconnected:
- Connected: There exists a path between each pair of vertices ie there is no unreachable vertex (The Königsberg graph is connected)
- Disconnected: There does not exist a path between all pairs of vertices ie there is at least one unreachable vertex
A weighted graph is one where each edge is assigned a weight. For instance, weights can represent the value or amount of transactions between people, where high values/numbers can potentially indicate money laundering in the use case of financial crime.
Below are some common definitions:
- If two edges have the same end vertex, then they are parallel
- An edge of the form \( (v, v) \) is a loop
- A simple graph is one that has no parallel edges or loops
- A graph with only one vertex is a trivial graph
- A graph with no edges is said to be empty
- A graph with no vertices is known as a null graph
- Edges are adjacent if they have a common vertex, and vertices are adjacent if they have a common edge
- The degree of a vertex, \(d(v)\), is the number of edges with \(v\) as an end vertex
- A graph is said to be complete if its set of edges contains every possible edge between all vertices
Graph Analytics
Some examples of the types of analysis that can be performed on graphs includes:
- Anomaly detection: patterns or relationships that are ‘out-of-the-norm’
- Clustering: used in k-means, a popular Machine Learning clustering algorithm. It aims to identify groups that share a common theme
- Connectedness: Identifying connections between items, such as two or more people. In logistics, the shortest distance between locations is used to optimise routes. One of the most widely used algorithms is Dijkstra’s Algorithm
- Centrality analysis: finding the most influential person in a particular social network, or more generally, identifying the root cause of events. Variations include:
- Betweenness: The number of times a vertex is present in the shortest path between two other vertices
- Degree: The number of edges connected to a vertex
- Closeness: The average length of the shortest path from a vertex to all other vertices
- Community detection: finding cohorts of people in a social network
But Why?
Graphs provide an efficient way to solve various classes of problems, especially ones that involve interactions and relationships. They have the great advantage of offering an intuitive and visual way to represent and tackle such problems.
Graph databases, such as Neo4J, are important alternatives to traditional relational databases, for storing such information, such as relationships between people.
They also offer computational efficiency for some algorithms when data can be arranged in a graph structure as opposed to tabular format.
One very interesting applications of graph analytics is to the area of Entity Resolution. This involves resolving different representations of the same entity (ie individual, organisation etc) across disparate data holdings. The use of graph analytics enables relational similarity between entities to be used, beyond just string similarity of attributes, that is traditionally used in entity resolution.