Visualization with Sankey Diagram – Video Analytics – News Couple
ANALYTICS

Visualization with Sankey Diagram – Video Analytics

This article was published as part of the Data Science Blogathon.

Introduction to Sankey diagram for data visualization

Oftentimes, we are in a situation where we have to visualize how data flows between entities. For example, let’s take the case of how residents migrated from one country to another within the United Kingdom. Here, it would be an interesting analysis to see how many residents migrated from England to say Northern Ireland, Scotland and Wales.

image source

From the visualization of this Sankey diagram, it is clear that more of the population emigrated from England to Wales than to Scotland or Northern Ireland.

What is a Sankey diagram?

Sankey diagrams usually depict a file flow of data from one entity (or node) to another.

The entity from/to where the data flows is referred to as a node The node from which the flow originates is Source node (like England on the left side) and where the flow ends is targeting Knot (like Wells on the right side). The source and target nodes are often represented as rectangles with a label.

The flow itself is represented by a straight or curved path called Link. The width of the flow/correlation is proportional to the amount/quantity of the flux. In the above example, the flow (i.e. population migration) from England to Wales is (more) wider than the flow from England to Scotland or Northern Ireland indicating that there is a larger number of immigrant populations to Wales than to other countries.

Sankey diagrams can be used to represent the flow of energy, money, costs, and anything that has a concept of flow.

Menard’s classic diagram of Napoleon’s invasion of Russia is perhaps the most famous example of a Sankey diagram. This visualization using the Sankey diagram very effectively shows how the French army advanced (or waned?) on its way to Russia and back.

image source

Now, let’s see how we can use Python Conspiracy To draw a Sankey diagram.

How to draw a Sankey diagram?

To plot a Sankey diagram, let’s use the 2021 Olympics dataset. This dataset contains details about medal counts – country, total medals, and breakdown across gold, silver, and bronze medals. Let’s draw a Sankey diagram to understand how many medals a country has won – gold, silver, and bronze.

```df_medals = pd.read_excel("data/Medals.xlsx")
print(df_medals.info())
df_medals.rename(columns='Team/NOC':'Country', 'Total': 'Total Medals', 'Gold':'Gold Medals', 'Silver': 'Silver Medals', 'Bronze': 'Bronze Medals', inplace=True)
df_medals.drop(columns=['Unnamed: 7','Unnamed: 8','Rank by Total'], inplace=True)

df_medals```

basic plot

We will use a file Conspiracy transition interface Sanky It takes two parameters – nodes and links.

Note that all nodes – source and target must have unique identifiers.

In this case,

• The Source will be the state. Let’s think of the first three countries (the USA, China and Japan) as source nodes. Let’s mark these source nodes with the following (unique) identifiers, labels, and colors
• Zero: USA: Green color
• 1: People’s Republic of China: blue
• 2: Japan: orange
• The targeting The medals will be gold, silver and bronze. Let’s distinguish these target nodes with the following (unique) identifiers, labels and colors
• 3: gold: Went
• 4: silver: silver
• 5: Bronze: Brown
• The connection (Between source and target nodes) will be the number of medals of each type (gold, silver, bronze). From each source we will have 3 links originated and each one ends with the target – Gold, Silver and Bronze. So we will have a total of 9 links. The width of each link should be the number of gold, silver and bronze medals. Let’s characterize these links with the following source for intent, values, and colors
• 0 (US) to 3,4,5: 39, 41, 33
• 1 (China) through 3, 4, 5:38, 32, 18
• 2 (Japan) to 3,4,5: 27, 14, 17

We will need to create two Python dict objects to represent

• Contract (Source and Target): With labels and colors as individual lists and
• links: the source node, the target node, the value (display), and the color of the links as individual lists

and pass this to ConspiracyGo interface Sanki.

Each of the lists index – label, source, target, value and color – corresponds to one node or link respectively.

```NODES = dict( #    0                 1                          2        3       4           5
label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
color = ["seagreen",                 "dodgerblue",                  "orange", "gold", "silver", "brown" ],)
LINKS = dict(   source = [  0,  0,  0,  1,  1,  1,  2,  2,  2], # The origin or the source nodes of the link
target = [  3,  4,  5,  3,  4,  5,  3,  4,  5], # The destination or the target nodes of the link
value =  [ 39, 41, 33, 38, 32, 18, 27, 14, 17], # The width (quantity) of the links
# Color of the links
# Target Node:    3-Gold          4 -Silver        5-Bronze
color =     [   "lightgreen",   "lightgreen",   "lightgreen",      # Source Node: 0 - United States of America
"lightskyblue", "lightskyblue", "lightskyblue",    # Source Node: 1 - People's Republic of China
"bisque",       "bisque",       "bisque"],)        # Source Node: 2 - Japan
data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.show()```

Sankey’s Diagram – Basic Plot

Here we have a very basic plot. But have you noticed that the graph is very wide and silver appears before gold? Let’s adjust the nodes’ position and width.

Adjust the position of the nodes and the width of the diagram

Let’s add the x and y positions of the nodes to explicitly specify the nodes’ positions. The values ​​must be between 0 and 1.

```NODES = dict( #           0                               1                          2        3       4           5
label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
color = [                "seagreen",                 "dodgerblue",  "orange", "gold", "silver", "brown" ],
x     = [                         0,                            0,         0,    0.5,      0.5,      0.5],
y     = [                         0,                          0.5,         1,    0.1,      0.5,        1],)
data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.update_layout(title="Olympics - 2021: Country &  Medals",  font_size=16)
fig.show()```

With this we get a compressed diagram:

Sankey diagram – knot position adjustment

See below how the various parameters in the code map are passed to the nodes and links in the diagram

Sankey Diagram – How to set the code for the diagram

Add meaningful hover labels

The plot is interactive. You can hover over the nodes and links for more information.

Sankey chart – with default scrolling labels

Currently, the information displayed in hover labels is the default text. When hovering over a file

• Nodes, node name, number of incoming streams, number of outgoing streams, and total value are displayed. for example,
• The USA knot has a total of 11 medals (= 39 gold + 41 silver + 33 bronze)
• The Golden Knot has a total of 104 medals (=39 from USA, 38 from China, 27 from Japan)
• Links, the name of the source node, the name of the target node, and the value of the link are displayed. For example, the link from the USA source node to the Silver target node has 39 medals.

Don’t you think the labels are too long? All of these can be improved.

Let’s improve the formatting of scroll labels with an extension hover template Labs

• For nodes, since hoverlabel tags don’t provide any new information on what is already there, let’s remove hoverlabel by passing the empty hovertempl template=””
• For links, we can make the label concise in the format
• For both nodes and links, let’s show the values ​​with the suffix “Medals”. For example 113 medals instead of 113. This can be achieved using update_traces work with appropriate value format And Value.
```NODES = dict( #           0                               1                          2        3       4           5
label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
color = [                "seagreen",                 "dodgerblue",  "orange", "gold", "silver", "brown" ],
x     = [                         0,                            0,         0,    0.5,      0.5,      0.5],
y     = [                         0,                          0.5,         1,    0.1,      0.5,        1],
hovertemplate=" ",)```

```LINK_LABELS = []
for country in ["USA","China","Japan"]:
for medal in ["Gold","Silver","Bronze"]:
```LINKS = dict(   source = [  0,  0,  0,  1,  1,  1,  2,  2,  2], # The origin or the source nodes of the link
target = [  3,  4,  5,  3,  4,  5,  3,  4,  5], # The destination or the target nodes of the link
value =  [ 39, 41, 33, 38, 32, 18, 27, 14, 17], # The width (quantity) of the links
# Color of the links
# Target Node:    3-Gold          4 -Silver        5-Bronze
color =     [   "lightgreen",   "lightgreen",   "lightgreen",      # Source Node: 0 - United States of America
"lightskyblue", "lightskyblue", "lightskyblue",    # Source Node: 1 - People's Republic of China
"bisque",       "bisque",       "bisque"],         # Source Node: 2 - Japan
label = LINK_LABELS, hovertemplate="%label",)```

```data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.update_layout(title="Olympics - 2021: Country &  Medals",  font_size=16)
fig.update_traces( valueformat="3d", valuesuffix=' Medals', selector=dict(type="sankey"))
fig.update_layout(hoverlabel=dict(bgcolor="lightgray",font_size=16,font_family="Rockwell"))
fig.show()```

Sankey Chart – With Enhanced Scroll Labels

Circular to hold and multiple levels

The nodes are referred to as the source and target in relation to the association. A node that is the target of one link can be the source of another link.

• The code can be generalized to handle all countries in the dataset.
• We can also extend the diagram to another level to visualize the total number of medals across countries.

end notes

We have seen how Sankey diagrams can be used to represent flows effectively and how Conspiracy It can be a python library for creating Sankey diagrams for a sample dataset.

Sridevi jatu

A technical engineer also loves to break down complex concepts into easy-to-digest capsules! Currently, I’m finding my way around the wonderful world of data visualization and storytelling!

The media described in this article is not owned by Analytics Vidhya and is used at the author’s discretion