Visualize data using a parallel coordinate diagram – News Couple
ANALYTICS

Visualize data using a parallel coordinate diagram


This article was published as part of the Data Science Blogathon.

Parallel coordinate diagram overview

While using visualizations, the combined visualization that shows the relationship between multiple variables has the upper hand over multiple visualizations – one for each variable. When you are trying to visualize high dimensional numeric data instead of multiple bar/line charts (one for each numerical variable), Parallel coordinate plot It could be more useful.

A parallel coordinate plot is used to analyze multivariate numerical data. Allows comparison of samples or observations across multiple numerical variables.

  • Each feature/variable is represented by a separate axis. All axes are evenly spaced and parallel to each other. Each axis can have a different scale and unit of measure.
  • Each sample/note is plotted horizontally.

See below the parallelogram plot for the iris dataset, where each sample is plotted starting from the leftmost axis to the far right axis. We can see how a sample behaves for each of the different features.

Parallel coordinates of the iris data set

Image source: https://upload.wikimedia.org/wikipedia/en/4/4a/ParCorFisherIris.png

From the above example, we can see that There are 4 variables/features – sepal width, sepal length, petal width, petal length. Both variables are represented by an axis. Each axis has different min-max values

In the plot, you can see the emergence of a clear pattern

  • flowers belonging to silky The species has large sepal width but low sepal length, petal width and length
  • flowers belonging to versicolor The species has a low sepal width, medium sepal length, and petal width and length
  • flowers belonging to viriginica The species has low to medium sepal widths, medium to large sepal length and large petal width and length

The following can be modified

  • Scale: All axes can be normalized to maintain standardization.
  • ranking: Features can be ordered so that there are not too many intersecting lines resulting in an unreadable graph
  • by Highlight One or more lines, you can focus on the part of the plot you are interested in

How to make a parallel coordinate diagram using Python?

Let’s use the 2021 Olympics dataset to illustrate the use of a parallel coordinate plot. This dataset contains details about

  • Teams that participated – Country, Specialties
  • Athletes who participated – country and athletes
  • Final medal tally – country, rank, total medals, and breakdown across gold, silver, and bronze medals

Let’s try to get a summary of the state, the number of athletes, the disciplines involved in it, the number of ranks and medals, and try to find answers to some questions

  • How many athletes have participated in a particular field in a country? How many disciplines have you participated in? How many medals did a country win?
  • Did countries with more athletes win more medals?
  • Did countries that participated in more disciplines win more medals?

Read and prepare data

df_teams = pd.read_excel("data/Teams.xlsx")
df_atheletes = pd.read_excel("data/Athletes.xlsx")
df_medals = pd.read_excel("data/Medals.xlsx")
print(df_teams.info())
print(df_atheletes.info())
print(df_medals.info())

There is no missing data, so no specific missing data processing is required.

Description of the data set

Let’s find the number of disciplines each country participated in and the number of athletes from each country who participated and combine this data into a single data frame.

df_medals.rename(columns='Team/NOC':'NOC', 'Total': 'Total Medals', 'Gold':'Gold Medals', 'Silver': 'Silver Medals', 'Bronze': 'Bronze Medals', inplace=True)
df_disciplines_per_country = df_teams.groupby(by='NOC').agg('Discipline':'nunique')
df_atheletes_per_country = df_atheletes.groupby(by='NOC').agg('Name':'nunique').rename(columns='Name':'Athletes')
df = pd.merge(left=df_disciplines_per_country, right=df_medals, how='inner',on='NOC')
df = pd.merge(left=df, right=df_atheletes_per_country, how='inner',on='NOC')
df.rename(columns='NOC':'Country', inplace=True)
df = df[['Country', 'Rank', 'Total Medals', 'Gold Medals', 'Silver Medals', 'Bronze Medals', 'Athletes', 'Discipline' ]]
df.sort_values(by='Rank', inplace=True)
df.reset_index(inplace=True)
df.drop(columns=['index'], inplace=True)
df.head(10)
data set

The final data set after merging all the different data sets

Draw with bar charts

First, let’s use bar charts to plot the athletes, majors, ranks, and medals for each country. For better reading, using only the top 20 entries

plt.figure(figsize=(20, 5))
ax = plt.subplot(1,2,1)
ax = df[['Country','Athletes']][:40].plot.bar(x='Country', xlabel="", ax=ax)
ax = plt.subplot(1,2,2)
df[['Country','Discipline']][:40].plot.bar(x='Country', xlabel="", ax=ax)
plt.figure(figsize=(20, 5))
ax = plt.subplot(1,2,1)
df[['Country','Rank']][:40].plot.bar(x='Country', xlabel="", ax=ax)
ax = plt.subplot(1,2,2)
df[['Country','Gold Medals', 'Silver Medals','Bronze Medals',]][:40].plot.bar(stacked=True, x='Country', xlabel="", ax=ax)
Multiple plots
Multiple plots

After looking at these four separate graphs, some ideas we can draw are

  • The top five countries have more than 300 athletes who have participated in more than 10 disciplines and won more than 50 medals, including more than 20 gold medals.
  • Whereas most countries have <200 athletes who participated in <7 sports and won <20 medals out of which <5 gold medals
  • Although Japan has fielded over 570 athletes in 20 (most disciplines from all countries) it is in third place with a total of 60 medals and 27 gold medals.
  • While China presented 400 athletes in 15 disciplines, it ranked second with 88 medals and 37 gold

This is good. But what if we could derive similar insights using a single, tighter visualization?

Parallel coordinate plot using pandas

Let’s plan using panda interface for 20 countries

df_20 = df.head(20).copy()
df_20 = df_20[['Country', 'Athletes', 'Discipline', 'Rank', 'Total Medals', 'Gold Medals', 'Silver Medals', 'Bronze Medals']]
plt.figure(figsize=(16,8))
pd.plotting.parallel_coordinates(df_20, 'Country', color=('#556270', '#4ECDC4', '#C7F464'))
Parallel coordinate plot with pandas

Parallel coordinate plot using pandas

With the panda interface, we have two problems

1. We cannot control the scale of individual axes

2. We cannot name inline (multiple) lines

We can use Plotly to better control different parameters.

Plot parallel coordinates using Plotly

Before we dive in too deeply, a little bit about Plotly. Plotly is a Python graphing library that makes publishing-quality interactive graphs online. Provides 2 interfaces

  • Express conspiracySimple interface that produces easy-to-design shapes. It uses graph objects internally.
  • Draw graph objectsLow-level interface that can be used for better control. What can be created by calling a single function with Plotly Express needs more code.

To create and present graphic figures (such as charts, plots, maps, and graphs) in Plotly, one has to

  • Increases Figures that can be represented either as dictations or as examples of Plotly .graph_objects
  • Processing if it is necessary
  • Submit – make It uses the JavaScript library Plotly.js under the hood

Parallel coordinates are rich interactive by default. One can drag lines along axes to filter regions and drag axes names across the plot to rearrange the variables.

Use an explicit plot interface

In a parallel coordinate plot with px.parallel_coordinates, each row (or sample) of a DataFrame is represented by a polyline that crosses a set of parallel axes, one for each of the dimensions.

import plotly.express as px
df_ = df.copy()
# color     : Values from this column are used to assign color to the poly lines.
# dimensions: Values from these columns form the axes in the plot.
fig = px.parallel_coordinates(df_, color="Rank", dimensions=['Rank', 'Athletes', 'Discipline','Total Medals'],
                              color_continuous_scale=px.colors.diverging.Tealrose,
                              color_continuous_midpoint=2)
fig.show()
Parallel coordinates express a plot

Parallel coordinate plot using Plotly express

Use Plotly’s Graph_objects interface

  1. First, Select menu Variables/axes to plot. For each dimension, select a file
    • Domain: start and end values ​​specified as a list or group
    • Tickvals: values ​​where the tags should be displayed on this axis
    • Hash text: Text to display at tags
    • Label: axle name
    • Value: the values ​​to be plotted on this axis
  2. Then, Create Parcoords It is a list of shape attributes.
  3. the following, character creation With Parcoords specified above
  4. Finally, pEnder the number using display
import plotly.graph_objects as go
df_ = df.copy()
dimensions = list([ dict(range=(df_['Rank']There is still one issue. USA won the most medals but is displayed at the bottom. Due to this there are unnecessary criss-crossed lines. This is no very intuitive. We would like to see countries in descending order.min(), df_['Rank'].max()),tickvals = df_['Rank'], ticktext = df_['Country'],label="Country", values=df_['Rank']),
                    dict(range=(df_['Athletes'].min(),df_['Athletes'].max()),label="Athletes", values=df_['Athletes']),
                    dict(range=(df_['Discipline'].min(),df_['Discipline'].max()),label="Discipline", values=df_['Discipline']),
                    dict(range=(df_['Total Medals'].min(), df_['Total Medals'].max()),label="Total Medals", values=df_['Total Medals']),
                    dict(range=(df_['Gold Medals'].min(), df_['Gold Medals'].max()),label="Gold Medals", values=df_['Gold Medals']),
                    dict(range=(df_['Silver Medals'].min(), df_['Silver Medals'].max()),label="Silver Medals", values=df_['Silver Medals']),
                    dict(range=(df_['Bronze Medals'].min(), df_['Bronze Medals'].max()),label="Bronze Medals", values=df_['Bronze Medals']),
                  ])
fig = go.Figure(data= go.Parcoords(line = dict(color = df_['Rank'], colorscale="agsunset"), dimensions = dimensions))
fig.show()
Parallel coordinate plot

Parallel coordinates: draw with graph objects

This is definitely a better plot than what Panda gave us. But the shape size is bad – the stickers are cut off. Let’s set the size using update_layout

# Adjust the size to fit all the labels
fig.update_layout(width=1200, height=800,margin=dict(l=150, r=60, t=60, b=40))
fig.show()
Parallel coordinate plot
Parallel coordinates: draw with graph objects

There is still one issue. The United States won the most medals but they are shown below. Due to this, there are unnecessary intersecting lines. This is not very intuitive. We would like to see the countries in descending order

# Let's reverse the min and max values for the Rank, so that the country with top rank comes on the top. 
dimensions = list([ dict(range=(df_['Rank'].max(), df_['Rank'].min()), tickvals = df_['Rank'], ticktext = df_['Country'],label="Country", values=df_['Rank']),
                    dict(range=(df_['Athletes'].min(),df_['Athletes'].max()),label="Athletes", values=df_['Athletes']),
                    dict(range=(df_['Discipline'].min(),df_['Discipline'].max()),label="Discipline", values=df_['Discipline']),
                    dict(range=(df_['Total Medals'].min(), df_['Total Medals'].max()),label="Total Medals", values=df_['Total Medals']),
                    dict(range=(df_['Gold Medals'].min(), df_['Gold Medals'].max()), label="Gold Medals", values=df_['Gold Medals']),
                    dict(range=(df_['Silver Medals'].min(), df_['Silver Medals'].max()),label="Silver Medals", values=df_['Silver Medals']),
                    dict(range=(df_['Bronze Medals'].min(), df_['Bronze Medals'].max()),label="Bronze Medals", values=df_['Bronze Medals']),
                  ])
fig = go.Figure(data= go.Parcoords(line = dict(color = df_['Rank'], colorscale="agsunset"), dimensions = dimensions))
fig.update_layout(width=1200, height=800,margin=dict(l=150, r=60, t=60, b=40))
fig.show()
Parallel coordinate plot

Parallel coordinates: draw with graph objects

Now the plot looks much better. do not accept that? If you follow the line corresponding to the USA ranked first at the top of the table, you can see that 614 athletes participated in 18 disciplines and won a total of 113 medals, of which 39 were gold. While China, which presented 400 athletes in 15 disciplines, ranked second with 88 medals and 37 golds.

From this graph, we can derive the following ideas which are basically the same as before. Only in this case there is a single summary view.

What are the visions?

  • The top five countries have more than 400 athletes who have participated in more than 15 disciplines and won more than 50 medals, including more than 20 gold medals.
  • Whereas most countries have <200 athletes who participated in <7 sports and won <20 medals out of which <5 gold medals
  • Although Japan has fielded over 570 athletes in 20 (most disciplines from all countries) it is in third place with a total of 60 medals and 27 gold medals.
  • While China presented 400 athletes in 15 disciplines, it ranked second with 88 medals and 37 gold

interaction

For large datasets, parallel coordinate plots tend to get cluttered. In such cases, interaction comes to our rescue. Using interaction, it is possible to filter or highlight certain sections of the data. The order of the axes can also be optimized so that patterns or correlations appear across variables.

Parallel coordinate diagrams support the interaction. one can

  • Drag the lines along the axes to filter areas
  • Drag the axis names across the graph to rearrange the variables.
Parallel coordinate plot
Parallel coordinates: draw with the graph – drag the lines along the axes

summary

We have seen how parallel coordinate plots – compressed visualizations – of high-dimensional multivariate numerical data can be used to produce meaningful insights. To generate parallel coordinate plots, we used the Plotly Python library which provides a lot of convenient functionality.

About the author

A technical engineer also loves to break down complex concepts into easy-to-digest capsules! Currently, I’m finding my way around the wonderful world of data visualization and data storytelling!!

Source

The media described in this article is not owned by Analytics Vidhya and is used at the author’s discretion



Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button