Creating a Sankey Diagram in R

Gustavo Santos
4 min readJun 5, 2024

Use R to quickly create an interactive Sankey Chart.

The Sankey Diagram. Image by the author.

What is it?

Do you know the Sankey Graphic/ Diagram?

Sankey diagrams are a powerful visualization tool that uses a diagram to show the flow, movement, and change from one state to another.

The components help us easily understand the graphic:

  • Each node is a category
  • Colors help us differentiate them
  • The node’s size and the bands’ widths are proportional to the flow rate of the category.

In other words, it is a great visualization tool that helps determine which portion of a whole went to each category.
Let’s see an example.

Code

Let’s say you want to show the components of a company’s Total Sales in a given year.
If you plot it in a Sankey Diagram, it’s easy to see because .

First, load the libraries needed to plot the graphic.

# Libraries
library(networkD3)
library(dplyr)

Next, let’s create a data frame. And here is the catch: you should always think about how each node interacts to each other when creating your data frame.

# A connection data frame is a list of flows with intensity for each flow
links <- data.frame(
source=c("TOTAL SALES", "TOTAL SALES",
"Products", "Products",
"Services", "Services"),
target=c("Products", "Services", "Product A", 'Product B', "Maintenance", 'Upgrade'),
value=c(22, 5, 10, 12, 2, 3)
)

Here, we are creating the flow from TOTAL SALES. So, TOTAL SALES is our source, and PRODUCTS and SERVICES are our targets. Therefore, we already need two rows in our data frame: TOTAL SALES | PRODUCTS and TOTAL SALES | SERVICES.

Next, Products will break into Product A and B. So, now Products is our Source, and Product A and Product B are the targets. Thus, two more rows.

The same applies to Services. And then we can add the total amounts for each node. This is how the data frame should look like so far.

Data with sources and targets. Image by the author.

Now, to finalize the visual, the library networdD3 needs the data encoded to be able to work properly. So, the next rows are meant to encode the sources and targets.

# From these flows we need to create a node data frame: it lists every entities involved in the flow
nodes <- data.frame(
name=c(as.character(links$source),
as.character(links$target)) %>% unique()
)

# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
links$IDsource <- match(links$source, nodes$name)-1
links$IDtarget <- match(links$target, nodes$name)-1
Encoded data with sources and targets. Image by the author.

And finally, let’s build the visual.

# Make the Network
p <- sankeyNetwork(Links = links, Nodes = nodes,
Source = "IDsource", Target = "IDtarget",
Value = "value", NodeID = "name",
sinksRight=FALSE,
fontSize = 12, fontFamily = "Arial Black")
# View
p
Sankey Diagram. Image by the author.

Before You Go

Now you have another visual tool in your hands. Make sure to make a good use of it.

Sankey Charts are a great resource to display parts of a whole or when you want to show how a resource was distributed in categories or groups. Observe how intuitive it is to see how much of the Total Sales went to Products and Services.

Certainly, like any other visualization chart, it won’t be the best option every time. When there are too many variables and categories, it can get messy pretty quickly. So have that in mind.

If you liked this quick tutorial, follow me for more content.

Find me on LinkedIn as well.

Reference

--

--

Gustavo Santos

Data Scientist. I extract insights from data to help people and companies to make better and data driven decisions. | In: https://www.linkedin.com/in/gurezende/