Linked Brushing#
import numpy as np
import pandas as pd
import holoviews as hv
from holoviews.util.transform import dim
from holoviews.selection import link_selections
from holoviews.operation import gridmatrix
from holoviews.operation.element import histogram
from holoviews import opts
hv.extension('bokeh', 'plotly', width=100)
JavaScript-based linked brushing#
Datasets very often have more dimensions than can be shown in a single plot, which is why HoloViews offers so many ways to show the data from each of these dimensions at once (via layouts, overlays, grids, holomaps, etc.). However, even once the data has been displayed, it can be difficult to relate data points between the various plots that are laid out together. For instance, “is the outlier I can see in this x,y plot the same datapoint that stands out in this w,z plot”? “Are the datapoints with high x values in this plot also the ones with high w values in this other plot?” Since points are not usually visibly connected between plots, answering such questions can be difficult and tedious, making it difficult to understand multidimensional datasets. Linked brushing (also called “brushing and linking”) offers an easy way to understand how data points and groups of them relate across different plots. Here “brushing” refers to selecting data points or ranges in one plot, with “linking” then highlighting those same points or ranges in other plots derived from the same data.
As an example, consider the standard “autompg” dataset:
from bokeh.sampledata.autompg import autompg
autompg
mpg | cyl | displ | hp | weight | accel | yr | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
387 | 27.0 | 4 | 140.0 | 86 | 2790 | 15.6 | 82 | 1 | ford mustang gl |
388 | 44.0 | 4 | 97.0 | 52 | 2130 | 24.6 | 82 | 2 | vw pickup |
389 | 32.0 | 4 | 135.0 | 84 | 2295 | 11.6 | 82 | 1 | dodge rampage |
390 | 28.0 | 4 | 120.0 | 79 | 2625 | 18.6 | 82 | 1 | ford ranger |
391 | 31.0 | 4 | 119.0 | 82 | 2720 | 19.4 | 82 | 1 | chevy s-10 |
392 rows × 9 columns
This dataset contains specifications for 392 different types of car models from 1970 to 1982. Each car model represents a particular point in a nine-dimensional space, with a certain mpg, cyl, displ, hp, weight, accel, yr, origin, and name. We can use a gridmatrix to see how each numeric dimension relates to the others:
autompg_ds = hv.Dataset(autompg, ['yr', 'name', 'origin'])
mopts = opts.Points(size=2, tools=['box_select','lasso_select'], active_tools=['box_select'])
gridmatrix(autompg_ds, chart_type=hv.Points).opts(mopts)
These plots show all sorts of interesting relationships already, such as that weight and horsepower are highly positively correlated (locate weight along one axis and hp along the other, and you can see that car models with high weight almost always have high horsepower and vice versa).
What if we want to focus specifically on the subset of cars that have 4 cylinders (cyl)? You can do that by pre-filtering the dataframe in Python, but questions like that can be answered immediately using linked brushing, which is automatically supported by gridmatrix
plots like this one. First, make sure the “box select” or “lasso select” tool is selected in the toolbar:
Then pick one of the points plots labeled cyl and use the selection tool to select all the values where cyl is 4. All of those points in each plot should remain blue, while the points where cyl is not 4 become more transparent:
You should be able to see that 4-cylinder models have low displacement (displ), low horsepower (hp), and low weight, but tend to have higher fuel efficiency (mpg). Repeatedly selecting subsets of the data in this way can help you understand properties of a multidimensional dataset that may not be visible in the individual plots, without requiring coding and examining additional plots.
Python-based linked brushing#
The above example illustrates Bokeh’s very useful automatic JavaScript-based linked brushing, which can be enabled for Bokeh plots sharing a common data source (as in the gridmatrix
call) by simply adding a selection tool. However, this approach offers only a single type of selection, and is not available for Python-based data-processing pipelines such as those using Datashader.
To get more power and flexibility (at the cost of requiring a Python server for deployment if you weren’t already), HoloViews provides a Python-based implementation of linked brushing. HoloViews linked brushing lets you fully customize what elements are used and how linking behaves. Here, let’s make a custom Layout displaying some Scatter plots for just a few of the available dimensions:
colors = hv.Cycle('Category10').values
dims = ["cyl", "displ", "hp", "mpg", "weight", "yr"]
layout = hv.Layout([
hv.Points(autompg_ds, dims).opts(color=c)
for c, dims in zip(colors, [[d,'accel'] for d in dims])
])
print(layout)
:Layout
.Points.I :Points [cyl,accel] (mpg,cyl,displ,hp,weight,accel)
.Points.II :Points [displ,accel] (mpg,cyl,displ,hp,weight,accel)
.Points.III :Points [hp,accel] (mpg,cyl,displ,hp,weight,accel)
.Points.IV :Points [mpg,accel] (mpg,cyl,displ,hp,weight,accel)
.Points.V :Points [weight,accel] (mpg,cyl,displ,hp,weight,accel)
.Points.VI :Points [yr,accel] (mpg,cyl,displ,hp,weight,accel)
Now that we have a layout we can simply apply the link_selections
operation to support linked brushing, automatically linking the selections across an arbitrary collection of plots that are derived from the same dataset:
link_selections(layout).opts(opts.Points(width=200, height=200)).cols(6)
The same box_select
and lasso_select
tools should now work as for the gridmatrix
plot, but this time by calling back to Python. There are now many more options and capabilities available, as described below, but by default you can now also select additional regions in different elements, and the selected points will be those that match all of the selections, so that you can precisely specify the data points of interest with constraints on all dimensions at once. A bounding box will be shown for each selection, but only the overall selected points (across all selection dimensions) will be highlighted in each plot. You can use the reset tool to clear all the selections and start over.
Box-select vs Lasso-select#
Since HoloViews version 1.13.3 linked brushing supports both the box_select
and lasso_select
tools. The lasso selection provides more fine-grained control about the exact region to include in the selection, however it is a much more expensive operation and will not scale as well to very large columnar datasets. Additionally lasso select has a number of dependencies:
Lasso-select on tabular data requires either
spatialpandas
orshapely
Lasso-select on gridded data requires
datashader
Lasso-select on geometry data requires
shapely
Filter and selection modes#
Two parameters of link_selections
control how the selections apply within a single element (the selection_mode
) and across elements (the cross_filter_mode
):
selection_mode
: Determines how to combine successive selections on the same element, either'overwrite'
(the default, allowing one selection per element),'intersect'
(taking the intersection of all selections for that element),'union'
(the combination of all selections for that element), or'inverse'
(select all but the selection region).cross_filter_mode
: Determines how to combine selections across different elements, either'overwrite'
(allows selecting on only a single element at a time) or'intersect'
(the default, combining selections across all elements).
To see how these work, we will create a number of views of the autompg dataset:
w_accel_scatter = hv.Scatter(autompg_ds, 'weight', 'accel')
mpg_hist = histogram(autompg_ds, dimension='mpg', normed=False).opts(color="green")
violin = hv.Violin(autompg_ds, [], 'hp')
We will also capture an “instance” of the link_selections
operation, which will allow us to access and set parameters on it even after we call it:
mpg_ls = link_selections.instance()
mpg_ls(w_accel_scatter + mpg_hist + violin)
Here you can select on both the Scatter plot and the Histogram. With these default settings, selecting on different elements computes the intersection of the two selections, allowing you to e.g. select only the points with high weight but mpg between 20 and 30. In the Scatter plot, the selected region will be shown as a rectangular bounding box, with the unselected points inside being transparent. On the histogram, data points selected on the histogram but not in other selections will be drawn in gray, data points not selected on either element will be transparent, and only those points that are selected in both plots will be shown in the default blue color. The Violin plot does not itself allow selections, but it will update to show the distribution of the selected points, with the original distribution being lighter (more transparent) behind it for comparison. Here, selecting high weights and intermediate mpg gives points with a lower range of horsepower in the Violin plot.
The way this all works is for each selection to be collected into a shared “selection expression” that is then applied by every linked plot after any change to a selection:
mpg_ls.selection_expr
e.g. a box selection on the weight,accel scatter element might look like this:
(((dim('weight') >= (3125.237)) & (dim('weight') <= (3724.860))) & (dim('accel') >= (13.383))) & (dim('accel') <= (19.678))
Additional selections in other plots add to this list of filters if enabled, while additional selections within the same plot are combined with an operator that depends on the selection_mode
.
To better understand how to configure linked brushing, let’s create a Panel that makes widgets for the parameters of the linked_selection
operation and lets us explore their effect interactively. Play around with different cross_filter_mode
and selection_mode
settings and observe their effects (hitting reset when needed to get back to an unselected state):
import panel as pn
mpg_lsp = link_selections.instance()
params = pn.Param(mpg_lsp, parameters=[
'cross_filter_mode', 'selection_mode', 'show_regions',
'selected_color', 'unselected_alpha', 'unselected_color'])
pn.Row(params, mpg_lsp(w_accel_scatter + mpg_hist + violin))
Note that in recent versions of Bokeh (>=2.1.0) and HoloViews (1.13.4) it is also possible to toggle the selection mode directly in the Bokeh toolbar by toggling the menu on the box-select and lasso-select tools:
Index-based selections#
So far we have worked entirely using range-based selections, which result in selection expressions based only on the axis ranges selected, not the actual data points. Range-based selection requires that all selectable dimensions are present on the datasets behind every plot, so that the selection expression can be evaluated to filter every plot down to the correct set of data points. Range-based selections also only support the box_select
tool, as they are filtering the data based on a rectangular region of the visible space in that plot. (Of course, you can still combine multiple such boxes to build up to selections of other shapes, with selection_mode='union'
.)
You can also choose to use index-based selections, which generate expressions based not on axis ranges but on values of one or more index columns (selecting individual, specific data points, as for the Bokeh JavaScript-based linked brushing). For index-based selections, plots can be linked as long as the datasets underlying each plot all have those index columns, so that expressions generated from a selection on one plot can be applied to all of the plots. Ordinarily the index columns should be unique in combination (e.g. Firstname,Lastname), each specifying one particular data point out of your data so that it can be correlated across all plots.
To use index-based selections, specify the index_cols
that are present across your elements. In the example below we will load the shapes and names of counties from the US state of Texas and their corresponding unemployment rates. We then generate a choropleth plot and a histogram plot both displaying the unemployment rate.
from bokeh.sampledata.us_counties import data as counties
from bokeh.sampledata.unemployment import data as unemployment
counties = [dict(county, Unemployment=unemployment[cid])
for cid, county in counties.items()
if county["state"] == "tx"]
detailed_name = 'detailed_name' if counties[0].get('detailed_name') else 'detailed name' # detailed name was changed in Bokeh 3.0
choropleth = hv.Polygons(counties, ['lons', 'lats'], [(detailed_name, 'County'), 'Unemployment'])
hist = choropleth.hist('Unemployment', adjoin=False, normed=False)
To link the two we will specify the 'detailed name'
column as the index_cols
.
linked_choropleth = link_selections(choropleth + hist, index_cols=['detailed name'])
Now that the two plots are linked we can display them and select individual polygons by tapping or apply a box selection on the histogram:
linked_choropleth.opts(
hv.opts.Polygons(tools=['hover', 'tap', 'box_select'], xaxis=None, yaxis=None,
show_grid=False, show_frame=False, width=500, height=500,
color='Unemployment', colorbar=True, line_color='white'),
hv.opts.Histogram(width=500, height=500)
)