Interactive Hover for Big Data#
Interactive Hover for Big Data#
When visualizing large datasets with Datashader, you can easily identify macro level patterns. However, the aggregation process that converts the data into an image can make it difficult to examine individual data points, especially if they occupy the same pixel. This sets up a challenge: how do you explore individual data points without sacrificing the benefits of aggregation?
To solve this problem, HoloViews offers the selector
keyword, which makes it possible for the hover tooltip to include information about the underlying data points when using a Datashader operation (rasterize
or datashade
).
The selector
mechanism performantly retrieves the specific details on the server side without having to search through the entire dataset or having to send all the data to the browser. This allows users working with large datasets to detect big picture patterns while also accessing information about individual points.
This notebook demonstrates how to use selector
, which creates a dynamic hover tool that keeps the interactive experience fast and smooth with very large datasets and makes it easier to explore and understand complex visualizations.
Note
This notebook uses dynamic updates, which require running a live Jupyter or Bokeh server. When viewed statically, the plots will not update, you can zoom and pan, and hover information will not be available.
Note
This functionality requires Bokeh version 3.7 or greater.
Let’s start by creating a Points element with a DataFrame consisting of five datasets combined. Each of the datasets has a random x, y-coordinate based on a normal distribution centered at a specific (x, y) location, with varying standard deviations. The datasets—labeled d1
through d5
—represent different clusters:
d1
is tightly clustered around (2, 2) with a small spread of 0.03,d2
is around (2, -2) with a wider spread of 0.10,d3
is around (-2, -2) with even more dispersion at 0.50,d4
is broadly spread around (-2, 2) with a standard deviation of 1.00,and
d5
has the widest spread of 3.00 centered at the origin (0, 0).
Each point also carries a val
and cat
column to identify its dataset and category. The total dataset contains 50,000 points, evenly split across the five distributions.
import datashader as ds
import numpy as np
import pandas as pd
import holoviews as hv
from holoviews.operation.datashader import datashade, dynspread, rasterize
hv.extension("bokeh")
# Set default hover tools on various plot types
hv.opts.defaults(hv.opts.RGB(tools=["hover"]), hv.opts.Image(tools=["hover"]))
def create_synthetic_dataset(x, y, s, val, cat):
seed = np.random.default_rng(1)
num = 10_000
return pd.DataFrame(
{"x": seed.normal(x, s, num), "y": seed.normal(y, s, num), "s": s, "val": val, "cat": cat}
)
df = pd.concat(
{
cat: create_synthetic_dataset(x, y, s, val, cat)
for x, y, s, val, cat in [
(2, 2, 0.03, 0, "d1"),
(2, -2, 0.10, 1, "d2"),
(-2, -2, 0.50, 2, "d3"),
(-2, 2, 1.00, 3, "d4"),
(0, 0, 3.00, 4, "d5"),
]
},
ignore_index=True,
)
points = hv.Points(df)
# Show a sample from each dataset
df.iloc[[0, 10_000, 20_000, 30_000, 40_000]]
Datashader Operations#
Datashader is used to convert the points into a rasterized image. Two common operations are:
rasterize
: Converts points into an image grid where each pixel aggregates data. The default is to count the number of points per pixel.datashade
: Applies a color map to the rasterized data, outputting RGBA values
The default aggregator counts the points per pixel, but you can specify a different aggregator, for example, ds.mean("s")
to calculate the mean of the s
column. For more information, see the Large Data user guide.
rasterized = rasterize(points)
shaded = datashade(points)
rasterized + shaded
Selectors are a subtype of Aggregators#
Both aggregator
and selector
relate to performing an operation on data points in a pixel, but it’s important to understand the difference.
When multiple data points fall into the same pixel, Datashader needs to get a single value from this collection to form an image. This is done with an aggregator
that can specify if the points should be combined (such as the mean
of a column) or that a single value should just be selected (such as the min
of a column).
Let’s see a couple of different aggregators in action:
# Combine data points for the aggregation:
rasterized_mean = rasterize(points, aggregator=ds.mean("s")).opts(title="Aggregate is Mean of s col")
# Select a data point for the aggregation:
rasterized_max = rasterize(points, aggregator=ds.max("s")).opts(title="Aggregate is Max of s column")
rasterized_mean + rasterized_max