Interactive Hover for Big Data#

Interactive Hover for Big Data#

When visualizing large datasets with Datashader, you can easily identify macro level patterns. However, the aggregation process that converts the data into an image can make it difficult to examine individual data points, especially if they occupy the same pixel. This sets up a challenge: how do you explore individual data points without sacrificing the benefits of aggregation?

To solve this problem, HoloViews offers the selector keyword, which makes it possible for the hover tooltip to include information about the underlying data points when using a Datashader operation (rasterize or datashade).

The selector mechanism performantly retrieves the specific details on the server side without having to search through the entire dataset or having to send all the data to the browser. This allows users working with large datasets to detect big picture patterns while also accessing information about individual points.

This notebook demonstrates how to use selector, which creates a dynamic hover tool that keeps the interactive experience fast and smooth with very large datasets and makes it easier to explore and understand complex visualizations.

Note

This notebook uses dynamic updates, which require running a live Jupyter or Bokeh server. When viewed statically, the plots will not update, you can zoom and pan, and hover information will not be available.

Note

This functionality requires Bokeh version 3.7 or greater.

Let’s start by creating a Points element with a DataFrame consisting of five datasets combined. Each of the datasets has a random x, y-coordinate based on a normal distribution centered at a specific (x, y) location, with varying standard deviations. The datasets—labeled d1 through d5—represent different clusters:

  • d1 is tightly clustered around (2, 2) with a small spread of 0.03,

  • d2 is around (2, -2) with a wider spread of 0.10,

  • d3 is around (-2, -2) with even more dispersion at 0.50,

  • d4 is broadly spread around (-2, 2) with a standard deviation of 1.00,

  • and d5 has the widest spread of 3.00 centered at the origin (0, 0).

Each point also carries a val and cat column to identify its dataset and category. The total dataset contains 50,000 points, evenly split across the five distributions.

import datashader as ds
import numpy as np
import pandas as pd

import holoviews as hv
from holoviews.operation.datashader import datashade, dynspread, rasterize

hv.extension("bokeh")

# Set default hover tools on various plot types
hv.opts.defaults(hv.opts.RGB(tools=["hover"]), hv.opts.Image(tools=["hover"]))


def create_synthetic_dataset(x, y, s, val, cat):
    seed = np.random.default_rng(1)
    num = 10_000
    return pd.DataFrame(
        {"x": seed.normal(x, s, num), "y": seed.normal(y, s, num), "s": s, "val": val, "cat": cat}
    )


df = pd.concat(
    {
        cat: create_synthetic_dataset(x, y, s, val, cat)
        for x, y, s, val, cat in [
            (2, 2, 0.03, 0, "d1"),
            (2, -2, 0.10, 1, "d2"),
            (-2, -2, 0.50, 2, "d3"),
            (-2, 2, 1.00, 3, "d4"),
            (0, 0, 3.00, 4, "d5"),
        ]
    },
    ignore_index=True,
)


points = hv.Points(df)

# Show a sample from each dataset
df.iloc[[0, 10_000, 20_000, 30_000, 40_000]]
x y s val cat
0 2.010368 1.982550 0.03 0 d1
10000 2.034558 -2.058168 0.10 1 d2
20000 -1.827208 -2.290838 0.50 2 d3
30000 -1.654416 1.418324 1.00 3 d4
40000 1.036753 -1.745027 3.00 4 d5

Datashader Operations#

Datashader is used to convert the points into a rasterized image. Two common operations are:

  • rasterize: Converts points into an image grid where each pixel aggregates data. The default is to count the number of points per pixel.

  • datashade: Applies a color map to the rasterized data, outputting RGBA values

The default aggregator counts the points per pixel, but you can specify a different aggregator, for example, ds.mean("s") to calculate the mean of the s column. For more information, see the Large Data user guide.

rasterized = rasterize(points)
shaded = datashade(points)
rasterized + shaded

Selectors are a subtype of Aggregators#

Both aggregator and selector relate to performing an operation on data points in a pixel, but it’s important to understand the difference.

When multiple data points fall into the same pixel, Datashader needs to get a single value from this collection to form an image. This is done with an aggregator that can specify if the points should be combined (such as the mean of a column) or that a single value should just be selected (such as the min of a column).

aggregator

Let’s see a couple of different aggregators in action:

# Combine data points for the aggregation:
rasterized_mean = rasterize(points, aggregator=ds.mean("s")).opts(title="Aggregate is Mean of s col")

# Select a data point for the aggregation:
rasterized_max = rasterize(points, aggregator=ds.max("s")).opts(title="Aggregate is Max of s column")

rasterized_mean + rasterized_max