Working with large data using datashader#
import datashader as ds import numpy as np import holoviews as hv import pandas as pd import numpy as np from holoviews import opts from holoviews.operation.datashader import datashade, rasterize, shade, dynspread, spread from holoviews.operation.resample import ResampleOperation2D from holoviews.operation import decimate hv.extension('bokeh','matplotlib', width=100) # Default values suitable for this notebook decimate.max_samples=1000 dynspread.max_px=20 dynspread.threshold=0.5 ResampleOperation2D.width=500 ResampleOperation2D.height=500 def random_walk(n, f=5000): """Random walk in a 2D space, smoothed with a filter of length f""" xs = np.convolve(np.random.normal(0, 0.1, size=n), np.ones(f)/f).cumsum() ys = np.convolve(np.random.normal(0, 0.1, size=n), np.ones(f)/f).cumsum() xs += 0.1*np.sin(0.1*np.array(range(n-1+f))) # add wobble on x axis xs += np.random.normal(0, 0.005, size=n-1+f) # add measurement noise ys += np.random.normal(0, 0.005, size=n-1+f) return np.column_stack([xs, ys]) def random_cov(): """Random covariance for use in generating 2D Gaussian distributions""" A = np.random.randn(2,2) return np.dot(A, A.T) def time_series(T = 1, N = 100, mu = 0.1, sigma = 0.1, S0 = 20): """Parameterized noisy time series""" dt = float(T)/N t = np.linspace(0, T, N) W = np.random.standard_normal(size = N) W = np.cumsum(W)*np.sqrt(dt) # standard brownian motion X = (mu-0.5*sigma**2)*t + sigma*W S = S0*np.exp(X) # geometric brownian motion return S
When viewed statically, the plots will not update fully when you zoom and pan.
Principles of datashading#
Because HoloViews elements are fundamentally data containers, not visualizations, you can very quickly declare elements such as
Path containing datasets that may be as large as the full memory available on your machine (or even larger if using Dask dataframes). So even for very large datasets, you can easily specify a data structure that you can work with for making selections, sampling, aggregations, and so on. However, as soon as you try to visualize it directly with either the Matplotlib, Plotly, or Bokeh plotting extensions, the rendering process may be prohibitively expensive.
Let’s start with a simple example that’s easy to visualize in any plotting library:
np.random.seed(1) points = hv.Points(np.random.multivariate_normal((0,0), [[0.1, 0.1], [0.1, 1.0]], (1000,)),label="Points") paths = hv.Path([random_walk(2000,30)], kdims=["u","v"], label="Paths") points + paths
These browser-based plots are fully interactive, as you can see if you select the Wheel Zoom or Box Zoom tools and use your scroll wheel or click and drag.
Because all of the data in these plots gets transferred directly into the web browser, the interactive functionality will be available even on a static export of this figure as a web page. Note that even though the visualization above is not computationally expensive, even with just 1000 points as in the scatterplot above, the plot already suffers from overplotting, with later points obscuring previously plotted points.
With much larger datasets, these issues will quickly make it impossible to see the true structure of the data. We can easily declare 50X or 1000X larger versions of the same plots above, but if we tried to visualize them directly they would be unusably slow even if the browser did not crash:
np.random.seed(1) points = hv.Points(np.random.multivariate_normal((0,0), [[0.1, 0.1], [0.1, 1.0]], (1000000,)),label="Points") paths = hv.Path([0.15*random_walk(100000) for i in range(10)], kdims=["u","v"], label="Paths") #points + paths ## Danger! Browsers can't handle 1 million points!
Luckily, HoloViews Elements are just containers for data and associated metadata, not plots, so HoloViews can generate entirely different types of visualizations from the same data structure when appropriate. For instance, in the plot on the left below you can see the result of applying a
decimate() operation acting on the
points object, which will automatically downsample this million-point dataset to at most 1000 points at any time as you zoom in or out:
decimate( points).relabel("Decimated Points") + \ rasterize(points).relabel("Rasterized Points").opts(colorbar=True, width=350) + \ rasterize(paths ).relabel("Rasterized Paths")