Tabular Datasets#

As we have already discovered, Elements are simple wrappers around your data that provide a semantically meaningful visual representation. HoloViews can work with a wide variety of data types, but many of them can be categorized as either:

  • Tabular: Tables of flat columns, or

  • Gridded: Array-like data on 2-dimensional or N-dimensional grids

These two general data types are explained in detail in the Tabular Data and Gridded Data user guides, including all the many supported formats (including Python dictionaries of NumPy arrays, pandas DataFrames, dask DataFrames, and xarray DataArrays and Datasets).

In this Getting-Started guide we provide a quick overview and introduction to two of the most flexible and powerful formats: columnar pandas DataFrames (in this section), and gridded xarray Datasets (in the next section).

Tabular#

Tabular data (also called columnar data) is one of the most common, general, and versatile data formats, corresponding to how data is laid out in a spreadsheet. There are many different ways to put data into a tabular format, but for interactive analysis having tidy data provides flexibility and simplicity. For tidy data, the columns of the table represent variables or dimensions and the rows represent observations. The best way to understand this format is to look at such a dataset:

import numpy as np
import pandas as pd
import holoviews as hv
from holoviews import opts
hv.extension('bokeh', 'matplotlib')
diseases = pd.read_csv('../assets/diseases.csv.gz')
diseases.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [1], in <cell line: 1>()
----> 1 diseases = pd.read_csv('../assets/diseases.csv.gz')
      2 diseases.head()

NameError: name 'pd' is not defined

This particular dataset was the subject of an excellent piece of visual journalism in the Wall Street Journal. The WSJ data details the incidence of various diseases over time, and was downloaded from the University of Pittsburgh’s Project Tycho. We can see we have 5 data columns, which each correspond either to independent variables that specify a particular measurement (‘Year’, ‘Week’, ‘State’), or observed/dependent variables reporting what was then actually measured (the ‘measles’ or ‘pertussis’ incidence).

Knowing the distinction between those two types of variables is crucial for doing visualizations, but unfortunately the tabular format does not declare this information. Plotting ‘Week’ against ‘State’ would not be meaningful, whereas ‘measles’ for each ‘State’ (averaging or summing across the other dimensions) would be fine, and there’s no way to deduce those constraints from the tabular format. Accordingly, we will first make a HoloViews object called a Dataset that declares the independent variables (called key dimensions or kdims in HoloViews) and dependent variables (called value dimensions or vdims) that you want to work with:

vdims = [('measles', 'Measles Incidence'), ('pertussis', 'Pertussis Incidence')]
ds = hv.Dataset(diseases, ['Year', 'State'], vdims)

Here we’ve used an optional tuple-based syntax (name,label) to specify a more meaningful description for the vdims, while using the original short descriptions for the two kdims. We haven’t yet specified what to do with the Week dimension, but we are only interested in yearly averages, so let’s just tell HoloViews to average over all remaining dimensions:

ds = ds.aggregate(function=np.mean)
ds

(We’ll cover aggregations like np.mean in detail later, but here the important bit is simply that the Week dimension can now be ignored.)

The repr shows us both the kdims (in square brackets) and the vdims (in parentheses) of the Dataset. Because it can hold arbitrary combinations of dimensions, a Dataset is not immediately visualizable. There’s no single clear mapping from these four dimensions onto a two-dimensional page, hence the textual representation shown above.

To make this data visualizable, we’ll need to provide a bit more metadata, by selecting one of the large library of Elements that can help answer the questions we want to ask about the data. Perhaps the most obvious representation of this dataset is as a Curve displaying the incidence for each year, for each state. We could pull out individual columns one by one from the original dataset, but now that we have declared information about the dimensions, the cleanest approach is to map the dimensions of our Dataset onto the dimensions of an Element using .to:

layout = (ds.to(hv.Curve, 'Year', 'measles') + ds.to(hv.Curve, 'Year', 'pertussis')).cols(1)
layout.opts(
    opts.Curve(width=600, height=250, framewise=True))

Here we specified two Curve elements showing measles and pertussis incidence respectively (the vdims), per year (the kdim), and laid them out in a vertical column. You’ll notice that even though we specified only the short name for the value dimensions, the plot shows the longer names (“Measles Incidence”, “Pertussis Incidence”) that we declared on the Dataset.

You’ll also notice that we automatically received a dropdown menu to select which State to view. Each Curve ignores unused value dimensions, because additional measurements don’t affect each other, but HoloViews has to do something with every key dimension for every such plot. If the State (or any other key dimension) isn’t somehow plotted or aggregated over, then HoloViews has to leave choosing a value for it to the user, hence the selection widget. Other options for what to do with extra dimensions or just extra data ranges are illustrated below.

Selecting#

One of the most common things we might want to do is to select only a subset of the data. The select method makes this extremely easy, letting you select a single value, a list of values supplied as a list, or a range of values supplied as a tuple. Here we will use select to display the measles incidence in four states over one decade. After applying the selection, we use the .to method as shown earlier, now displaying the data as Bars indexed by ‘Year’ and ‘State’ key dimensions and displaying the ‘Measles Incidence’ value dimension:

states = ['New York', 'New Jersey', 'California', 'Texas']
bars = ds.select(State=states, Year=(1980, 1990)).to(hv.Bars, ['Year', 'State'], 'measles').sort()
bars.opts(
    opts.Bars(width=800, height=400, tools=['hover'], xrotation=90, show_legend=False))

Faceting#

Above we already saw what happens to key dimensions that we didn’t explicitly assign to the Element using the .to method: they are grouped over, popping up a set of widgets so the user can select the values to show at any one time. However, using widgets is not always the most effective way to view the data, and a Dataset lets you specify other alternatives using the .overlay, .grid and .layout methods. For instance, we can lay out each state separately using .grid:

grouped = ds.select(State=states, Year=(1930, 2005)).to(hv.Curve, 'Year', 'measles')
gridspace = grouped.grid('State')
gridspace.opts(
    opts.Curve(width=200, color='indianred'))

Or we can take the same grouped object and .overlay the individual curves instead of laying them out in a grid:

ndoverlay = grouped.overlay('State')
ndoverlay.opts(
    opts.Curve(width=600, color=hv.Cycle(values=['indianred', 'slateblue', 'lightseagreen', 'coral'])))

These faceting methods even compose together, meaning that if we had more key dimensions we could .overlay one dimension, .grid another and have a widget for any other remaining key dimensions.

Aggregating#

Instead of selecting a subset of the data, another common operation supported by HoloViews is computing aggregates. When we first loaded this dataset, we aggregated over the ‘Week’ column to compute the mean incidence for every year, thereby reducing our data significantly. The aggregate method is therefore very useful to compute statistics from our data.

A simple example using our dataset is to compute the mean and standard deviation of the Measles Incidence by 'Year'. We can express this simply by passing the key dimensions to aggregate over (in this case just the ‘Year’) along with a function and optional spreadfn to compute the statistics we want. The spreadfn will append the name of the function to the dimension name so we can reference the computed value separately. Once we have computed the aggregate, we can simply cast it to a Curve and ErrorBars:

agg = ds.aggregate('Year', function=np.mean, spreadfn=np.std)
errorbars = hv.ErrorBars(agg,vdims=['measles', 'measles_std']).iloc[::2]
overlay =  (hv.Curve(agg) * errorbars).redim.range(measles=(0, None))

overlay.opts(width=600)

In this way we can summarize a multi-dimensional dataset as something that can be visualized directly, while allowing us to compute arbitrary statistics along a dimension.

Other data#

If you want to know more about working with tabular data, particularly when using datatypes other than pandas, have a look at the user guide. The different interfaces allow you to work with everything from simple NumPy arrays to out-of-core dataframes using dask. Dask dataframes scale to visualizations of billions of rows, when using datashader with HoloViews to aggregate the data as needed.

This web page was generated from a Jupyter notebook and not all interactivity will work on this website. Right click to download and run locally for full Python-backed interactivity.