Tabular Datasets#
As we have already discovered, Elements are simple wrappers around your data that provide a semantically meaningful visual representation. HoloViews can work with a wide variety of data types, but many of them can be categorized as either:
Tabular: Tables of flat columns, or
Gridded: Array-like data on 2-dimensional or N-dimensional grids
These two general data types are explained in detail in the Tabular Data and Gridded Data user guides, including all the many supported formats (including Python dictionaries of NumPy arrays, pandas DataFrames
, dask DataFrames
, and xarray DataArrays
and Datasets
).
In this Getting-Started guide we provide a quick overview and introduction to two of the most flexible and powerful formats: columnar pandas DataFrames (in this section), and gridded xarray Datasets (in the next section).
Tabular#
Tabular data (also called columnar data) is one of the most common, general, and versatile data formats, corresponding to how data is laid out in a spreadsheet. There are many different ways to put data into a tabular format, but for interactive analysis having tidy data provides flexibility and simplicity. For tidy data, the columns of the table represent variables or dimensions and the rows represent observations. The best way to understand this format is to look at such a dataset:
import numpy as np
import pandas as pd
import holoviews as hv
from holoviews import opts
hv.extension('bokeh', 'matplotlib')
diseases = pd.read_csv('../assets/diseases.csv.gz')
diseases.head()
Year | Week | State | measles | pertussis | |
---|---|---|---|---|---|
0 | 1928 | 1 | Alabama | 3.67 | NaN |
1 | 1928 | 2 | Alabama | 6.25 | NaN |
2 | 1928 | 3 | Alabama | 7.95 | NaN |
3 | 1928 | 4 | Alabama | 12.58 | NaN |
4 | 1928 | 5 | Alabama | 8.03 | NaN |
This particular dataset was the subject of an excellent piece of visual journalism in the Wall Street Journal. The WSJ data details the incidence of various diseases over time, and was downloaded from the University of Pittsburgh’s Project Tycho. We can see we have 5 data columns, which each correspond either to independent variables that specify a particular measurement (‘Year’, ‘Week’, ‘State’), or observed/dependent variables reporting what was then actually measured (the ‘measles’ or ‘pertussis’ incidence).
Knowing the distinction between those two types of variables is crucial for doing visualizations, but unfortunately the tabular format does not declare this information. Plotting ‘Week’ against ‘State’ would not be meaningful, whereas ‘measles’ for each ‘State’ (averaging or summing across the other dimensions) would be fine, and there’s no way to deduce those constraints from the tabular format. Accordingly, we will first make a HoloViews object called a Dataset
that declares the independent variables (called key dimensions or kdims in HoloViews) and dependent variables (called value dimensions or vdims) that you want to work with:
vdims = [('measles', 'Measles Incidence'), ('pertussis', 'Pertussis Incidence')]
ds = hv.Dataset(diseases, ['Year', 'State'], vdims)
Here we’ve used an optional tuple-based syntax (name,label)
to specify a more meaningful description for the vdims
, while using the original short descriptions for the two kdims
. We haven’t yet specified what to do with the Week
dimension, but we are only interested in yearly averages, so let’s just tell HoloViews to average over all remaining dimensions:
ds = ds.aggregate(function=np.mean)
ds
:Dataset [Year,State] (Measles Incidence,Pertussis Incidence)
(We’ll cover aggregations like np.mean
in detail later, but here the important bit is simply that the Week
dimension can now be ignored.)
The repr
shows us both the kdims
(in square brackets) and the vdims
(in parentheses) of the Dataset
. Because it can hold arbitrary combinations of dimensions, a Dataset
is not immediately visualizable. There’s no single clear mapping from these four dimensions onto a two-dimensional page, hence the textual representation shown above.
To make this data visualizable, we’ll need to provide a bit more metadata, by selecting one of the large library of Elements that can help answer the questions we want to ask about the data. Perhaps the most obvious representation of this dataset is as a Curve
displaying the incidence for each year, for each state. We could pull out individual columns one by one from the original dataset, but now that we have declared information about the dimensions, the cleanest approach is to map the dimensions of our Dataset
onto the dimensions of an Element using .to
:
layout = (ds.to(hv.Curve, 'Year', 'measles') + ds.to(hv.Curve, 'Year', 'pertussis')).cols(1)
layout.opts(
opts.Curve(width=600, height=250, framewise=True))
Here we specified two Curve
elements showing measles and pertussis incidence respectively (the vdims), per year (the kdim), and laid them out in a vertical column. You’ll notice that even though we specified only the short name for the value dimensions, the plot shows the longer names (“Measles Incidence”, “Pertussis Incidence”) that we declared on the Dataset
.
You’ll also notice that we automatically received a dropdown menu to select which State
to view. Each Curve
ignores unused value dimensions, because additional measurements don’t affect each other, but HoloViews has to do something with every key dimension for every such plot. If the State
(or any other key dimension) isn’t somehow plotted or aggregated over, then HoloViews has to leave choosing a value for it to the user, hence the selection widget. Other options for what to do with extra dimensions or just extra data ranges are illustrated below.
Selecting#
One of the most common things we might want to do is to select only a subset of the data. The select
method makes this extremely easy, letting you select a single value, a list of values supplied as a list, or a range of values supplied as a tuple. Here we will use select
to display the measles incidence in four states over one decade. After applying the selection, we use the .to
method as shown earlier, now displaying the data as Bars
indexed by ‘Year’ and ‘State’ key dimensions and displaying the ‘Measles Incidence’ value dimension:
states = ['New York', 'New Jersey', 'California', 'Texas']
bars = ds.select(State=states, Year=(1980, 1990)).to(hv.Bars, ['Year', 'State'], 'measles').sort()
bars.opts(
opts.Bars(width=800, height=400, tools=['hover'], xrotation=90, show_legend=False))
Faceting#
Above we already saw what happens to key dimensions that we didn’t explicitly assign to the Element using the .to
method: they are grouped over, popping up a set of widgets so the user can select the values to show at any one time. However, using widgets is not always the most effective way to view the data, and a Dataset
lets you specify other alternatives using the .overlay
, .grid
and .layout
methods. For instance, we can lay out each state separately using .grid
:
grouped = ds.select(State=states, Year=(1930, 2005)).to(hv.Curve, 'Year', 'measles')
gridspace = grouped.grid('State')
gridspace.opts(
opts.Curve(width=200, color='indianred'))
Or we can take the same grouped object and .overlay
the individual curves instead of laying them out in a grid:
ndoverlay = grouped.overlay('State')
ndoverlay.opts(
opts.Curve(width=600, color=hv.Cycle(values=['indianred', 'slateblue', 'lightseagreen', 'coral'])))
These faceting methods even compose together, meaning that if we had more key dimensions we could .overlay
one dimension, .grid
another and have a widget for any other remaining key dimensions.
Aggregating#
Instead of selecting a subset of the data, another common operation supported by HoloViews is computing aggregates. When we first loaded this dataset, we aggregated over the ‘Week’ column to compute the mean incidence for every year, thereby reducing our data significantly. The aggregate
method is therefore very useful to compute statistics from our data.
A simple example using our dataset is to compute the mean and standard deviation of the Measles Incidence by 'Year'
. We can express this simply by passing the key dimensions to aggregate over (in this case just the ‘Year’) along with a function and optional spreadfn
to compute the statistics we want. The spreadfn
will append the name of the function to the dimension name so we can reference the computed value separately. Once we have computed the aggregate, we can simply cast it to a Curve
and ErrorBars
:
agg = ds.aggregate('Year', function=np.mean, spreadfn=np.std)
errorbars = hv.ErrorBars(agg,vdims=['measles', 'measles_std']).iloc[::2]
overlay = (hv.Curve(agg) * errorbars).redim.range(measles=(0, None))
overlay.opts(width=600)
In this way we can summarize a multi-dimensional dataset as something that can be visualized directly, while allowing us to compute arbitrary statistics along a dimension.
Other data#
If you want to know more about working with tabular data, particularly when using datatypes other than pandas, have a look at the user guide. The different interfaces allow you to work with everything from simple NumPy arrays to out-of-core dataframes using dask. Dask dataframes scale to visualizations of billions of rows, when using datashader with HoloViews to aggregate the data as needed.