How to quickly find the best bins for your histogram

Wednesday, February 6, 2019

Data exploration is a critical step in every data science project and it usually starts with looking at the distribution of single variables. This is where histograms shine.

Histograms are great for visualising the distribution of columns, which helps to understand important aspects of the data. By simply looking at a histogram, we can for example immediately identify outliers or even errors in our data (e.g. negative values in a column containing the age of patients).

When working with histograms, we almost always end up adjusting the bin width, which is a critical parameter as it determines how much and what kind of information we can extract from the plot.

In this article, I will show you how you can quickly find your optimal bin width by creating an interactive histogram that you can rebin on the fly using plotly and ipywidgets in Jupyter Notebook or JupyterLab.

Even though I show interactive rebinning with plotly, you can apply the logic I’m illustrating to any plotting library, such as seaborn and matplotlib.

For the visualization, I will display the air time in minutes of more than 300,000 flights that departed NYC in 2013 (NYCflights13 data). You can find the full code for this article as a Jupyter Notebook on GitHub.

Histogram with interactive binning

In this graphic you can see the end result. If we change the bin width through a slider, the plotly graph adjusts automatically.

In order to implement this behavior, we combine plotly.graph_objs (creates the plotly graph) with an ipywidgets.Floatslider.

This is the code for creating the rebinnable histogram.


import plotly.graph_objs as go
import ipywidgets as widgets


def rebinnable_interactive_histogram(series, initial_bin_width=10):
    figure_widget = go.FigureWidget(
        data=[go.Histogram(x=series, xbins={"size": initial_bin_width})]
    )

    bin_slider = widgets.FloatSlider(
        value=initial_bin_width,
        min=1,
        max=30,
        step=1,
        description="Bin width:",
        readout_format=".0f",  # display as integer
    )

    histogram_object = figure_widget.data[0]

    def set_bin_size(change):
        histogram_object.xbins = {"size": change["new"]}

    bin_slider.observe(set_bin_size, names="value")

    output_widget = widgets.VBox([figure_widget, bin_slider])
    return output_widget


rebinnable_interactive_histogram(df, "air_time")

Let’s go through it line by line.

Explaining the code line by line

0. Function signature

def rebinnable_interactive_histogram(series, initial_bin_width=10):

Note that our function takes two arguments: series a pandas.Series, and initial_bin_width, specifying the bin width we want to have a as a default in our plot. In our case, its a 10-minutes air time window.

1. Creating the figure

    figure_widget = go.FigureWidget(
        data=[go.Histogram(x=series, xbins={"size": initial_bin_width})]
    )

We generate a new FigureWidget instance. The FigureWidget object is the new “magic object” of plotly. You can display it within Jupyter Notebook or JupyterLab like any normal plotly figure. However, this approach has some advantages:

  • FigureWidgets can be combined with ipywidgets in order to create more powerful constructs (in fact, that’s what FigureWidgets are designed for)
  • you can manipulate the FigureWidget in various ways from Python
  • you can also listen for some events and
  • when an event is triggered, you can execute more Python code

The FigureWidget receives the attribute data, which specifies a list of all the traces (read: visualizations) that we want to show. In our case, we only want to show a single histogram. The x values for the histogram are coming from the series. We set the bin width by passing a dictionary to xbins. When we set size=None in the dictionary, plotly will choose a bin width for us.

2. Creating the slider

    bin_slider = widgets.FloatSlider(
        value=initial_bin_width,
        min=1,
        max=30,
        step=1,
        description="Bin width:",
        readout_format=".0f",  # display as integer
    )

We generate a FloatSlider using the ipywidgets library. Via this slider, we will later be able to manipulate our histogram.

3. Saving a reference to the histogram

    histogram_object = figure_widget.data[0]

We get the reference to the histogram because we want to manipulate it in the last step. In particular, we will change the xbins attribute of our object, which we can access via histogram_object.xbins.

4. Write and use the callback

    def set_bin_size(change):
        histogram_object.xbins = {"size": change["new"]}

    bin_slider.observe(set_bin_size, names="value")

The FloatSlider we have implemented comes with some magic. Every time its value changes (i.e. we move the slider), it triggers an event. We can use that event to update the bin width in our histogram. Technically, you do that by calling the observe method on the bin slider, pass it the function you want to call ( set_bin_size in our case) and tell it when to call the function (name="value" meaning that we call the function whenever the value of the slider changes). Now, whenever the slider’s value changes, it will call set_bin_size. set_bin_size has access to the slider’s value through the magic argument change — a dictionary containing data about the event triggered by bin_slider. For example, change["new"] contains the new value of the slider, but you can also access its previous value with change["old"]. Note that you don’t have to use the argument name change. You can give it any name you want.

Inside the callback function set_bin_size, we can see that it simply takes the reference histogram_object in order to update the FigureWidget‘s bin settings (i.e. change the bin width) by overwriting xbins.

When we put all the pieces from above together, we have our first prototype for a nice interactive histogram.

Conclusion

Histograms are a great way to get started exploring single columns of a data set. With plotly, we can create powerful interactive visualizations which can further be enhanced with ipywidgets.

In this article, I have shown you how you can interactively and quickly find the (subjectively) optimal bin width for a histogram when working in Jupyter Notebook or JupyterLab using plotly and ipywidgets.

At 8080 Labs, we use the rebinning feature in our python tool bamboolib. Together with many other interactive features, it helps our users get insights faster.

If you have any feedback or constructive criticism about this article or want to discuss ways to add more functionality to the histogram, please feel free to reach out to me via LinkedIn.


About Tobias

I am the co-creator of bamboolib. I am always looking for ways to make Python data scientists more productive, so that they have more time for the things they love - be it data insights, building fancy ML models, or simply spending more time with friends & family. I love food, music, hiking and spirited discussions.

jupyteripywidgetsplotly

How to fix qgrid in Jupyter Lab — Error displaying widget: model not found