Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Histogramming and Binning Data with Python
1. Histogramming
When a measurement is made numerous times, it is often useful to bin (or group) the data and make a histogram. For example, if the time that it takes a sphere to roll down a ramp was measured one hundred times, then a histogram of the times would show how they are distributed. The hist function from the pylab library is useful for making histograms. The example below makes a histogram from a list of 24 numbers. You can add labels to the histogram like othe graphs.
The first line imports the pylab library, which makes the hist function available.
As with other plotting commands, the figure and show functions are also needed.
By default, the histogram will have 10 bins. If no additional arguments are sent, the hist function decides where to put the boundaries of the bins.
The color argument can be used to set the color of the bars in the histogram. Alternatively, the edgecolor and facecolor arguments separately set the colors of the edges and middle of the bars in the histogram, respectively. Some color options are:
-
r = red
g = green
b = blue
k = black
c = cyan
m = magenta
y = yellow
w = white
The default is for the edgecolor to be the same as the facecolor. The bins stand out better if the edgecolor is black.
The facecolor argument can also be set to "None" so that the bars only have outlines. Alternatively, you can set fill to "False". This is useful if you want to plot data on top of the histograms as shown further below.
The hist function returns the number of events in each bin, the edges of the bins, and things called patches (which will not be discussed further). These values can be captured by providing three variable names for them as follows.
The array events contains the numbers of occurences in each of the 10 bins. The array edges contain 11 elements. (The first 10 elements are the lower edges of the bins and the final element is the upper edge of the final bin.) The bins are the same width, but the edges may end up in unusual places. A number is included in a bin if it is greater than or equal to its lower edge and less than its upper edge.
If you set the density argument to “True”, the function will make an area-normalized histogram. For each bin, the height on the histogram is the probability density, which is the number of events in the bin divided by the total number of events and the width of the bin. The area of each bin in the histogram is the probability of an event being in that bin, so the total area is one. With this option, the probability density is returned instead of the number of events. Compare the example below with the previous example.
You can control the number of bins by setting the bins argument to an integer, but this doesn’t control the locations of the edges. Choosing an appropriate number of bins is important. If there are too few or too many bins, the histogram won’t show how the events are distributed very well. For example, the same example data is histogrammed below with 3 and 30 bins. Neither on of these is very helpful.
If you want to have control over the number and location of the bins, you can make the bins argument an array. If you want N bins, the array will have (N + 1) elements. The first N elements are the lower edges of the bins and the final element is the upper edge of the final bin. Usually the bins have equal widths, but they can be made unequal. The array can be made with the linspace function from the scipy library, which will need to be imported. You must specify the first element of the array (the lower edge of the first bin), the last element of the array (the upper edge of the final bin), and the number of elements in the array (one more than the number of bins). The example below would produce2 10 bins (not 11) starting at 0 and ending at 10. For the example data, some of the bins are empty and aren't displayed.
It is also possible to set the upper and lower limits of the bins using the range argument. Values outside of the specified range are ignored. The following example does the same as the previous example because the default number of bins is 10.
Note that in all of the examples above the center of each bin is placed midway between the edges, which define what values are counted in that bin. If the values being histogrammed are all integers, it makes more sense for the to shift the bins to the left so that they are centered over integers. Setting align to "left" will put the center of the bin over the left edge, which will center them over integers.
If the bins aren't filled, you can graph points (using scatter) or curves (using plot) on the same figure. If the bins are filled, they can hide the points or curves.
2. Binning Data
Sometimes data is binned before it is analyzed. For example, a set of decay times could be binned before fitting the data to an exponential function. The histogram function from the numpy library can be used to bin data without making a plot. The histogram function is similar to the hist function described in the previous section. The range and bins arguments can be used, but it doesn’t return patches. Associating the locations of the bins and the numbers of events in them is a little tricky because the edges array is one element longer than the events array.
If your counting the occurences of integers, the lower edges are the appropriate thing to use. In the example below, the resize function makes an array called lower which has a length one less than the length of the edges array, so it just contains the lower edges.
For non-integer data, it makes more sense to associate the number of events with the center of bin. For example, the number of event wiht values of t between 0 and 1 should be associated with 0.5. The example below will make an array called tmid which is the same length as events and contains the values of t in the middle of the bins. Again, the resize function makes an array called lower which contains the locations of the lower edges of the bins because the final element is dropped. An array containing the difference between consecutive elements of the edges array is returned by the function diff. Adding half of the difference between the edges to the lower edge gives the value in the middle of a bin. Note that "diff(edges)" is the same length as the array lower.
Additional Documentation
Further information is available at: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html