Streaming Histograph

Streaming Histograph is a term I came up with for a component I tried to recreate from controul.com. It is essentially a histogram that works with stream data. Stream data is a series of data where there is no known end or where the population size is potentially infinite. The term is frequently used with time series data and is usually bound by some time period. I really liked the concept of viewing the data generated by a function, over time, which sparked me to re-create this tool. I have in the past done something similar with dice where I made use of minimal comps graphing components. But this was just way cooler than graphing data.

Park Miller RNG applied with a Marsaglia Transform

The development process

I have to add that I learnt a LOT during this process. My first approach, as is often the case, was very naive. I created the histogram to keep track of the total population and compute the descriptive statistics on a refresh. The methods used were:

  • mean

public function mean(array:Array):Number
{
	return sum(array) / array.length;
}
  • variance

public function variance(array:Array):Number 
{
	var m:Number = mean(array);
	var avg:Number = 0;
	for each (var num:Number in array) {
		avg += Math.pow(m - num, 2);
	}
	return avg / array.length;
}
  • stddev

public function stddev(array:Array):Number
{
	return Math.sqrt(variance(array));
}

This is the correct way to compute the statistics and it is very accurate. I would have been very happy with my naive implementation if I did not see the kind of sample counts that Hristo Dachev was getting in his histographer (we are talking orders of magnitude more than what I was getting). I thought about it a bit and concluded that it MUST be the sort function on the array. So I implemented…

The sorted array

I am not 100% sure how or when I found this, I think it was just coincidence. Over at jacksondunstan.com I ran across his sorted array class and made the connection to the requirement of the histogram data needing to be sorted the whole time. After a quick refactor I had the sorted array implemented and tested. This improved the performance considerably, but sadly, it was not the only cause for my poor performance. After some testing, I realised that the sorted array itself became expensive to insert into after 250,000+ elements, but it was degrading long before that. How was Hristo doing it? It was around here that I realised that I would have to keep a smaller sample of the data instead of the population. This led me to the topic of approximate descriptive statistics.

Rolling mean

The first thing I tackled was how to keep track of the mean without keeping all the samples. I learnt that this type of statistic tracking is called a rolling / streaming statistic. I came across some math that did this for the mean and implemented it. I also tested that it was accurate to within some epsilon. Happy with the implementation, I tackled the problem of the standard deviation eating cpu cycles. Which redirected my efforts to finding a way to stream the variation and thus the standard deviation.

Rolling standard deviation

After quite a bit of searching I found some c++ code that computed the mean, variance, standard deviation, skewness and kurtosis from streaming stats. The author called the class RunningStats and I implemented it as a straight port. I am tempted to refactor it, but that I will leave that for another day as I have little idea of what is mathematically happening there. The implementation of this section significantly improved the performance of the histographer, a name that was starting to stick with me around here. However, I was still getting nowhere near the kind of performance that I was seeing on controul.com.

package net.avdw.stats
{

	/**
	 * http://www.johndcook.com/skewness_kurtosis.html
	 * ...
	 * @author Andrew van der Westhuizen
	 */
	public class RollingStats
	{
		private var n:uint;
		private var M1:Number, M2:Number, M3:Number, M4:Number;

		public function RollingStats()
		{
			clear();
		}

		public function clear():void
		{
			n = 0;
			M1 = M2 = M3 = M4 = 0;
		}

		public function push(x:Number):void
		{
			var delta:Number, deltaByN:Number, deltaByNSqr:Number, term1:Number;
			var oldSampleSize:Number = n;

			n++;
			delta = x - M1;
			deltaByN = delta / n;
			deltaByNSqr = deltaByN * deltaByN;
			term1 = delta * deltaByN * oldSampleSize;
			M1 += deltaByN;
			M4 += term1 * deltaByNSqr * (n * n - 3 * n + 3) + 6 * deltaByNSqr * M2 - 4 * deltaByN * M3;
			M3 += term1 * deltaByN * (n - 2) - 3 * deltaByN * M2;
			M2 += term1;
		}

		public function get mean():Number
		{
			return M1;
		}

		public function get variance():Number
		{
			return n == 1 ? 0 : M2 / (n - 1);
		}

		public function get stddev():Number
		{
			return Math.sqrt(variance);
		}

		public function get skewness():Number
		{
			return Math.sqrt(n) * M3 / Math.pow(M2, 1.5);
		}

		public function get kurtosis():Number
		{
			return n * M4 / (M2 * M2) - 3.0;
		}

		public function plus(stats:RollingStats):RollingStats
		{
			var combined:RollingStats = new RollingStats;
			combined.n = this.n + stats.n;

			var delta:Number = stats.M1 - M1;
			var delta2:Number = delta * delta;
			var delta3:Number = delta * delta2;
			var delta4:Number = delta2 * delta2;

			combined.M1 = (this.n * this.M1 + stats.n * stats.M1) / combined.n;

			combined.M2 = this.M2 + stats.M2 + delta2 * this.n * stats.n / combined.n;

			combined.M3 = this.M3 + stats.M3 + delta3 * this.n * stats.n * (this.n - stats.n) / (combined.n * combined.n);
			combined.M3 += 3.0 * delta * (this.n * stats.M2 - stats.n * this.M2) / combined.n;

			combined.M4 = this.M4 + stats.M4 + delta4 * this.n * stats.n * (this.n * this.n - this.n * stats.n + stats.n * stats.n) / (combined.n * combined.n * combined.n);
			combined.M4 += 6.0 * delta2 * (this.n * this.n * stats.M2 + stats.n * stats.n * this.M2) / (combined.n * combined.n) + 4.0 * delta * (this.n * stats.M3 - stats.n * this.M3) / combined.n;

			return combined;
		}
	}

}

Rolling quartiles

Note that by now I was still using the sorted array to compute quartiles. It was clear to me now that the SortedArray would not cut it and that I had to find a way to stream quartiles. It was surprisingly difficult to find any material on this subject that I could actually understand. I did find a paper detailing the P2 algorithm for dynamic calculation of quantiles and histograms without storing observations, from 1985, which was confusing at best. I also found an implementation of this algorithm in c++ / c, which I still have to build up the courage to port.

Some more searching led me to the concept of parallel descriptive statistic gathering on databases. An algorithm used there is the QDigest algorithm. It allows for merging multiple QDigests into a single QDigest. I even found an implementation of the algorithm in Java. However, I read somewhere that it only works on integers. From here I ended up looking into the TDigest algorithm. Which still seemed a bit much for what I wanted, but if I recall correctly, it handles real numbers.

I was determined to find something simpler than the above. I was losing faith quickly. But then I found an implementation of Greenwald and Khanna’s “Space-efficient online computation of quantile summaries” in SIGMOD 2001. Additionally it contained a simple implementation of Cormode, Korn, Muthukrishnan, and Srivastava’s “Effective Computation of Biased Quantiles over Data Streams” in ICDE 2005. I ported both of them to as3 and worked them into the now poorly titled “Streaming Histograph” class.

Histogram data structure

I came up with a simple data structure to maintain the histogram data. The data structure only keeps track of counts in bins. It requires an expected range to observe and the number of buckets it should maintain. These requirement can be removed when looking into streaming histograms. However, I left these parameters in to simplify the implementation and I needed to have control over the amount of buckets and range for display purposes.

package net.avdw.ds
{
	import net.avdw.math.Range;

	public class HistogramDs
	{
		public var expectedRange:Range;
		public var actualRange:Range;
		public var bins:Vector.;
		public var maxBinCount:uint = 1;
		public var binSize:Number;

		public function HistogramDs(expectedRange:Range, numberBins:uint)
		{
			this.expectedRange = expectedRange;
			actualRange = Range.closed(.5, .5);

			binSize = expectedRange.size / numberBins;
			bins = new Vector.;
			for (var i:int = 0; i < numberBins; i++)
				bins.push(0);
		}

		public function add(number:Number):void
		{
			if (number < actualRange.lower) 				actualRange.lower = number; 			 			if (number > actualRange.upper)
				actualRange.upper = number;

			var binIdx:int = Math.floor((number - expectedRange.lower) / binSize) ;
			if (binIdx < bins.length && binIdx > 0 && ++bins[binIdx] > maxBinCount)
				maxBinCount = bins[binIdx];
		}
	}
}

The result

While it is still nowhere near as efficient as controul.com’s histograph, this implementation can compute a stream’s descriptive statistics at a little more than 1,000+ samples per frame to maintain a stable 31 frames per second. Which is around 31k samples per second. Additionally it does not degrade with the amount of observations and can maintain its performance at well over 10,000,000+ observations. It just gets there slower than the implementation over at controul.com. I have requested to see Hristo’s implementation, but have yet to get a response from him. His blog does seem to be dead though, the last post was in 2011 which was a year after the post before it. Hopefully I will get to see his code some time, but until then… I have more than enough material to go through with regards to looking for optimisations.

Here is a demo of the component.

Park Miller RNG applied with a Marsaglia Transform

The code

This code is still way too messy to release, but hey. It might never get cleaned up… so why not post it anyway.

package net.avdw.gui
{
	import flash.display.Sprite;
	import flash.events.Event;
	import flash.globalization.LocaleID;
	import flash.globalization.NumberFormatter;
	import flash.text.TextField;
	import flash.text.TextFieldAutoSize;
	import flash.text.TextFormat;
	import flash.text.TextFormatAlign;
	import flash.utils.Dictionary;
	import net.avdw.align.alignHorizontalRightTo;
	import net.avdw.demo.ADemo;
	import net.avdw.display.addChildrenTo;
	import net.avdw.ds.HistogramDs;
	import net.avdw.ds.SortedArrayDs;
	import net.avdw.font.ConsolasFont;
	import net.avdw.stats.QuantileEstimationGK;
	import net.avdw.stats.rollingMean;
	import net.avdw.math.Range;
	import net.avdw.math.sum;
	import net.avdw.stats.RollingStats;
	import net.avdw.text.embeddedFont;
	import net.avdw.tracepoint;

	/**
	 * ...
	 * @author Andrew van der Westhuizen
	 */
	public class StreamingHistograph extends Sprite
	{
		private const nf:NumberFormatter = new NumberFormatter(LocaleID.DEFAULT);
		private const approxSampleStatisticsText:TextField = new TextField();
		private const approxSampleStatisticsStr:String = "samples: {samples}"
		+ "\nmean: {mean}"
		+ "\nstddev: {stddev}";

		private const fiveNumSummaryText:TextField = new TextField();
		private const fiveNumSummaryStr:String = "min: {min}" 
		+ "\nlower fence: {lfence}" 
		+ "\nlower quartile: {lquart}" 
		+ "\nmedian: {median}" 
		+ "\nupper quartile: {uquart}" 
		+ "\nupper fence: {ufence}"
		+ "\nmax: {max}";

		private const rollingStats:RollingStats = new RollingStats();

		private var graphWidth:int;
		private var graphHeight:int;
		private var histogram:HistogramDs;
		private var quantiles:QuantileEstimationGK = new QuantileEstimationGK(0.01, 7);
		private var sampleCount:uint = 0;
		private var heightDiv4:int;
		private var heightDiv2:int;
		private var heightDiv3Over4:int;

		public function StreamingHistograph(name:String, expectedRange:Range, width:int = 600, data:Array = null)
		{
			this.name = name;
			this.graphWidth = width;
			histogram = new HistogramDs(expectedRange, width);

			const nameText:TextField = embeddedFont(name, ConsolasFont.NAME, 10);
			nameText.x = (width - nameText.width) / 2;
			graphHeight = 6 * nameText.textHeight + nameText.height;
			heightDiv4 = graphHeight * .25;
			heightDiv2 = graphHeight * .5;
			heightDiv3Over4 = graphHeight * .75;

			approxSampleStatisticsText.y = 2 * nameText.textHeight;
			approxSampleStatisticsText.autoSize = TextFieldAutoSize.LEFT;
			approxSampleStatisticsText.embedFonts = true;
			approxSampleStatisticsText.selectable = false;
			approxSampleStatisticsText.defaultTextFormat = nameText.defaultTextFormat;

			fiveNumSummaryText.autoSize = TextFieldAutoSize.LEFT;
			fiveNumSummaryText.embedFonts = true;
			fiveNumSummaryText.selectable = false;
			fiveNumSummaryText.defaultTextFormat = new TextFormat(ConsolasFont.NAME, 10, null, null, null, null, null, null, TextFormatAlign.RIGHT);

			addChildrenTo(this, nameText, approxSampleStatisticsText, fiveNumSummaryText);

			refresh();
		}

		public function add(element:Number):void {
			histogram.add(element);
			rollingStats.push(element);
			quantiles.insert(element);
			sampleCount++;
		}

		public function refresh():void
		{	
			// border
			graphics.clear();
			graphics.lineStyle(1, 0xAAAAAA);
			graphics.beginFill(0xEEEEEE);
			graphics.drawRect(0, 0, graphWidth, graphHeight);
			graphics.endFill();

			if (sampleCount == 0)
				return;

			var lquart:Number = quantiles.query(.25);
			var median:Number = quantiles.query(.5);
			var uquart:Number = quantiles.query(.75);
			var interQr:Number = uquart - lquart;
			var lfence:Number = lquart - 1.58 * interQr;
			var ufence:Number = uquart + 1.58 * interQr;

			// data
			graphics.lineStyle(1, 0xCCCCCC);
			for (var xCount:int = 0; xCount < graphWidth; xCount++) {
				graphics.moveTo(xCount, graphHeight);
				graphics.lineTo(xCount, graphHeight - histogram.bins[xCount] / histogram.maxBinCount * graphHeight);
			}

			var xCountMean:Number = Math.floor((rollingStats.mean - histogram.expectedRange.lower) / histogram.binSize);
			var xCountLQuart:Number = Math.floor((lquart - histogram.expectedRange.lower) / histogram.binSize);
			var xCountUQuart:Number = Math.ceil((uquart - histogram.expectedRange.lower) / histogram.binSize);
			var xCountLFence:Number = Math.floor((Math.max(lfence, histogram.actualRange.lower) - histogram.expectedRange.lower)  / histogram.binSize);
			var xCountUFence:Number = Math.ceil((Math.min(ufence, histogram.actualRange.upper) - histogram.expectedRange.lower)  / histogram.binSize);

			// stddev
			graphics.lineStyle(1, 0xAAAAAA);
			var xCountStdDev:Number = Math.floor(rollingStats.stddev / histogram.binSize);
			for (var i:int = 1; i < 4; i++) {
				graphics.moveTo(Math.max(0, Math.floor(xCountMean - i * xCountStdDev)), graphHeight);
				graphics.lineTo(Math.max(0, Math.floor(xCountMean - i * xCountStdDev)), 0);
				graphics.moveTo(Math.min(graphWidth, Math.ceil(xCountMean + i * xCountStdDev)), graphHeight);
				graphics.lineTo(Math.min(graphWidth, Math.ceil(xCountMean + i * xCountStdDev)), 0);
			}

			// boxplot
			graphics.lineStyle(1, 0x888888);
			graphics.moveTo(xCountMean, graphHeight);
			graphics.lineTo(xCountMean, 0);
			graphics.drawRect(xCountLQuart, heightDiv4, xCountUQuart - xCountLQuart, heightDiv2);
			graphics.moveTo(xCountLFence, heightDiv4);
			graphics.lineTo(xCountLFence, heightDiv3Over4);
			graphics.moveTo(xCountUFence, heightDiv4);
			graphics.lineTo(xCountUFence, heightDiv3Over4);
			graphics.moveTo(xCountLFence, heightDiv2);
			graphics.lineTo(xCountLQuart, heightDiv2);
			graphics.moveTo(xCountUQuart, heightDiv2);
			graphics.lineTo(xCountUFence, heightDiv2);

			approxSampleStatisticsText.text = approxSampleStatisticsStr
			.replace("{samples}", sampleCount)
			.replace("{mean}", nf.formatNumber(rollingStats.mean))
			.replace("{stddev}", nf.formatNumber(rollingStats.stddev));

			fiveNumSummaryText.text = fiveNumSummaryStr
			.replace("{min}", nf.formatNumber(histogram.actualRange.lower))
			.replace("{lfence}", nf.formatNumber(lfence))
			.replace("{lquart}", nf.formatNumber(lquart))
			.replace("{median}", nf.formatNumber(median))
			.replace("{uquart}", nf.formatNumber(uquart))
			.replace("{ufence}", nf.formatNumber(ufence))
			.replace("{max}", nf.formatNumber(histogram.actualRange.upper));

			fiveNumSummaryText.x = graphWidth - fiveNumSummaryText.width;
		}
	}
}