Real-time Statistics with stdlib

26
Sep

Real-time Statistics with stdlib

Statistics and other data mining approaches help us make sense of this treasure trove of information. As data requirements continue to evolve, we should expect that our analytical tools keep pace. In this article, we will explore how to use stdlib for calculating real-time statistics.

Today, more data is accumulated than ever before. Not only is data collected at an unprecedented scale (hence the term “Big Data”), but the data is also qualitatively different from what people are used to. Besides the changing nature of what constitutes data (comprising everything from numbers to text files to images to video streams), data is frequently collected in real-time in both high volume and velocity. Statistics and other data mining approaches help us make sense of this treasure trove of information. As data requirements continue to evolve, we should expect that our analytical tools keep pace.

Recent initiatives such as tensorflow.js demonstrate that the time is ripe for bringing statistics and machine learning to JavaScript and thus to the web platform. What is missing are the building blocks to build advanced functionality for statistics and data science more generally. This is one major reason why we are building stdlib, an open-source standard library for JavaScript and Node.js, with an emphasis on numerical and scientific computing applications (see the GitHub repository). As of August 2018, the library provides more than 2,000 functions for mathematics, statistics, data processing, streams, and other areas. It contains many of the utilities you would expect from a standard library.

Whole ecosystems of statistical tools and machine learning frameworks have emerged for various programming languages, such as Python, R, and, more recently, Julia. So why, we might ask, would access to comparable functionality in JavaScript be useful when we already have access to these other environments?

FIRST, server-side JavaScript (via Node.js) has expanded the use cases for JavaScript to become more data-focused (e.g., its use in database management and data processing). Especially for smaller teams without the resources to support a polyglot stack, the ability to run analyses and create visualizations with JavaScript when all other data processing is already performed in JavaScript means faster and more efficient development.

SECOND, having a strong JavaScript scientific computing stack enables novel types of applications that are generally subsumed under the umbrella term edge computing. The idea here is to bring computations back from the cloud to the edges of the network, i.e., end-users. For example, we might want to outsource server-side computations to a user’s web browser. Not only would such a change yield benefits in the form of shorter latency and lower server costs, but it also allows us to think of applications that would otherwise be infeasible. For example, we could build a tool which, given a user’s behavior on a website, predicts which webpage resources to load next. A user’s scroll behavior, search queries, and other UI interactions can all inform what assets to request next from a host server.

THIRD, JavaScript is everywhere: phones, tablets, desktops, embedded devices, and more. Especially in the case of web applications, as long as a device has a web browser, you can run JavaScript. No complicated installation and setup is necessary in order to run a web application which trains models, performs statistical analysis, or runs machine learning algorithms. A user only needs to open a URL, and downloading and executing a machine learning enabled web application happens automatically. This stands in stark contrast to traditional scientific computing environments which often have long and complicated installation and setup procedures. In short, JavaScript is an attractive environment for scientific computing and machine learning due to its ubiquity, approachability, vast developer community, and the amazing new types of applications that it affords by virtue of its being the lingua franca of the web.

Want more exclusive knowledge? Sign up now!

In this article, we will explore how to use stdlib for calculating real-time statistics. A statistic is a quantity computed from a larger sample of observations (e.g., the sum of a collection of numbers). Historically, one or more statistics were calculated from static data sets. However, in today’s world, data sets are increasingly dynamic, coming in bursts or streamed in real-time. When data arrives in batches or in real-time, repeatedly recomputing statistics over an entire data set is not computationally efficient. Instead, one would prefer to compute statistics incrementally (i.e., value-by-value in sequential order). In the case of computing the sum, the solution is straightforward: we keep a running sum of all data points observed up to the current time, adding new values to the accumulated sum as they arrive. For higher order statistics, the calculations might become more complex, but the principle remains the same–employ algorithms for maintaining running accumulators–no matter what statistics should be calculated or what statistical model should be updated. To this end, stdlib contains a wide variety of functions for incrementally computing statistics.

If you are invested in the Web platform, you have probably looked at usage metrics for the websites you have worked on. To make things concrete, let’s walk through the steps of calculating statistics for the number of page views a website is receiving each second. Instead of using real data, we will use what is commonly referred to as Monte Carlo simulation (named after the Monte Carlo Casino in Monaco); i.e., we will generate hypothetical page view data using a probability model. One main advantage of using simulations is that we can see whether (and how fast) we recover the true parameters of our model.

The Poisson distribution is often used to model counts, whether sales of a company per month, page views on a website per hour, or number of births in a hospital per day. The distribution may be appropriate whenever the counts are calculated for a fixed interval and the events occur independently of each other with a fixed rate λ. To facilitate working with the Poisson distribution, stdlib contains functions for calculating various distribution properties and a constructor for creating Poisson distribution objects. Let’s use the latter to create a Poisson distribution with a rate λ = 30 (i.e., 30 page views per second).

import Poisson from '@stdlib/stats/base/dists/poisson/ctor';

const pois = new Poisson( 30 );
The rate λ is equal to the average count (i.e., page views) in an interval, as we can quickly confirm:
let mean = pois.mean
// returns 30
The cumulative distribution function (CDF) of a distribution tells us the probability that we would observe a count less than or equal to a supplied value:
let p = pois.cdf( 30 );
// ~0.548 

p = pois.cdf( 80 );
// returns ~0.999

Now that we can simulate page view counts, let’s compute some statistics! To incrementally calculate a running maximum and mean, stdlib provides maximum and mean accumulators, respectively, which can be used in the following manner:

import incrmax from '@stdlib/stats/incr/max';
import incrmean from '@stdlib/stats/incr/mean';

// Initialize accumulators:
const stats = {
    'max': incrmax(),
    'mean': incrmean()
};

// Provide the accumulators values:
for ( let i = 0; i < 100; i++ ) {
    stats.max( i );
    stats.mean( i );
}

// Print the current accumulator state:
console.log( 'max: %d, mean: %d', stats.max(), stats.mean() );

And now that we’ve familiarized ourselves with how to use stdlib‘s incremental accumulators, let’s combine the previous snippet with our simulation of page view counts

import poisson from '@stdlib/random/base/poisson';
import incrmax from '@stdlib/stats/incr/max';
import incrmean from '@stdlib/stats/incr/mean';

const stats = {
    'max': incrmax(),
    'mean': incrmean()
};

// Create a "seeded" PRNG:
const rpois = poisson.factory({
    'seed': 1303
});

async function main() {
    for ( let i = 0; i < 60; i++ ) {
        await sleep( 1000 );
        let count = rpois( 30 );
        let max = stats.max( count );  
        let mean = stats.mean( count );
        console.log( 'count: %d; max: %d, mean: %d', count, max, mean );
    }
}

main();

You may notice that we modified our simulation by creating a “seeded” PRNG. Using a seeded PRNG allows us to generate a reproducible sequence of random numbers, which is not only important in scientific contexts, but also when testing and debugging your functions. The ability to create seeded PRNGs is one reason why you might prefer to use stdlib‘s PRNGs (randu, mt19937, and others) over the built-in JavaScript Math.random function.

Running the above code snippet should produce the following output:

count: 26; max: 26, mean: 26
count: 28; max: 28, mean: 27
count: 31; max: 31, mean: 28.333333333333332
count: 30; max: 31, mean: 28.75
...
count: 24; max: 41, mean: 28.94642857142857
count: 25; max: 41, mean: 28.877192982456137
count: 30; max: 41, mean: 28.89655172413793
count: 25; max: 41, mean: 28.83050847457627

In reality, one is unlikely to encounter data that can be perfectly modeled as Poisson data. A common check for assessing whether count data can be modeled by a Poisson process is to compare the mean to the variance, where the variance is a measure of dispersion, calculated as the average sum of squared differences from the mean. The rationale for this check is that a Poisson process’ mean should equal to its variance.

To assess whether observed data can be modeled as a Poisson process, we’ll use the variance-to-mean ratio (VMR), which, as the name suggests, is defined as the variance divided by the mean. When the VMR is less than one, we say that the observed data is “under-dispersed”. When the VMR is greater than one, we say that the observed data is “over-dispersed”. If the VMR is equal to one, the variance and mean are equal, suggesting a Poisson process. In both the under- and over-dispersed cases, a distribution other than the Poisson distribution may be a better fit (e.g., the negative binomial distribution is commonly used to model over-dispersed data).

Let’s modify our previous code snippet to calculate the variance-to-mean ratio.

import poisson from '@stdlib/random/base/poisson';
import incrvmr from '@stdlib/stats/incr/vmr';

const stats = {
    'vmr': incrvmr()
};

// Create a "seeded" PRNG:
const rpois = poisson.factory({
    'seed': 1303
});

async function main() {
    for ( let i = 0; i < 60; i++ ) {
        await sleep( 1000 );
        let count = rpois( 30 );
        let vmr = stats.vmr( count );
        console.log( 'count: %d, vmr: %d', count, vmr );
    }
}

main();

Running this snippet should yield the following output:

count: 26, vmr: 0
count: 28, vmr: ~0.074
count: 30, vmr: ~0.224
count: 31, vmr: ~0.171
...
count: 24, vmr: ~1.024
count: 25, vmr: ~1.018
count: 30, vmr: ~1.0
count: 25, vmr: ~0.994

Given that our data is generated from a Poisson distribution, we should expect that the VMR gets closer to one the more data we accumulate, as is the case as shown above.
So what can we do with our probability model? One possibility is to create an anomaly detection system, such that we are alerted whenever we observe a spike in the number of page views. Such an alert might prompt us to provision more servers to handle the increased traffic or to investigate the anomalous user behavior to ensure we are not victims of a denial of service attack.
As a first pass page view anomaly detection system, we can use our model to construct a prediction interval, which is defined as an interval for which we expect X% of observed counts to belong. For example, in our case, we’d like to know the upper bound for observed counts in which we can be 99% certain that an observed count is expected and not anomalous.
In practice, this task is a bit more involved since the data has been used to estimate the rate parameter λ. But in our toy example, we know the true rate (λ = 30), and, thus, we can readily calculate an interval that will contain the next count with a specified probability. For a given probability p between 0 and 1, the quantile function of a probability distribution returns the value that a random variable will only exceed in 100(1-p)% of cases. For example, if we wanted to know the count upper bound which is exceeded less than 1% of the time, we’d evaluate the quantile function for p = 0.99.

let v = pois.quantile( 0.99 );
// returns 43

For λ = 30, we can thus expect that we will observe page view counts which are greater than 43 only 1% of the time. If we believed a 99% threshold appropriate, we could then configure our anomaly detection system to notify us whenever the number of page views exceeds 43.

import poisson from '@stdlib/random/base/poisson';

// Create a "seeded" PRNG:
const rpois = poisson.factory({
    'seed': 1303
});

// Compute our anomaly detection threshold:
const threshold = pois.quantile( 0.99 );

async function main() {
    for ( let i = 0; i < 60; i++ ) { await sleep( 1000 ); let count = rpois( 30 ); if ( count > threshold ) {
            console.log( 'ALERT! Possible anomaly detected. Count: %d.', count );
        }
    }
}

main();

While this is a simple example, probabilistic models form the core of many modern predictive models. A few of these are already included in stdlib (including for the analysis of large datasets). If you are keen on exploring how to use stdlib to create a more robust anomaly detection system, several other aspects might deserve attention:

  • In reality, the underlying forces governing page views are likely to change over time. For example, page views may vary according to season or, on a finer-grained scale, depending on the time of day. One approach to deal with such “non-stationarity” is to calculate relevant statistics over a specified time interval, thus discarding older observations and only considering the most recent observations.
  • In the code snippets above, we used accumulators which calculate statistics over all provided values, but we could have just as easily used windowed variants which are also available as part of stdlib. These moving versions can be identified by an m in front of a statistic’s name (e.g., mmax, mmean, mvmr). Accordingly, one possible extension to our anomaly detection system is to use the windowed variants to account for a fluctuating λ.
  • To assess whether or not we could model observed data as a Poisson process, we calculated the variance-to-mean ratio. A more statistically rigorous approach would be to use a chi-square goodness-of-fit test. The chi-square goodness-of-fit test is appropriate for discrete data and distributions. For continuous data, we can use the Kolmogorov-Smirnov goodness-of-fit test
  • An interesting extension to our anomaly detection system would be to use a chi-square goodness-of-fit test to monitor incoming data for changes in the generative model. If a chi-square goodness-of-fit test consistently fails, we may want to re-evaluate the appropriateness of the thresholding technique we discussed above and perhaps opt for a different model.

To recap: we explored how we can use stdlib to perform Monte Carlo simulations for generating synthetic data, to incrementally compute statistics over real-time data, to evaluate model fit, and to calculate various properties of probability distributions. From our accumulated knowledge, we showed how we could monitor page views on a website and create a basic anomaly detection algorithm. Hopefully, you learned a thing or two about how you can use JavaScript and stdlib for analyzing data in real-time and are as excited as we are for the future of JavaScript for data analysis and machine learning.

If you are interested in running the above code snippets yourself, see the following GitHub repository: https://github.com/Planeshifter/stdlib-realtime-statistics-code.

 

Sessions about Node.js at the iJS 2018 in Munich:


→ Session: Analyzing Large Datasets with JavaScript and Node.js

→Session: Node.js: Performance Analytics and Best Practices

More about the Node.js-Track

STAY TUNED!

 

BEHIND THE TRACKS OF iJS

Angular

Best-Practises with Angular

Vue.js

One of the most famous frameworks of modern days

Web Development & Architecture

DevOps, Testing, Performance, Toolchain & SEO

Node.js

All about Node.js

React

From Basic concepts to unidirectional data flows