/* This file contains libraries of statistical functions, including
those for:
* Error function
* Inverse error function
* Normal/Gaussian distribution
* Inverse functions for Normal distribution
* Log-Normal distribution
* Binomial distribution
* Poisson distribution.
* Weibull distribution
* Exponential distribution
TODO: These functions don't yet work for arbitrary precision (e.g.
phi[9] gives 0.9999999999999999999 and larger than that just gives 1).
Some functions claim to be able to calculate to a large number of digits,
but that sometimes doesn't happen. (This should include and use
ArbitraryPrecision.frink for those cases.)
However, many of these functions now dispatch correctly to correct limiting
functions, so if you set precision properly, you can get a good answer.
(See the dispatching functions erf, erfc, inverseErf, and inverseErfc
for examples of choosing the right function for various ranges.)
for example, if you call:
setPrecision[40]
a = phi[9.4]
You will get:
0.9999999999999999999972718464286538660795
And then calling:
inversePhi[a]
You will get:
9.4000000000000007230366609333197
Which is a reasonable answer to lots of decimal places. You may need to
request even more working precision. For example, phi[12], calculated with
40 digits of precision is:
0.999999999999999999999999999999998223518
While phi[13] gets dangerously close to losing all information with
40 digits of precision:
phi[13]
0.999999999999999999999999999999999999994
While any statistical function with this much confidence is totally
suspect, it's sometimes useful to be able to calculate how silly that
some statistical claims really are with lots of digits.
NOTE: As of the 2024-05-09 release of Frink, the function to calculate the
mean and standard deviation of a data series, meanAndSD[list, sample] has
been made part of Frink's built-in functions and has been removed from this
file. See the "Other Functions" section of the documentation for details.
*/
/** Error function erf, complementary error function erfc, and their
inverses.
Implementation of the error function erf[x] and the complementary error
function erfc[x] where erfc[x] = 1 - erf[x].
Error function erf[x]
This is a "smart" dispatcher function that attempts to call the right
function to calculate for a given range. This is the function that users
should call.
*/
erf[x] := (x<0 ? -erf[-x] : ((abs[x] > 4.0) ? 1-erfcAsymptotic[x] : erfTaylor[x]))
/** Complementary error function erfc[x] where erfc[x] = 1 - erf[x]
This dispatches to a "smart" dispatcher function that attempts to call the
right function to calculate for a given range. This is the function that
users should call. Note that this is 1-erf[x], but fixed slightly so that
we don't underflow and lose precision if erf[x] is tiny.
*/
erfc[x] := (x<0 ? 2 - erfc[-x] : ((abs[x] > 4.0) ? erfcAsymptotic[x] : 1-erfTaylor[x]))
// Inverse error function. This is a dispatcher function that calls the
// best algorithm depending on the size of the number. This is the function
// that users should call.
inverseErf[z, digits=getPrecision[]] := z < 0 ? -inverseErf[-z, digits] : (z > .96 ? inverseErfcFettis[1-z] : inverseErfForward[z,digits])
// Complementary inverse error function.
inverseErfc[z, digits=getPrecision[]] := z < 0 ? -inverseErfc[-z, digits] : (z < .04 ? inverseErfcFettis[z] : inverseErfForward[1-z,digits])
/* =====================================================
Functions for the normal/Gaussian distribution.
*/
// The probability distribution function or probability mass function
// for the normal distribution. You can, say, use this to plot the typical
// bell curve. Note, however, that it's the *integral* of this function that
// you generally want to work with, which is represented by the phi function
// below.
normalDensity[x, mean, sd] := 1/(sd * sqrt[2 pi]) exp[-(x-mean)^2/(2 sd^2)]
// This is the cumulative distribution function for the normal/Gaussian
// distribution.
//
// Calculate the probability (from 0 to 1) of a score *less than or equal to* x
// in a normally-distributed distribution with the given mean and standard
// deviation.
// For example, to find the percentile score of an IQ of 130, given the
// assumption that IQ has mean of 100 and sd of 15, use:
// phi[130, 100, 15]
// which gives 0.977, or the 97.7th percentile.
phi[x, mean, sd] := phi[(x-mean)/sd]
// An alias for phi[x, mean, sd]
normalProbability[x, mean, sd] := phi[x, mean, sd]
// Calculate the probability that a random value z has a value less than or
// equal to x, assuming x is normally distributed, with a mean of 0 and a
// standard deviation of 1.
// This is fundamental to lots of statistics. It is essentially the integral
// of the normal curve from negative infinity to z.
phi[z] := 1/2 (1 + erf[z/sqrt[2]])
// Inverse functions for the normal distribution.
// Inverse normal probability. This gives a z-score (number of standard
// deviations above or below the mean). So, for example, inversePhi[0.5]
// gives a z-score of 0, as the 50th percentile is right at the mean.
inversePhi[phi, digits=getPrecision[]] := sqrt[2] inverseErf[2 phi - 1, digits]
// An alias for inversePhi[phi]
inverseNormalProbability[phi, digits=getPrecision[]] := inversePhi[phi, digits]
// Inverse phi function for a distribution with the specified mean and
// standard deviation. For example, the IQ of someone in the 98th percentile
// can be calculated with the following, assuming the mean IQ is 100 and the
// IQ scale has a standard deviation of 15:
// inversePhi[0.98, 100, 15]
inversePhi[phi, mean, sd] := mean + inversePhi[phi] * sd
// An alias for inversePhi
inverseNormalProbability[phi, mean, sd] := inversePhi[phi, mean, sd]
/* =====================================================
Functions for the log-normal distribution.
A log-normal distribution is a probability distribution of a random variable
whose logarithm is normally distributed. If X is a random variable with a
normal distribution, then Y = exp(X) has a log-normal distribution; likewise,
if Y is log-normally distributed, then X = log(Y) is normally distributed.
*/
/*
The probability density function or probability mass function
for the log-normal distribution.
*/
logNormalDensity[x, mean, sd] :=
{
1/(x sqrt[2 pi sd^2]) e^-((ln[x] - mean)^2 / (2 sd^2))
}
/** The cumulative distribution function for the log-normal distribution. */
logNormalProbability[x, mean, sd] := 1/2 (1 + erf[(ln[x] - mean)/sqrt[2 sd^2]])
/* ===================================================
Functions for the Binomial distribution
This gives the probability density function or probability mass function
of the binomial distribution. The function below gives the probability of
getting *exactly* k successes in n trials, where the probability of success
of a single trial is p.
(If the number of trials n is large and the probability low, the Binomial
distribution approaches the Poisson distribution (see below.). To be more
precise, binomialDensity[k, n, p] approaches poissonDensity[k, n * p])
For example, the probability of getting exactly 8 heads out of 10 coin
flips is given by:
binomialDensity[8, 10, 1/2]
*/
binomialDensity[k, n, p] := binomial[n, k] p^k (1-p)^(n-k)
/* This gives the cumulative probability of getting x *or fewer* successes in
n trials, where the probability of success of a single trial is p.
For example, the probability of getting 8 or fewer heads out of 10 coin
flips is given by:
binomialProbability[8, 10, 1/2]
(If the number of trials n is large and the probability low, the Binomial
distribution approaches the Poisson distribution (see below.). To be more
precise, binomialProbability[x, n, p] approaches
poissonProbability[x, n * p])
*/
binomialProbability[x, n, p] :=
{
sum = 0
for i = 0 to floor[x]
sum = sum + binomialDensity[i, n, p]
return sum
}
/* binomialProbabilityNew[x, n, p] :=
{
/** This is a slightly optimized version of the naive:
for i = 0 to floor[x]
sum = sum + binomialDensity[i, n, p]
but for large number of iterations it isn't significantly faster. Most
of the time is spent because integers and rational numbers get HUGE.
*/
sum = 0
ratio = p/(1-p)
term = (1-p)^n
for i = 0 to floor[x]
{
sum = sum + binomial[n, i] term
term = term * ratio
}
return sum
}
*/
/* ===================================================================
Functions for the Poisson distribution.
The Poisson distribution is a discrete probability distribution that expresses
the probability of a number of events occurring in a fixed period of time if
these events occur with a known average rate and independently of the time
since the last event.
See the exponential distribution functions below for predictions of when
events are likely to happen.
*/
/* This gives the probability mass function or probability density function of
the Poisson distribution. This gives the probability of getting *exactly* k
occurrences of an event, given the expected number of occurrences (expected)
in that interval. */
poissonDensity[k, expected] := expected^k e^(-expected) / k!
/*
This gives the cumulative probability or cumulative distribution function of
the Poisson distribution. This gives the probability of getting k *or fewer*
occurrences of the event, given the expected number of occurrences (expected)
in that interval.
*/
poissonProbability[k, expected] :=
{
if k <0
return 0
// Since we're going to be multiplying by a transcendental (e^-expected)
// we can use floating-point through all calculations.
sum = 1.
num = 1.
denom = 1.
// This loop is a faster way of doing sum(i=1 to k, expected^i / i!)
for i=1 to k
{
num = num * expected
denom = denom * i
sum = sum + num/denom
}
return sum * e^(-expected)
}
/* ===================================================================
Functions for the Exponential distribution.
The exponential distribution is a discrete probability distribution that
expresses the time between events in a Poisson process, i.e. a process in
which events occur continuously and independently at a constant average rate.
See the Poisson distribution functions above for functions related to the
probabilities of *multiple* events with an exponential distribution occurring.
*/
/* This gives the probability mass function or probability density function of
the exponential distribution. This gives the probability of an event
occurring *exactly* at time x, given the expected number of occurrences
(expected) in a given time. You will usually not want to use this function
but the cumulative distribution function exponentialProbability below. */
exponentialDensity[x, expected] := expected e^(-expected x)
/* This gives the cumulative distribution function of the exponential
distribution. This gives the probability of an event occurring at time x *or
before*, given the expected number of occurrences (expected) in a given
time. */
exponentialProbability[x, expected] := 1 - e^(-expected x)
/* ===================================================================
Functions for the Weibull distribution.
The Weibull distribution is a continuous probability distribution. It is a
two-parameter distribution often used for time-to-failure analysis.
It is identical to the exponential distribution (see above) when k=1,
and the Rayleigh distribution when k=2.
lambda is sometimes called the "scale" parameter, and is analogous/identical
to the "expected" value in the exponential and Poisson distributions.
k is sometimes called the "shape" parameter.
If the quantity x is a "time-to-failure", the Weibull distribution gives a
distribution for which the failure rate is proportional to a power of
time. The shape parameter, k, is that power plus one, and so this parameter
can be interpreted directly as follows:
* A value of k<1 indicates that the failure rate decreases over time. This
happens if there is significant "infant mortality", or defective items
failing early and the failure rate decreasing over time as the defective
items are weeded out of the population.
* A value of k=1 indicates that the failure rate is constant over
time. This might suggest random external events are causing mortality, or
failure.
* A value of k>1 indicates that the failure rate increases with time. This
happens if there is an "aging" process, or parts that are more likely to
fail as time goes on.
See: http://en.wikipedia.org/wiki/Weibull_distribution
*/
/* This gives the probability mass function or probability density function of
the Weibull distribution. */
WeibullDensity[x, lambda, k] :=
{
if x < 0
return 0
else
return k/lambda (x/lambda)^(k-1) e^-(x/lambda)^k
}
/* This gives the cumulative distribution function of the Weibull
distribution. */
WeibullProbability[x, lambda, k] := 1 - e^-(x/lambda)^k
/* ===================================================================
Functions for the Student's t distribution.
Student's-t distribution is a continuous distribution. It is primarily used
when sampling from a normally-distributed population whose standard
deviation is unknown. It is used in the Studentâ€™s t-test and for calculating
confidence intervals for the difference between two population means,
*/
/* This calculates the probability mass function or probability density
function of the StudentT distribution for a normalized t parameter and
specified degrees of freedom (the degrees of freedom is usually 1 less
than the number of items sampled.) */
studentTDensity[t, degreesFreedom] :=
{
return studentTCoeff[degreesFreedom] (1 + t^2/degreesFreedom)^(-(degreesFreedom+1)/2)
}
/** This gives the cumulative distribution function of the Student-t
distribution with the specified normalized t parameter and degrees of
freedom. */
studentTProbability[t, degreesFreedom] :=
{
// AUUUUGHHHHH. What a pain in the ass this is to calculate.
}
// ===================================================================
// Functions below this point should probably not be called directly,
// as they may not behave well for all numerical values. Use the functions
// above which automatically select the proper functions to call for different
// numeric ranges.
// Asymptotic series expansion for erfc.
// See :
// http://en.wikipedia.org/wiki/Error_function#Asymptotic_expansion
// Unfortunately, the algorithm below diverges for all values of x.
// It is able to produce more digits for large values of x before it
// begins to diverge, and is practically useless for smaller x.
//
// Using this function to calculate erf is generally better than the Taylor
// series for erf[4.0] and above.
erfcAsymptotic[x, debug = false, stats = false, digits=getPrecision[]] :=
{
c = e^(-x^2)/(x sqrt[pi])
if (debug)
println["c is $c"]
sum = 1
precision = 10^(-digits)
if (debug)
println["Attempting to get precision less than $precision"]
var term = 10000000
var n = 0
var lastterm
do
{
lastterm = term
n = n + 1
term = (-1)^n * (2n)!/(n! (2x)^(2n))
if (debug)
println[term]
sum = sum + term
} while (abs[term/sum] > precision) and (abs[term] < abs[lastterm])
// Did we terminate because the series started to diverge?
if (abs[term] > abs[lastterm])
{
digits = -log[abs[term]] // Adjust the number of useful digits
if (debug)
println["Warning: started to diverge. Precision will be no better than " + format[digits, 1, 1] + " digits."]
sum = sum - term // Back out last diverging term.
}
if (debug)
println["Took $n iterations."]
if (stats)
return [sum * c, n, format[digits,1,3]]
else
return sum * c
}
// Taylor series expansion for erf[x]. See:
//
// http://en.wikipedia.org/wiki/Error_function
//
// This begins to fall apart at x=4.0 or so, and gets really bad above
// 5.0.
//
// This function should not be called directly, but call the erf[] dispatcher
// function above.
erfTaylor[x, debug = false, stats = false, digits=getPrecision[]] :=
{
// This special case is necessary because the Taylor series just
// sits at 0 for all iterations.
if x == 0
return 0
c = 2 / sqrt[pi]
sum = 0
precision = 10^(-digits)
if (debug)
println["Attempting to get precision less than $precision"]
var term = 10000000
n = -1
var lastterm
do
{
lastterm = term
n = n + 1
term = (-1)^n * x^(2n+1)/((2n+1) n!)
if (debug)
println[term]
sum = sum + term
} while (abs[term/sum] > precision)
if (debug)
println["Took $n iterations."]
if (stats)
return [sum * c, n, format[digits,1,3]]
else
return sum * c
}
// Inverse of the Erf function
// See
// http://functions.wolfram.com/GammaBetaErf/InverseErf/06/01/0001/
// for an explanation of the power series.
// This algorithm works but is bloody slow for z > 0.9
// It is currently rejected in favor of the algorithms due to Fettis below.
inverseErfForward[z, digits=getPrecision[], debug = false, stats = false] :=
{
if (z==0) // Special case to avoid divide-by-zero
return 0
var sum = 0.
var precision = 10^(-digits)
var term = 10000000
var c = sqrt[pi]/2
var k = 0
do
{
term = InverseErfHelper.getCk[k] (c z)^(2k + 1) / (2k+1)
if (debug)
println[term]
sum = sum + term
k = k + 1
} while (abs[term/sum] > precision)
if (stats)
return [sum, k, format[digits,1,3]]
else
return sum
}
// This algorithm works but is bloody slow for n<0.1
inverseErfc[z, digits=getPrecision[], debug=false] :=
{
var term
var k = 0
var c = sqrt[pi]/2
var sum = 0.
var precision = 10^(-digits)
do
{
term = InverseErfHelper.getCk[k] (c * (z-1))^(2k+1) / (2k+1)
if (debug)
println[term]
sum = sum + term
k = k + 1
} while (abs[term/sum] > precision)
return -sum
}
// Asymptotic expansion of inverseErf which works for values as z->1.
// This doesn't seem to work well and probably shouldn't be trusted.
inverseErfAsymptotic[z] := (1/sqrt[2]) sqrt[log[2/(pi (z - 1)^2)] - log[log[2/(pi (z - 1)^2)]]]
// Algorithm given by "A Stable Algorithm for Computing the Inverse Error
// Function in the 'Tail-End' Region", Henry E. Fettis, Mathematics of
// Computation, Volume 28, Number 126, April 1974, Pages 585-587
// http://www.ams.org/journals/mcom/1974-28-126/S0025-5718-1974-0341812-5/S0025-5718-1974-0341812-5.pdf
inverseErfcFettis[x] :=
{
y = (-ln[pi^(1/2) x (-ln[x])^(1/2)])^(1/2)
// println["Initial y is $y"]
sqrtpi = sqrt[pi]
do
{
oldy = y
y = (ln[FettisG[y] / (sqrtpi x)])^(1/2)
// println["Next y is $y"]
} while abs[y - oldy]/y > 1e-15
return y
}
// Build continued fraction from Fettis paper.
FettisG[y] :=
{
t = 1/y
return t / (1 + FettisG[t^2 / 2, 1])
}
// Recursively calculate parts of continued fraction.
// TODO: Make limits dynamic. This function actually requires *fewer*
// iterations as the numbers get larger.
// Since we're currently working with a fixed number of iterations,
// the continued fraction could be built bottom-up and non-recursively.
FettisG[tt,n] :=
{
if n > 87 // Good enough for z=0.96; TODO: Make dynamic
return 1.
else
return n tt / (1. + FettisG[tt, n+1])
}
//for n = 1 to 80
// println[FettisG[(1-0.91), n]]
// This class has methods to cache the values of c_k which are used repeatedly
// in calculating the inverse Erf function.
class InverseErfHelper
{
class var ck = new array
class var summn = new dict[]
class getCk[k] :=
{
len = length[ck]
if (len == 0)
{
ck@0 = 1
ck@1 = 1
len = len + 2
}
// If cached, return it
if len > k
return ck@k
var sum
var m
for n = len to k
{
sum = 0.
// Calculate for all needed values
// Some recalculation is done here. This could be made more
// efficient.
for m = 0 to n-1
sum = sum + (ck@m ck@(n-1-m))/((m+1)(2m + 1))
ck@n = sum
// if n > 0
// println["ck@$n = $sum, ratio = " + (ck@n / ck@(n-1))]
}
return ck@k
}
}
/** Calculate the coefficients for Student's T distribution. This is a
shortcut which eliminates the need to calculate the gamma function.
Shortcuts taken from:
http://en.wikipedia.org/wiki/Student%27s_t-distribution#Probability_density_function
*/
studentTCoeff[nu] :=
{
coeff = 1
if (nu mod 2) == 0 and nu>0 // nu even
{
for c = 2 to nu-2 step 2
coeff = coeff * (c+1)/(c)
coeff = coeff / (2 sqrt[nu])
return coeff
}
if (nu mod 2) == 1 and nu >= 1 // nu odd
{
for c = 3 to nu-2 step 2
coeff = coeff * (c-1)/c
coeff = coeff / (pi sqrt[nu])
return coeff
}
println["studentTCoeff not implemented for nu=$nu"]
return undef
}
/** Calculate the Student's T value of a single sampled group.
*/
studentT[sampleMean, populationMean, sampleSD, sampleSize] :=
{
return (sampleMean - populationMean) / (sampleSD / sqrt[sampleSize])
}
// Samples to examine behavior of functions
//for n=0.01 to .89 step .1
//{
// [val, iterations, digits] = inverseErf[n, false, true, 15]
// println["$n\t$val\t$iterations\t$digits"]
//}
//for n=0.2 to 20 step .2
//{
// [val, iterations, digits] = erfTaylor[n, false, true, 20]
// println["$n\t$iterations\t$digits"]
//}
"statistics.frink included OK"