Statistics function incorrectly computing median
Miguel Garcia-Blanco
miguel.01 at ihug.com.au
Sat Jan 19 04:13:33 CST 2008
> I looked at the routines in R and Maxima and have concluded that they
> assume the probability of each sample is a constant; i.e. 1/(# of
> samples).
>
> The routine "discrete_inv (x, v, p)" permits the user the specify the
> probability of each sample.
>
> Thus, it took some time to determine a method that was consistent with
> constant probability and variable probability.
>
> Before, submitting a formal patch, I thought it wise to attach a
> script for testing.
>
> The equivalent commands in Octave, R, and Maxima are
>
> Octave: discrete_inv ([1/4 2/4 3/4], <v>, ones (size (<v>)) / numel
> (<v>))
>
> GNU R: summary ( c( <v> ) )
>
> Maxima: load( discriptive )$
> quantile ( <v>, 1/4), numer;
> quantile ( <v>, 2/4), numer;
> quantile ( <v>, 3/4), numer;
>
> Where <v> is to be replaced by a sequence like; [1, 2, 3, 4].
>
> I've compared the result for a few examples. It appears to now be
> working properly. However, I've been wrong before, so ....
>
> Ben
(Disclaimer: I don't know much about quantiles, so what I'm about to say may
not be correct.)
I don't know what the summary function actually does (I haven't seen the
code), but I suspect it makes a call to the quantile function. R has 9
different methods for calculating sample quantiles. By default, the extreme
values and quartiles are returned, calculated using method 7. Here's an
example:
> # Example to compare different methods
> x <- c( 1, 2, 3, 4 )
> res <- matrix( as.numeric(NA), 9, 5 )
> for( iType in 1:9 ) res[iType, ] <- y <- quantile( x, type = iType )
> dimnames( res ) <- list( 1:9, names( y ) )
> round( res, 2 )
0% 25% 50% 75% 100%
1 1 1.00 2.0 3.00 4
2 1 1.50 2.5 3.50 4
3 1 1.00 2.0 3.00 4
4 1 1.00 2.0 3.00 4
5 1 1.50 2.5 3.50 4
6 1 1.25 2.5 3.75 4
7 1 1.75 2.5 3.25 4
8 1 1.42 2.5 3.58 4
9 1 1.44 2.5 3.56 4
The appropriateness of a particular sample quantile probably depends on
purpose and whether the parent distribution is continuous or discrete.
(Discrete: methods 1, 2, and 3. Continuous: methods 4 through 9.)
Since discrete_inv.m is for a discrete rv, I think it should give results
that match with one of R's discrete methods. (My preference is for 2, simply
because it gives the median such that Pr(X <= m) = Pr(X >= m) = 0.5).
In light of this, perhaps statistics.m needs a continuous/discrete flag?
I've attached the help file for R's quantile function for people who don't
have R.
-Miguel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.cae.wisc.edu/pipermail/bug-octave/attachments/20080119/321576a7/attachment-0002.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.cae.wisc.edu/pipermail/bug-octave/attachments/20080119/321576a7/attachment-0003.html
More information about the Bug-octave
mailing list