Statistics function incorrectly computing median
Ben Abbott
bpabbott at mac.com
Sat Jan 19 11:49:29 CST 2008
On Jan 19, 2008, at 5:13 AM, Miguel Garcia-Blanco wrote:
> > I looked at the routines in R and Maxima and have concluded that
> they
> > assume the probability of each sample is a constant; i.e. 1/(# of
> > samples).
> >
> > The routine "discrete_inv (x, v, p)" permits the user the specify
> the
> > probability of each sample.
> >
> > Thus, it took some time to determine a method that was consistent
> with
> > constant probability and variable probability.
> >
> > Before, submitting a formal patch, I thought it wise to attach a
> > script for testing.
> >
> > The equivalent commands in Octave, R, and Maxima are
> >
> > Octave: discrete_inv ([1/4 2/4 3/4], <v>, ones (size (<v>)) / numel
> > (<v>))
> >
> > GNU R: summary ( c( <v> ) )
> >
> > Maxima: load( discriptive )$
> > quantile ( <v>, 1/4), numer;
> > quantile ( <v>, 2/4), numer;
> > quantile ( <v>, 3/4), numer;
> >
> > Where <v> is to be replaced by a sequence like; [1, 2, 3, 4].
> >
> > I've compared the result for a few examples. It appears to now be
> > working properly. However, I've been wrong before, so ....
> >
> > Ben
>
> (Disclaimer: I don't know much about quantiles, so what I'm about to
> say may
> not be correct.)
>
> I don't know what the summary function actually does (I haven't seen
> the
> code), but I suspect it makes a call to the quantile function. R has 9
> different methods for calculating sample quantiles. By default, the
> extreme
> values and quartiles are returned, calculated using method 7. Here's
> an
> example:
>
> > # Example to compare different methods
> > x <- c( 1, 2, 3, 4 )
> > res <- matrix( as.numeric(NA), 9, 5 )
> > for( iType in 1:9 ) res[iType, ] <- y <- quantile( x, type = iType )
> > dimnames( res ) <- list( 1:9, names( y ) )
> > round( res, 2 )
> 0% 25% 50% 75% 100%
> 1 1 1.00 2.0 3.00 4
> 2 1 1.50 2.5 3.50 4
> 3 1 1.00 2.0 3.00 4
> 4 1 1.00 2.0 3.00 4
> 5 1 1.50 2.5 3.50 4
> 6 1 1.25 2.5 3.75 4
> 7 1 1.75 2.5 3.25 4
> 8 1 1.42 2.5 3.58 4
> 9 1 1.44 2.5 3.56 4
>
> The appropriateness of a particular sample quantile probably depends
> on
> purpose and whether the parent distribution is continuous or discrete.
> (Discrete: methods 1, 2, and 3. Continuous: methods 4 through 9.)
>
> Since discrete_inv.m is for a discrete rv, I think it should give
> results
> that match with one of R's discrete methods. (My preference is for
> 2, simply
> because it gives the median such that Pr(X <= m) = Pr(X >= m) = 0.5).
>
> In light of this, perhaps statistics.m needs a continuous/discrete
> flag?
It is clear, that a change is merited. I agree it makes sense to
consider "discrete", and "continuous" distributions.
There is no "statistics()" function in Matlab, but there is a
"quantile()". It uses a continuous representation, I'd like to confirm
our implementation is consistent with theirs, but otherwise there is
no compatibility issue.
The current continuous method is consistent with both R and Maxima, so
we're good there.
What we need to decide is;
(1) What algorithm should be used in the discrete case?
I'd prefer the current implementation, which mirror of R's 1st
method ... less work for me;-)
However, changing to the second method should be simple. Please post
results for some other examples; x = [1:5], x = [1, 2, 5, 9], and x =
[1, 2, 5, 9, 11]; ... I'd do it myself, but am not so familiar with R.
(2) Are "empirical" samples to be handled as "continuous" or "discrete"?
... I assume "continuous" is correct?
Ben
More information about the Bug-octave
mailing list