Statistics function incorrectly computing median

Ben Abbott bpabbott at mac.com
Sat Jan 19 11:49:29 CST 2008


On Jan 19, 2008, at 5:13 AM, Miguel Garcia-Blanco wrote:

> > I looked at the routines in R and Maxima and have concluded that  
> they
> > assume the probability of each sample is a constant; i.e. 1/(# of
> > samples).
> >
> > The routine "discrete_inv (x, v, p)" permits the user the specify  
> the
> > probability of each sample.
> >
> > Thus, it took some time to determine a method that was consistent  
> with
> > constant probability and variable probability.
> >
> > Before, submitting a formal patch, I thought it wise to attach a
> > script for testing.
> >
> > The equivalent commands in Octave, R, and Maxima are
> >
> > Octave: discrete_inv ([1/4 2/4 3/4], <v>, ones (size (<v>)) / numel
> > (<v>))
> >
> > GNU R: summary ( c( <v> ) )
> >
> > Maxima: load( discriptive )$
> >         quantile ( <v>, 1/4), numer;
> >         quantile ( <v>, 2/4), numer;
> >         quantile ( <v>, 3/4), numer;
> >
> > Where <v> is to be replaced by a sequence like; [1, 2, 3, 4].
> >
> > I've compared the result for a few examples. It appears to now be
> > working properly. However, I've been wrong before, so ....
> >
> > Ben
>
> (Disclaimer: I don't know much about quantiles, so what I'm about to  
> say may
> not be correct.)
>
> I don't know what the summary function actually does (I haven't seen  
> the
> code), but I suspect it makes a call to the quantile function. R has 9
> different methods for calculating sample quantiles. By default, the  
> extreme
> values and quartiles are returned, calculated using method 7. Here's  
> an
> example:
>
> > # Example to compare different methods
> > x <- c( 1, 2, 3, 4 )
> > res <- matrix( as.numeric(NA), 9, 5 )
> > for( iType in 1:9 ) res[iType, ] <- y <- quantile( x, type = iType )
> > dimnames( res ) <- list( 1:9, names( y ) )
> > round( res, 2 )
>   0%  25% 50%  75% 100%
> 1  1 1.00 2.0 3.00    4
> 2  1 1.50 2.5 3.50    4
> 3  1 1.00 2.0 3.00    4
> 4  1 1.00 2.0 3.00    4
> 5  1 1.50 2.5 3.50    4
> 6  1 1.25 2.5 3.75    4
> 7  1 1.75 2.5 3.25    4
> 8  1 1.42 2.5 3.58    4
> 9  1 1.44 2.5 3.56    4
>
> The appropriateness of a particular sample quantile probably depends  
> on
> purpose and whether the parent distribution is continuous or discrete.
> (Discrete: methods 1, 2, and 3. Continuous: methods 4 through 9.)
>
> Since discrete_inv.m is for a discrete rv, I think it should give  
> results
> that match with one of R's discrete methods. (My preference is for  
> 2, simply
> because it gives the median such that Pr(X <= m) = Pr(X >= m) = 0.5).
>
> In light of this, perhaps statistics.m needs a continuous/discrete  
> flag?

It is clear, that a change is merited. I agree it makes sense to  
consider "discrete",  and "continuous" distributions.

There is no "statistics()" function in Matlab, but there is a  
"quantile()". It uses a continuous representation, I'd like to confirm  
our implementation is consistent with theirs, but otherwise there is  
no compatibility issue.

The current continuous method is consistent with both R and Maxima, so  
we're good there.

What we need to decide is;

(1) What algorithm should be used in the discrete case?

  I'd prefer the current implementation, which mirror of R's 1st   
method ... less work for me;-)

However, changing to the second method should be simple. Please post  
results for some other examples; x = [1:5], x = [1, 2, 5, 9], and x =  
[1, 2, 5, 9, 11]; ... I'd do it myself, but am not so familiar with R.

(2) Are "empirical" samples to be handled as "continuous" or "discrete"?

  ... I assume "continuous" is correct?

Ben



More information about the Bug-octave mailing list