Speed improvements and slowups 3.0.x to latest 3.1.x

David Bateman dbateman at dbateman.org
Sat Feb 7 02:11:40 CST 2009


Jaroslav Hajek wrote:
> Now, theoretically memcpy is faster than std::copy because it has the
> stronger assumptions (non-overlapping ranges), but I always thought
> the differences were minuscule. You can easily choose the fastest
> algorithm if you can safely compare any two pointers within your
> address space, which you can't in C/C++, but maybe you can at the
> system level? I see g++ 4.3. defers std::copy of POD types to an
> internal function (__builtin_memmove), so it's a question how good is
> that. Were you compiling using an older version / different compiler?
> Basically, we could mimick the POD dispatching done in g++'s standard
> library files and then put memcpy back into the game, but I really
> don't like this idea much, I think this is the sort of stuff compilers
> should handle.
>   
Although the compiler was the same, in one case I'm using the prebuilt 
debian version of octave and building it myself for 3.1.51+.. It semms 
you built both yourself and didn't see the effect I was seeing so I take 
it back..

> regarding sort, there are a number of further optimization possibilities.
>
> 1. the algorithm, borrowed from Python, is really targeted to sorting
> arrays containing already sorted segments, i.e. re-sorting. I did a
> few experiments here ind it seems that std::sort from gcc 4.3,
> implementing introsort, wins for random arrays but gets beaten for
> arrays that contain a lot of sorted segments. Maybe we could create a
> hybrid, falling back to std::sort if no long runs are detected.
>   
The python algorithm is perhaps 70% slower for random data but many 
times faster for partially sorted data that is likely to be a more 
common case inside a loop. So I'd say the penalty for random data is 
acceptable given the gain for partially sorted lists


> 2. the comparison is currently done by passing in a function pointer.
> For simple POD types, this is a significant overhead. In fact I
> experimented with allowing it to be inlined by giving octave_sort a
> second template parameter, and for random arrays it was a win by up to
> 20% (IIRC). For partially sorted arrays, the win was smaller
> (apparently because the number of comparisons was less compared to the
> bookkeeping).
> The result was, however, more complicated because you no longer have
> one external octave_sort instance per type, a the instantiations got a
> bit messy - it took me a while before I sorted out all undefined
> references at linking. And I didn't like the idea making the algorithm
> completely inline to further slow down compilation.
> Now, the mess could be avoided if we resign on optimizing descending
> sorts and just optimize the ascending ones by optimizing the case when
> octave_sort::compare is null. Currently this is tested in each
> comparison (see IFLT in octave_sort.cc) which is probably sub-optimal
> even though branch prediction in modern CPUs can cope amazingly well
> with stuff like that.
> I think it is possible to move the dispatch up a little, using
> templates or just code duplication, which could again win us some
> percents.
>   
Maybe this one would be reasonable ..
> I guess I could try to realize some of the two ideas. Do you know of a
> serious Octave application where sorting is a bottleneck?
>   
Not really, just that it was a regression I saw against the sciview 
benchmark

D.

-- 
David Bateman                                dbateman at dbateman.org
35 rue Gambetta                              +33 1 46 04 02 18 (Home)
92100 Boulogne-Billancourt FRANCE            +33 6 72 01 06 33 (Mob)



More information about the Octave-maintainers mailing list