Speed improvements and slowups 3.0.x to latest 3.1.x
David Bateman
dbateman at dbateman.org
Sat Feb 7 02:11:40 CST 2009
Jaroslav Hajek wrote:
> Now, theoretically memcpy is faster than std::copy because it has the
> stronger assumptions (non-overlapping ranges), but I always thought
> the differences were minuscule. You can easily choose the fastest
> algorithm if you can safely compare any two pointers within your
> address space, which you can't in C/C++, but maybe you can at the
> system level? I see g++ 4.3. defers std::copy of POD types to an
> internal function (__builtin_memmove), so it's a question how good is
> that. Were you compiling using an older version / different compiler?
> Basically, we could mimick the POD dispatching done in g++'s standard
> library files and then put memcpy back into the game, but I really
> don't like this idea much, I think this is the sort of stuff compilers
> should handle.
>
Although the compiler was the same, in one case I'm using the prebuilt
debian version of octave and building it myself for 3.1.51+.. It semms
you built both yourself and didn't see the effect I was seeing so I take
it back..
> regarding sort, there are a number of further optimization possibilities.
>
> 1. the algorithm, borrowed from Python, is really targeted to sorting
> arrays containing already sorted segments, i.e. re-sorting. I did a
> few experiments here ind it seems that std::sort from gcc 4.3,
> implementing introsort, wins for random arrays but gets beaten for
> arrays that contain a lot of sorted segments. Maybe we could create a
> hybrid, falling back to std::sort if no long runs are detected.
>
The python algorithm is perhaps 70% slower for random data but many
times faster for partially sorted data that is likely to be a more
common case inside a loop. So I'd say the penalty for random data is
acceptable given the gain for partially sorted lists
> 2. the comparison is currently done by passing in a function pointer.
> For simple POD types, this is a significant overhead. In fact I
> experimented with allowing it to be inlined by giving octave_sort a
> second template parameter, and for random arrays it was a win by up to
> 20% (IIRC). For partially sorted arrays, the win was smaller
> (apparently because the number of comparisons was less compared to the
> bookkeeping).
> The result was, however, more complicated because you no longer have
> one external octave_sort instance per type, a the instantiations got a
> bit messy - it took me a while before I sorted out all undefined
> references at linking. And I didn't like the idea making the algorithm
> completely inline to further slow down compilation.
> Now, the mess could be avoided if we resign on optimizing descending
> sorts and just optimize the ascending ones by optimizing the case when
> octave_sort::compare is null. Currently this is tested in each
> comparison (see IFLT in octave_sort.cc) which is probably sub-optimal
> even though branch prediction in modern CPUs can cope amazingly well
> with stuff like that.
> I think it is possible to move the dispatch up a little, using
> templates or just code duplication, which could again win us some
> percents.
>
Maybe this one would be reasonable ..
> I guess I could try to realize some of the two ideas. Do you know of a
> serious Octave application where sorting is a bottleneck?
>
Not really, just that it was a regression I saw against the sciview
benchmark
D.
--
David Bateman dbateman at dbateman.org
35 rue Gambetta +33 1 46 04 02 18 (Home)
92100 Boulogne-Billancourt FRANCE +33 6 72 01 06 33 (Mob)
More information about the Octave-maintainers
mailing list