csvread 3.0.1 exponentially slower than v 3.0.0
Bill Denney
bill at denney.ws
Thu Jul 10 17:43:25 CDT 2008
Dmitri A. Sergatskov wrote:
> On Thu, Jul 10, 2008 at 11:28 AM, John W. Eaton <jwe at bevo.che.wisc.edu> wrote:
>
>> I'd recommend starting with something like 100x100 and then only
>> resizing if those bounds are exceeded, and resizing by something like
>> another 100 rows or 100 columns (whichever is needed). Then in the
>> end, resizing once more to the actual max number of rows and columns
>> found in the file. That should improve performance somewhat, but
>> perhaps someone can come up with a better solution.
>
> Double pass (first to figure out the size of the matrix, second to do
> the actual parsing and filling in the array) on the file is out of
> consideration?
My solution in similar situations before has been to stat the file
getting its size, read the first two rows (to allow for a possible
header), use the number of bytes in the second row as an estimate of the
number of bytes per row, and estimate the number of rows. After that,
you just add a fudge factor to the estimate (say add 20%), let it run,
if you need to resize again, you now have a much better idea of the
amount of space you need, so you resize to that new value (using your
current file position and a smaller fudge factor), and adjust it at the end.
If you want to get fancy, you can always make the final estimator more
dynamic, but that worked very well for me.
A somewhat more concrete example that won't work but gets the idea
across is:
fid = fopen(filename);
## This next line doesn't work either, but you get the idea
filesize = stat(fid);
fileleft = filesize;
output = [];
row = 0;
maxrow = 2;
while (there is more of the file left)
row++;
## In reality you would want to add something to verify that the number
## of columns has not changed, but changing the number of columns is
## relatively rare, so it doesn't need the fancier algorithm
[output(row,:), bytes] = readcsvline(fid);
fileleft = fileleft - bytes;
if (row == 2)
output = resize (output, ceil (filesize/bytes*1.2), size
(output, 2));
maxrow = rows (output);
elseif (row == maxrow)
maxrows = max (100, ceil (1.1*fileleft/((filesize-fileleft)/rows
(output)));
output = resize (output, maxrows, size (output, 2));
endif
endwhile
output(row+1,:) = [];
function [x, bytes] = readcsvline(fid)
## Read the next line of the file, parse it and return the vector (x) for
## the row and the number of bytes in the line
## Assume that there's some code here that does that
Have a good day,
Bill
More information about the Bug-octave
mailing list