SortedSeek.pm |
numeric()
and alphabetic()
find_time()
get_between()
get_between()
get_between()
get_last()
File::SortedSeek - A Perl module providing fast access to large files
use File::SortedSeek ':all'; open BIG, $file or die $!;
# find a number or the first number greater in a file (ascending order) $tell = numeric( *BIG, $number ); # read a line in from where we matched in the file $line = <BIG>; print "Found exact match as $line" if File::SortedSeek:was_exact();
# find a string or the first string greater in a file (alphabetical order) $tell = alphabetic( *BIG, $string ); $line = <BIG>;
# find a date in a logfile supplying a scalar localtime type string $tell = find_time( *BIG, "Thu Aug 23 22:59:16 2001" ); # or supplying GMT epoch time $tell = find_time( *BIG, 998571554 ); # get all the lines after our date @lines = <BIG>;
# get the lines between two logfile dates $begin = find_time( *LOG, $start ); $end = find_time( *LOG, $finish ); # get lines as an array @lines = get_between( *LOG, $begin, $end ); # get lines as an array reference $lines = get_between( *LOG, $begin, $end );
# use you own sub to munge the file line data before comparison $tell = numeric( *BIG, $number, \&epoch ); $tell = alphabetic( *BIG, $string, \&munge_line );
# use methods on files in reverse alphabetic or descending numerical order File::SortedSeek::set_descending();
# for inexact matches set FH so first value read is before and second after File::SortedSeek::set_cuddle();
# get last $n lines of any file as an array @lines = get_last( *BIG, $n ) # or an array reference $lines = get_last( *BIG, $n ) # change the input record separator from the OS default @lines = get_last( *BIG, $n, $rec_sep )
File::SortedSeek provides fast access to data from large files. Three
methods numeric()
alphabetic()
and find_time()
depend on the file data
being sorted in some way. Logfiles are a typical example of big files that
are sorted (by date stamp). The get_between()
method can be used to get
a chunk of lines efficiently from anywhere in the file. The required postion(s)
for the get_between()
method are supplied by the previous methods. The
get_last()
method will efficiently get the last N lines of any file, sorted
or not.
With sorted data a linear search is not required. Here is a typical linear search
while (<FILE>) { next unless /$some_cond/ # found cond, do stuff }
Remember that old game where you try to guess a number between lets say 0 and say 128? Let's choose 101 and now try to guess it.
Using a linear search is the same as going 1 higher 2 higher 3 higher ... 100 higher 101 correct! Consider the geometric approach: 64 higher 96 higher 112 lower 104 lower 100 higher 102 lower - ta da must be 101! This is the halving the difference search method and can be applied to any data set where we can logically say higher or lower. In other words any sorted data set can be searched like this. It is a far more efficient method - see the SPEED section for a quick analysis.
Fiel::SortedSeek provides fast access to data from large files. Three
methods numeric()
alphabetic()
and find_time()
depend on the file data
being sorted in some way. Logfiles are a typical example of big files that
are sorted (by date stamp). The get_between()
method can be used to get
a chunk of lines efficiently from anywhere in the file. The required postion(s)
for the get_between()
method are supplied by the previous methods. The
get_last()
method will efficiently get the last N lines of any file, sorted
or not.
numeric()
and alphabetic()
There are two basic methods - numeric()
to do numeric searches and
alphabetic()
that does alphabetic searches.
You call the functions like this:
$tell = numeric( *BIG, $find ); $tell = alphabetic( *BIG, $find );
These methods take two required arguments. *BIG is a FILEHANDE to read from. $find is the item you wish to find. $find must be appropriate to the function as the numeric method will make numeric comparisons ( == < > ). Similarly the alphabetic method makes string comparisons ( eq lt gt ). You will get strange results if you use the wrong method just as you do if you say use == when you actually meant eq
The return value from the numeric()
and alphabetic()
methods depend on the
result of the search. If the search fails the return value is undefined.
A search can succeed in two ways. If an exact match is found then the
current file position pointer is set to the beginning of the matching line.
The return value is the corresponding response from tell(). This means that
the next read from <FILEHANDLE> will return the matching line.
Subsequent reads return the following lines as expected.
Alternatively a search will succeed if a point in the file can be found such that $find is cuddled between two adjacent lines. For example consider searching for the number 42 in a file like this:
.. 36 40 <- Before 44 <- After 48 ..
The number 42 is not actually there but the search will still succeed as it is between 40 and 44. By default the file postion pointer is set to the beginning of the line '44' so the next read from <FILEHANDLE> will return this line. If the File::SortedSeek::set_cuddle() function is called then the file position pointer will be set to the beginning of line '40' so that the first two reads from <FILEHANDLE> will cuddle the in-between value in $find.
Both the numeric and alphabetic subs take an optional third argument. This optional argument is a reference to a subroutine to munge the file lines so that suitable values are extracted for comparison to $find.
$tell = numeric( *BIG, $find, \&munge_line ); $tell = alphabetic( *BIG, $find, \&munge_line );
A good example of this is the find_time()
function. This is just an
implementation of the basic numeric algorithm similar to this.
$tell = numeric ( *BIG, $epoch_seconds, \&get_epoch_seconds );
sub get_epoch_seconds { use Time::Local; my $line = shift; return undef unless defined $line; my %months = ( Jan => 0, Feb => 1, Mar => 2, Apr => 3, May => 4, Jun => 5, Jul => 6, Aug => 7, Sep => 8, Oct => 9, Nov => 10, Dec => 12); # grab a scalar localtime looking like string from the line my ($wday,$mon,$mday,$hours,$min,$sec,$year) = $line =~ m/(\w\w\w)\s+(\w\w\w)\s+(\d{1,2})\s+(\d\d):(\d\d):(\d\d)\s+(\d{4})/; unless ($year) { $error_msg = "Unable to find time like string in line:\n$line"; warn $error_msg unless $silent; return undef; } $mon = $months{$mon}; # convert to numerical months 0 - 11 return timegm($sec,$min,$hours,$mday,$mon,$year); }
As the search is made the test lines are passed to the munging sub. This sub
needs to return a string or number that we can perform comparison on. In this
case the sub looks for something that looks like a scalar localtime()
string,
and assuming this is a date passes it to timegm()
for conversion to
epoch seconds and returns this number.
You can see further examples of this in the test suite test.pl
find_time()
The find_time()
function is an implementation of the basic numeric method as
discussed briefly above. You call it like:
$tell = find_time( *LOG, 'Thu Jan 1 00:42:00 1970' ); $tell = find_time( *LOG, $epoch_seconds );
You may use either a scalar localtime()
like string or epoch seconds. If you
use epoch seconds it assumes gmtime. If in doubt use the string as although
it works internally with gmtime the offsets cancel out and the correct result
is returned.
get_between()
Say you have a logfile and you want to get the log between one date and
another. You can simply use two calls to the find_time()
to get the beginning
and end positions and then use get_between()
to get the lines.
# get the lines between two logfile dates $begin = find_time( *LOG, $start ); $end = find_time( *LOG, $finish ); # get lines as an array @lines = get_between( *LOG, $begin, $end ); # get lines as an array reference $lines = get_between( *LOG, $begin, $end );
The get_between()
method returns an array in list context as above and a
reference to an array in scalar context.
This function needs to apply binmode so it splits the lines based on a system specific default record separator. This is derived as below:
my $default_rec_sep = ($^O =~ m/win32|vms/i) ? "\015\012" : ( $^O =~ /mac/i ) ? "\015" : "\012";
You can override this on a per file basis by passing the record separator
to the get_between()
function.
@lines = get_between( *LOG, $begin, $end, $rec_sep );
Modifying $/ has no effect. Note that *the record separator is not returned* in the array. As a result the returned array has effectively had every element chomped.
Warning - this method will apply binmode to the FH so line endings
will possibly not be converted properly if you try to continue to read from
it. As there is no unbinmode()
close the FH afterwards and reopen it if you
want to read from it. You can seek FH, 0, $end if say you want to read more
lines after $end.
get_between()
Using the get_between()
method you can efficiently get the lines at the
beginning of a file. Although you can just read in lines sequentially with
a while loop this requires that you test each line. If you can find the
end point using the find_time()
numeric()
or alphabetic()
methods you
can the just get what you need. For large files many thousands of
unnecessary tests are avoided saving time. Using the example above
you simply set $begin to 0
$begin = 0; $end = find_time( *LOG, $finish ); @lines = get_between( *LOG, $begin, $end );
get_between()
You can similarly use get between to get all the lines from a specific point up to the end of the file. The end is just the size of the file so:
$begin = find_time( *LOG, $start ); $end = -s LOG; @lines = get_between( *LOG, $begin, $end );
get_last()
This method does not depend on the file being sorted to work.
When you use the get_last()
method the module estimates how many bytes at
the end of the file to read in. To make the estimate the module multiplies
the default line length (80 chars) by the number of lines required and then
doubles it.
If it does not get sufficient lines on its first attempt it re-estimates
the line length from the actual data read in, re-calculates
the read, doubles it and then tries again. This algorithm is unlikely to
take more than 2 reads but if you have unusually long of short lines you may
get a small speed benefit by using the set_line_length()
method to set the
average line length. The default is 80 chars per line. Setting the line length
close to the actual will also avoid reading a excessive quantity of data into
memory.
# get last $n lines of any file as an array @lines = get_last( *BIG, $n ) # or an array reference $lines = get_last( *BIG, $n ) # change the input record separator from the default @lines = get_last( *BIG, $n, $rec_sep )
This function needs to apply binmode so it splits the lines based on a system specific default record separator. This is derived as below:
my $default_rec_sep = ($^O =~ m/win32|vms/i) ? "\015\012" : ( $^O =~ /mac/i ) ? "\015" : "\012";
You can override this on a per file basis by passing the record separator
$rec_sep to the get_last()
function as shown. Modifying $/ has no effect.
Note that *the record separator is not returned* in the array. As a
result the returned array has effectively had every element chomped.
Warning - this method will apply binmode to the FH so line endings
will possibly not be converted properly if you try to continue to read from
it. As there is no unbinmode()
close the FH afterwards and reopen it if you
want to read from it. You can seek FH, 0, $end if say you want to read more
lines after $end.
Nothing is exported by default. The following 5 methods are available for import:
alphabetic() numeric() find_time() get_between() get_last()
You can import just the method you want with a:
use File::SortedSeek 'numeric';
or all 5 methods using the ':all' tag.
use File::SortedSeek ':all';
There are some options available via non exported function calls. You will need to fully specify the name if you want to use these.
If a function returns undefined there has been an error. error()
will
contain the text of the last error message or a null string if there
was no error.
was_exact()
will return true if an exact match was found. It will be
false if the match was in between or failed.
set_cuddle()
changes the default line returned for in between matches as
discussed above and set_no_cuddle()
restores default behaviour
By default ascending numerical order and alphabetical order are assumed.
This assumption can be reversed by calling set_descending()
and reset
by calling set_ascending()
We need to know the order to seek within the
file in the correct direction.
This sets the maximum times that the module will try the halve the difference search before it decides there is a problem and bails out. The default value is 42 which allows files with up to 2**42 or a bit more than 10**12 lines to be processed. A seek in a million line file will take a mere 20 tries to find the required value.
When you use the get_last()
method the module uses its default
line length to estimate how many bytes at the end of the file to read in.
You can improve speed slightly and decrease memory usage by setting an
accurate line length. The default is 80 chars per line. The function will
work fine regardless of what the line length is, this is just an efficiency
tweak.
You can silence or activate error messages by calling these two subs. The default is verbose.
Sets debug on or off. Default is of course off.
Here is a table that demonstrates the advantage of using the halve the difference algorithm.
Num items Lin avg Geom avg Lin:Geom 2 1 1 1 4 2 2 1 8 4 3 1 16 8 4 2 32 16 5 3 64 32 6 5 128 64 7 9 256 128 8 16 512 256 9 28 1024 512 10 51 2048 1024 11 93 4096 2048 12 170 8192 4096 13 315 16384 8192 14 585 32768 16384 15 1092 65536 32768 16 2048 131072 65536 17 3855 262144 131072 18 7281 524288 262144 19 13797 1048576 524288 20 26214
Even though there is an overhead involved with this search this is minor as the number of tests required is so much less. Speed increases of 100-1000 of times are typical.
An OO interface slows things down by > 50% so is not used.
Bound to be some. The binmoding of the FH by get_between()
and get_last()
can
not be easily avoided.
(c) Dr James Freeman 2000-01 <jfreeman@tassie.net.au> All rights reserved.
This package is free software and is provided ``as is'' without express or implied warranty. It may be used, redistributed and/or modified under the terms of the Perl Artistic License (see http://www.perl.com/perl/misc/Artistic.html)
For details about the mystical significance of the number 42 and how it can be applied to Life the Universe and everything see The Hitch Hiker's Guide to the Galaxy 'trilogy' by the recently departed Douglas Adams.
SortedSeek.pm |