This blog post announces pts-line-bisect, a C program and an equivalent Python script and library for doing binary seach in line-sorted text files, and it also explains some of the design details of the Python implementation.
Let's suppose you have a sorted text file, in which each line is lexicographically larger than the previous one, and you want to find a specific range of lines, or all lines with a specific prefix.
I've written the program pts-line-bisectfor that recently. See the beginning of the READMEon many examples how to use the C program. The Python program has fewer features, here is how to use it:
- Closed ended interval search: print all lines at least bar and at most foo:
$ python pts_line_bisect.py -t file.sorted bar foo
- Open ended interval search: print all lines at least bar and smaller than foo:
$ python pts_line_bisect.py -e file.sorted bar foo
- Prepend position: print the smallest offset where foo can be inserted (please note that this does not report whether foo is present, always exit(0)).
$ python pts_line_bisect.py -oe file.sorted foo
- Append position: print the largest offset where foo can be inserted (please note that this does not report whether foo is present, always exit(0)).
$ python pts_line_bisect.py -oae file.sorted foo
- Exact range: print the range (start byte offset and after-the-last-byte end offset) of lines containing only foo:
$ python pts_line_bisect.py -ot file.sorted foo
Please note that these tools support duplicate lines correctly (i.e. if the same line appears multiple times, in a block).
Please note that these tools and libraries assume that the key is at the beginning of the line. If you have a sorted file whose lines contain the sort key somewhere in the middle, you need to make changes to the code. It's possible to do so in a way that the code will still support duplicate keys (i.e. records with the same key).
Analysis
I've written the article titled Evolution of a binary search implementation for line-sorted text files about the topic, containing the problem statement, the possible pitfalls, an analysis of some incorrect solutions available on the web as code example, a detailed specification and explanation of my solution (including a proof), disk seek and speed analysis, a set of speedup ideas and their implementation, and further notes about the speedups in the C implementation.
As a teaser, here is an incorrect solution in Python:
defbisect_left_incorrect(f,x):
"""... Warning: Incorrect implementation with corner case bugs!"""
x=x.rstrip('\n')
f.seek(0,2)# Seek to EOF.
lo,hi=0,f.tell()
whilelo<hi:
mid=(lo+hi)>>1
f.seek(mid)
f.readline()# Ignore previous line, find our line.
ifx<=f.readline().rstrip('\n'):
hi=mid
else:
lo=mid+1
returnlo
Can you spot the all the 3 bugs?
Read the article for all the details and the solution.
About sorting: For the above implementation to work, files must be sorted lexicographically, using the unsigned value (0..255) for each byte in the line. If you don't sort the file, or you sort it using some language collation or some locale, then the linked implementation won't find the results you are looking for. On Linux, use LC_ALL=C sort <in.txt >out.txt
to sort lexicographically.
As a reference, here is the correct implementation of the same algorithm for finding the start of the interval in a sorted list or other sequences (based on the bisect module):
defbisect_left(a,x,lo=0,hi=None):
"""Return the index where to insert item x in list a, assuming a is sorted.
The return value i is such that all e in a[:i] have e < x, and all e in
a[i:] have e >= x. So if x already appears in the list, a.insert(x) will
insert just before the leftmost x already there.
Optional args lo (default 0) and hi (default len(a)) bound the
slice of a to be searched.
"""
iflo<0:
raiseValueError('lo must be non-negative')
ifhiisNone:
hi=len(a)
whilelo<hi:
mid=(lo+hi)>>1
ifx<=a[mid]:# Change `<=' to `<', and you get bisect_right.
hi=mid
else:
lo=mid+1
returnlo
A typical real-word use case for such a binary search tool is retrieving lines corresponding to a particular time range in log files (or time based measurement records). These files text files with variable-length lines, with the log timestamp in the beginning of the line, and they are generated in increasing timestamp order. Unfortunately the lines are not lexicographically sorted, so the timestamp has to be decoded first for the comparison. The bsearch tool does that, it also supports parsing arbitrary, user-specifiable datetime formats, and it can binary search in gzip(1)ped files as well (by building an index). It's also of high performance and low overhead, partially because it is written in C++. So bsearch is practical tool with lots of useful features. If you need anything more complicated than a lexicographic binary search, use it instead.
Before getting too excited about binary search, please note that there are much faster alternatives for data on disk. In an external (file-based) binary search the slowest operation is the disk seek needed for each bisection. Most of the software and hardware components would be waiting for the hard disk to move the reading head. (Except, of course, that seek times are negligible when non-spinning storage hardware such as SSD or memory card is used.)
An out-of-the box solution would be adding the data to more disk-efficient key-value store. There are several programs providing such stores. Most of them are based on a B-tree, B*-tree or B+-tree data structure if sorted iteration and range searches have to be supported, or disk-based hashtables otherwise. Some of the high-performance single-machine key-value stores: cdb (read-only), Tokyo Cabinet, Kyoto Cabinet, LevelDB; see more in the NoSQL software list.
The fundamental speed difference between a B-tree search and a binary search in a sorted list stems from the fact that B-trees have a branching factor larger than 2 (possibly 100s or 1000s), thus each seeking step in a B-tree search reduces possible input size by a factor larger than 2, while in a binary search each step reduces the the input size by a factor 2 only (i.e. we keep either the bottom half or the top half). So both kinds of searches are logarithmic, but the base of the logarithm is different, and this causes a constant factor difference in the disk seek count. By careful tunig of the constant in B-trees it's usual to have only 2 or 3 disk seeks for each search even for 100GB of data, while a binary search in a such a large file with 50 bytes per record would need 31 disk seeks. By taking 10ms as the seek time (see more info about typical hard disk seek times), a typical B-tree search takes 0.03 second, and a typical binary search takes 0.31 second.
Have fun binary searching, but don't forget to sort your files first!