Péter Szabó: Binary search in line-sorted text files

This blog post announces pts-line-bisect, a C program and an equivalent Python script and library for doing binary seach in line-sorted text files, and it also explains some of the design details of the Python implementation.

Let's suppose you have a sorted text file, in which each line is lexicographically larger than the previous one, and you want to find a specific range of lines, or all lines with a specific prefix.

I've written the program pts-line-bisectfor that recently. See the beginning of the READMEon many examples how to use the C program. The Python program has fewer features, here is how to use it:

Closed ended interval search: print all lines at least bar and at most foo:
```
$ python pts_line_bisect.py -t file.sorted bar foo
```
Open ended interval search: print all lines at least bar and smaller than foo:
```
$ python pts_line_bisect.py -e file.sorted bar foo
```
Prepend position: print the smallest offset where foo can be inserted (please note that this does not report whether foo is present, always exit(0)).
```
$ python pts_line_bisect.py -oe file.sorted foo
```
Append position: print the largest offset where foo can be inserted (please note that this does not report whether foo is present, always exit(0)).
```
$ python pts_line_bisect.py -oae file.sorted foo
```
Exact range: print the range (start byte offset and after-the-last-byte end offset) of lines containing only foo:
```
$ python pts_line_bisect.py -ot file.sorted foo
```

Please note that these tools support duplicate lines correctly (i.e. if the same line appears multiple times, in a block).

Please note that these tools and libraries assume that the key is at the beginning of the line. If you have a sorted file whose lines contain the sort key somewhere in the middle, you need to make changes to the code. It's possible to do so in a way that the code will still support duplicate keys (i.e. records with the same key).

Analysis

I've written the article titled Evolution of a binary search implementation for line-sorted text files about the topic, containing the problem statement, the possible pitfalls, an analysis of some incorrect solutions available on the web as code example, a detailed specification and explanation of my solution (including a proof), disk seek and speed analysis, a set of speedup ideas and their implementation, and further notes about the speedups in the C implementation.

As a teaser, here is an incorrect solution in Python:

defbisect_left_incorrect(f,x):
"""... Warning: Incorrect implementation with corner case bugs!"""
x=x.rstrip('\n')
f.seek(0,2)# Seek to EOF.
lo,hi=0,f.tell()
whilelo<hi:
mid=(lo+hi)>>1
f.seek(mid)
f.readline()# Ignore previous line, find our line.
ifx<=f.readline().rstrip('\n'):
hi=mid
else:
lo=mid+1
returnlo

Can you spot the all the 3 bugs?

Read the article for all the details and the solution.

About sorting: For the above implementation to work, files must be sorted lexicographically, using the unsigned value (0..255) for each byte in the line. If you don't sort the file, or you sort it using some language collation or some locale, then the linked implementation won't find the results you are looking for. On Linux, use LC_ALL=C sort <in.txt >out.txt to sort lexicographically.

As a reference, here is the correct implementation of the same algorithm for finding the start of the interval in a sorted list or other sequences (based on the bisect module):

defbisect_left(a,x,lo=0,hi=None):
"""Return the index where to insert item x in list a, assuming a is sorted.

  The return value i is such that all e in a[:i] have e < x, and all e in
  a[i:] have e >= x.  So if x already appears in the list, a.insert(x) will
  insert just before the leftmost x already there.

  Optional args lo (default 0) and hi (default len(a)) bound the
  slice of a to be searched.
"""
iflo<0:
raiseValueError('lo must be non-negative')
ifhiisNone:
hi=len(a)
whilelo<hi:
mid=(lo+hi)>>1
ifx<=a[mid]:# Change `<=' to `<', and you get bisect_right.
hi=mid
else:
lo=mid+1
returnlo

A typical real-word use case for such a binary search tool is retrieving lines corresponding to a particular time range in log files (or time based measurement records). These files text files with variable-length lines, with the log timestamp in the beginning of the line, and they are generated in increasing timestamp order. Unfortunately the lines are not lexicographically sorted, so the timestamp has to be decoded first for the comparison. The bsearch tool does that, it also supports parsing arbitrary, user-specifiable datetime formats, and it can binary search in gzip(1)ped files as well (by building an index). It's also of high performance and low overhead, partially because it is written in C++. So bsearch is practical tool with lots of useful features. If you need anything more complicated than a lexicographic binary search, use it instead.

Before getting too excited about binary search, please note that there are much faster alternatives for data on disk. In an external (file-based) binary search the slowest operation is the disk seek needed for each bisection. Most of the software and hardware components would be waiting for the hard disk to move the reading head. (Except, of course, that seek times are negligible when non-spinning storage hardware such as SSD or memory card is used.)

An out-of-the box solution would be adding the data to more disk-efficient key-value store. There are several programs providing such stores. Most of them are based on a B-tree, B*-tree or B+-tree data structure if sorted iteration and range searches have to be supported, or disk-based hashtables otherwise. Some of the high-performance single-machine key-value stores: cdb (read-only), Tokyo Cabinet, Kyoto Cabinet, LevelDB; see more in the NoSQL software list.

The fundamental speed difference between a B-tree search and a binary search in a sorted list stems from the fact that B-trees have a branching factor larger than 2 (possibly 100s or 1000s), thus each seeking step in a B-tree search reduces possible input size by a factor larger than 2, while in a binary search each step reduces the the input size by a factor 2 only (i.e. we keep either the bottom half or the top half). So both kinds of searches are logarithmic, but the base of the logarithm is different, and this causes a constant factor difference in the disk seek count. By careful tunig of the constant in B-trees it's usual to have only 2 or 3 disk seeks for each search even for 100GB of data, while a binary search in a such a large file with 50 bytes per record would need 31 disk seeks. By taking 10ms as the seek time (see more info about typical hard disk seek times), a typical B-tree search takes 0.03 second, and a typical binary search takes 0.31 second.

Have fun binary searching, but don't forget to sort your files first!

Péter Szabó: Binary search in line-sorted text files

Analysis

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112