Vasudev Ram: Unix split command in Python

By Vasudev Ram

Recently, there was an HN thread about the implementation (not just use) of text editors. Someone mentioned that some editors, including vim, have problems opening large files. Various people gave workarounds or solutions, including using vim and other ways.

I commented that you can use the Unix command bfs (for big file scanner), if you have it on your system, to open the file read-only and then move around in it, like you can in an editor.

I also said that the Unix commands split and csplit can be used to split a large file into smaller chunks, edit the chunks as needed, and then combine the chunks back into a single file using the cat commmand.

This made me think of writing, just for fun, a simple version [1] of the split command in Python. So I did that, and then tested it some [2]. Seems to be working okay so far.

[1] I have not implemented the full functionality of the POSIX split command, only a subset, for now. May enhance it with a few command-line options, or more functionality, later, e.g. with the ability to split binary files. I've also not implemented the default size of 1000 lines, or the ability to take input from standard input if no filename is specfied. (Both are easy.)

However, I am not sure whether the binary file splitting feature should be a part of split, or should be a separate command, considering the Unix philosophy of doing one thing and doing it well. Binary file splitting seems like it should be a separate task from text file splitting. Maybe it is a matter of opinion.

[2] I tested split.py with various valid and invalid values for the lines_per_file argument (such as -3, -2, -1, 0, 1, 2, 3, 10, 50, 100) on each of these input files:

in_file_0_lines.txt
in_file_1_line.txt
in_file_2_lines.txt
in_file_3_lines.txt
in_file_10_lines.txt
in_file_100_lines.txt

where the meaning of the filenames should be self-explanatory.

Of course, I also checked after each test run, that the output file(s) contained the right data.

(There may still be some bugs, of course. If you find any, I'd appreciate hearing about it.)

Here is the code for split.py:

import sys
import os

OUTFIL_PREFIX = "out_"

def make_out_filename(prefix, idx):
    '''Make a filename with a serial number suffix.'''
    return prefix + str(idx).zfill(4)

def split(in_filename, lines_per_file):
    '''Split the input file in_filename into output files of 
    lines_per_file lines each. Last file may have less lines.'''
    in_fil = open(in_filename, "r")
    outfil_idx = 1
    out_filename = make_out_filename(OUTFIL_PREFIX, outfil_idx)
    out_fil = open(out_filename, "w")
    # Using chain assignment feature of Python.
    line_count = tot_line_count = file_count = 0
    # Loop over the input and split it into multiple files.
    # A text file is an iterable sequence, from Python 2.2,
    # so the for line below works.
    for lin in in_fil:
        # Bump vars; change to next output file.
        if line_count >= lines_per_file:
            tot_line_count += line_count
            line_count = 0
            file_count += 1
            out_fil.close()
            outfil_idx += 1
            out_filename = make_out_filename(OUTFIL_PREFIX, outfil_idx)
            out_fil = open(out_filename, "w")
        line_count += 1
        out_fil.write(lin)
    in_fil.close()
    out_fil.close()
    sys.stderr.write("Output is in file(s) with prefix {}\n".format(OUTFIL_PREFIX))

def usage():
    sys.stderr.write(
"Usage: {} in_filename lines_per_file\n".format(sys.argv[0]))

def main():

    if len(sys.argv) != 3:
        usage()
        sys.exit(1)

    try:
        # Get and validate in_filename.
        in_filename = sys.argv[1]
        # If input file does not exist, exit.
        if not os.path.exists(in_filename):
            sys.stderr.write("Error: Input file '{}' not found.\n".format(in_filename))
            sys.exit(1)
        # If input is empty, exit.
        if os.path.getsize(in_filename) == 0:
            sys.stderr.write("Error: Input file '{}' has no data.\n".format(in_filename))
            sys.exit(1)
        # Get and validate lines_per_file.
        lines_per_file = int(sys.argv[2])
        if lines_per_file = 0:
            sys.stderr.write("Error: lines_per_file cannot be less than or equal to 0.\n")
            sys.exit(1)
        # If all checks pass, split the file.
        split(in_filename, lines_per_file) 
    except ValueError as ve:
        sys.stderr.write("Caught ValueError: {}\n".format(repr(ve)))
    except IOError as ioe:
        sys.stderr.write("Caught IOError: {}\n".format(repr(ioe)))
    except Exception as e:
        sys.stderr.write("Caught Exception: {}\n".format(repr(e)))
        raise

if __name__ == '__main__':
    main()

You can run split.py like this:

$ python split.py
Usage: split.py in_filename lines_per_file

which will give you the usage help. And like this to actually split text files, in this case, a 100-line text file into 10 files of 10 lines each:

$ python split.py in_file_100_lines.txt 10
Output is in file(s) with prefix out_

Here are a couple of runs with invalid values for either the input file or the lines_per_file argument:

$ python split.py in_file_100_lines.txt 0
Error: lines_per_file cannot be less than or equal to 0.

$ python split.py not-there.txt 0
Error: Input file 'not-there.txt' not found.

As an aside, thinking about whether to use 0 or 1 as initial value for some of the _count variables in the program, made me remember this topic:

Why programmers count from 0

See the first few hits for some good answers.

And finally, speaking of zero, check out this earlier post by me:

Bhaskaracharya and the man who found zero

- Enjoy.

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python Posts about xtopdf

My ActiveState recipes

Share|