Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22462

Vasudev Ram: Processing DSV data (Delimiter-Separated Values) with Python

$
0
0
By Vasudev Ram



I needed to process some DSV files recently. Here is a Python program I wrote for it, with a few changes over the original. E.g. I do not show the processing of the data here; I only read and print it. Also, I support two different command-line options (and ways) to specify the delimiter character.

DSV (Delimiter-separated values) is a common tabular text data format, with one record per line, and some number of fields per record, where the fields are separated or delimited by some specific character. Some common delimiter characters used in DSV files are tab (which makes them TSV files, Tab-Separated Values, common on Unix), comma (CSV files, Comma-Separated Values, a common spreadsheet and database import-export format), the pipe character (|), the colon (:) and others.

They - DSV files - are described in this section, Data File Metaformats, under Chapter 5: Textuality, of Eric Raymond (ESR)'s book, The Art of Unix Programming, which is a recommended read for anyone interested in Unix (one of the longest-lived operating systems [1]) and in software design and development.

[1] And, speaking a bit loosely, nowaday Unix is also the most widely used OS in the world, by a fair margin, due to its use (as a variant) in Android and iOS based mobile devices, both of which are Unix-based, not to mention Apple MacOS and Linux computers, which also are. Android devices alone number in the billions.

The program, read_dsv.py, is a command-line utility, written in Python, that allows you to specify the delimiter character in one of two ways:

- with a "-c delim_char" option, in which delim_char is an ASCII character,
- with a "-n delim_code" option, in which delim_code is an ASCII code.

It then reads either the files specified as command-line arguments after the -n or -c option, or if no files are given, it reads its standard input.

Here is the code for read_dsv.py:
from __future__ import print_function

"""
read_dsv.py
Author: Vasudev Ram
Web site: https://vasudevram.github.io
Blog: https://jugad2.blogspot.com
Product store: https://gumroad.com/vasudevram
Purpose: Shows how to read DSV data, i.e.

https://en.wikipedia.org/wiki/Delimiter-separated_values

from either files or standard input, split the fields of each
line on the delimiter, and process the fields in some way.
The delimiter character is configurable by the user and can
be specified as either a character or its ASCII code.

Reference:
TAOUP (The Art Of Unix Programming): Data File Metaformats:
http://www.catb.org/esr/writings/taoup/html/ch05s02.html
ASCII table: http://www.asciitable.com/
"""

import sys
import string

def err_write(message):
sys.stderr.write(message)

def error_exit(message):
err_write(message)
sys.exit(1)

def usage(argv, verbose=False):
usage1 = \
"{}: read and process DSV (Delimiter-Separated-Values) data.\n".format(argv[0])
usage2 = "Usage: python" + \
" {} [ -c delim_char | -n delim_code ] [ dsv_file ] ...\n".format(argv[0])
usage3 = [
"where one of either the -c or -n option must be given,\n",
"delim_char is a single ASCII delimiter character, and\n",
"delim_code is a delimiter character's ASCII code.\n",
"Text lines will be read from specified DSV file(s) or\n",
"from standard input, split on the specified delimiter\n",
"specified by either the -c or -n option, processed, and\n",
"written to standard output.\n",
]
err_write(usage1)
err_write(usage2)
if verbose:
for line in usage3:
err_write(line)

def str_to_int(s):
try:
return int(s)
except ValueError as ve:
error_exit(repr(ve))

def valid_delimiter(delim_code):
return not invalid_delimiter(delim_code)

def invalid_delimiter(delim_code):
# Non-ASCII codes not allowed, i.e. codes outside
# the range 0 to 255.
if delim_code < 0 or delim_code > 255:
return True
# Also, don't allow some specific ASCII codes;
# add more, if it turns out they are needed.
if delim_code in (10, 13):
return True
return False

def read_dsv(dsv_fil, delim_char):
for idx, lin in enumerate(dsv_fil):
fields = lin.split(delim_char)
assert len(fields) > 0
# Knock off the newline at the end of the last field,
# since it is the line terminator, not part of the field.
if fields[-1][-1] == '\n':
fields[-1] = fields[-1][:-1]
# Treat a blank line as a line with one field,
# an empty string (that is what split returns).
print("Line", idx, "fields:")
for idx2, field in enumerate(fields):
print(str(idx2) + ":", "|" + field + "|")

def main():
# Get and check validity of arguments.
sa = sys.argv
lsa = len(sa)
if lsa == 1:
usage(sa)
sys.exit(0)
if lsa == 2:
# Allow the help option with any letter case.
if sa[1].lower() in ("-h", "--help"):
usage(sa, verbose=True)
sys.exit(0)
else:
usage(sa)
sys.exit(0)

# If we reach here, lsa is >= 3.
# Check for valid mandatory options (sic).
if not sa[1] in ("-c", "-n"):
usage(sa, verbose=True)
sys.exit(0)

# If -c option given ...
if sa[1] == "-c":
# If next token is not a single character ...
if len(sa[2]) != 1:
error_exit(
"{}: Error: -c option needs a single character after it.".format(sa[0]))
if not sa[2] in string.printable:
error_exit(
"{}: Error: -c option needs a printable ASCII character after it.".format(\
sa[0]))
delim_char = sa[2]
# else if -n option given ...
elif sa[1] == "-n":
delim_code = str_to_int(sa[2])
if invalid_delimiter(delim_code):
error_exit(
"{}: Error: invalid delimiter code {} given for -n option.".format(\
sa[0], delim_code))
delim_char = chr(delim_code)
else:
# Checking for what should not happen ... a bit of defensive programming here.
error_exit("{}: Program error: neither -c nor -n option given.".format(sa[0]))

try:
# If no filenames given, read sys.stdin ...
if lsa == 3:
print("processing sys.stdin")
dsv_fil = sys.stdin
read_dsv(dsv_fil, delim_char)
dsv_fil.close()
# else (filenames given), read them ...
else:
for dsv_filename in sa[3:]:
print("processing file:", dsv_filename)
dsv_fil = open(dsv_filename, 'r')
read_dsv(dsv_fil, delim_char)
dsv_fil.close()
except IOError as ioe:
error_exit("{}: Error: {}".format(sa[0], repr(ioe)))

if __name__ == '__main__':
main()

Here are test runs of the program (both valid and invalid), and the results of each one:

Run it without any arguments. Gives a brief usage message.
$ python read_dsv.py
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...
Run it with a -h option (for help). Gives the verbose usage message.
$ python read_dsv.py -h
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...
where one of either the -c or -n option must be given,
delim_char is a single ASCII delimiter character, and
delim_code is a delimiter character's ASCII code.
Text lines will be read from specified DSV file(s) or
from standard input, split on the specified delimiter
specified by either the -c or -n option, processed, and
written to standard output.
Run it with a -v option (invalid run). Gives the brief usage message.
$ python read_dsv.py -v
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...
Run it with a -c option but no ASCII character argument (invalid run). Gives the brief usage message.
$ python read_dsv.py -c
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...
Run it with a -c option followed by the pipe character (invalid run). The OS (here, Windows) gives an error message because the pipe character cannot be used to end a pipeline.
$ python read_dsv.py -c |
The syntax of the command is incorrect.
Run it with the -c option and the pipe character as the delimiter character, but protected (by double quotes) from interpretation by the OS shell (CMD).
$ python read_dsv.py -c "|" file1.dsv
processing file: file1.dsv
Line 0 fields:
0: |1|
1: |2|
2: |3|
3: |4|
4: |5|
5: |6|
6: |7|
Line 1 fields:
0: |field1|
1: |fld2|
2: | fld3 with spaces around it |
3: | fld4 with leading spaces|
4: |fld5 with trailing spaces |
5: |next field is empty|
6: ||
7: |last field|
Line 2 fields:
0: ||
1: |1|
2: |22|
3: |333|
4: |4444|
5: |55555|
6: |666666|
7: |7777777|
8: |88888888|
Line 3 fields:
0: ||
1: ||
2: ||
3: ||
4: ||
5: | |
6: ||
7: ||
8: ||
9: ||
10: ||
Run it with the -n option followed by 124, the ASCII code for the pipe character as the delimiter.
$ python read_dsv.py -n 124 file1.dsv
[Gives exact same output as the run above, as it should, because both use the same delimiter and read the same input file.]
Copy file1.dsv to file3.dsv. Change all the pipe characters (delimiters) to colons:
Run it with the -n option followed by 58, the ASCII code for the colon character.
$ python read_dsv.py -n 58 file3.dsv
[Gives exact same output as the run above, as it should, because other than the delimiters (pipe versus colon), the input is the same.]

I added support for the -n option to the program because it makes it more flexible, since you can specify any ASCII character as the delimiter (that makes sense), by giving its ASCII code.

And of course, to find out the values of the ASCII codes for these delimiter characters, I used the char_to_ascii_code.py program from my recent post:

Trapping KeyboardInterrupt and EOFError for program cleanup

You may have noticed that I mentioned delimiter characters and DSV files in that post too. The char_to_ascii_code.py utility shown in that post was created to find the ASCII code for any character (without having to look it up on the web each time).

- Enjoy.

- Vasudev Ram - Online Python training and consulting

- Black Flyday at Flywheel Wordpress Managed Hosting - get 3 months free on the annual plan.

Get updates on my software products / ebooks / courses.

Jump to posts: Python  DLang  xtopdf

Subscribe to my blog by email

My ActiveState recipes




Viewing all articles
Browse latest Browse all 22462

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>