I was always wondering how pandas infers data types and why sometimes it takes a lot of memory when reading large CSV files. Well, it is time to understand how it works.
This article describes a default C-based CSV parsing engine in pandas.
First off, there is a low_memory
parameter in the read_csv
function that is set to True
by default. Instead of processing whole file in a single pass, it splits CSV into chunks, which size is limited by the number of lines. A special heuristic determines the number of lines — 2**20 / number_of_columns
rounded down to the nearest power of two.
Parsing process
The parsing process starts with a tokenizer, which splits each line into fields (tokens). The tokenizing engine does not make any assumptions about the data and stores each column as an array of strings.
The next step involves data conversion. If there is no type specified, pandas will try to infer the type automatically.
The inference module is written in C and Cython, here is a pseudo code of the main logic behind type inference:
deftry_int64(column):result=np.empty(