polars.read_csv_batched#
- polars.read_csv_batched(
- source: str | Path,
- *,
- has_header: bool = True,
- columns: Sequence[int] | Sequence[str] | None = None,
- new_columns: Sequence[str] | None = None,
- separator: str = ',',
- comment_prefix: str | None = None,
- quote_char: str | None = '"',
- skip_rows: int = 0,
- schema_overrides: Mapping[str, PolarsDataType] | Sequence[PolarsDataType] | None = None,
- null_values: str | Sequence[str] | dict[str, str] | None = None,
- missing_utf8_is_empty_string: bool = False,
- ignore_errors: bool = False,
- try_parse_dates: bool = False,
- n_threads: int | None = None,
- infer_schema_length: int | None = 100,
- batch_size: int = 50000,
- n_rows: int | None = None,
- encoding: CsvEncoding | str = 'utf8',
- low_memory: bool = False,
- rechunk: bool = False,
- skip_rows_after_header: int = 0,
- row_index_name: str | None = None,
- row_index_offset: int = 0,
- sample_size: int = 1024,
- eol_char: str = '\n',
- raise_if_empty: bool = True,
- truncate_ragged_lines: bool = False,
- decimal_comma: bool = False,
Read a CSV file in batches.
Upon creation of the
BatchedCsvReader
, Polars will gather statistics and determine the file chunks. After that, work will only be done ifnext_batches
is called, which will return a list ofn
frames of the given batch size.- Parameters:
- source
Path to a file or a file-like object (by “file-like object” we refer to objects that have a
read()
method, such as a file handler like the builtinopen
function, or aBytesIO
instance). Iffsspec
is installed, it will be used to open remote files.- has_header
Indicate if the first row of the dataset is a header or not. If set to False, column names will be autogenerated in the following format:
column_x
, withx
being an enumeration over every column in the dataset, starting at 1.- columns
Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.
- new_columns
Rename columns right after parsing the CSV file. If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.
- separator
Single byte character to use as separator in the file.
- comment_prefix
A string used to indicate the start of a comment line. Comment lines are skipped during parsing. Common examples of comment prefixes are
#
and//
.- quote_char
Single byte character used for csv quoting, default =
"
. Set to None to turn off special handling and escaping of quotes.- skip_rows
Start reading after
skip_rows
lines.- schema_overrides
Overwrite dtypes during inference.
- null_values
Values to interpret as null values. You can provide a:
str
: All values equal to this string will be null.List[str]
: All values equal to any string in this list will be null.Dict[str, str]
: A dictionary that maps column name to a null value string.
- missing_utf8_is_empty_string
By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.
- ignore_errors
Try to keep reading lines if some lines yield errors. First try
infer_schema_length=0
to read all columns aspl.String
to check which values might cause an issue.- try_parse_dates
Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type
pl.String
.- n_threads
Number of threads to use in csv parsing. Defaults to the number of physical cpu’s of your system.
- infer_schema_length
The maximum number of rows to scan for schema inference. If set to
0
, all columns will be read aspl.String
. If set toNone
, the full data may be scanned (this is slow).- batch_size
Number of lines to read into the buffer at once.
Modify this to change performance.
- n_rows
Stop reading from CSV file after reading
n_rows
. During multi-threaded parsing, an upper bound ofn_rows
rows cannot be guaranteed.- encoding{‘utf8’, ‘utf8-lossy’, …}
Lossy means that invalid utf8 values are replaced with
�
characters. When using other encodings thanutf8
orutf8-lossy
, the input is first decoded in memory with python. Defaults toutf8
.- low_memory
Reduce memory pressure at the expense of performance.
- rechunk
Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.
- skip_rows_after_header
Skip this number of rows when the header is parsed.
- row_index_name
Insert a row index column with the given name into the DataFrame as the first column. If set to
None
(default), no row index column is created.- row_index_offset
Start the row index at this offset. Cannot be negative. Only used if
row_index_name
is set.- sample_size
Set the sample size. This is used to sample statistics to estimate the allocation needed.
- eol_char
Single byte end of line character (default:
n
). When encountering a file with windows line endings (rn
), one can go with the defaultn
. The extrar
will be removed when processed.- raise_if_empty
When there is no data in the source,`NoDataError` is raised. If this parameter is set to False,
None
will be returned fromnext_batches(n)
instead.- truncate_ragged_lines
Truncate lines that are longer than the schema.
- decimal_comma
Parse floats using a comma as the decimal separator instead of a period.
- Returns:
- BatchedCsvReader
See also
scan_csv
Lazily read from a CSV file or multiple files via glob patterns.
Examples
>>> reader = pl.read_csv_batched( ... "./tpch/tables_scale_100/lineitem.tbl", ... separator="|", ... try_parse_dates=True, ... ) >>> batches = reader.next_batches(5) >>> for df in batches: ... print(df)
Read big CSV file in batches and write a CSV file for each “group” of interest.
>>> seen_groups = set() >>> reader = pl.read_csv_batched("big_file.csv") >>> batches = reader.next_batches(100)
>>> while batches: ... df_current_batches = pl.concat(batches) ... partition_dfs = df_current_batches.partition_by("group", as_dict=True) ... ... for group, df in partition_dfs.items(): ... if group in seen_groups: ... with open(f"./data/{group}.csv", "a") as fh: ... fh.write(df.write_csv(file=None, include_header=False)) ... else: ... df.write_csv(file=f"./data/{group}.csv", include_header=True) ... seen_groups.add(group) ... ... batches = reader.next_batches(100)