polars.scan_csv#

polars.scan_csv(
source: str | Path,
*,
has_header: bool = True,
separator: str = ',',
comment_char: str | None = None,
quote_char: str | None = '"',
skip_rows: int = 0,
dtypes: SchemaDict | Sequence[PolarsDataType] | None = None,
null_values: str | Sequence[str] | dict[str, str] | None = None,
missing_utf8_is_empty_string: bool = False,
ignore_errors: bool = False,
cache: bool = True,
with_column_names: Callable[[list[str]], list[str]] | None = None,
infer_schema_length: int | None = 100,
n_rows: int | None = None,
encoding: CsvEncoding = 'utf8',
low_memory: bool = False,
rechunk: bool = True,
skip_rows_after_header: int = 0,
row_count_name: str | None = None,
row_count_offset: int = 0,
try_parse_dates: bool = False,
eol_char: str = '\n',
new_columns: Sequence[str] | None = None,
raise_if_empty: bool = True,
) LazyFrame[source]#

Lazily read from a CSV file or multiple files via glob patterns.

This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.

Parameters:
source

Path to a file.

has_header

Indicate if the first row of dataset is a header or not. If set to False, column names will be autogenerated in the following format: column_x, with x being an enumeration over every column in the dataset starting at 1.

separator

Single byte character to use as delimiter in the file.

comment_char

Single byte character that indicates the start of a comment line, for instance #.

quote_char

Single byte character used for csv quoting, default = ". Set to None to turn off special handling and escaping of quotes.

skip_rows

Start reading after skip_rows lines. The header will be parsed at this offset.

dtypes

Overwrite dtypes during inference; should be a {colname:dtype,} dict or, if providing a list of strings to new_columns, a list of dtypes of the same length.

null_values

Values to interpret as null values. You can provide a:

  • str: All values equal to this string will be null.

  • List[str]: All values equal to any string in this list will be null.

  • Dict[str, str]: A dictionary that maps column name to a null value string.

missing_utf8_is_empty_string

By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.

ignore_errors

Try to keep reading lines if some lines yield errors. First try infer_schema_length=0 to read all columns as pl.Utf8 to check which values might cause an issue.

cache

Cache the result after reading.

with_column_names

Apply a function over the column names just in time (when they are determined); this function will receive (and should return) a list of column names.

infer_schema_length

Maximum number of lines to read to infer schema. If set to 0, all columns will be read as pl.Utf8. If set to None, a full table scan will be done (slow).

n_rows

Stop reading from CSV file after reading n_rows.

encoding{‘utf8’, ‘utf8-lossy’}

Lossy means that invalid utf8 values are replaced with characters. Defaults to “utf8”.

low_memory

Reduce memory usage in expense of performance.

rechunk

Reallocate to contiguous memory when all chunks/ files are parsed.

skip_rows_after_header

Skip this number of rows when the header is parsed.

row_count_name

If not None, this will insert a row count column with the given name into the DataFrame.

row_count_offset

Offset to start the row_count column (only used if the name is set).

try_parse_dates

Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type pl.Utf8.

eol_char

Single byte end of line character

new_columns

Provide an explicit list of string column names to use (for example, when scanning a headerless CSV file). Note that unlike read_csv it is considered an error to provide fewer column names than there are columns in the file.

raise_if_empty

When there is no data in the source,``NoDataError`` is raised. If this parameter is set to False, an empty LazyFrame (with no columns) is returned instead.

Returns:
LazyFrame

See also

read_csv

Read a CSV file into a DataFrame.

Examples

>>> import pathlib
>>>
>>> (
...     pl.scan_csv("my_long_file.csv")  # lazy, doesn't do a thing
...     .select(
...         ["a", "c"]
...     )  # select only 2 columns (other columns will not be read)
...     .filter(
...         pl.col("a") > 10
...     )  # the filter is pushed down the scan, so less data is read into memory
...     .fetch(100)  # pushed a limit of 100 rows to the scan level
... )  

We can use with_column_names to modify the header before scanning:

>>> df = pl.DataFrame(
...     {"BrEeZaH": [1, 2, 3, 4], "LaNgUaGe": ["is", "hard", "to", "read"]}
... )
>>> path: pathlib.Path = dirpath / "mydf.csv"
>>> df.write_csv(path)
>>> pl.scan_csv(
...     path, with_column_names=lambda cols: [col.lower() for col in cols]
... ).collect()
shape: (4, 2)
┌─────────┬──────────┐
│ breezah ┆ language │
│ ---     ┆ ---      │
│ i64     ┆ str      │
╞═════════╪══════════╡
│ 1       ┆ is       │
│ 2       ┆ hard     │
│ 3       ┆ to       │
│ 4       ┆ read     │
└─────────┴──────────┘

You can also simply replace column names (or provide them if the file has none) by passing a list of new column names to the new_columns parameter:

>>> df.write_csv(path)
>>> pl.scan_csv(
...     path,
...     new_columns=["idx", "txt"],
...     dtypes=[pl.UInt16, pl.Utf8],
... ).collect()
shape: (4, 2)
┌─────┬──────┐
│ idx ┆ txt  │
│ --- ┆ ---  │
│ u16 ┆ str  │
╞═════╪══════╡
│ 1   ┆ is   │
│ 2   ┆ hard │
│ 3   ┆ to   │
│ 4   ┆ read │
└─────┴──────┘