source: str | Path | list[str] | list[Path],
has_header: bool = True,
separator: str = ',',
comment_prefix: str | None = None,
quote_char: str | None = '"',
skip_rows: int = 0,
schema: SchemaDict | None = None,
schema_overrides: SchemaDict | Sequence[PolarsDataType] | None = None,
null_values: str | Sequence[str] | dict[str, str] | None = None,
missing_utf8_is_empty_string: bool = False,
ignore_errors: bool = False,
cache: bool = True,
with_column_names: Callable[[list[str]], list[str]] | None = None,
infer_schema_length: int | None = 100,
n_rows: int | None = None,
encoding: CsvEncoding = 'utf8',
low_memory: bool = False,
rechunk: bool = False,
skip_rows_after_header: int = 0,
row_index_name: str | None = None,
row_index_offset: int = 0,
try_parse_dates: bool = False,
eol_char: str = '\n',
new_columns: Sequence[str] | None = None,
raise_if_empty: bool = True,
truncate_ragged_lines: bool = False,
decimal_comma: bool = False,
glob: bool = True,
) LazyFrame[source]#

Lazily read from a CSV file or multiple files via glob patterns.

This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.


Path to a file.


Indicate if the first row of the dataset is a header or not. If set to False, column names will be autogenerated in the following format: column_x, with x being an enumeration over every column in the dataset, starting at 1.


Single byte character to use as separator in the file.


A string used to indicate the start of a comment line. Comment lines are skipped during parsing. Common examples of comment prefixes are # and //.


Single byte character used for csv quoting, default = ". Set to None to turn off special handling and escaping of quotes.


Start reading after skip_rows lines. The header will be parsed at this offset.


Provide the schema. This means that polars doesn’t do schema inference. This argument expects the complete schema, whereas schema_overrides can be used to partially overwrite a schema.


Overwrite dtypes during inference; should be a {colname:dtype,} dict or, if providing a list of strings to new_columns, a list of dtypes of the same length.


Values to interpret as null values. You can provide a:

  • str: All values equal to this string will be null.

  • List[str]: All values equal to any string in this list will be null.

  • Dict[str, str]: A dictionary that maps column name to a null value string.


By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.


Try to keep reading lines if some lines yield errors. First try infer_schema_length=0 to read all columns as pl.String to check which values might cause an issue.


Cache the result after reading.


Apply a function over the column names just in time (when they are determined); this function will receive (and should return) a list of column names.


The maximum number of rows to scan for schema inference. If set to 0, all columns will be read as pl.String. If set to None, the full data may be scanned (this is slow).


Stop reading from CSV file after reading n_rows.

encoding{‘utf8’, ‘utf8-lossy’}

Lossy means that invalid utf8 values are replaced with characters. Defaults to “utf8”.


Reduce memory pressure at the expense of performance.


Reallocate to contiguous memory when all chunks/ files are parsed.


Skip this number of rows when the header is parsed.


If not None, this will insert a row index column with the given name into the DataFrame.


Offset to start the row index column (only used if the name is set).


Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type pl.String.


Single byte end of line character (default: n). When encountering a file with windows line endings (rn), one can go with the default n. The extra r will be removed when processed.


Provide an explicit list of string column names to use (for example, when scanning a headerless CSV file). If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.


When there is no data in the source,`NoDataError` is raised. If this parameter is set to False, an empty LazyFrame (with no columns) is returned instead.


Truncate lines that are longer than the schema.


Parse floats using a comma as the decimal separator instead of a period.


Expand path given via globbing rules.


See also


Read a CSV file into a DataFrame.


>>> import pathlib
>>> (
...     pl.scan_csv("my_long_file.csv")  # lazy, doesn't do a thing
...     .select(
...         ["a", "c"]
...     )  # select only 2 columns (other columns will not be read)
...     .filter(
...         pl.col("a") > 10
...     )  # the filter is pushed down the scan, so less data is read into memory
...     .fetch(100)  # pushed a limit of 100 rows to the scan level
... )

We can use with_column_names to modify the header before scanning:

>>> df = pl.DataFrame(
...     {"BrEeZaH": [1, 2, 3, 4], "LaNgUaGe": ["is", "hard", "to", "read"]}
... )
>>> path: pathlib.Path = dirpath / "mydf.csv"
>>> df.write_csv(path)
>>> pl.scan_csv(
...     path, with_column_names=lambda cols: [col.lower() for col in cols]
... ).collect()
shape: (4, 2)
│ breezah ┆ language │
│ ---     ┆ ---      │
│ i64     ┆ str      │
│ 1       ┆ is       │
│ 2       ┆ hard     │
│ 3       ┆ to       │
│ 4       ┆ read     │

You can also simply replace column names (or provide them if the file has none) by passing a list of new column names to the new_columns parameter:

>>> df.write_csv(path)
>>> pl.scan_csv(
...     path,
...     new_columns=["idx", "txt"],
...     schema_overrides=[pl.UInt16, pl.String],
... ).collect()
shape: (4, 2)
│ idx ┆ txt  │
│ --- ┆ ---  │
│ u16 ┆ str  │
│ 1   ┆ is   │
│ 2   ┆ hard │
│ 3   ┆ to   │
│ 4   ┆ read │