polars.scan_csv#

polars.scan_csv(
source: str | Path | list[str] | list[Path],
*,
has_header: bool = True,
separator: str = ',',
comment_prefix: str | None = None,
quote_char: str | None = '"',
skip_rows: int = 0,
schema: SchemaDict | None = None,
schema_overrides: SchemaDict | Sequence[PolarsDataType] | None = None,
null_values: str | Sequence[str] | dict[str, str] | None = None,
missing_utf8_is_empty_string: bool = False,
ignore_errors: bool = False,
cache: bool = True,
with_column_names: Callable[[list[str]], list[str]] | None = None,
infer_schema: bool = True,
infer_schema_length: int | None = 100,
n_rows: int | None = None,
encoding: CsvEncoding = 'utf8',
low_memory: bool = False,
rechunk: bool = False,
skip_rows_after_header: int = 0,
row_index_name: str | None = None,
row_index_offset: int = 0,
try_parse_dates: bool = False,
eol_char: str = '\n',
new_columns: Sequence[str] | None = None,
raise_if_empty: bool = True,
truncate_ragged_lines: bool = False,
decimal_comma: bool = False,
glob: bool = True,
storage_options: dict[str, Any] | None = None,
retries: int = 2,
file_cache_ttl: int | None = None,
include_file_paths: str | None = None,
) LazyFrame[source]#

Lazily read from a CSV file or multiple files via glob patterns.

This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.

Parameters:
source

Path(s) to a file or directory When needing to authenticate for scanning cloud locations, see the storage_options parameter.

has_header

Indicate if the first row of the dataset is a header or not. If set to False, column names will be autogenerated in the following format: column_x, with x being an enumeration over every column in the dataset, starting at 1.

separator

Single byte character to use as separator in the file.

comment_prefix

A string used to indicate the start of a comment line. Comment lines are skipped during parsing. Common examples of comment prefixes are # and //.

quote_char

Single byte character used for csv quoting, default = ". Set to None to turn off special handling and escaping of quotes.

skip_rows

Start reading after skip_rows lines. The header will be parsed at this offset.

schema

Provide the schema. This means that polars doesn’t do schema inference. This argument expects the complete schema, whereas schema_overrides can be used to partially overwrite a schema.

schema_overrides

Overwrite dtypes during inference; should be a {colname:dtype,} dict or, if providing a list of strings to new_columns, a list of dtypes of the same length.

null_values

Values to interpret as null values. You can provide a:

  • str: All values equal to this string will be null.

  • List[str]: All values equal to any string in this list will be null.

  • Dict[str, str]: A dictionary that maps column name to a null value string.

missing_utf8_is_empty_string

By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.

ignore_errors

Try to keep reading lines if some lines yield errors. First try infer_schema=False to read all columns as pl.String to check which values might cause an issue.

cache

Cache the result after reading.

with_column_names

Apply a function over the column names just in time (when they are determined); this function will receive (and should return) a list of column names.

infer_schema

When True, the schema is inferred from the data using the first infer_schema_length rows. When False, the schema is not inferred and will be pl.String if not specified in schema or schema_overrides.

infer_schema_length

The maximum number of rows to scan for schema inference. If set to None, the full data may be scanned (this is slow). Set infer_schema=False to read all columns as pl.String.

n_rows

Stop reading from CSV file after reading n_rows.

encoding{‘utf8’, ‘utf8-lossy’}

Lossy means that invalid utf8 values are replaced with characters. Defaults to “utf8”.

low_memory

Reduce memory pressure at the expense of performance.

rechunk

Reallocate to contiguous memory when all chunks/ files are parsed.

skip_rows_after_header

Skip this number of rows when the header is parsed.

row_index_name

If not None, this will insert a row index column with the given name into the DataFrame.

row_index_offset

Offset to start the row index column (only used if the name is set).

try_parse_dates

Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type pl.String.

eol_char

Single byte end of line character (default: n). When encountering a file with windows line endings (rn), one can go with the default n. The extra r will be removed when processed.

new_columns

Provide an explicit list of string column names to use (for example, when scanning a headerless CSV file). If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.

raise_if_empty

When there is no data in the source, NoDataError is raised. If this parameter is set to False, an empty LazyFrame (with no columns) is returned instead.

truncate_ragged_lines

Truncate lines that are longer than the schema.

decimal_comma

Parse floats using a comma as the decimal separator instead of a period.

glob

Expand path given via globbing rules.

storage_options

Options that indicate how to connect to a cloud provider.

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

  • aws

  • gcp

  • azure

  • Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

retries

Number of retries if accessing a cloud instance fails.

file_cache_ttl

Amount of time to keep downloaded cloud files since their last access time, in seconds. Uses the POLARS_FILE_CACHE_TTL environment variable (which defaults to 1 hour) if not given.

include_file_paths

Include the path of the source file(s) as a column with this name.

Returns:
LazyFrame

See also

read_csv

Read a CSV file into a DataFrame.

Examples

>>> import pathlib
>>>
>>> (
...     pl.scan_csv("my_long_file.csv")  # lazy, doesn't do a thing
...     .select(
...         ["a", "c"]
...     )  # select only 2 columns (other columns will not be read)
...     .filter(
...         pl.col("a") > 10
...     )  # the filter is pushed down the scan, so less data is read into memory
...     .head(100)  # constrain number of returned results to 100
... )  

We can use with_column_names to modify the header before scanning:

>>> df = pl.DataFrame(
...     {"BrEeZaH": [1, 2, 3, 4], "LaNgUaGe": ["is", "hard", "to", "read"]}
... )
>>> path: pathlib.Path = dirpath / "mydf.csv"
>>> df.write_csv(path)
>>> pl.scan_csv(
...     path, with_column_names=lambda cols: [col.lower() for col in cols]
... ).collect()
shape: (4, 2)
┌─────────┬──────────┐
│ breezah ┆ language │
│ ---     ┆ ---      │
│ i64     ┆ str      │
╞═════════╪══════════╡
│ 1       ┆ is       │
│ 2       ┆ hard     │
│ 3       ┆ to       │
│ 4       ┆ read     │
└─────────┴──────────┘

You can also simply replace column names (or provide them if the file has none) by passing a list of new column names to the new_columns parameter:

>>> df.write_csv(path)
>>> pl.scan_csv(
...     path,
...     new_columns=["idx", "txt"],
...     schema_overrides=[pl.UInt16, pl.String],
... ).collect()
shape: (4, 2)
┌─────┬──────┐
│ idx ┆ txt  │
│ --- ┆ ---  │
│ u16 ┆ str  │
╞═════╪══════╡
│ 1   ┆ is   │
│ 2   ┆ hard │
│ 3   ┆ to   │
│ 4   ┆ read │
└─────┴──────┘