polars.scan_csv#

polars.scan_csv( source: str | Path | list[str] | list[Path], *, has_header: bool = True, separator: str = ',', comment_prefix: str | None = None, quote_char: str | None = '"', skip_rows: int = 0, schema: SchemaDict | None = None, schema_overrides: SchemaDict | Sequence[PolarsDataType] | None = None, null_values: str | Sequence[str] | dict[str, str] | None = None, missing_utf8_is_empty_string: bool = False, ignore_errors: bool = False, cache: bool = True, with_column_names: Callable[[list[str]], list[str]] | None = None, infer_schema: bool = True, infer_schema_length: int | None = 100, n_rows: int | None = None, encoding: CsvEncoding = 'utf8', low_memory: bool = False, rechunk: bool = False, skip_rows_after_header: int = 0, row_index_name: str | None = None, row_index_offset: int = 0, try_parse_dates: bool = False, eol_char: str = '\n', new_columns: Sequence[str] | None = None, raise_if_empty: bool = True, truncate_ragged_lines: bool = False, decimal_comma: bool = False, glob: bool = True, storage_options: dict[str, Any] | None = None, retries: int = 2, file_cache_ttl: int | None = None, include_file_paths: str | None = None, ) → LazyFrame[source]#

Lazily read from a CSV file or multiple files via glob patterns.

This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.

Parameters:

source

Path to a file.

has_header

Indicate if the first row of the dataset is a header or not. If set to False, column names will be autogenerated in the following format: column_x, with x being an enumeration over every column in the dataset, starting at 1.

separator

Single byte character to use as separator in the file.

comment_prefix

A string used to indicate the start of a comment line. Comment lines are skipped during parsing. Common examples of comment prefixes are # and //.

quote_char

Single byte character used for csv quoting, default = ". Set to None to turn off special handling and escaping of quotes.

skip_rows

Start reading after skip_rows lines. The header will be parsed at this offset.

schema

Provide the schema. This means that polars doesn’t do schema inference. This argument expects the complete schema, whereas schema_overrides can be used to partially overwrite a schema.

schema_overrides

Overwrite dtypes during inference; should be a {colname:dtype,} dict or, if providing a list of strings to new_columns, a list of dtypes of the same length.

null_values

Values to interpret as null values. You can provide a:

str: All values equal to this string will be null.
List[str]: All values equal to any string in this list will be null.
Dict[str, str]: A dictionary that maps column name to a null value string.

missing_utf8_is_empty_string

By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.

ignore_errors

Try to keep reading lines if some lines yield errors. First try infer_schema=False to read all columns as pl.String to check which values might cause an issue.

cache

Cache the result after reading.

with_column_names

Apply a function over the column names just in time (when they are determined); this function will receive (and should return) a list of column names.

infer_schema

When True, the schema is inferred from the data using the first infer_schema_length rows. When False, the schema is not inferred and will be pl.String if not specified in schema or schema_overrides.

infer_schema_length

The maximum number of rows to scan for schema inference. If set to None, the full data may be scanned (this is slow). Set infer_schema=False to read all columns as pl.String.

n_rows

Stop reading from CSV file after reading n_rows.

encoding{‘utf8’, ‘utf8-lossy’}

Lossy means that invalid utf8 values are replaced with � characters. Defaults to “utf8”.

low_memory

Reduce memory pressure at the expense of performance.

rechunk

Reallocate to contiguous memory when all chunks/ files are parsed.

skip_rows_after_header

Skip this number of rows when the header is parsed.

row_index_name

If not None, this will insert a row index column with the given name into the DataFrame.

row_index_offset

Offset to start the row index column (only used if the name is set).

try_parse_dates

Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type pl.String.

eol_char

Single byte end of line character (default: n). When encountering a file with windows line endings (rn), one can go with the default n. The extra r will be removed when processed.

new_columns

Provide an explicit list of string column names to use (for example, when scanning a headerless CSV file). If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.

raise_if_empty

When there is no data in the source, NoDataError is raised. If this parameter is set to False, an empty LazyFrame (with no columns) is returned instead.

truncate_ragged_lines

Truncate lines that are longer than the schema.

decimal_comma

Parse floats using a comma as the decimal separator instead of a period.

glob

Expand path given via globbing rules.

storage_options

Options that indicate how to connect to a cloud provider. If the cloud provider is not supported by Polars, the storage options are passed to fsspec.open().

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

If storage_options is not provided, Polars will try to infer the information from environment variables.

retries

Number of retries if accessing a cloud instance fails.

file_cache_ttl

Amount of time to keep downloaded cloud files since their last access time, in seconds. Uses the POLARS_FILE_CACHE_TTL environment variable (which defaults to 1 hour) if not given.

include_file_paths

Include the path of the source file(s) as a column with this name.

Returns:

LazyFrame

See also

read_csv: Read a CSV file into a DataFrame.

Examples

>>> import pathlib
>>>
>>> (
...     pl.scan_csv("my_long_file.csv")  # lazy, doesn't do a thing
...     .select(
...         ["a", "c"]
...     )  # select only 2 columns (other columns will not be read)
...     .filter(
...         pl.col("a") > 10
...     )  # the filter is pushed down the scan, so less data is read into memory
...     .head(100)  # constrain number of returned results to 100
... )  

We can use with_column_names to modify the header before scanning:

>>> df = pl.DataFrame(
...     {"BrEeZaH": [1, 2, 3, 4], "LaNgUaGe": ["is", "hard", "to", "read"]}
... )
>>> path: pathlib.Path = dirpath / "mydf.csv"
>>> df.write_csv(path)
>>> pl.scan_csv(
...     path, with_column_names=lambda cols: [col.lower() for col in cols]
... ).collect()
shape: (4, 2)
┌─────────┬──────────┐
│ breezah ┆ language │
│ ---     ┆ ---      │
│ i64     ┆ str      │
╞═════════╪══════════╡
│ 1       ┆ is       │
│ 2       ┆ hard     │
│ 3       ┆ to       │
│ 4       ┆ read     │
└─────────┴──────────┘

You can also simply replace column names (or provide them if the file has none) by passing a list of new column names to the new_columns parameter:

>>> df.write_csv(path)
>>> pl.scan_csv(
...     path,
...     new_columns=["idx", "txt"],
...     schema_overrides=[pl.UInt16, pl.String],
... ).collect()
shape: (4, 2)
┌─────┬──────┐
│ idx ┆ txt  │
│ --- ┆ ---  │
│ u16 ┆ str  │
╞═════╪══════╡
│ 1   ┆ is   │
│ 2   ┆ hard │
│ 3   ┆ to   │
│ 4   ┆ read │
└─────┴──────┘