polars.scan_csv#

polars.scan_csv(
source: str | Path | IO[str] | IO[bytes] | bytes | list[str] | list[Path] | list[IO[str]] | list[IO[bytes]] | list[bytes],
*,
has_header: bool = True,
separator: str = ',',
comment_prefix: str | None = None,
quote_char: str | None = '"',
skip_rows: int = 0,
skip_lines: int = 0,
schema: SchemaDict | None = None,
schema_overrides: SchemaDict | Sequence[PolarsDataType] | None = None,
null_values: str | Sequence[str] | dict[str, str] | None = None,
missing_utf8_is_empty_string: bool = False,
ignore_errors: bool = False,
cache: bool = True,
with_column_names: Callable[[list[str]], list[str]] | None = None,
infer_schema: bool = True,
infer_schema_length: int | None = 100,
n_rows: int | None = None,
encoding: CsvEncoding = 'utf8',
low_memory: bool = False,
rechunk: bool = False,
skip_rows_after_header: int = 0,
row_index_name: str | None = None,
row_index_offset: int = 0,
try_parse_dates: bool = False,
eol_char: str = '\n',
new_columns: Sequence[str] | None = None,
raise_if_empty: bool = True,
truncate_ragged_lines: bool = False,
decimal_comma: bool = False,
glob: bool = True,
storage_options: dict[str, Any] | None = None,
credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto',
retries: int = 2,
file_cache_ttl: int | None = None,
include_file_paths: str | None = None,
) LazyFrame[source]#

Lazily read from a CSV file or multiple files via glob patterns.

This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.

Parameters:
source

Path(s) to a file or directory When needing to authenticate for scanning cloud locations, see the storage_options parameter.

has_header

Indicate if the first row of the dataset is a header or not. If set to False, column names will be autogenerated in the following format: column_x, with x being an enumeration over every column in the dataset, starting at 1.

separator

Single byte character to use as separator in the file.

comment_prefix

A string used to indicate the start of a comment line. Comment lines are skipped during parsing. Common examples of comment prefixes are # and //.

quote_char

Single byte character used for csv quoting, default = ". Set to None to turn off special handling and escaping of quotes.

skip_rows

Start reading after skip_rows rows. The header will be parsed at this offset. Note that we respect CSV escaping/comments when skipping rows. If you want to skip by newline char only, use skip_lines.

skip_lines

Start reading after skip_lines lines. The header will be parsed at this offset. Note that CSV escaping will not be respected when skipping lines. If you want to skip valid CSV rows, use skip_rows.

schema

Provide the schema. This means that polars doesn’t do schema inference. This argument expects the complete schema, whereas schema_overrides can be used to partially overwrite a schema.

schema_overrides

Overwrite dtypes during inference; should be a {colname:dtype,} dict or, if providing a list of strings to new_columns, a list of dtypes of the same length.

null_values

Values to interpret as null values. You can provide a:

  • str: All values equal to this string will be null.

  • List[str]: All values equal to any string in this list will be null.

  • Dict[str, str]: A dictionary that maps column name to a null value string.

missing_utf8_is_empty_string

By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.

ignore_errors

Try to keep reading lines if some lines yield errors. First try infer_schema=False to read all columns as pl.String to check which values might cause an issue.

cache

Cache the result after reading.

with_column_names

Apply a function over the column names just in time (when they are determined); this function will receive (and should return) a list of column names.

infer_schema

When True, the schema is inferred from the data using the first infer_schema_length rows. When False, the schema is not inferred and will be pl.String if not specified in schema or schema_overrides.

infer_schema_length

The maximum number of rows to scan for schema inference. If set to None, the full data may be scanned (this is slow). Set infer_schema=False to read all columns as pl.String.

n_rows

Stop reading from CSV file after reading n_rows.

encoding{‘utf8’, ‘utf8-lossy’}

Lossy means that invalid utf8 values are replaced with characters. Defaults to “utf8”.

low_memory

Reduce memory pressure at the expense of performance.

rechunk

Reallocate to contiguous memory when all chunks/ files are parsed.

skip_rows_after_header

Skip this number of rows when the header is parsed.

row_index_name

If not None, this will insert a row index column with the given name into the DataFrame.

row_index_offset

Offset to start the row index column (only used if the name is set).

try_parse_dates

Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type pl.String.

eol_char

Single byte end of line character (default: n). When encountering a file with windows line endings (rn), one can go with the default n. The extra r will be removed when processed.

new_columns

Provide an explicit list of string column names to use (for example, when scanning a headerless CSV file). If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.

raise_if_empty

When there is no data in the source, NoDataError is raised. If this parameter is set to False, an empty LazyFrame (with no columns) is returned instead.

truncate_ragged_lines

Truncate lines that are longer than the schema.

decimal_comma

Parse floats using a comma as the decimal separator instead of a period.

glob

Expand path given via globbing rules.

storage_options

Options that indicate how to connect to a cloud provider.

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

  • aws

  • gcp

  • azure

  • Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

credential_provider

Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

retries

Number of retries if accessing a cloud instance fails.

file_cache_ttl

Amount of time to keep downloaded cloud files since their last access time, in seconds. Uses the POLARS_FILE_CACHE_TTL environment variable (which defaults to 1 hour) if not given.

include_file_paths

Include the path of the source file(s) as a column with this name.

Returns:
LazyFrame

See also

read_csv

Read a CSV file into a DataFrame.

Examples

>>> import pathlib
>>>
>>> (
...     pl.scan_csv("my_long_file.csv")  # lazy, doesn't do a thing
...     .select(
...         ["a", "c"]
...     )  # select only 2 columns (other columns will not be read)
...     .filter(
...         pl.col("a") > 10
...     )  # the filter is pushed down the scan, so less data is read into memory
...     .head(100)  # constrain number of returned results to 100
... )  

We can use with_column_names to modify the header before scanning:

>>> df = pl.DataFrame(
...     {"BrEeZaH": [1, 2, 3, 4], "LaNgUaGe": ["is", "hard", "to", "read"]}
... )
>>> path: pathlib.Path = dirpath / "mydf.csv"
>>> df.write_csv(path)
>>> pl.scan_csv(
...     path, with_column_names=lambda cols: [col.lower() for col in cols]
... ).collect()
shape: (4, 2)
┌─────────┬──────────┐
│ breezah ┆ language │
│ ---     ┆ ---      │
│ i64     ┆ str      │
╞═════════╪══════════╡
│ 1       ┆ is       │
│ 2       ┆ hard     │
│ 3       ┆ to       │
│ 4       ┆ read     │
└─────────┴──────────┘

You can also simply replace column names (or provide them if the file has none) by passing a list of new column names to the new_columns parameter:

>>> df.write_csv(path)
>>> pl.scan_csv(
...     path,
...     new_columns=["idx", "txt"],
...     schema_overrides=[pl.UInt16, pl.String],
... ).collect()
shape: (4, 2)
┌─────┬──────┐
│ idx ┆ txt  │
│ --- ┆ ---  │
│ u16 ┆ str  │
╞═════╪══════╡
│ 1   ┆ is   │
│ 2   ┆ hard │
│ 3   ┆ to   │
│ 4   ┆ read │
└─────┴──────┘