polars.scan_csv#
- polars.scan_csv(
- source: str | Path | IO[str] | IO[bytes] | bytes | list[str] | list[Path] | list[IO[str]] | list[IO[bytes]] | list[bytes],
- *,
- has_header: bool = True,
- separator: str = ',',
- comment_prefix: str | None = None,
- quote_char: str | None = '"',
- skip_rows: int = 0,
- schema: SchemaDict | None = None,
- schema_overrides: SchemaDict | Sequence[PolarsDataType] | None = None,
- null_values: str | Sequence[str] | dict[str, str] | None = None,
- missing_utf8_is_empty_string: bool = False,
- ignore_errors: bool = False,
- cache: bool = True,
- with_column_names: Callable[[list[str]], list[str]] | None = None,
- infer_schema: bool = True,
- infer_schema_length: int | None = 100,
- n_rows: int | None = None,
- encoding: CsvEncoding = 'utf8',
- low_memory: bool = False,
- rechunk: bool = False,
- skip_rows_after_header: int = 0,
- row_index_name: str | None = None,
- row_index_offset: int = 0,
- try_parse_dates: bool = False,
- eol_char: str = '\n',
- new_columns: Sequence[str] | None = None,
- raise_if_empty: bool = True,
- truncate_ragged_lines: bool = False,
- decimal_comma: bool = False,
- glob: bool = True,
- storage_options: dict[str, Any] | None = None,
- credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto',
- retries: int = 2,
- file_cache_ttl: int | None = None,
- include_file_paths: str | None = None,
Lazily read from a CSV file or multiple files via glob patterns.
This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.
- Parameters:
- source
Path(s) to a file or directory When needing to authenticate for scanning cloud locations, see the
storage_options
parameter.- has_header
Indicate if the first row of the dataset is a header or not. If set to False, column names will be autogenerated in the following format:
column_x
, withx
being an enumeration over every column in the dataset, starting at 1.- separator
Single byte character to use as separator in the file.
- comment_prefix
A string used to indicate the start of a comment line. Comment lines are skipped during parsing. Common examples of comment prefixes are
#
and//
.- quote_char
Single byte character used for csv quoting, default =
"
. Set to None to turn off special handling and escaping of quotes.- skip_rows
Start reading after
skip_rows
lines. The header will be parsed at this offset.- schema
Provide the schema. This means that polars doesn’t do schema inference. This argument expects the complete schema, whereas
schema_overrides
can be used to partially overwrite a schema.- schema_overrides
Overwrite dtypes during inference; should be a {colname:dtype,} dict or, if providing a list of strings to
new_columns
, a list of dtypes of the same length.- null_values
Values to interpret as null values. You can provide a:
str
: All values equal to this string will be null.List[str]
: All values equal to any string in this list will be null.Dict[str, str]
: A dictionary that maps column name to a null value string.
- missing_utf8_is_empty_string
By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.
- ignore_errors
Try to keep reading lines if some lines yield errors. First try
infer_schema=False
to read all columns aspl.String
to check which values might cause an issue.- cache
Cache the result after reading.
- with_column_names
Apply a function over the column names just in time (when they are determined); this function will receive (and should return) a list of column names.
- infer_schema
When
True
, the schema is inferred from the data using the firstinfer_schema_length
rows. WhenFalse
, the schema is not inferred and will bepl.String
if not specified inschema
orschema_overrides
.- infer_schema_length
The maximum number of rows to scan for schema inference. If set to
None
, the full data may be scanned (this is slow). Setinfer_schema=False
to read all columns aspl.String
.- n_rows
Stop reading from CSV file after reading
n_rows
.- encoding{‘utf8’, ‘utf8-lossy’}
Lossy means that invalid utf8 values are replaced with
�
characters. Defaults to “utf8”.- low_memory
Reduce memory pressure at the expense of performance.
- rechunk
Reallocate to contiguous memory when all chunks/ files are parsed.
- skip_rows_after_header
Skip this number of rows when the header is parsed.
- row_index_name
If not None, this will insert a row index column with the given name into the DataFrame.
- row_index_offset
Offset to start the row index column (only used if the name is set).
- try_parse_dates
Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type
pl.String
.- eol_char
Single byte end of line character (default:
n
). When encountering a file with windows line endings (rn
), one can go with the defaultn
. The extrar
will be removed when processed.- new_columns
Provide an explicit list of string column names to use (for example, when scanning a headerless CSV file). If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.
- raise_if_empty
When there is no data in the source,
NoDataError
is raised. If this parameter is set to False, an empty LazyFrame (with no columns) is returned instead.- truncate_ragged_lines
Truncate lines that are longer than the schema.
- decimal_comma
Parse floats using a comma as the decimal separator instead of a period.
- glob
Expand path given via globbing rules.
- storage_options
Options that indicate how to connect to a cloud provider.
The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (
hf://
): Accepts an API key under thetoken
parameter:{'token': '...'}
, or by setting theHF_TOKEN
environment variable.
If
storage_options
is not provided, Polars will try to infer the information from environment variables.- credential_provider
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- retries
Number of retries if accessing a cloud instance fails.
- file_cache_ttl
Amount of time to keep downloaded cloud files since their last access time, in seconds. Uses the
POLARS_FILE_CACHE_TTL
environment variable (which defaults to 1 hour) if not given.- include_file_paths
Include the path of the source file(s) as a column with this name.
- Returns:
- LazyFrame
See also
read_csv
Read a CSV file into a DataFrame.
Examples
>>> import pathlib >>> >>> ( ... pl.scan_csv("my_long_file.csv") # lazy, doesn't do a thing ... .select( ... ["a", "c"] ... ) # select only 2 columns (other columns will not be read) ... .filter( ... pl.col("a") > 10 ... ) # the filter is pushed down the scan, so less data is read into memory ... .head(100) # constrain number of returned results to 100 ... )
We can use
with_column_names
to modify the header before scanning:>>> df = pl.DataFrame( ... {"BrEeZaH": [1, 2, 3, 4], "LaNgUaGe": ["is", "hard", "to", "read"]} ... ) >>> path: pathlib.Path = dirpath / "mydf.csv" >>> df.write_csv(path) >>> pl.scan_csv( ... path, with_column_names=lambda cols: [col.lower() for col in cols] ... ).collect() shape: (4, 2) ┌─────────┬──────────┐ │ breezah ┆ language │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════════╪══════════╡ │ 1 ┆ is │ │ 2 ┆ hard │ │ 3 ┆ to │ │ 4 ┆ read │ └─────────┴──────────┘
You can also simply replace column names (or provide them if the file has none) by passing a list of new column names to the
new_columns
parameter:>>> df.write_csv(path) >>> pl.scan_csv( ... path, ... new_columns=["idx", "txt"], ... schema_overrides=[pl.UInt16, pl.String], ... ).collect() shape: (4, 2) ┌─────┬──────┐ │ idx ┆ txt │ │ --- ┆ --- │ │ u16 ┆ str │ ╞═════╪══════╡ │ 1 ┆ is │ │ 2 ┆ hard │ │ 3 ┆ to │ │ 4 ┆ read │ └─────┴──────┘