polars.scan_csv#
- polars.scan_csv(
- source: str | Path | IO[str] | IO[bytes] | bytes | list[str] | list[Path] | list[IO[str]] | list[IO[bytes]] | list[bytes],
- *,
- has_header: bool = True,
- separator: str = ',',
- comment_prefix: str | None = None,
- quote_char: str | None = '"',
- skip_rows: int = 0,
- skip_lines: int = 0,
- schema: SchemaDict | None = None,
- schema_overrides: SchemaDict | Sequence[PolarsDataType] | None = None,
- null_values: str | Sequence[str] | dict[str, str] | None = None,
- missing_utf8_is_empty_string: bool = False,
- ignore_errors: bool = False,
- cache: bool = True,
- with_column_names: Callable[[list[str]], list[str]] | None = None,
- infer_schema: bool = True,
- infer_schema_length: int | None = 100,
- n_rows: int | None = None,
- encoding: CsvEncoding = 'utf8',
- low_memory: bool = False,
- rechunk: bool = False,
- skip_rows_after_header: int = 0,
- row_index_name: str | None = None,
- row_index_offset: int = 0,
- try_parse_dates: bool = False,
- eol_char: str = '\n',
- new_columns: Sequence[str] | None = None,
- raise_if_empty: bool = True,
- truncate_ragged_lines: bool = False,
- decimal_comma: bool = False,
- glob: bool = True,
- storage_options: dict[str, Any] | None = None,
- credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto',
- retries: int = 2,
- file_cache_ttl: int | None = None,
- include_file_paths: str | None = None,
Lazily read from a CSV file or multiple files via glob patterns.
This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.
Changed in version 0.20.31: The
dtypes
parameter was renamedschema_overrides
.Changed in version 0.20.4: * The
row_count_name
parameter was renamedrow_index_name
. * Therow_count_offset
parameter was renamedrow_index_offset
.- Parameters:
- source
Path(s) to a file or directory When needing to authenticate for scanning cloud locations, see the
storage_options
parameter.- has_header
Indicate if the first row of the dataset is a header or not. If set to False, column names will be autogenerated in the following format:
column_x
, withx
being an enumeration over every column in the dataset, starting at 1.- separator
Single byte character to use as separator in the file.
- comment_prefix
A string used to indicate the start of a comment line. Comment lines are skipped during parsing. Common examples of comment prefixes are
#
and//
.- quote_char
Single byte character used for csv quoting, default =
"
. Set to None to turn off special handling and escaping of quotes.- skip_rows
Start reading after
skip_rows
rows. The header will be parsed at this offset. Note that we respect CSV escaping/comments when skipping rows. If you want to skip by newline char only, useskip_lines
.- skip_lines
Start reading after
skip_lines
lines. The header will be parsed at this offset. Note that CSV escaping will not be respected when skipping lines. If you want to skip valid CSV rows, useskip_rows
.- schema
Provide the schema. This means that polars doesn’t do schema inference. This argument expects the complete schema, whereas
schema_overrides
can be used to partially overwrite a schema. Note that the order of the columns in the providedschema
must match the order of the columns in the CSV being read.- schema_overrides
Overwrite dtypes during inference; should be a {colname:dtype,} dict or, if providing a list of strings to
new_columns
, a list of dtypes of the same length.- null_values
Values to interpret as null values. You can provide a:
str
: All values equal to this string will be null.List[str]
: All values equal to any string in this list will be null.Dict[str, str]
: A dictionary that maps column name to a null value string.
- missing_utf8_is_empty_string
By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.
- ignore_errors
Try to keep reading lines if some lines yield errors. First try
infer_schema=False
to read all columns aspl.String
to check which values might cause an issue.- cache
Cache the result after reading.
- with_column_names
Apply a function over the column names just in time (when they are determined); this function will receive (and should return) a list of column names.
- infer_schema
When
True
, the schema is inferred from the data using the firstinfer_schema_length
rows. WhenFalse
, the schema is not inferred and will bepl.String
if not specified inschema
orschema_overrides
.- infer_schema_length
The maximum number of rows to scan for schema inference. If set to
None
, the full data may be scanned (this is slow). Setinfer_schema=False
to read all columns aspl.String
.- n_rows
Stop reading from CSV file after reading
n_rows
.- encoding{‘utf8’, ‘utf8-lossy’}
Lossy means that invalid utf8 values are replaced with
�
characters. Defaults to “utf8”.- low_memory
Reduce memory pressure at the expense of performance.
- rechunk
Reallocate to contiguous memory when all chunks/ files are parsed.
- skip_rows_after_header
Skip this number of rows when the header is parsed.
- row_index_name
If not None, this will insert a row index column with the given name into the DataFrame.
- row_index_offset
Offset to start the row index column (only used if the name is set).
- try_parse_dates
Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type
pl.String
.- eol_char
Single byte end of line character (default:
n
). When encountering a file with windows line endings (rn
), one can go with the defaultn
. The extrar
will be removed when processed.- new_columns
Provide an explicit list of string column names to use (for example, when scanning a headerless CSV file). If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.
- raise_if_empty
When there is no data in the source,
NoDataError
is raised. If this parameter is set to False, an empty LazyFrame (with no columns) is returned instead.- truncate_ragged_lines
Truncate lines that are longer than the schema.
- decimal_comma
Parse floats using a comma as the decimal separator instead of a period.
- glob
Expand path given via globbing rules.
- storage_options
Options that indicate how to connect to a cloud provider.
The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (
hf://
): Accepts an API key under thetoken
parameter:{'token': '...'}
, or by setting theHF_TOKEN
environment variable.
If
storage_options
is not provided, Polars will try to infer the information from environment variables.- credential_provider
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- retries
Number of retries if accessing a cloud instance fails.
- file_cache_ttl
Amount of time to keep downloaded cloud files since their last access time, in seconds. Uses the
POLARS_FILE_CACHE_TTL
environment variable (which defaults to 1 hour) if not given.- include_file_paths
Include the path of the source file(s) as a column with this name.
- Returns:
- LazyFrame
See also
read_csv
Read a CSV file into a DataFrame.
Examples
>>> import pathlib >>> >>> ( ... pl.scan_csv("my_long_file.csv") # lazy, doesn't do a thing ... .select( ... ["a", "c"] ... ) # select only 2 columns (other columns will not be read) ... .filter( ... pl.col("a") > 10 ... ) # the filter is pushed down the scan, so less data is read into memory ... .head(100) # constrain number of returned results to 100 ... )
We can use
with_column_names
to modify the header before scanning:>>> df = pl.DataFrame( ... {"BrEeZaH": [1, 2, 3, 4], "LaNgUaGe": ["is", "hard", "to", "read"]} ... ) >>> path: pathlib.Path = dirpath / "mydf.csv" >>> df.write_csv(path) >>> pl.scan_csv( ... path, with_column_names=lambda cols: [col.lower() for col in cols] ... ).collect() shape: (4, 2) ┌─────────┬──────────┐ │ breezah ┆ language │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════════╪══════════╡ │ 1 ┆ is │ │ 2 ┆ hard │ │ 3 ┆ to │ │ 4 ┆ read │ └─────────┴──────────┘
You can also simply replace column names (or provide them if the file has none) by passing a list of new column names to the
new_columns
parameter:>>> df.write_csv(path) >>> pl.scan_csv( ... path, ... new_columns=["idx", "txt"], ... schema_overrides=[pl.UInt16, pl.String], ... ).collect() shape: (4, 2) ┌─────┬──────┐ │ idx ┆ txt │ │ --- ┆ --- │ │ u16 ┆ str │ ╞═════╪══════╡ │ 1 ┆ is │ │ 2 ┆ hard │ │ 3 ┆ to │ │ 4 ┆ read │ └─────┴──────┘