polars.scan_csv#
- polars.scan_csv(
- source: str | Path | list[str] | list[Path],
- *,
- has_header: bool = True,
- separator: str = ',',
- comment_prefix: str | None = None,
- quote_char: str | None = '"',
- skip_rows: int = 0,
- dtypes: SchemaDict | Sequence[PolarsDataType] | None = None,
- schema: SchemaDict | None = None,
- null_values: str | Sequence[str] | dict[str, str] | None = None,
- missing_utf8_is_empty_string: bool = False,
- ignore_errors: bool = False,
- cache: bool = True,
- with_column_names: Callable[[list[str]], list[str]] | None = None,
- infer_schema_length: int | None = 100,
- n_rows: int | None = None,
- encoding: CsvEncoding = 'utf8',
- low_memory: bool = False,
- rechunk: bool = True,
- skip_rows_after_header: int = 0,
- row_count_name: str | None = None,
- row_count_offset: int = 0,
- try_parse_dates: bool = False,
- eol_char: str = '\n',
- new_columns: Sequence[str] | None = None,
- raise_if_empty: bool = True,
- truncate_ragged_lines: bool = False,
Lazily read from a CSV file or multiple files via glob patterns.
This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.
- Parameters:
- source
Path to a file.
- has_header
Indicate if the first row of dataset is a header or not. If set to False, column names will be autogenerated in the following format:
column_x
, withx
being an enumeration over every column in the dataset starting at 1.- separator
Single byte character to use as separator in the file.
- comment_prefix
A string, which can be up to 5 symbols in length, used to indicate the start of a comment line. For instance, it can be set to
#
or//
.- quote_char
Single byte character used for csv quoting, default =
"
. Set to None to turn off special handling and escaping of quotes.- skip_rows
Start reading after
skip_rows
lines. The header will be parsed at this offset.- dtypes
Overwrite dtypes during inference; should be a {colname:dtype,} dict or, if providing a list of strings to
new_columns
, a list of dtypes of the same length.- schema
Provide the schema. This means that polars doesn’t do schema inference. This argument expects the complete schema, whereas
dtypes
can be used to partially overwrite a schema.- null_values
Values to interpret as null values. You can provide a:
str
: All values equal to this string will be null.List[str]
: All values equal to any string in this list will be null.Dict[str, str]
: A dictionary that maps column name to a null value string.
- missing_utf8_is_empty_string
By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.
- ignore_errors
Try to keep reading lines if some lines yield errors. First try
infer_schema_length=0
to read all columns aspl.Utf8
to check which values might cause an issue.- cache
Cache the result after reading.
- with_column_names
Apply a function over the column names just in time (when they are determined); this function will receive (and should return) a list of column names.
- infer_schema_length
Maximum number of lines to read to infer schema. If set to 0, all columns will be read as
pl.Utf8
. If set toNone
, a full table scan will be done (slow).- n_rows
Stop reading from CSV file after reading
n_rows
.- encoding{‘utf8’, ‘utf8-lossy’}
Lossy means that invalid utf8 values are replaced with
�
characters. Defaults to “utf8”.- low_memory
Reduce memory pressure at the expense of performance.
- rechunk
Reallocate to contiguous memory when all chunks/ files are parsed.
- skip_rows_after_header
Skip this number of rows when the header is parsed.
- row_count_name
If not None, this will insert a row count column with the given name into the DataFrame.
- row_count_offset
Offset to start the row_count column (only used if the name is set).
- try_parse_dates
Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type
pl.Utf8
.- eol_char
Single byte end of line character (default:
n
). When encountering a file with windows line endings (rn
), one can go with the defaultn
. The extrar
will be removed when processed.- new_columns
Provide an explicit list of string column names to use (for example, when scanning a headerless CSV file). If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.
- raise_if_empty
When there is no data in the source,`NoDataError` is raised. If this parameter is set to False, an empty LazyFrame (with no columns) is returned instead.
- truncate_ragged_lines
Truncate lines that are longer than the schema.
- Returns:
- LazyFrame
See also
read_csv
Read a CSV file into a DataFrame.
Examples
>>> import pathlib >>> >>> ( ... pl.scan_csv("my_long_file.csv") # lazy, doesn't do a thing ... .select( ... ["a", "c"] ... ) # select only 2 columns (other columns will not be read) ... .filter( ... pl.col("a") > 10 ... ) # the filter is pushed down the scan, so less data is read into memory ... .fetch(100) # pushed a limit of 100 rows to the scan level ... )
We can use
with_column_names
to modify the header before scanning:>>> df = pl.DataFrame( ... {"BrEeZaH": [1, 2, 3, 4], "LaNgUaGe": ["is", "hard", "to", "read"]} ... ) >>> path: pathlib.Path = dirpath / "mydf.csv" >>> df.write_csv(path) >>> pl.scan_csv( ... path, with_column_names=lambda cols: [col.lower() for col in cols] ... ).collect() shape: (4, 2) ┌─────────┬──────────┐ │ breezah ┆ language │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════════╪══════════╡ │ 1 ┆ is │ │ 2 ┆ hard │ │ 3 ┆ to │ │ 4 ┆ read │ └─────────┴──────────┘
You can also simply replace column names (or provide them if the file has none) by passing a list of new column names to the
new_columns
parameter:>>> df.write_csv(path) >>> pl.scan_csv( ... path, ... new_columns=["idx", "txt"], ... dtypes=[pl.UInt16, pl.Utf8], ... ).collect() shape: (4, 2) ┌─────┬──────┐ │ idx ┆ txt │ │ --- ┆ --- │ │ u16 ┆ str │ ╞═════╪══════╡ │ 1 ┆ is │ │ 2 ┆ hard │ │ 3 ┆ to │ │ 4 ┆ read │ └─────┴──────┘