polars.read_csv#
- polars.read_csv(
- source: str | Path | IO[str] | IO[bytes] | bytes,
- *,
- has_header: bool = True,
- columns: Sequence[int] | Sequence[str] | None = None,
- new_columns: Sequence[str] | None = None,
- separator: str = ',',
- comment_prefix: str | None = None,
- quote_char: str | None = '"',
- skip_rows: int = 0,
- schema: SchemaDict | None = None,
- schema_overrides: Mapping[str, PolarsDataType] | Sequence[PolarsDataType] | None = None,
- null_values: str | Sequence[str] | dict[str, str] | None = None,
- missing_utf8_is_empty_string: bool = False,
- ignore_errors: bool = False,
- try_parse_dates: bool = False,
- n_threads: int | None = None,
- infer_schema: bool = True,
- infer_schema_length: int | None = 100,
- batch_size: int = 8192,
- n_rows: int | None = None,
- encoding: CsvEncoding | str = 'utf8',
- low_memory: bool = False,
- rechunk: bool = False,
- use_pyarrow: bool = False,
- storage_options: dict[str, Any] | None = None,
- skip_rows_after_header: int = 0,
- row_index_name: str | None = None,
- row_index_offset: int = 0,
- sample_size: int = 1024,
- eol_char: str = '\n',
- raise_if_empty: bool = True,
- truncate_ragged_lines: bool = False,
- decimal_comma: bool = False,
- glob: bool = True,
Read a CSV file into a DataFrame.
- Parameters:
- source
Path to a file or a file-like object (by “file-like object” we refer to objects that have a
read()
method, such as a file handler like the builtinopen
function, or aBytesIO
instance). Iffsspec
is installed, it will be used to open remote files. For file-like objects, stream position may not be updated accordingly after reading.- has_header
Indicate if the first row of the dataset is a header or not. If set to False, column names will be autogenerated in the following format:
column_x
, withx
being an enumeration over every column in the dataset, starting at 1.- columns
Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.
- new_columns
Rename columns right after parsing the CSV file. If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.
- separator
Single byte character to use as separator in the file.
- comment_prefix
A string used to indicate the start of a comment line. Comment lines are skipped during parsing. Common examples of comment prefixes are
#
and//
.- quote_char
Single byte character used for csv quoting, default =
"
. Set to None to turn off special handling and escaping of quotes.- skip_rows
Start reading after
skip_rows
lines.- schema
Provide the schema. This means that polars doesn’t do schema inference. This argument expects the complete schema, whereas
schema_overrides
can be used to partially overwrite a schema.- schema_overrides
Overwrite dtypes for specific or all columns during schema inference.
- null_values
Values to interpret as null values. You can provide a:
str
: All values equal to this string will be null.List[str]
: All values equal to any string in this list will be null.Dict[str, str]
: A dictionary that maps column name to a null value string.
- missing_utf8_is_empty_string
By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.
- ignore_errors
Try to keep reading lines if some lines yield errors. Before using this option, try to increase the number of lines used for schema inference with e.g
infer_schema_length=10000
or override automatic dtype inference for specific columns with theschema_overrides
option or useinfer_schema=False
to read all columns aspl.String
to check which values might cause an issue.- try_parse_dates
Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type
pl.String
. Ifuse_pyarrow=True
, dates will always be parsed.- n_threads
Number of threads to use in csv parsing. Defaults to the number of physical cpu’s of your system.
- infer_schema
When
True
, the schema is inferred from the data using the firstinfer_schema_length
rows. WhenFalse
, the schema is not inferred and will bepl.String
if not specified inschema
orschema_overrides
.- infer_schema_length
The maximum number of rows to scan for schema inference. If set to
None
, the full data may be scanned (this is slow). Setinfer_schema=False
to read all columns aspl.String
.- batch_size
Number of lines to read into the buffer at once. Modify this to change performance.
- n_rows
Stop reading from CSV file after reading
n_rows
. During multi-threaded parsing, an upper bound ofn_rows
rows cannot be guaranteed.- encoding{‘utf8’, ‘utf8-lossy’, …}
Lossy means that invalid utf8 values are replaced with
�
characters. When using other encodings thanutf8
orutf8-lossy
, the input is first decoded in memory with python. Defaults toutf8
.- low_memory
Reduce memory pressure at the expense of performance.
- rechunk
Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.
- use_pyarrow
Try to use pyarrow’s native CSV parser. This will always parse dates, even if
try_parse_dates=False
. This is not always possible. The set of arguments given to this function determines if it is possible to use pyarrow’s native parser. Note that pyarrow and polars may have a different strategy regarding type inference.- storage_options
Extra options that make sense for
fsspec.open()
or a particular storage connection. e.g. host, port, username, password, etc.- skip_rows_after_header
Skip this number of rows when the header is parsed.
- row_index_name
Insert a row index column with the given name into the DataFrame as the first column. If set to
None
(default), no row index column is created.- row_index_offset
Start the row index at this offset. Cannot be negative. Only used if
row_index_name
is set.- sample_size
Set the sample size. This is used to sample statistics to estimate the allocation needed.
- eol_char
Single byte end of line character (default:
n
). When encountering a file with windows line endings (rn
), one can go with the defaultn
. The extrar
will be removed when processed.- raise_if_empty
When there is no data in the source,
NoDataError
is raised. If this parameter is set to False, an empty DataFrame (with no columns) is returned instead.- truncate_ragged_lines
Truncate lines that are longer than the schema.
- decimal_comma
Parse floats using a comma as the decimal separator instead of a period.
- glob
Expand path given via globbing rules.
- Returns:
- DataFrame
See also
scan_csv
Lazily read from a CSV file or multiple files via glob patterns.
Notes
If the schema is inferred incorrectly (e.g. as
pl.Int64
instead ofpl.Float64
), try to increase the number of lines used to infer the schema withinfer_schema_length
or override the inferred dtype for those columns withschema_overrides
.This operation defaults to a
rechunk
operation at the end, meaning that all data will be stored continuously in memory. Setrechunk=False
if you are benchmarking the csv-reader. Arechunk
is an expensive operation.Examples
>>> pl.read_csv("data.csv", separator="|")
Demonstrate use against a BytesIO object, parsing string dates.
>>> from io import BytesIO >>> data = BytesIO( ... b"ID,Name,Birthday\n" ... b"1,Alice,1995-07-12\n" ... b"2,Bob,1990-09-20\n" ... b"3,Charlie,2002-03-08\n" ... ) >>> pl.read_csv(data, try_parse_dates=True) shape: (3, 3) ┌─────┬─────────┬────────────┐ │ ID ┆ Name ┆ Birthday │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ date │ ╞═════╪═════════╪════════════╡ │ 1 ┆ Alice ┆ 1995-07-12 │ │ 2 ┆ Bob ┆ 1990-09-20 │ │ 3 ┆ Charlie ┆ 2002-03-08 │ └─────┴─────────┴────────────┘