polars.scan_ndjson#
- polars.scan_ndjson(
- source: str | Path | list[str] | list[Path],
- *,
- schema: SchemaDefinition | None = None,
- infer_schema_length: int | None = 100,
- batch_size: int | None = 1024,
- n_rows: int | None = None,
- low_memory: bool = False,
- rechunk: bool = False,
- row_index_name: str | None = None,
- row_index_offset: int = 0,
- ignore_errors: bool = False,
Lazily read from a newline delimited JSON file or multiple files via glob patterns.
This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.
- Parameters:
- source
Path to a file.
- schemaSequence of str, (str,DataType) pairs, or a {str:DataType,} dict
The DataFrame schema may be declared in several ways:
As a dict of {name:type} pairs; if type is None, it will be auto-inferred.
As a list of column names; in this case types are automatically inferred.
As a list of (name,type) pairs; this is equivalent to the dictionary form.
If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.
- infer_schema_length
The maximum number of rows to scan for schema inference. If set to
None
, the full data may be scanned (this is slow).- batch_size
Number of rows to read in each batch.
- n_rows
Stop reading from JSON file after reading
n_rows
.- low_memory
Reduce memory pressure at the expense of performance.
- rechunk
Reallocate to contiguous memory when all chunks/ files are parsed.
- row_index_name
If not None, this will insert a row index column with give name into the DataFrame
- row_index_offset
Offset to start the row index column (only use if the name is set)
- ignore_errors
Return
Null
if parsing fails because of schema mismatches.