polars.scan_ndjson#

polars.scan_ndjson(
source: str | Path | IO[str] | IO[bytes] | bytes | list[str] | list[Path] | list[IO[str]] | list[IO[bytes]],
*,
schema: SchemaDefinition | None = None,
schema_overrides: SchemaDefinition | None = None,
infer_schema_length: int | None = 100,
batch_size: int | None = 1024,
n_rows: int | None = None,
low_memory: bool = False,
rechunk: bool = False,
row_index_name: str | None = None,
row_index_offset: int = 0,
ignore_errors: bool = False,
storage_options: dict[str, Any] | None = None,
credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto',
retries: int = 2,
file_cache_ttl: int | None = None,
include_file_paths: str | None = None,
) LazyFrame[source]#

Lazily read from a newline delimited JSON file or multiple files via glob patterns.

This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.

Parameters:
source

Path to a file.

schemaSequence of str, (str,DataType) pairs, or a {str:DataType,} dict

The DataFrame schema may be declared in several ways:

  • As a dict of {name:type} pairs; if type is None, it will be auto-inferred.

  • As a list of column names; in this case types are automatically inferred.

  • As a list of (name,type) pairs; this is equivalent to the dictionary form.

If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.

schema_overridesdict, default None

Support type specification or override of one or more columns; note that any dtypes inferred from the schema param will be overridden.

infer_schema_length

The maximum number of rows to scan for schema inference. If set to None, the full data may be scanned (this is slow).

batch_size

Number of rows to read in each batch.

n_rows

Stop reading from JSON file after reading n_rows.

low_memory

Reduce memory pressure at the expense of performance.

rechunk

Reallocate to contiguous memory when all chunks/ files are parsed.

row_index_name

If not None, this will insert a row index column with give name into the DataFrame

row_index_offset

Offset to start the row index column (only use if the name is set)

ignore_errors

Return Null if parsing fails because of schema mismatches.

storage_options

Options that indicate how to connect to a cloud provider.

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

  • aws

  • gcp

  • azure

  • Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

credential_provider

Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

retries

Number of retries if accessing a cloud instance fails.

file_cache_ttl

Amount of time to keep downloaded cloud files since their last access time, in seconds. Uses the POLARS_FILE_CACHE_TTL environment variable (which defaults to 1 hour) if not given.

include_file_paths

Include the path of the source file(s) as a column with this name.