pyspark.sql.DataFrameReader.json¶
-
DataFrameReader.
json
(path, schema=None, primitivesAsString=None, prefersDecimal=None, allowComments=None, allowUnquotedFieldNames=None, allowSingleQuotes=None, allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None, mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None, multiLine=None, allowUnquotedControlChars=None, lineSep=None, samplingRatio=None, dropFieldIfAllNull=None, encoding=None, locale=None, pathGlobFilter=None, recursiveFileLookup=None, allowNonNumericNumbers=None, modifiedBefore=None, modifiedAfter=None)[source]¶ Loads JSON files and returns the results as a
DataFrame
.JSON Lines (newline-delimited JSON) is supported by default. For JSON (one record per file), set the
multiLine
parameter totrue
.If the
schema
parameter is not specified, this function goes through the input once to determine the input schema.New in version 1.4.0.
- Parameters:
- pathstr, list or
RDD
string represents path to the JSON dataset, or a list of paths, or RDD of Strings storing JSON objects.
- schema
pyspark.sql.types.StructType
or str, optional an optional
pyspark.sql.types.StructType
for the input schema or a DDL-formatted string (For examplecol0 INT, col1 DOUBLE
).- primitivesAsStringstr or bool, optional
infers all primitive values as a string type. If None is set, it uses the default value,
false
.- prefersDecimalstr or bool, optional
infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles. If None is set, it uses the default value,
false
.- allowCommentsstr or bool, optional
ignores Java/C++ style comment in JSON records. If None is set, it uses the default value,
false
.- allowUnquotedFieldNamesstr or bool, optional
allows unquoted JSON field names. If None is set, it uses the default value,
false
.- allowSingleQuotesstr or bool, optional
allows single quotes in addition to double quotes. If None is set, it uses the default value,
true
.- allowNumericLeadingZerostr or bool, optional
allows leading zeros in numbers (e.g. 00012). If None is set, it uses the default value,
false
.- allowBackslashEscapingAnyCharacterstr or bool, optional
allows accepting quoting of all character using backslash quoting mechanism. If None is set, it uses the default value,
false
.- modestr, optional
- allows a mode for dealing with corrupt records during parsing. If None is
set, it uses the default value,
PERMISSIVE
.
PERMISSIVE
: when it meets a corrupted record, puts the malformed string into a field configured bycolumnNameOfCorruptRecord
, and sets malformed fields tonull
. To keep corrupt records, an user can set a string type field namedcolumnNameOfCorruptRecord
in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds acolumnNameOfCorruptRecord
field in an output schema.DROPMALFORMED
: ignores the whole corrupted records.FAILFAST
: throws an exception when it meets corrupted records.
- columnNameOfCorruptRecord: str, optional
allows renaming the new field having malformed string created by
PERMISSIVE
mode. This overridesspark.sql.columnNameOfCorruptRecord
. If None is set, it uses the value specified inspark.sql.columnNameOfCorruptRecord
.- dateFormatstr, optional
sets the string that indicates a date format. Custom date formats follow the formats at datetime pattern. # noqa This applies to date type. If None is set, it uses the default value,
yyyy-MM-dd
.- timestampFormatstr, optional
sets the string that indicates a timestamp format. Custom date formats follow the formats at datetime pattern. # noqa This applies to timestamp type. If None is set, it uses the default value,
yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
.- multiLinestr or bool, optional
parse one record, which may span multiple lines, per file. If None is set, it uses the default value,
false
.- allowUnquotedControlCharsstr or bool, optional
allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.
- encodingstr or bool, optional
allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. If None is set, the encoding of input JSON will be detected automatically when the multiLine option is set to
true
.- lineSepstr, optional
defines the line separator that should be used for parsing. If None is set, it covers all
\r
,\r\n
and\n
.- samplingRatiostr or float, optional
defines fraction of input JSON objects used for schema inferring. If None is set, it uses the default value,
1.0
.- dropFieldIfAllNullstr or bool, optional
whether to ignore column of all null values or empty array/struct during schema inference. If None is set, it uses the default value,
false
.- localestr, optional
sets a locale as language tag in IETF BCP 47 format. If None is set, it uses the default value,
en-US
. For instance,locale
is used while parsing dates and timestamps.- pathGlobFilterstr or bool, optional
an optional glob pattern to only include files with paths matching the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. It does not change the behavior of partition discovery. # noqa
- recursiveFileLookupstr or bool, optional
recursively scan a directory for files. Using this option disables partition discovery. # noqa
- allowNonNumericNumbersstr or bool
allows JSON parser to recognize set of “Not-a-Number” (NaN) tokens as legal floating number values. If None is set, it uses the default value,
true
.+INF
: for positive infinity, as well as alias of+Infinity
andInfinity
.
-INF
: for negative infinity, alias-Infinity
.NaN
: for other not-a-numbers, like result of division by zero.
- modifiedBeforean optional timestamp to only include files with
modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
- modifiedAfteran optional timestamp to only include files with
modification times occurring after the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
- pathstr, list or
Examples
>>> df1 = spark.read.json('python/test_support/sql/people.json') >>> df1.dtypes [('age', 'bigint'), ('name', 'string')] >>> rdd = sc.textFile('python/test_support/sql/people.json') >>> df2 = spark.read.json(rdd) >>> df2.dtypes [('age', 'bigint'), ('name', 'string')]