JsonFileFormat — Built-In Support for Files in JSON Format

JsonFileFormat is a TextBasedFileFormat for json format (i.e. registers itself to handle files in json format and convert them to Spark SQL rows).

spark.read.format("json").load("json-datasets")

// or the same as above using a shortcut
spark.read.json("json-datasets")

JsonFileFormat comes with options to further customize JSON parsing.

Note
JsonFileFormat uses Jackson 2.6.7 as the JSON parser library and some options map directly to Jackson’s internal options (as JsonParser.Feature).
Table 1. JsonFileFormat’s Options
Option Default Value Description

allowBackslashEscapingAnyCharacter

false

Note
Internally, allowBackslashEscapingAnyCharacter becomes JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER.

allowComments

false

Note
Internally, allowComments becomes JsonParser.Feature.ALLOW_COMMENTS.

allowNonNumericNumbers

true

Note
Internally, allowNonNumericNumbers becomes JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS.

allowNumericLeadingZeros

false

Note
Internally, allowNumericLeadingZeros becomes JsonParser.Feature.ALLOW_NUMERIC_LEADING_ZEROS.

allowSingleQuotes

true

Note
Internally, allowSingleQuotes becomes JsonParser.Feature.ALLOW_SINGLE_QUOTES.

allowUnquotedControlChars

false

Note
Internally, allowUnquotedControlChars becomes JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS.

allowUnquotedFieldNames

false

Note
Internally, allowUnquotedFieldNames becomes JsonParser.Feature.ALLOW_UNQUOTED_FIELD_NAMES.

columnNameOfCorruptRecord

compression

Compression codec that can be either one of the known aliases or a fully-qualified class name.

dateFormat

yyyy-MM-dd

Date format

Note
Internally, dateFormat is converted to Apache Commons Lang’s FastDateFormat.

multiLine

false

Controls whether…​FIXME

mode

PERMISSIVE

Case insensitive name of the parse mode

  • PERMISSIVE

  • DROPMALFORMED

  • FAILFAST

prefersDecimal

false

primitivesAsString

false

samplingRatio

1.0

timestampFormat

yyyy-MM-dd’T’HH:mm:ss.SSSXXX

Timestamp format

Note
Internally, timestampFormat is converted to Apache Commons Lang’s FastDateFormat.

timeZone

Java’s TimeZone

isSplitable Method

isSplitable(
  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean
Note
isSplitable is part of FileFormat Contract.

isSplitable…​FIXME

inferSchema Method

inferSchema(
  sparkSession: SparkSession,
  options: Map[String, String],
  files: Seq[FileStatus]): Option[StructType]
Note
inferSchema is part of FileFormat Contract.

inferSchema…​FIXME

Building Partitioned Data Reader — buildReader Method

buildReader(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]
Note
buildReader is part of the FileFormat Contract to build a PartitionedFile reader.

buildReader…​FIXME

Preparing Write Job — prepareWrite Method

prepareWrite(
  sparkSession: SparkSession,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory
Note
prepareWrite is part of the FileFormat Contract to prepare a write job.

prepareWrite…​FIXME

results matching ""

    No results matching ""