json2er

This module is used to generate Statistics from Source Data. It is also used to create a Data Flow from Source Data (JSON). Finally, It is used to execute a Task to convert Source Data (JSON) to a Target Format.

Parameters

To get all the supported json2er parameters the following command can be used:

$ json2er --help

All the compatible paramters that can be used are listed:

Usage: json2er [Options] [INPUT...]


  INPUT                           Files or directories input location path

  Options:

  -h, --help

  --version

  # schema

  -x, --schema-origin ID          Schema Origin ID

  --schema-name                   Schema name

  -j, --job ID                    Job ID


  # spark-submit

  --master MASTER_URL             spark://host:port, mesos://host:port, yarn, or local.

  --name NAME                     The name of your application.

  --jars JARS                     Comma-separated list of local jars to include on the driver

  --conf PROP=VALUE               Arbitrary Spark configuration property.

  -b, --blocksize BLOCKSIZE       Block Size, eg. 1024kb, 64mb, 128mb...

  -D, --warehouse-dir DIRECTORY   Spark SQL warehouse directory

  -k, --cache CACHE               Spark's storage level. Full or shortcut names
                                     - N, NONE
                                     - M, MEMORY_ONLY
                                     - M2, MEMORY_ONLY_2
                                     - MD, MEMORY_AND_DISK
                                     - MS, MEMORY_ONLY_SER
                                     - MD2, MEMORY_AND_DISK_2
                                     - MS2, MEMORY_ONLY_SER_2
                                     - MDS, MEMORY_AND_DISK_SER
                                     - MDS2, MEMORY_AND_DISK_SER_2
                                     - D, DISK_ONLY
                                     - D2, DISK_ONLY_2
                                     - O, OFF_HEAP
                                     - other...
                                  default: M

  -K, --checkpoint-dir DIRECTORY  Spark Streaming checkpoint directory

  --metrics DURATION              Enabling metrics printing in defined interval: 1h, 1m, 1s...


  # input data

  -Z, --stream DURATION           Stream mode duration. 1h, 1m, 1s, 1ms..

  -U, --in-user USER              Input user

  -P, --in-password PASSWORD      Input password

  --in-token TOKEN                Session token of input S3 bucket

  -F, --in-format FORMAT          Input format.
                                  - jdbc, parquet, orc, json, csv, tsv, psv, avro...

  -T, --table TABLE|QUERY         Name of the table (or specified SQL) to return tabular data with XML
                                  content in one of the columns

  -C, --column NAME|EXPR [AS ALIAS]
                                  Name of the column containing JSON data to process

  -I, --id-column <NAME|EXPR>[:DATATYPE] [EXTRA] [AS ALIAS]
                                  Name of identifier column to contain document's ID values for tabular
                                  files, jdbc or hive tables

  -<MIN>, --id-min MIN            ID value specifying lower bound data range to be processed

  +<MAX>, --id-max MAX            ID value specifying upper bound data range to be processed

  --where WHERE                   Where clause for filtering tabular files, jdbc or hive tables

  --limit LIMIT                   Limit clause for filtering the number of records/documents

  -X, --in-part PARTITIONS        Number of partitions for reading data

  -n, --new-since DURATION        Records are considered new since X ago

  -a, --use-stats ID,...          Use the statistics to generate the new schema

  --archive-read                  Read from archive files. zip, tar, jar, cpio, 7z...
                                  default: true

  --byte-stream                   Enforce reading as stream of bytes (2.1GB limit per file)

  --filter-content                Filter content, wrong bytes and charset detection
                                  default: true

  --detect-type                   Ignores non JSON files based on detected media type
                                  default: true

  --in-opt PROP=VALUE,...         Extra options for Spark dataFrame reader, ex: prop1=value1,prop2=value2


  # metadata

  -m, --meta URL                  JDBC location of metadata store
                                  default: jdbc:postgresql:x2er

  -M, --meta-user USER            Metadata user
                                  default: flex2er

  -w, --meta-password PASSWORD    Metadata password

  -g,--map MAPPING                Mapping generating and optimization levels:
                                    - "": Disabled
                                    - 0: No optimized mapping
                                    - 1: Elevate optimization (1=1)
                                   disabled

  --name-max-len SIZE             Maximum column size for mapping generating
                                  default: 30

  # output data

  -o, --out OUTPUT                Output location path

  -u, --out-user USER             Output user

  -p, --out-password PASSWORD     Output password

  --out-token TOKEN               Session token of output S3 bucket

  -B, --batchsize BATCHSIZE       Batch size to write into databases
                                  default: 1000

  -f, --out-format FORMAT         Output format:
                                  - jdbc, parquet, orc, json, csv, tsv, psv, avro...
                                  default: orc

  -z, --compression COMPRESSION   Compression mode for filesystem output
                                  - none, snappy, gzip, bzip2, deflate, lzo, lz4, zlib, xz...
                                  default: snappy

  -S, --savemode SAVEMODE         Save Mode when target table, directory or file exists
                                  ex: [e]rror, [a]ppend, [o]verwrite, [i]gnore, [p]rintschema, [w]riteschema
                                  default: append

  -Y, --out-part PARTITIONS       Number of partitions for writing data

  -V, --hive-create               Enable creating hive tables

  -E, --out-schema SCHEMA         Creating hive or jdbc tables into schema

  -r, --out-prefix PREFIX         Append a prefix in each output table name

  -y, --out-suffix SUFFIX         Append a suffix in each output table name

  -N, --unified-fks               Unified FKs in composite columns (applies to "reuse" optimization)

  --reset-pks                     Sequential reset primary keys starting from 1

  --sequence-type                 Sequential data type defined in ANSI SQL: decimal(38,0), bigint, varchar(36),...
  
  --default-varchar-len LENGTH    Length of VARCHAR/CLOB datatype for mapping generation where not explicitly
                                  defined by XSD schema

  --default-num-type native|max|decimal[(PRECISION,SCALE)]|PRECISION,SCALE
                                  Definition of the default numeric type, options:
                                  - native: Keep the numeric types optimized to native types like TINYINT, REAL,
                                    DOUBLE PRECISION
                                  - max: Enforce the maximum precision length accepted by the spark, DECIMAL(38,18)
                                  - decimal: Enforce using decimal type instead native ones, regards their precision
                                    and scale
                                  - decimal(PRECISION[,SCALE]): Enforce using a particular decimal type with minimum
                                    precision and scale
                                  - PRECISION[,SCALE]: same as above
                                  - decimal(+PRECISION[,+SCALE]): Increase the precision and/or scale by margin
                                    precision and scale
                                  - +PRECISION[,+SCALE]: same as above

  --default-float-type native|max|decimal[(PRECISION,SCALE)]|PRECISION,SCALE
                                  The same as --default-num-type, limited only for float numbers

  --default-int-type native|max|decimal[(PRECISION,SCALE)]|PRECISION,SCALE
                                  The same as --default-num-type, limited only for integer numbers

  -R, --extra-column <NAME|EXPR>[:DATATYPE] [AUTO|RANDOM|NOW|ID|FILENAME|FILEPATH|=VALUE] [AS ALIAS]
                                  Extra column(s) to be included in all output tables

  -i, --partition-column <NAME|EXPR>[:DATATYPE] [AUTO|RANDOM|NOW|ID|FILENAME|FILEPATH|=VALUE] [AS ALIAS]
                                  Partition column(s) to be included in all output tables

  --rename [IDX=|NAME=]NEW_NAME,* Rename a table or multiple tables by index (1,2,3...) or by its original name

  --namemode MODE                 Change the table/column original names to:
                                    - "": Keep original names
                                    - [l]ower: to lower case.
                                    - [u]pper: to upper case.
                                    - [c]camelcase: Remove separators and keep words separated by case.

  --out-opt PROP=VALUE,...        Extra options for Spark dataFrame writer, ex: prop1=value1,prop2=value2

  --tables-at-once <N>            Number of tables to be written into target store simultaneously
  
  --remap-tables                  Re-mapping all column order based on what is found in all tables
                              
  --remap-table TABLE,...         Re-mapping all column order based on what is found in the specified table(s)

  # actions

  -e, --parsemode MODE            Mode of parsing.
                                    - with doc stats:    [a]ll, [d]ata, [s]tats
                                    - without doc stats: [A]ll, [D]ata, [S]tats
                                  default: all

  -c, --commands                  Show SQL commands

  -s, --skip                      Skip writing results