xml2er

This module is used to generate Statistics from Source Data. It is also used to create a Data Flow from Source Data (XML). Finally, It is also used to execute a Task to convert Source Data (XML).

Parameters

To get all the supported xml2er parameters the following command can be used:

$ xml2er --help

All the compatible paramters that can be used are listed:

Usage: xml2er [Options] [INPUT...]

INPUT                           Files or directories input location path
                                

Options:

-h, --help

--version

# schema

-x, --schema-origin ID          Schema Origin ID
                                
--schema-name NAME              Schema name
                                
-j, --job ID                    Job ID
                                

# spark-submit

--master MASTER_URL             spark://host:port, mesos://host:port, yarn, or local.
                                
--name NAME                     The name of your application.
                                
--jars JARS                     Comma-separated list of local jars to include on the driver

--files FILES                   Comma-separated list of local files to include on the driver

--conf PROP=VALUE               Arbitrary Spark configuration property.

-b, --blocksize BLOCKSIZE       Block Size, eg. 1024kb, 64mb, 128mb...
                                
-D, --warehouse-dir DIRECTORY   Spark SQL warehouse directory
                                
-k, --cache CACHE               Spark's storage level. Full or shortcut names
                                   - N, NONE
                                   - M, MEMORY_ONLY
                                   - M2, MEMORY_ONLY_2
                                   - MD, MEMORY_AND_DISK
                                   - MS, MEMORY_ONLY_SER
                                   - MD2, MEMORY_AND_DISK_2
                                   - MS2, MEMORY_ONLY_SER_2
                                   - MDS, MEMORY_AND_DISK_SER
                                   - MDS2, MEMORY_AND_DISK_SER_2
                                   - D, DISK_ONLY
                                   - D2, DISK_ONLY_2
                                   - O, OFF_HEAP
                                   - other...
                                default: M

-K, --checkpoint-dir DIRECTORY  Spark Streaming checkpoint directory
                                
--metrics DURATION              Enabling metrics printing in defined interval: 1h, 1m, 1s...
                                

# input data

-Z, --stream DURATION           Stream mode duration. 1h, 1m, 1s, 1ms..
                                
-U, --in-user USER              Input user
                                
-P, --in-password PASSWORD      Input password

--in-token TOKEN                Session token of input S3 bucket
                                
-F, --in-format FORMAT          Input format.
                                - jdbc, parquet, orc, json, csv, tsv, psv, avro...
                                
-T, --table TABLE|QUERY         Name of the table (or specified SQL) to return tabular data with XML
                                content in one of the columns
                                
-C, --column NAME|EXPR [AS ALIAS]
                                Name of the column containing XML data to process
                                
-I, --id-column <NAME|EXPR>[:DATATYPE] [EXTRA] [AS ALIAS]
                                Name of identifier column to contain document's ID values for tabular
                                files, jdbc or hive tables
                                
-<MIN>, --id-min MIN            ID value specifying lower bound data range to be processed
                                
+<MAX>, --id-max MAX            ID value specifying upper bound data range to be processed
                                
--where WHERE                   Where clause for filtering tabular files, jdbc or hive tables
                                
--limit LIMIT                   Limit clause for filtering the number of records/documents
                                
-X, --in-part PARTITIONS        Number of partitions for reading data
                                
-n, --new-since DURATION        Records are considered new since X ago
                                
-a, --use-stats ID,...          Use the statistics to generate the new schema
                                
--parse-lib LIB                 The alias or class specification for the SAX2 parser factory class
                                  - "": default
                                  - default: Java JDK provided SAX Parser
                                  - piccolo: Piccolo SAX Parser
                                  - oracle: Oracle SAX Parser
                                  - xerces: Apache Xerces-J SAX Parser
                                  - gnu: GNU Ælfred2 SAX Parser
                                  - package.to.ClassName: Other SAX2 Parser in the Classpath
                                
--archive-read                  Read from archive files. zip, tar, jar, cpio, 7z...
                                default: true

--byte-stream                   Enforce reading as stream of bytes (2.1GB limit per file)
                                
--filter-content                Filter content, wrong bytes and charset detection
                                default: true

--detect-type                   Ignores non XML files based on detected media type
                                default: true

--ignore-mixed-content          Ignores tags which content is a mix between text and tags, related to
                                HTML or formated text.
                                
--in-opt PROP=VALUE,...         Extra options for Spark dataFrame reader, ex: prop1=value1,prop2=value2
                                

# metadata

-m, --meta URL                  JDBC location of metadata store
                                default: jdbc:postgresql:x2er

-M, --meta-user USER            Metadata user
                                default: flex2er

-w, --meta-password PASSWORD    Metadata password

-g,--map MAPPING                Mapping generating and optimization levels:
                                  - "": Disabled
                                  - 0: No optimized mapping
                                  - 1: Elevate optimization (1=1)
                                  - 2: Reference optimization (type="", ref="")
                                  - 3: Elevate + Reference optimization

--name-max-len SIZE             Maximum column size for mapping generating
                                default: 30


# output data

-o, --out OUTPUT                Output location path
                                
-u, --out-user USER             Output user
                                
-p, --out-password PASSWORD     Output password

--out-token TOKEN               Session token of output S3 bucket
                                

-B, --batchsize BATCHSIZE       Batch size to write into databases
                                default: 1000

-f, --out-format FORMAT         Output format:
                                - jdbc, parquet, orc, json, csv, tsv, psv, avro...
                                default: orc

-z, --compression COMPRESSION   Compression mode for filesystem output
                                - none, snappy, gzip, bzip2, deflate, lzo, lz4, zlib, xz...
                                default: snappy

-S, --savemode SAVEMODE         Save Mode when target table, directory or file exists
                                ex: [e]rror, [a]ppend, [o]verwrite, [i]gnore, [p]rintschema, [w]riteschema
                                default: append

-Y, --out-part PARTITIONS       Number of partitions for writing data
                                
-V, --hive-create               Enable creating hive tables
                                
-E, --out-schema SCHEMA         Creating hive or jdbc tables into schema
                                
-r, --out-prefix PREFIX         Append a prefix in each output table name
                                
-y, --out-suffix SUFFIX         Append a suffix in each output table name
                                
-N, --unified-fks               Unified FKs in composite columns (applies to "reuse" optimization)
                                
--reset-pks                     Sequential reset primary keys starting from 1
                                
--sequence-type                 Sequential data type defined in ANSI SQL: decimal(38,0), bigint, varchar(36),...
                                
--fk-create                     Enable foreign keys creation
                                
--default-varchar-len LENGTH    Length of VARCHAR/CLOB datatype for mapping generation where not explicitly
                                defined by XSD schema
                                
--default-num-type native|max|decimal[(PRECISION,SCALE)]|PRECISION,SCALE
                                Definition of the default numeric type, options:
                                  - native: Keep the numeric types optimized to native types like TINYINT, REAL,
                                    DOUBLE PRECISION
                                  - max: Enforce the maximum precision length accepted by the spark, DECIMAL(38,18)
                                  - decimal: Enforce using decimal type instead native ones, regards their precision
                                    and scale
                                  - decimal(PRECISION[,SCALE]): Enforce using a particular decimal type with minimum
                                    precision and scale
                                  - PRECISION[,SCALE]: same as above
                                  - decimal(+PRECISION[,+SCALE]): Increase the precision and/or scale by margin
                                    precision and scale
                                  - +PRECISION[,+SCALE]: same as above

--default-float-type native|max|decimal[(PRECISION,SCALE)]|PRECISION,SCALE
                               The same as --default-num-type, limited only for float numbers
                               
--default-int-type native|max|decimal[(PRECISION,SCALE)]|PRECISION,SCALE
                               The same as --default-num-type, limited only for integer numbers
                               
-R, --extra-column <NAME|EXPR>[:DATATYPE] [AUTO|RANDOM|NOW|ID|FILENAME|FILEPATH|=VALUE] [AS ALIAS]
                                Extra column(s) to be included in all output tables
                                
-i, --partition-column <NAME|EXPR>[:DATATYPE] [AUTO|RANDOM|NOW|ID|FILENAME|FILEPATH|=VALUE] [AS ALIAS]
                                Partition column(s) to be included in all output tables
                                
--rename [IDX=|NAME=]NEW_NAME,* Rename a table or multiple tables by index (1,2,3...) or by its original name
                                
--namemode MODE                 Change the table/column original names to:
                                  - "": Keep original names
                                  - [l]ower: to lower case.
                                  - [u]pper: to upper case.
                                  - [c]camelcase: Remove separators and keep words separated by case.
                                
--out-opt PROP=VALUE,...        Extra options for Spark dataFrame writer, ex: prop1=value1,prop2=value2
                                
--tables-at-once <N>            Number of tables to be written into target store simultaneously
                                
--remap-tables                  Re-mapping column order based on scanning metadata for every table
                                
--remap-table TABLE,...         Re-mapping column order based on scanning metadata for provided table(s)
                                

# actions

-e, --parsemode MODE            Mode of parsing.
                                  - with doc stats:    [a]ll, [d]ata, [s]tats
                                  - without doc stats: [A]ll, [D]ata, [S]tats
                                default: all

-c, --commands                  Show SQL commands
                                
-s, --skip                      Skip writing results