xml2er
This module is used to generate Statistics from Source Data. It is also used to create a Data Flow from Source Data (XML). Finally, It is also used to execute a Task to convert Source Data (XML).
Parameters
To get all the supported xml2er parameters the following command can be used:
$ xml2er --help
All the compatible paramters that can be used are listed:
Usage: xml2er [Options] [INPUT...]
INPUT Input location path
Options:
-h, --help
--version
# schema
-x, --schema-origin ID Schema Origin ID
--schema-name Schema name
-j, --job ID Job ID
# spark-submit
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--name NAME The name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
--conf PROP=VALUE Arbitrary Spark configuration property.
-b, --blocksize BLOCKSIZE Block Size, eg. 1024kb, 64mb, 128mb...
-D, --warehouse-dir DIRECTORY Spark SQL warehouse directory
-k, --cache CACHE Spark's storage level. Full or shortcut names
- N, NONE
- M, MEMORY_ONLY
- M2, MEMORY_ONLY_2
- MD, MEMORY_AND_DISK
- MS, MEMORY_ONLY_SER
- MD2, MEMORY_AND_DISK_2
- MS2, MEMORY_ONLY_SER_2
- MDS, MEMORY_AND_DISK_SER
- MDS2, MEMORY_AND_DISK_SER_2
- D, DISK_ONLY
- D2, DISK_ONLY_2
- O, OFF_HEAP
- other...
default: M
-K, --checkpoint-dir DIRECTORY Spark Streaming checkpoint directory
--metrics DURATION Enabling metrics printing in defined interval: 1h, 1m, 1s...
# input data
-Z, --stream DURATION Stream mode duration. 1h, 1m, 1s, 1ms..
-U, --in-user USER Input user
-P, --in-password PASSWORD Input password
--in-token TOKEN Session token of input S3 bucket
-F, --in-format FORMAT Input format.
- jdbc, parquet, orc, json, csv, tsv, psv, avro...
-T, --table TABLE|QUERY Name of the table (or specified SQL) to return tabular data with XML
content in one of the columns
-C, --column NAME|EXPR [AS ALIAS]
Name of the column containing XML data to process
-I, --id-column <NAME|EXPR>[:DATATYPE] [EXTRA] [AS ALIAS]
IdentifierName of identifier column to contain document's ID values for tabular
files, jdbc or hive tables
-<MIN>, --id-min MIN ID value specifying lower bound data range to be processed
+<MAX>, --id-max MAX ID value specifying upper bound data range to be processed
--where WHERE Where clause for filtering tabular files, jdbc or hive tables
--limit LIMIT Limit clause for filtering the number of records/documents
-X, --in-part PARTITIONS Number of partitions for reading data
-n, --new-since DURATION Records are considered new since X ago
-a, --use-stats ID,... Use the statistics to generate the new schema
--archive-read Read from archive files. zip, tar, jar, cpio, 7z...
default: true
--byte-stream Enforce reading as stream of bytes (2.1GB limit per file)
--filter-content Filter content, wrong bytes and charset detection
default: true
--detect-type Ignores non XML files based on detected media type
default: true
--in-opt PROP=VALUE,... Extra options for Spark dataFrame reader, ex: prop1=value1,prop2=value2
# metadata
-m, --meta URL JDBC location of metadata store
default: jdbc:postgresql:x2er
-M, --meta-user USER Metadata user
default: flex2er
-w, --meta-password PASSWORD Metadata password
-g,--map MAPPING Mapping generating and optimization levels:
- "": Disabled
- 0: No optimized mapping
- 1: Elevate optimization (1=1)
- 2: Reference optimization (type="", ref="")
- 3: Elevate + Reference optimization
disabled
--name-max-len SIZE Maximum column size for mapping generating
default: 30
--default-varchar-len LENGTH Length of VARCHAR/CLOB datatype for mapping generation where not explicitly
defined by XSD schema
# output data
-o, --out OUTPUT Output location path
-u, --out-user USER Output user
-p, --out-password PASSWORD Output password
--out-token TOKEN Session token of output S3 bucket
-B, --batchsize BATCHSIZE Batch size to write into databases
default: 1000
-f, --out-format FORMAT Output format:
- jdbc, parquet, orc, json, csv, tsv, psv, avro...
default: orc
-z, --compression COMPRESSION Compression mode for filesystem output
- none, snappy, gzip, bzip2, deflate, lzo, lz4, zlib, xz...
default: snappy
-S, --savemode SAVEMODE Save Mode when target table, directory or file exists
ex: [e]rror, [a]ppend, [o]verwrite, [i]gnore, [p]rintschema, [w]riteschema
default: append
-Y, --out-part PARTITIONS Number of partitions for writing data
-V, --hive-create Enable creating hive tables
-E, --out-schema SCHEMA Creating hive or jdbc tables into schema
-r, --out-prefix PREFIX Append a prefix in each output table name
-y, --out-suffix SUFFIX Append a suffix in each output table name
-N, --unified-fks Unified FKs in composite columns (applies to "reuse" optimization)
--reset-pks Sequential reset primary keys starting from 1
--sequence-type Sequential data type defined in ANSI SQL: decimal(38,0), bigint, varchar(36),...
-R, --extra-column <NAME|EXPR>[:DATATYPE] [AUTO|RANDOM|NOW|ID|FILENAME|FILEPATH|=VALUE] [AS ALIAS]
Extra column(s) to be included in all output tables
-i, --partition-column <NAME|EXPR>[:DATATYPE] [AUTO|RANDOM|NOW|ID|FILENAME|FILEPATH|=VALUE] [AS ALIAS]
Partition column(s) to be included in all output tables
--rename [IDX=|NAME=]NEW_NAME,* Rename a table or multiple tables by index (1,2,3...) or by its original name
--namemode MODE Change the table/column original names to:
- "": Keep original names
- [l]ower: to lower case.
- [u]pper: to upper case.
- [c]camelcase: Remove separators and keep words separated by case.
--out-opt PROP=VALUE,... Extra options for Spark dataFrame writer, ex: prop1=value1,prop2=value2
--tables-at-once <N> Number of tables to be written into target store simultaneously
# actions
-e, --parsemode MODE Mode of parsing.
- with doc stats: [a]ll, [d]ata, [s]tats
- without doc stats: [A]ll, [D]ata, [S]tats
default: all
-c, --commands Show SQL commands
-s, --skip Skip writing results