merge2er
This module is used to merge smaller files of the same structure into larger files to address the small files problem in Hadoop and Spark.
Parameters
To get all the supported merge2er parameters the following command can be used:
$ merge2er --help
All the compatible paramters that can be used are listed:
Usage: merge2er [Options] [INPUT]
INPUT Files or directories input location path
Options:
-h, --help
--version
# schema
-j, --job ID Job ID
# spark-submit
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--name NAME The name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
--files FILES Comma-separated list of local files to include on the driver
--conf PROP=VALUE Arbitrary Spark configuration property.
-b, --blocksize BLOCKSIZE Block Size, eg. 1024kb, 64mb, 128mb...
-D, --warehouse-dir DIRECTORY Spark SQL warehouse directory
--metrics DURATION Enabling metrics printing in defined interval: 1h, 1m, 1s...
# input data
-U, --in-user USER Input user
-P, --in-password PASSWORD Input password
--in-token TOKEN Session token of input S3 bucket
-F, --in-format FORMAT Input format.
- jdbc, parquet, orc, json, csv, tsv, psv, avro...
--in-schema SCHEMA Input schema related to hive or jdbc source
--schema-pattern PATTERN Input schema pattern related to hive or jdbc source: Ex "*word*"
--table-pattern PATTERN Input table pattern related to hive or jdbc source: Ex "*word*"
-T, --table TABLE|QUERY Name of the table (or specified SQL)
-I, --id-column <NAME|EXPR>[:DATATYPE] [EXTRA] [AS ALIAS]
Name of identifier column to contain document's ID values for tabular
files, jdbc or hive tables
-<MIN>, --id-min MIN ID value specifying lower bound data range to be processed
+<MAX>, --id-max MAX ID value specifying upper bound data range to be processed
--where WHERE Where clause for filtering tabular files, jdbc or hive tables
--limit LIMIT Limit clause for filtering the number of records/documents
--has-header If the input csv/psv/tsv file has a header
default: true
--in-opt PROP=VALUE,... Extra options for Spark dataFrame reader, ex: prop1=value1,prop2=value2
# metadata
-m, --meta URL JDBC location of metadata store
default: jdbc:postgresql:x2er
-M, --meta-user USER Metadata user
default: flex2er
-w, --meta-password PASSWORD Metadata password
# output data
-o, --out OUTPUT Output location path
-u, --out-user USER Output user
-p, --out-password PASSWORD Output password
--out-token TOKEN Session token of output S3 bucket
-B, --batchsize BATCHSIZE Batch size to write into databases
default: 1000
-f, --out-format FORMAT Output format:
- jdbc, parquet, orc, json, csv, tsv, psv, avro...
-z, --compression COMPRESSION Compression mode for filesystem output
- none, snappy, gzip, bzip2, deflate, lzo, lz4, zlib, xz...
default: snappy
-S, --savemode SAVEMODE Save Mode when target table, directory or file exists
ex: [e]rror, [a]ppend, [o]verwrite, [i]gnore, [p]rintschema, [w]riteschema
default: append
-Y, --out-part PARTITIONS Number of partitions for writing data
-V, --hive-create Enable creating hive tables
-E, --out-schema SCHEMA Creating hive or jdbc tables into schema
-r, --out-prefix PREFIX Append a prefix in each output table name
-y, --out-suffix SUFFIX Append a suffix in each output table name
--constraints Enables copying the primary and foreign key constraints
-R, --extra-column <NAME|EXPR>[:DATATYPE] [AUTO|RANDOM|NOW|ID|FILENAME|FILEPATH|=VALUE] [AS ALIAS]
Extra column(s) to be included in all output tables
-i, --partition-column <NAME|EXPR>[:DATATYPE] [AUTO|RANDOM|NOW|ID|FILENAME|FILEPATH|=VALUE] [AS ALIAS]
Partition column(s) to be included in all output tables
--rename [IDX=|NAME=]NEW_NAME,* Rename a table or multiple tables by index (1,2,3...) or by its original name
--namemode MODE Change the table/column original names to:
- "": Keep original names
- [l]ower: to lower case.
- [u]pper: to upper case.
- [c]camelcase: Remove separators and keep words separated by case.
--name-max-len SIZE Maximum column name length
default: 30
--out-opt PROP=VALUE,... Extra options for Spark dataFrame writer, ex: prop1=value1,prop2=value2
# actions
-c, --commands Show SQL commands
-s, --skip Skip writing results