merge2er

This module is used to merge smaller files of the same structure into larger files to address the small files problem in Hadoop and Spark.

Parameters

To get all the supported merge2er parameters the following command can be used:

$ merge2er --help

All the compatible paramters that can be used are listed:

Usage: merge2er [Options] [INPUT]


  INPUT                           Input location path

  Options:

  -h, --help

  --version

  # schema

  -j, --job ID                    Job ID


  # spark-submit

  --master MASTER_URL             spark://host:port, mesos://host:port, yarn, or local.

  --name NAME                     The name of your application.

  --jars JARS                     Comma-separated list of local jars to include on the driver

  --conf PROP=VALUE               Arbitrary Spark configuration property.

  -b, --blocksize BLOCKSIZE       Block Size, eg. 1024kb, 64mb, 128mb...

  -D, --warehouse-dir DIRECTORY   Spark SQL warehouse directory

  --metrics DURATION              Enabling metrics printing in defined interval: 1h, 1m, 1s...


  # input data

  -U, --in-user USER              Input user

  -P, --in-password PASSWORD      Input password

  --in-token TOKEN                Session token of input S3 bucket

  -F, --in-format FORMAT          Input format:
                                  - jdbc, parquet, orc, json, csv, tsv, psv, avro...

  --in-schema SCHEMA              Input schema related to hive or jdbc source

  --schema-pattern PATTERN        Input schema pattern related to hive or jdbc source: Ex "*word*"

  --table-pattern PATTERN         Input table pattern related to hive or jdbc source: Ex "*word*"

  -T, --table TABLE|QUERY         Name of the table (or specified SQL)

  -I, --id-column <NAME|EXPR>[:DATATYPE] [EXTRA] [AS ALIAS]
                                  Name of identifier column to contain document's ID values for tabular
                                  files, jdbc or hive tables

  -<MIN>, --id-min MIN            ID value specifying lower bound data range to be processed

  +<MAX>, --id-max MAX            ID value specifying upper bound data range to be processed

  --where WHERE                   Where clause for filtering tabular files, jdbc or hive tables

  --limit LIMIT                   Limit clause for filtering the number of records

  --has-header                    If the input csv/psv/tsv file has a header
                                  default: true

  --in-opt PROP=VALUE,...         Extra options for Spark dataFrame reader, ex: prop1=value1,prop2=value2


  # metadata

  -m, --meta URL                  JDBC location of metadata store
                                  default: jdbc:postgresql:x2er

  -M, --meta-user USER            Metadata user
                                  default: flex2er

  -w, --meta-password PASSWORD    Metadata password


  # output data

  -o, --out OUTPUT                Output location path

  -u, --out-user USER             Output user

  -p, --out-password PASSWORD     Output password

  --out-token TOKEN               Session token of output S3 bucket

  -B, --batchsize BATCHSIZE       Batch size to write into databases
                                  default: 1000

  -f, --out-format FORMAT         Output format:
                                  - jdbc, parquet, orc, json, csv, tsv, psv, avro...

  -z, --compression COMPRESSION   Compression mode for filesystem output
                                  - none, snappy, gzip, bzip2, deflate, lzo, lz4, zlib, xz...
                                  default: snappy

  -S, --savemode SAVEMODE         Save Mode when target table, directory or file exists
                                  ex: [e]rror, [a]ppend, [o]verwrite, [i]gnore, [p]rintschema, [w]riteschema
                                  default: append

  -Y, --out-part PARTITIONS       Number of partitions for writing data

  -V, --hive-create               Enable creating hive tables

  -E, --out-schema SCHEMA         Creating hive or jdbc tables into schema

  -r, --out-prefix PREFIX         Append a prefix in each output table name

  -y, --out-suffix SUFFIX         Append a suffix in each output table name

  --constraints                   Enables copying the primary and foreign key constraints


  -R, --extra-column <NAME|EXPR>[:DATATYPE] [AUTO|RANDOM|NOW|FILENAME|FILEPATH|=VALUE] [AS ALIAS]
                                  Extra column(s) to be included in all output tables

  -i, --partition-column <NAME|EXPR>[:DATATYPE] [AUTO|RANDOM|NOW|FILENAME|FILEPATH|=VALUE] [AS ALIAS]
                                  Partition column(s) to be included in all output tables

  -z, --compression COMPRESSION   File formats compression mode
                                  - none, snappy, gzip, bzip2, deflate, lzo, lz4, zlib, xz...
                                  default: snappy

  --rename [IDX=|NAME=]NEW_NAME,* Rename a table or multiple tables by index (1,2,3...) or by its original name

  --namemode MODE                 Change the table/column original names to:
                                      - "": Keep original names
                                    - [l]ower: to lower case.
                                    - [u]pper: to upper case.
                                    - [c]camelcase: Remove separators and keep words separated by case.

  --name-max-len SIZE             Maximum column name length
                                  default: 30

  --out-opt PROP=VALUE,...        Extra options for Spark dataFrame writer, ex: prop1=value1,prop2=value2


  # actions

  -c, --commands                  Show SQL commands

  -s, --skip                      Skip writing results