Target Connections
Target Connections are based on the URI standard. It means Flexter can access different locations and protocols defined by the path.
protocol://user:password@hostname:port/path?param=value#param=value
file://
ftp:// ftps://
hdfs:// maprfs://
s3a://
jdbc:xyz://
snowflake://
bigquery://
mongodb://
redshift://
Support of jdbc protocols:
jdbc:bigquery:
jdbc:derby:
jdbc:sqlserver:
jdbc:mariadb
jdbc:mysql
jdbc:oracle:
jdbc:postgresql:
jdbc:redshift:
jdbc:snowflake:
jdbc:teradata:
Some of these protocols require you to provide a username and password. In all cases it’s possible to use the parameters or configuration files to define them. For some, you can specify the username and password directly in the URI path, e.g. for FTP.
-u, --out-user USER Output user
-p, --out-password PASSWORD Output password
The Target Formats can be split into three types:
1. File based
2. Hive
3. JDBC
All three types have the following parameters in common:
-S, --savemode SAVEMODE Save Mode when table, directory or file
exists or when getting the output tables's DDL
[e]rror, [a]ppend, [o]verwrite, [i]gnore, [w]riteschema, [p]rintschema
default: append
-Y, --out-part PARTITIONS Number of partitions for writing data
-N, --unified-fks Unified FKs in composite columns. Only
applicable when Reuse optimisation is applied
default: false
--reset-pks Reset primary keys restarting them from 1.
Not recommended for production use.
default: false
--out-opt PROP=VALUE,... Extra options for Spark dataFrame writer,
ex:prop1=value1,prop2=value2
-r, --out-prefix PREFIX Append a prefix in each output table name
-y, --out-suffix SUFFIX Append a suffix in each output table name
--rename [IDX=|NAME=]NEW_NAME,...
Rename a table or multiple tables by index
(1,2,3...) or by its original name
--namemode MODE Change the table/column original names to:
- "": Keep original names
- [l]ower: to lower case.
- [u]pper: to upper case.
- [c]camelcase: Remove separators
and keep words separated by case.
Unified foreign keys is an alternative way for the ER normalization. It merges all foreign keys into two other columns: one to store the table name and another table to store the foreign key value.
Reset the primary keys is an option for generating primary keys restarting from 1. It can conflict with other generated batches but it could be useful for the first run or overwrite save mode.
Flexter can generate a partitioned table as output. The –partition-by parameter generates partitioned output for each table. It’s capable to use static, random (date, time, timestamp, uuid) and input column value for partitioning.
The partition column name can be auto generated considering the –partition-by parameter or can be set by –partition-by-name parameter.
File Based Target Connections
For file based Target Connections we need to define the Target Format and the location for the output directory.
$ xml2er|json2er -f <File Format> -o <Output Directory> …
Useful parameters
-f, --out-format FORMAT Output format. (parquet, orc, json, csv, tsv)
default: parquet
-z, --compression COMPRESSION File formats compression mode
- none, snappy, gzip, bzip2, deflate, lzo,
lz4, zlib, xz...
default: snappy
Similar to a Source Connection, the format parameter also accepts the full class name if the extra jars are included in the spark-submit library path. Here is an example for working with Avro.
# example writing to avro files
$ xml2er|json2er -f avro -o <Output Directory> …
# or if avro package isn't available
$ xml2er|json2er --pkg com.databricks:spark-avro_2.11:4.0.0 -f avro -o <Output Directory> …
Hive as a Target Connection
Hive as a Target Connection behaves similar to Target Connections that output files. This mode can be activated by parameter -V or –hive-create and the output location is optional.
$ xml2er|json2er -V …
With a defined output location, the tables will be created as external tables.
$ xml2er|json2er -V -o <Output Directory> …
Below the useful options:
-V, --hive-create Enable creating hive tables
-E, --out-schema SCHEMA Creating hive or jdbc tables into schema
JDBC as a Target Connection
When specifying JDBC as a Target Connection you need to include an output path with the jdbc:// protocol.
$ xml2er|json2er -o jdbc:<database url> -u <dbuser> -p <dbpassword> …
-B, --batchsize BATCHSIZE Batch size to write into databases
default: 1000
The batch size can be changed to make the inserts stored in bigger or smaller packages before being sent to the database.
Because the characteristics of a parallel Apache Spark architecture, it can’t guarantee that all inserts will be made in the same transaction.
Each table will require a new transaction, and it could be performed with 1 or more executors. Each executor will use its own transaction across the cluster.
Printing out the DDL to the console or writing to the target Jdbc
Whether the target is a Jdbc Connection or a Hive schema, you can print out the output tables’ DDL to the console using the [p]rintschema parsing mode.
...
17:38:30.827 INFO Initialized in 2889 milliseconds
17:38:30.829 INFO getting schema definition
17:38:30.831 INFO Loading metadata
17:38:31.407 INFO printing schema definition
CREATE TABLE items ("PK_items" NUMERIC(38) NOT NULL, "FILENAME" VARCHAR(4000) NOT NULL, "FILEPATH" TEXT NOT NULL, PRIMARY KEY("PK_items"))
CREATE TABLE item ("PK_item" NUMERIC(38) NOT NULL, "FK_items" NUMERIC(38) NOT NULL, "id" SMALLINT , "name" VARCHAR(10) , "ppu" FLOAT , "type_" VARCHAR(6) , PRIMARY KEY("PK_item"))
CREATE TABLE topping ("FK_item" NUMERIC(38) NOT NULL, "id" SMALLINT , "topping" VARCHAR(24) )
CREATE TABLE batter ("PK_batter" NUMERIC(38) NOT NULL, "FK_item" NUMERIC(38) NOT NULL, "id" SMALLINT , "name" VARCHAR(12) , PRIMARY KEY("PK_batter"))
CREATE TABLE size_ ("FK_batter" NUMERIC(38) NOT NULL, "size_" VARCHAR(6) NOT NULL)
CREATE TABLE filling ("FK_item" NUMERIC(38) NOT NULL, "addcost" FLOAT , "id" SMALLINT , "name" VARCHAR(16) )
17:38:32.966 INFO checking foreign keys
ALTER TABLE "item" ADD FOREIGN KEY("FK_items") REFERENCES "items"("PK_items");
ALTER TABLE "topping" ADD FOREIGN KEY("FK_item") REFERENCES "item"("PK_item");
ALTER TABLE "batter" ADD FOREIGN KEY("FK_item") REFERENCES "item"("PK_item");
ALTER TABLE "size_" ADD FOREIGN KEY("FK_batter") REFERENCES "batter"("PK_batter");
ALTER TABLE "filling" ADD FOREIGN KEY("FK_item") REFERENCES "item"("PK_item");
Alternatively, you can create the empty target tables using the [w]riteschema mode.
These options could be extremely for set up and preparing the database before writing the data.
$ psql -U flex2er -d x2er
Password for user flex2er:
psql
Type "help" for help.
x2er=# \dt
List of relations
Schema | Name | Type | Owner
--------+------------------+-------+---------
public | batter | table | flex2er
public | filling | table | flex2er
public | item | table | flex2er
public | items | table | flex2er
public | size_ | table | flex2er
public | topping | table | flex2er