Creating a Data Flow
As outlined in section Overview - Flexter Conversion process, there are three ways to generate a Data Flow.
1.Sample of XML/JSON
2.XSD only
3.Sample of XML + XSD
Irrespective of the type of Data Flow generation we select, we need to define the type of Optimisation. This is done with the -g parameter
-g, --map MAPPING Mapping generation and optimization levels:
- 0: No optimized mapping
- 1: Elevate optimization (1=1)
- 2: Reuse optimization
- 3: Elevate + Reuse optimization
-a, --use-stats ID... Use Statistics to generate new data flow
--name-max-len SIZE Maximum column size for mapping generatingdefault: 30
--default-varchar-len LENGTH User defined length of VARCHAR/CLOB datatype for mapping generating
One important parameter to understand when creating Data Flows is the –parsemode switch. It is applicable when running a Conversion and when collecting Statistics incrementally.
When running a Conversion, you can collect Statistics at the same time. This is useful if you would like to detect changes to your Source Data during the Conversion process. It is also useful when upgrading a Target Schema from one version to the next. By default Flexter collects Statistics during the Conversion. You can disable this behaviour by setting the –parsemode to [d]ata.
The –parsemode [s]tats is only used for collecting Statistics incrementally, which is described in section Create Data Flow with XML/JSON of this guide.
Further options are available from version 1.8, using upper case letters it will disable collecting statistics about documents.
-e, --parsemode MODE Mode of parsing.
- with doc stats:
[a]ll, [d]ata, [s]tats
- without doc stats:
[A]ll, [D]ata, [S]tats
default: all
Create Data Flow with Source Schema (XSD)
Before creating the Data Flow with a Source Schema (XSD), it is recommended to locate the root (top outermost)
Once the analysis is complete, the skip parameter must be removed so the processed metadata can be saved into the Flexter Metadata Database.
Basic command:
# Template
$ xsd2er -s -g<level> INPUTPATH
# Reading a file and extracting the xsd schema
$ xsd2er -s -g1 donut.xsd
Running the command with the -s switch will show all of the possible root elements in the XSD.
» SCHEMA {NAMESPACE} · ELEMENT
------------------------------
» donut.zip!/donut.xsd
{}
· addcost
· batter
· batters
· filling
· fillings
· item
· items
· name
· ppu
· size
· sizeslimitedto
· topping
Given no additional switches all of these root elements will be saved into the Metadata DB and used to generate Data Flows and the Target Schema. It’s the most flexible option but might also prove to produce excessive volumes of metadata information.
If you know the root
-r, --root ROOT... Process only requested root elements
An example
$ xsd2er -r items donut.zip
…
» SCHEMA {NAMESPACE} · ELEMENT
------------------------------
» donut.zip!/donut.xsd
{}
· items
Flexter can also automatically identify the root based on a simple rule.
An element that is not referenced by anything else should be the root. In other words an element without any parents.
It works for most cases apart from a recursive relationship.
-R, --unref-root Unreferenced elements will be considered roots
Using the -R switch in the example, Flexter has automatically detected that items is an unreferenced element and will use it as the root element.
$ xsd2er -R -s donut.zip
…
» SCHEMA {NAMESPACE} · ELEMENT
------------------------------
» donut.zip!/donut.xsd
{}
· items
The same rules are applied to Source Schemas (XSD) with multiple files.
Flexter offers some useful switches to work with XSDs with multiple files and multiple namespaces.
-f, --file-root FILE ... Process only requested root schema files
-F, --unref-file-root ... Unreferenced files will be considered root files
-N, --with-target ... Only files with the specified target namespace will be considered
It’s also possible to increase the level of diagnostics, i.e. to print out more details on the XPaths.
-X, --xpath ... Print processed XPaths
-z, --xpath-full ... Print processed XPaths including attributes
--levels LEVEL ... Recursive <TAG> levels treatment:
- 0: Disabled
- 1: Stop on the recursive TAG
- 2: Include the recursive TAG and stop
- 3: Include the recursive TAG, its attributes and stop
- 4: Include the recursive TAG, its attributes, child one-to-one TAGs
and stop
- 5... Repeat the same pattern accepting 1 more level or recursion
--levels-type LEVEL ... Recursive <TAG> types levels treatment.
Now, by removing the –skip, -s parameter, the following command will write the metadata:
$ xsd2er -R -g1 donut.zip
…
» SCHEMA {NAMESPACE} · ELEMENT
-----------------------------------------
» donut.zip!/donut.xsd
{}
· items
20:00:35.986 INFO Building metadata
20:00:36.064 INFO Writing metadata
20:00:36.120 INFO Generating the mapping: elevate,reuse
20:00:36.210 INFO Registering success of job 3
20:00:36.225 INFO Finished successfully in 1058 milliseconds
# schema
origin: 3
logical: 1
job: 3
# statistics
load: 447 ms
parse: 365 ms
build: 83 ms
write: 56 ms
map: 92 ms
xpaths: 16
At the end of the successful XSD analysis process, Flexter prints out the ID of the Data Flow (logical). In the example above, the ID of the Data Flow is 1.
This should be noted and used later in your Conversion task.
Create Data Flow with Source Data (XML or JSON)
You can process XML and JSON documents without a Source Schema (XSD files).
However, this requires choosing a representative sample from your XML/JSON documents. With this representative sample we collect Statistics before creating the Data Flow.
The Source Data can be located on different types of Source Connections, e.g. a file system such as HDFS or an FTP server. For a full description of Source Connections please refer to section Source Connections in this guide.
We can simulate the process with the –skip or -s parameter and remove it when we are ready to run the command for real.
Basic command:
# Template
$ xml2er|json2er -s -g<Optimization Level> INPUTPATH
# Reading a file and extracting the statistics
$ xml2er -s -g1 donut.xml
Let’s run the command without the skip parameter:
$ xml2er -g1 donut.xml
…
# schema
origin: 4
logical: 2
job: 4
# statistics
startup: 4107 ms
parse: 6855 ms
xpath stats: 436 ms
doc stats: 4744 ms
xpaths: 19 | map:0.0%/0 new:100.0%/19
documents: 1 | suc:100.0%/1 part:0.0%/0 fail:0.0%/0
This command collects Statistics (origin ID 4). At the same time it also creates a Data Flow (logical ID 2).
We can also create the Data Flow in two separate steps. We first collect Statistics and then generate the Data Flow. This is useful if we want to collect Statistics incrementally.
Incrementally create Data Flow from Source Data (XML/JSON)
If the sample of your Source Data is too big to be processed in one iteration, you can split it into smaller batches and generate Statistics incrementally.
In a first step we process the first batch of Source Data. This gives us the ID for the Source Data Statistics (origin ID 5).
$ xml2er samples/batch1
…
# schema
origin: 5
job: 5
# statistics
startup: 4107 ms
parse: 6855 ms
xpath stats: 436 ms
doc stats: 4744 ms
xpaths: 325 | map:0.0%/0 new:100.0%/325
documents: 5009 | suc:100.0%/5009 part:0.0%/0 fail:0.0%/0
We can now incrementally update the generated Source ID (origin) and its statistics with another batch of Source Data. This can be done by feeding the Source Statistics ID (origin) to the –schema-origin or -x parameter. We also need to supply the –parsemode or -e parameter to enforce Statistics only processing. This command updates the Source Statistics for ID 5 (origin 5).
$ xml2er -x5 -e stats samples/batch2
$ xml2er -x5 -e stats samples/batch3
$ xml2er -x5 -e stats samples/batch4
…
After all the Statistics have been collected, the Data Flow can be generated using the –map or -g parameter. Together with the -g parameter we provide the level of optimization. This generates the Data Flow Id (logical 3)
$ xml2er -a5 -g1
…
# schema
origin: 5
logical: 3
job: 10
# statistics
startup: 4911 ms
map: 92 ms
We can now use the Data Flow ID (logical) to convert our Source Data. In our example the Data Flow ID is 3.
Create Data Flow with Source Schema (XSD) and Source Data (XML)
In a first step we need to collect Statistics from the Source Data.
Basic command:
# Template
$ xml2er -s INPUTPATH to Source Data
# Test (-s skip) reading a file and collecting Statistics
$ xml2er -s donut.zip
# running the command without skip
$ xml2er donut.zip
…
# schema
origin: 4
job: 4
# statistics
startup: 4107 ms
parse: 6855 ms
xpath stats: 436 ms
doc stats: 4744 ms
xpaths: 19 | map:0.0%/0 new:100.0%/19
documents: 1 | suc:100.0%/1 part:0.0%/0 fail:0.0%/0
We have generated a new set of Statistics (origin 4). We are now ready to generate the new Data Flow with the xsd2er module. We need to use the use-stats parameter and also the mapping parameter to generate the new Data Flow using both an XSD and Statistics for a Source ID (origin 4).
-a, --use-stats ID... Use Statistics to generate new data flow
# Template
$ xsd2er -s -a<Statistics ID (origin)> -g<Optimization Level> INPUTPATH
# Testing to read a file and generating Data Flow with Elevate Optimization (g1)
$ xsd2er -s -a4 -g1 donut.xsd
# running the command without skip
$ xsd2er -a4 -g1 donut.zip
…
# schema
origin: 7
logical: 3
job: 6
# statistics
load: 1608 ms
stats: 40 ms
parse: 368 ms
build: 121 ms
write: 47 ms
map: 128 ms
xpaths: 16
Now with a generated Data Flow ID (logical 3) the data conversion process can begin.
Incrementally create Data Flow from Source Schema (XSD) and Source Data (XML)
We can also use the incremental method of generating Statistics in combination with a Source Schema (XSD).
In a first step we process the first batch of Source Data
$ xml2er samples/batch1
…
# schema
origin: 5
job: 5
# statistics
startup: 4107 ms
parse: 6855 ms
xpath stats: 436 ms
doc stats: 4744 ms
xpaths: 325 | map:0.0%/0 new:100.0%/325
documents: 5009 | suc:100.0%/5009 part:0.0%/0 fail:0.0%/0
$ xml2er -x5 -e stats samples/batch2
$ xml2er -x5 -e stats samples/batch3
$ xml2er -x5 -e stats samples/batch4
…
Once we have collected all Statistics we generate the Data Flow with the –map or -g parameter providing the type of optimization. In this example -g1 applies the Elevate optimisation. We also provide the path to our XSD.
$ xsd2er -a5 -g1 samples/xsds
…
# schema
origin: 5
logical: 3
job: 10
# statistics
startup: 4911 ms
map: 92 ms
xpaths: 127
With the generated Data Flow ID (logical) we can now start converting Source Data.