Creating a Data Flow

This section covers the processes by which Flexter can create a Data Flow.

As outlined in section Overview - Flexter Conversion process, there are three ways to generate a Data Flow.

1.Sample of XML/JSON
2.XSD only
3.Sample of XML + XSD

Irrespective of the type of Data Flow generation we select, we need to define the type of Optimisation. This is done with the -g parameter

-g, --map MAPPING		Mapping generation and optimization levels:
                                - 0: No optimized mapping
                                - 1: Elevate optimization (1=1)
                                - 2: Reuse optimization  
                                - 3: Elevate + Reuse optimization

-a, --use-stats ID...		Use Statistics to generate new data flow

--name-max-len SIZE		Maximum column size for mapping generatingdefault: 30

--default-varchar-len LENGTH    User defined length of VARCHAR/CLOB datatype for mapping generating

One important parameter to understand when creating Data Flows is the –parsemode switch. It is applicable when running a Conversion and when collecting Statistics incrementally.

When running a Conversion, you can collect Statistics at the same time. This is useful if you would like to detect changes to your Source Data during the Conversion process. It is also useful when upgrading a Target Schema from one version to the next. By default Flexter collects Statistics during the Conversion. You can disable this behaviour by setting the –parsemode to [d]ata.

The –parsemode [s]tats is only used for collecting Statistics incrementally, which is described in section Create Data Flow with XML/JSON of this guide.

Further options are available from version 1.8, using upper case letters it will disable collecting statistics about documents.

 -e, --parsemode MODE            Mode of parsing.
                                - with doc stats:    
                                    [a]ll, [d]ata, [s]tats
                                - without doc stats: 
                                    [A]ll, [D]ata, [S]tats
                                  default: all

Create Data Flow with Source Schema (XSD)

Before creating the Data Flow with a Source Schema (XSD), it is recommended to locate the root (top outermost) . Flexter greatly simplifies this process by analysing the Source Schema for possible root elements. To leverage this information it is advised to use the –skip or -s parameter for the first time use.

Once the analysis is complete, the skip parameter must be removed so the processed metadata can be saved into the Flexter Metadata Database.

Basic command:

# Template

$ xsd2er -s -g<level> INPUTPATH

# Reading a file and extracting the xsd schema

$ xsd2er -s -g1 donut.xsd

Running the command with the -s switch will show all of the possible root elements in the XSD.

» SCHEMA {NAMESPACE} · ELEMENT
------------------------------
» donut.zip!/donut.xsd
   {}
 · addcost
 · batter
 · batters
 · filling
 · fillings
 · item
 · items
 · name
 · ppu
 · size
 · sizeslimitedto
 · topping

Given no additional switches all of these root elements will be saved into the Metadata DB and used to generate Data Flows and the Target Schema. It’s the most flexible option but might also prove to produce excessive volumes of metadata information.

If you know the root in the XML documents you can choose one or more by using the -r switch:

-r, --root  ROOT...   	Process only requested root elements

An example

$ xsd2er -r items donut.zip
…
» SCHEMA {NAMESPACE} · ELEMENT
------------------------------
» donut.zip!/donut.xsd
   {}
 · items

Flexter can also automatically identify the root based on a simple rule.

An element that is not referenced by anything else should be the root. In other words an element without any parents.

It works for most cases apart from a recursive relationship.

-R, --unref-root           Unreferenced elements will be considered roots

Using the -R switch in the example, Flexter has automatically detected that items is an unreferenced element and will use it as the root element.

$ xsd2er -R -s donut.zip
…
» SCHEMA {NAMESPACE} · ELEMENT
------------------------------
» donut.zip!/donut.xsd
   {}
 · items

The same rules are applied to Source Schemas (XSD) with multiple files.

Flexter offers some useful switches to work with XSDs with multiple files and multiple namespaces.

-f, --file-root FILE    ...      Process only requested root schema files
                              	
-F, --unref-file-root   ...      Unreferenced files will be considered root files

-N, --with-target       ...    	 Only files with the specified target namespace will be considered

It’s also possible to increase the level of diagnostics, i.e. to print out more details on the XPaths.

-X, --xpath              ...   	Print processed XPaths
                          	
-z, --xpath-full         ... 	Print processed XPaths including attributes
                          	
--levels LEVEL           ...   	Recursive <TAG> levels treatment:
                            - 0: Disabled
                            - 1: Stop on the recursive TAG
                            - 2: Include the recursive TAG and stop
                            - 3: Include the recursive TAG, its attributes and stop
                            - 4: Include the recursive TAG, its attributes, child one-to-one TAGs 
                                 and stop
                            - 5... Repeat the same pattern accepting 1 more level or recursion

--levels-type LEVEL       ...   Recursive <TAG> types levels treatment.

Now, by removing the –skip, -s parameter, the following command will write the metadata:

$ xsd2er -R -g1 donut.zip
…
» SCHEMA {NAMESPACE} · ELEMENT
-----------------------------------------
» donut.zip!/donut.xsd   
{}
 · items

20:00:35.986  INFO  Building metadata
20:00:36.064  INFO  Writing metadata
20:00:36.120  INFO  Generating the mapping: elevate,reuse
20:00:36.210  INFO  Registering success of job 3
20:00:36.225  INFO  Finished successfully in 1058 milliseconds

# schema
     origin:  3
     logical:  1
     job:  3

# statistics
     load:  447 ms
     parse:  365 ms
     build:  83 ms
     write:  56 ms
     map:  92 ms
     xpaths:  16

At the end of the successful XSD analysis process, Flexter prints out the ID of the Data Flow (logical). In the example above, the ID of the Data Flow is 1.

This should be noted and used later in your Conversion task.

Create Data Flow with Source Data (XML or JSON)

You can process XML and JSON documents without a Source Schema (XSD files).

However, this requires choosing a representative sample from your XML/JSON documents. With this representative sample we collect Statistics before creating the Data Flow.

The Source Data can be located on different types of Source Connections, e.g. a file system such as HDFS or an FTP server. For a full description of Source Connections please refer to section Source Connections in this guide.

We can simulate the process with the –skip or -s parameter and remove it when we are ready to run the command for real.

Basic command:

# Template

$ xml2er|json2er -s -g<Optimization Level> INPUTPATH

# Reading a file and extracting the statistics

$ xml2er -s -g1 donut.xml

Let’s run the command without the skip parameter:

$ xml2er -g1 donut.xml
…

# schema
origin:  4
    logical:  2
job:  4

# statistics
startup:  4107 ms
parse:  6855 ms
xpath stats:  436 ms
doc stats:  4744 ms

xpaths:  19           | map:0.0%/0  new:100.0%/19
documents:  1         | suc:100.0%/1  part:0.0%/0  fail:0.0%/0

This command collects Statistics (origin ID 4). At the same time it also creates a Data Flow (logical ID 2).

We can also create the Data Flow in two separate steps. We first collect Statistics and then generate the Data Flow. This is useful if we want to collect Statistics incrementally.

Incrementally create Data Flow from Source Data (XML/JSON)

If the sample of your Source Data is too big to be processed in one iteration, you can split it into smaller batches and generate Statistics incrementally.

In a first step we process the first batch of Source Data. This gives us the ID for the Source Data Statistics (origin ID 5).

$ xml2er samples/batch1
…

# schema
origin:  5
job:  5

# statistics
startup:  4107 ms
parse:  6855 ms
xpath stats:  436 ms
doc stats:  4744 ms

xpaths:  325             | map:0.0%/0  new:100.0%/325
documents:  5009         | suc:100.0%/5009  part:0.0%/0  fail:0.0%/0

We can now incrementally update the generated Source ID (origin) and its statistics with another batch of Source Data. This can be done by feeding the Source Statistics ID (origin) to the –schema-origin or -x parameter. We also need to supply the –parsemode or -e parameter to enforce Statistics only processing. This command updates the Source Statistics for ID 5 (origin 5).

$ xml2er -x5 -e stats samples/batch2

$ xml2er -x5 -e stats samples/batch3

$ xml2er -x5 -e stats samples/batch4

…

After all the Statistics have been collected, the Data Flow can be generated using the –map or -g parameter. Together with the -g parameter we provide the level of optimization. This generates the Data Flow Id (logical 3)

$ xml2er -a5 -g1
…

# schema
origin:  5
    logical: 3
job:  10

# statistics
startup:  4911 ms
map:  92 ms

We can now use the Data Flow ID (logical) to convert our Source Data. In our example the Data Flow ID is 3.

Create Data Flow with Source Schema (XSD) and Source Data (XML)

In a first step we need to collect Statistics from the Source Data.

Basic command:

# Template

$ xml2er -s INPUTPATH to Source Data

# Test (-s skip) reading a file and collecting Statistics

$ xml2er -s donut.zip

# running the command without skip

$ xml2er donut.zip
…

# schema
origin:  4
job:  4

# statistics
startup:  4107 ms
parse:  6855 ms
xpath stats:  436 ms
doc stats:  4744 ms

xpaths:  19           | map:0.0%/0  new:100.0%/19
documents:  1         | suc:100.0%/1  part:0.0%/0  fail:0.0%/0

We have generated a new set of Statistics (origin 4). We are now ready to generate the new Data Flow with the xsd2er module. We need to use the use-stats parameter and also the mapping parameter to generate the new Data Flow using both an XSD and Statistics for a Source ID (origin 4).

-a, --use-stats ID...       Use Statistics to generate new data flow

# Template

$ xsd2er -s -a<Statistics ID (origin)> -g<Optimization Level> INPUTPATH

# Testing to read a file and generating Data Flow with Elevate Optimization (g1)

$ xsd2er -s -a4 -g1 donut.xsd

# running the command without skip

$ xsd2er -a4 -g1 donut.zip
…
# schema
origin:  7
logical:  3
job:  6

# statistics
load:  1608 ms
stats:  40 ms
parse:  368 ms
build:  121 ms
write:  47 ms
map:  128 ms
xpaths:  16

Now with a generated Data Flow ID (logical 3) the data conversion process can begin.

Incrementally create Data Flow from Source Schema (XSD) and Source Data (XML)

We can also use the incremental method of generating Statistics in combination with a Source Schema (XSD).

In a first step we process the first batch of Source Data

$ xml2er samples/batch1
…

# schema
origin:  5
job:  5

# statistics
startup:  4107 ms
parse:  6855 ms
xpath stats:  436 ms
doc stats:  4744 ms

xpaths:  325             | map:0.0%/0  new:100.0%/325
documents:  5009         | suc:100.0%/5009  part:0.0%/0  fail:0.0%/0
In a next step we use the Statistics Source ID (origin) to incrementally collect additional Statistics from another batch of Source Data. We use the –schema-origin or for short the -x parameter to collect additional Statistics. We also need to specify the –parsemode or -e parameter to enforce the Statistics only mode.

$ xml2er -x5 -e stats samples/batch2

$ xml2er -x5 -e stats samples/batch3

$ xml2er -x5 -e stats samples/batch4

…
This process can be repeated N number of times until all relevant Statistics have been collected.

Once we have collected all Statistics we generate the Data Flow with the –map or -g parameter providing the type of optimization. In this example -g1 applies the Elevate optimisation. We also provide the path to our XSD.

$ xsd2er -a5 -g1 samples/xsds
…

# schema
origin:  5
    logical: 3
job:  10

# statistics
startup:  4911 ms
map:  92 ms
    xpaths:  127

With the generated Data Flow ID (logical) we can now start converting Source Data.