Creating a Data Flow

This section covers the processes by which Flexter can create a Data Flow.

As outlined in section Overview - Flexter Conversion process, there are three ways to generate a Data Flow.

1.Sample of XML/JSON
2.XSD only
3.Sample of XML + XSD

Irrespective of the type of Data Flow generation we select, we need to define the type of Optimisation. This is done with the -g parameter

-g, --map MAPPING		Mapping generation and optimization levels:
                                - 0: No optimized mapping
                                - 1: Elevate optimization (1=1)
                                - 2: Reuse optimization  
                                - 3: Elevate + Reuse optimization

-a, --use-stats ID...		Use Statistics to generate new data flow

--name-max-len SIZE		Maximum column size for mapping generatingdefault: 30

--default-varchar-len LENGTH    User defined length of VARCHAR/CLOB datatype for mapping generating

One important parameter to understand when creating Data Flows is the –parsemode switch. It is applicable when running a Conversion and when collecting Statistics incrementally.

When running a Conversion, you can collect Statistics at the same time. This is useful if you would like to detect changes to your Source Data during the Conversion process. It is also useful when upgrading a Target Schema from one version to the next. By default Flexter collects Statistics during the Conversion. You can disable this behaviour by setting the –parsemode to [d]ata.

The –parsemode [s]tats is only used for collecting Statistics incrementally, which is described in section Create Data Flow with XML/JSON of this guide.

Further options are available from version 1.8, using upper case letters it will disable collecting statistics about documents.

 -e, --parsemode MODE            Mode of parsing.
                                - with doc stats:    
                                    [a]ll, [d]ata, [s]tats
                                - without doc stats: 
                                    [A]ll, [D]ata, [S]tats
                                  default: all

Create Data Flow with Source Schema (XSD)

Before creating the Data Flow with a Source Schema (XSD), it is recommended to locate the root (top outermost) . Flexter greatly simplifies this process by analysing the Source Schema for possible root elements. To leverage this information it is advised to use the –skip or -s parameter for the first time use.

Once the analysis is complete, the skip parameter must be removed so the processed metadata can be saved into the Flexter Metadata Database.

Basic command template:

# Diagnostics 
$ xsd2er -s [options*] INPUTPATH

# Generating the data flow
$ xsd2er -g <0|1|2|3> [options*] INPUTPATH

Basic command example, reading a file and extracting the xsd schema:

# Diagnostics 
$ xsd2er -s donut.xsd

# Generating the data flow
$ xsd2er -g1 donut.xsd

Running the command with the -s switch will show all of the possible root elements in the XSD.

» SCHEMA {NAMESPACE} · ELEMENT
------------------------------
» donut.xsd
   {}
 · addcost
 · batter
 · batters
 · filling
 · fillings
 · item
 · items
 · name
 · ppu
 · size
 · sizeslimitedto
 · topping

Given no additional switches all of these root elements will be saved into the Metadata DB and used to generate Data Flows and the Target Schema. It’s the most flexible option but might also prove to produce excessive volumes of metadata information.

If you know the root in the XML documents you can choose one or more by using the -r switch:

-r, --root  ROOT...   	Process only requested root elements

An example

$ xsd2er -s -r items donut.xsd
…
» SCHEMA {NAMESPACE} · ELEMENT
------------------------------
» donut.xsd
   {}
 · items

Flexter can also automatically identify the root based on a simple rule.

An element that is not referenced by anything else should be the root. In other words an element without any parents.

It works for most cases apart from a recursive relationship.

-R, --unref-root           Unreferenced elements will be considered roots

Using the -R switch in the example, Flexter has automatically detected that items is an unreferenced element and will use it as the root element.

$ xsd2er -s -R donut.xsd
…
» SCHEMA {NAMESPACE} · ELEMENT
------------------------------
» donut.xsd
   {}
 · items

The same rules are applied to Source Schemas (XSD) with multiple files.

Flexter offers some useful switches to work with XSDs with multiple files and multiple namespaces.

-f, --file-root FILE    ...      Process only requested root schema files
                              	
-F, --unref-file-root   ...      Unreferenced files will be considered root files

-N, --with-target       ...    	 Only files with the specified target namespace will be considered

It’s also possible to increase the level of diagnostics, i.e. to print out more details on the XPaths.

-X, --xpath              ...   	Print processed XPaths
                          	
-z, --xpath-full         ... 	Print processed XPaths including attributes
$ xsd2er -s -R -X donut.xsd
…
» SCHEMA {NAMESPACE} · ELEMENT
------------------------------
» donut.xsd
   {}
 · items

17:42:17.704 INFO  Building metadata
/items
/items/item
/items/item/name
/items/item/ppu
/items/item/batters
/items/item/batters/batter
/items/item/batters/batter/name
/items/item/batters/batter/sizeslimitedto
/items/item/batters/batter/sizeslimitedto/size
/items/item/topping
/items/item/fillings
/items/item/fillings/filling
/items/item/fillings/filling/name
/items/item/fillings/filling/addcost

Flexter also offers the flexible ways to handle XSD recursions defining how it suppose to stop --stop-policy the the --levels of recursions reaches it’s maximum point.

--levels LEVEL                  Recursive <TAG> levels treatment:
                               - 0: Disabled, don't stop
                               - <n>: Number of accepted recursions
                              default: 1

--levels-type LEVEL             Recursive <TAG> types levels treatment:
                               - 0: Disabled, don't stop
                               - <n>: Number of accepted recursions
                              default: 2

--stop-policy POLICY            The stopping policy applied in case of a recursion:
                               - [u]nlimited: Unlimited, keep forward until the end or the recursive levels limit
                               - [s]top or 0: Stop immediately
                               - +: Keep forward only with one-to-one parent/child relationships
                               - +<n>: Keep forward only with one-to-one parent/child relationships, up to N child levels
                               - <n>: skip N child levels before stop.
                               - <n>+: skip N child levels before keeping forward with one-to-one only
                               - <n>+<n> skip N child levels before keeping forward with one-to-one only, up to N child levels
                              default: +

As the default setup, when the recursion occurs in the second time (--levels 2), it will trigger the --stop-policy +.

In this case it, will ignores tag’s children which has the 1 to n relationship, keeping it going for 1 to 1 ones.

If the other recursions appears, even with 1 to 1 relationships, it will stop immediately.

Now, by removing the –skip, -s parameter, the following command will write the metadata:

$ xsd2er -R -g1 donut.xsd
…
» SCHEMA {NAMESPACE} · ELEMENT
-----------------------------------------
» donut.xsd   
{}
 · items

20:00:35.986  INFO  Building metadata
20:00:36.064  INFO  Writing metadata
20:00:36.120  INFO  Generating the mapping: elevate,reuse
20:00:36.210  INFO  Registering success of job 3
20:00:36.225  INFO  Finished successfully in 1058 milliseconds

# schema
    origin:  3
   logical:  1
       job:  3

# statistics
     load:  447 ms
     parse:  365 ms
     build:  83 ms
     write:  56 ms
     map:  92 ms
     xpaths:  16

At the end of the successful XSD analysis process, Flexter prints out the ID of the Data Flow (logical). In the example above, the ID of the Data Flow is 1.

This should be noted and used later in your Conversion task.

Create Data Flow with Source Data (XML or JSON)

You can process XML and JSON documents without a Source Schema (XSD files).

However, this requires choosing a representative sample from your XML/JSON documents. With this representative sample we collect Statistics before creating the Data Flow.

The Source Data can be located on different types of Source Connections, e.g. a file system such as HDFS or an FTP server. For a full description of Source Connections please refer to section Source Connections in this guide.

We can simulate the process with the –skip or -s parameter and remove it when we are ready to run the command for real.

Basic command template:

# Diagnostics 
$ xml2er|json2er -s [options*] INPUTPATH

# Generating the data flow
$ xml2er|json2er -g <0|1> [options*] INPUTPATH

Basic command example, reading a file and extracting the statistics:

# Diagnostics 
$ xml2er -s donut.xml

# Generating the data flow with statistics
$ xsd2er -g1 donut.xml

Let’s run the command without the skip parameter:

$ xml2er -g1 donut.xml
…

# schema
     origin:  4
    logical:  2
        job:  4

# statistics
    startup:  4107 ms
      parse:  6855 ms
xpath stats:  436 ms
  doc stats:  4744 ms

     xpaths:  19           | map:0.0%/0  new:100.0%/19
  documents:  1            | suc:100.0%/1  part:0.0%/0  fail:0.0%/0

This command collects Statistics (origin ID 4). At the same time it also creates a Data Flow (logical ID 2).

We can also create the Data Flow in two separate steps. We first collect Statistics and then generate the Data Flow. This is useful if we want to collect Statistics incrementally.

Incrementally create Data Flow from Source Data (XML/JSON)

If the sample of your Source Data is too big to be processed in one iteration, you can split it into smaller batches and generate Statistics incrementally.

In a first step we process the first batch of Source Data. This gives us the ID for the Source Data Statistics (origin ID 5).

$ xml2er batch1/
…

# schema
     origin:  5
        job:  5

# statistics
    startup:  4107 ms
      parse:  6855 ms
xpath stats:  436 ms
  doc stats:  4744 ms

     xpaths:  325          | map:0.0%/0  new:100.0%/325
  documents:  5009         | suc:100.0%/5009  part:0.0%/0  fail:0.0%/0

We can now incrementally update the generated Source ID (origin) and its statistics with another batch of Source Data. This can be done by feeding the Source Statistics ID (origin) to the –schema-origin or -x parameter. We also need to supply the –parsemode or -e parameter to enforce Statistics only processing. This command updates the Source Statistics for ID 5 (origin 5).

$ xml2er -x5 -e stats batch2/

$ xml2er -x5 -e stats batch3/

$ xml2er -x5 -e stats batch4/

…

After all the Statistics have been collected, the Data Flow can be generated using the –map or -g parameter. Together with the -g parameter we provide the level of optimization. This generates the Data Flow Id (logical 3)

$ xml2er -a5 -g1
…

# schema
     origin: 5
    logical: 3
        job: 10

# statistics
     startup:  4911 ms
         map:  92 ms

We can now use the Data Flow ID (logical) to convert our Source Data. In our example the Data Flow ID is 3.

Create Data Flow with Source Schema (XSD) and Source Data (XML)

In a first step we need to collect Statistics from the Source Data.

Basic command template:

# Diagnostics statistics
$ xml2er -s [options*] INPUTPATH

# Diagnostics schema
$ xsd2er -s -a <origin> [options*] INPUTPATH

# Generating the statistics
$ xml2er [options*] INPUTPATH

# Generating the data flow
$ xsd2er -g1 -a <origin> [options*] INPUTPATH

Basic command example, extracting the statistics and reading the schema file:

# Generating the statistics
$ xml2er donut.xml

# schema
     origin:  4
        job:  4

# statistics
    startup:  4107 ms
      parse:  6855 ms
xpath stats:  436 ms
  doc stats:  4744 ms

     xpaths:  19           | map:0.0%/0  new:100.0%/19
  documents:  1            | suc:100.0%/1  part:0.0%/0  fail:0.0%/0

We have generated a new set of Statistics (origin 4). We are now ready to generate the new Data Flow with the xsd2er module. We need to use the use-stats parameter and also the mapping parameter to generate the new Data Flow using both an XSD and Statistics for a Source ID (origin 4).

-a, --use-stats ID...       Use Statistics to generate new data flow

# Diagnostics schema
$ xsd2er -s -a4 donut.xsd

# Generating the data flow with statistics and schema

$ xsd2er -a4 -g1 donut.xsd
…
# schema
     origin:  7
    logical:  3
        job:  6

# statistics
       load:  1608 ms
      stats:  40 ms
      parse:  368 ms
      build:  121 ms
      write:  47 ms
        map:  128 ms
     xpaths:  16

Now with a generated Data Flow ID (origin 7) the data conversion process can begin.

Incrementally create Data Flow from Source Schema (XSD) and Source Data (XML)

We can also use the incremental method of generating Statistics in combination with a Source Schema (XSD).

In a first step we process the first batch of Source Data

$ xml2er samples/batch1
…

# schema
     origin:  5
        job:  5

# statistics
    startup:  4107 ms
      parse:  6855 ms
xpath stats:  436 ms
  doc stats:  4744 ms

     xpaths:  325             | map:0.0%/0  new:100.0%/325
  documents:  5009            | suc:100.0%/5009  part:0.0%/0  fail:0.0%/0
In a next step we use the Statistics Source ID (origin) to incrementally collect additional Statistics from another batch of Source Data. We use the –schema-origin or for short the -x parameter to collect additional Statistics. We also need to specify the –parsemode or -e parameter to enforce the Statistics only mode.

$ xml2er -x5 -e stats batch2/

$ xml2er -x5 -e stats batch3/

$ xml2er -x5 -e stats batch4/

…
This process can be repeated N number of times until all relevant Statistics have been collected.

Once we have collected all Statistics we generate the Data Flow with the –map or -g parameter providing the type of optimization. In this example -g1 applies the Elevate optimisation. We also provide the path to our XSD.

$ xsd2er -a5 -g1 samples/xsds
…

# schema
     origin:  5
    logical:  3
        job:  10

# statistics
    startup:  4911 ms
        map:  92 ms
     xpaths:  127

With the generated Data Flow ID (logical) we can now start converting Source Data.