Conversion Process
The Conversion process with Flexter is simple. It happens in two phases. We first create a Data Flow and then use this Data Flow many times in a Conversion task.
If your Source changes, e.g. a new version of your Source Schema is released or you add new XML elements to your Source Data, you may want to create a new Data Flow and Target Format to reflect these changes.
When you create a Data Flow, Flexter collects Statistics (optional) and creates the Target Schema and the mapping between Source and Target Schema.
Once the Data Flow has been created we can use it N number of times to convert Source Data. The conversion happens in a Conversion task.
Depending on the available type of Source (XML, JSON, XSD), the steps to create a Data Flow vary. A Data Flow can be created from: Source Schema (XSD) Source Data (XML or JSON) Source Schema (XSD) and Source Data (XML)
Source Schema (XSD) available
Using a Source Schema to create a Data Flow has some advantages and disadvantages:
Pros:
- The Target Schema you generate from an XSD is comprehensive. Any XML document that conforms to the XSD can be processed, All the possible data points from the input are mapped to the output with their cardinality and datatypes known and constant. Note this remains true as long as the Source Data fully conforms to the XSD definitions used.
- The Target Schema can be built without having to rely on a representative sample of XML documents or even the complete set of XML documents. This makes it much faster to generate the Target Schema.
- It can optimise the output by analysing shared types and creating a single table of these shared types in the target schema (Reuse Optimisation). This simplifies the Target Schema.
Cons:
- XSDs that are used in industry data standards typically model a huge number of business processes. In reality a company typically only implements a subset of the business processes modeled in the XSD. As a result the Target Schema contains many tables and columns that are not used or populated. They are redundant. This makes the Target Schema unnecessarily complex and difficult to work with.
- Not ideal for deep recursive XPaths. Such cases require more memory to build and store as metadata.
You create a Data Flow from a Source Schema (XSD) in a single step. This is done in a single command with optional and mandatory parameters. Flexter automatically creates the Target Schema and Mapping for you. The output is an Origin ID (Data Flow) that is used as Input in a Conversion task
A Conversion task is a single command with optional and mandatory parameters. A successful Conversion task loads the Source Data into the Target Format, e.g. from an XML archive on an FTP server to a Redshift database.
No Source Schema available
It is possible to convert XML documents without a Source Schema (XSD). For JSON documents this is the only option.
You need to provide a representative sample of XML/JSON documents for this to work best. Alternatively, you can provide all of the available XML or JSON documents. Flexter collects information from the data sample and stores it in the Metadata DB.
Here is a list of advantages and disadvantages.
Pros:
- It makes XML processing possible when you don’t have an XSD.
- It makes it possible to build a Target Schema for JSON files
- It generates a set of lightweight metadata: only what is really present in the Source Data is reflected.
- It may offer an alternative solution for scenarios where the XSD and XML samples don’t match, or have different versions.
- It’s ideal for recursive XPaths.
Cons:
- The Reuse optimization is not applicable.
- It requires collecting statistics over XML/JSON data before it can be translated into a Target Schema. If you can’t easily derive a representative sample of XML files you will require the full set of Source Data to collect Statistics. This may be time consuming.
With Source Data as your only Source you create a Data Flow in two steps. You first collect Statistics from the Source Data. This step can be repeated N number of times to collect Statistics incrementally. Using the collected Statistics you create the Data Flow in a second step. Next you create and execute the Data Flow in a Conversion task. 5.3. Source Schema and Source Data available
Generating a Target Schema based on both a Source Schema and Source Data allows you to combine the information from both Sources. This approach tends to give you the best results as you can combine the information Flexter infers from both the Source Data and Source Schema. This option is not available for JSON documents. Pros: It generates the most optimized logical schema, based on information from both Sources. The Target Schema is simplified. You can augment the information from the XSD with the intelligence collected from the Source Data documents, e.g. only generating the Data Points in the Target Schema that are in use. It lets you process recursive XPaths found in XSD definitions with Statistics. Reuse optimisation is applicable Cons: It does not work with JSON documents It requires collecting statistics over XML/JSON data before it can be translated into a target. If you can’t easily derive a representative sample of XML files you will require the full set of Source Data. This may be time consuming.
With Source Data and a Source Schema you create a Data Flow in two steps. You first collect Statistics from the Source Data. This step can be repeated n number of times. Using the collected Statistics AND a Source Schema you create a Data Flow in a second step. Next you create and execute the Data Flow in a Conversion task.