wiki:TheonCoupler

Generic Sync Framework

The new 3G Data Model supports a generic feed and sync framework. This allows any table (and subset of columns within) to be synchronized with an external source using the standard functions. This essentially means adding new sync tables and columns is as simple as configuring the table structure and setting up some parameters to drive the process, all the functions to load and process the feed are then built automatically. Adding completely new feeds into the framework is equally easy, simply by creating a local cache table and then hooking into the generic framework using parameterisation. The actual loading of the remote source data remains feed dependent although a lot of that could be generalised as well. This page describes how to create and maintain the process.

Making a table ready for Sync

If you want to sync one or more columns in a table from an external feed using the generic sync framework you need to add some extra columns to the table definition as below.

_in_syn INTEGER
_in_src VARCHAR(16)
_in_new DATE
_in_old DATE

The _in_syn column is an integer value indicating whether the record is to be sync'ed or not. By default new entries (created locally) will not be sync'ed (a value of empty or zero) but entries created by feed will have this value set to one in order to preserve the sync'ed state. At any point the flag can be manually changed to empty or zero to take a record out of the sync process or changed to one to add a local record into the sync process. Every sync update to a record increments the value (this is a mechanism used to separate a sync update from other updates).

The _in_src column is an eight character tag string holding the unique identification name for the source feed being used to sync the record. Any new records created by a sync will all be given this same tag name. Only records with a matching tag name (or an empty tag name in the case of manually created previously unsynced records or once synced records) will be updated, reclaimed or deleted by a sync. Any records marked deleted will have the tag name set to empty (this allows another source feed to take over sync'ing the data).

The _in_new column should only be set by the sync process. It will be set to the date the record was first created by the sync when it appeared in the source feed and will never be subsequently altered.

The _in_old column should only be set by the sync process. It will be set to the date the record disappeared from the source feed, thus it represents a deleted record although the record is never actually locally deleted (this is left to a manual local archiving process). Note that a deleted record can be re-instated if it re-appears in the feed which will reset this column to NULL (it will not change the created date). A deleted record can also be re-instated by another feed.

Then, by convention but this is optional, each column in the table that is going to be sync'ed from the source feed should be re-named with a preceeding _in_ tag. For example, the name column would become _in_name and the _total column would become _in__total. Not all columns need to be sync'ed. Those columns left alone can still be manually maintained. New local records can still be manually created.

In addition two extra columns need to be added for every column that is sync'ed (been re-named _in_) except those columns that are the primary sync key or form a foreign sync key (see keys below).

_in_mod_NAME DATE
_in_src_NAME VARCHAR(16)

Where _NAME is the base name of the column, so in the re-naming example above we would also add

_in_mod_name DATE
_in_src_name VARCHAR(8)
_in_mod__total DATE
_in_src__total VARCHAR(8)

The _in_mod_NAME column is a date used to indicate the last point at which the column value was modified (by either a local user or the feed sync). It is used to construct reports on how long a column has remained locally modified against the feed.

The _in_src_NAME column is a string with the UUN of the local user that last modified the column value, or NULL if last modified from a sync.

The sync process will only alter the value of a column if the corresponding _in_src_NAME is NULL. The sync process will reset the _in_src_NAME to NULL if the value in the sync feed for the column is the same as the current column value and the value of _in_src_NAME is not NULL. This facilitates the local (within a record) override facility. Once the value of a column is locally modified the sync will not alter it until the upstream value is the same at which point the sync will "reclaim" the column and start sync'ing changes again.

More than one feed can be sync'ed into a table. Each would generally operate within its own key namespace within the table although this is not a requirement. Normally different feeds would create different records and/or manage different columns but they can overlap (although this may result in values flip flopping). Only one feed can "own" a sync'ed record at one time.

  • Correlated Sync: Sometimes you may want columns in a record to be sync'ed from more than one source, however it is not immediately obvious how to achieve this in the framework due to single source ownership of a record. There are two solutions. One is to combine each source into a view and sync from the single view. This can also be used to union sources containing different sets of records that should be treated as a single set. The other is to simply use the same sync tag identification for both (but guaranteeing that both provide an identical key set to avoid deletion flip-flop, which can be achieved by making one the master and the other a slave by defining it as a view with the base feed table left joined to the master).
  • Foreign Key Sync: Sometimes the column value to be managed is a foreign key for which the data source does not have the true value. These can also be maintained (and locally overridden) by using the data source value as a lookup against the table providing the related primary key. They are configured separately (see below). They can be optional (NULL values allowed) or mandatory. If mandatory a failed lookup will result in the whole record not being synced.

Keys

Primary or foreign key columns in the sync data are never used directly as primary or foreign key columns in the sync'ed table. This is so that records can be created outside of the sync without worrying about key namespace clashing. It also means that changes in the local primary key can be made without breaking the association with a record in the feed. To support this all sync'ed tables must be able to auto generate their own unique primary key column values if none are specified, this is not part of the sync framework itself. Foreign key column values in the sync data will be correlated on sync to the correct locally generated primary key column values. When a foreign key is changed in the sync'ed table the sync framework adds a process to cause this to update the sync'ed key column values (as a local override).

A column must be added to the table for each column that comprises the primary key of the sync feed data, all with a preceeding _in_ tag (and by convention followed by the original column name in the sync feed although this is not a requirement under the framework). These are used solely by the sync process to match records between the feed and the local table. They are not the actual primary key as that is auto generated as described above, nor are they end user modifiable. The column must have a unique constraint.

Similarly a column must be added to the table for each column that comprises a foreign key in the sync feed data and which is being managed by the sync process (changes to columns comprising a foreign key in the table as opposed to changes to normal columns). Again these must start with the _in_ tag (and by convention followed by the original column name in the sync feed although this is not a requirement under the framework). These are also used solely by the sync process to lookup the real foreign key values associated with the sync values. They are not the actual foreign key as that will be added by the lookup process, nor are they end user modifiable.

Example

A workthrough of the above making the table below ready for sync.

CREATE TABLE test (
  pk SERIAL,
  a INTEGER,
  b INTEGER,
  c VARCHAR(4),
  fk INTEGER NOT NULL
)

The above table has a primary key pk which is auto generated (this could be a normal column with triggers and a function to do more sophisticated key generation if necessary of course), three normal columns a, b and c and a foreign key column fk which must have a value on record creation. We are going to sync columns a and b and by necessity fk (the sync data must have something which can be used to uniquely map to this value). The first step is to add the standard additional sync columns.

ALTER TABLE test ADD COLUMN _in_syn INTEGER
ALTER TABLE test ADD COLUMN _in_src VARCHAR(8)
ALTER TABLE test ADD COLUMN _in_new DATE
ALTER TABLE test ADD COLUMN _in_old DATE

Next we need to add the additional sync columns associated with each normal column being put under sync control.

ALTER TABLE test ADD COLUMN _in_mod_a DATE
ALTER TABLE test ADD COLUMN _in_src_a VARCHAR(8)
ALTER TABLE test ADD COLUMN _in_mod_b DATE
ALTER TABLE test ADD COLUMN _in_src_b VARCHAR(8)

Next we need to add a column for holding the sync source key value(s), in this example its just one column but it could be any number. These hold the values which are used to tie a record between the sync source and the sync target table (remember that the sync process does not care about the actual primary key value in the target table):

  • if the key value is in the sync source but not in the target table then a record is created from the sync source
  • if the key value is in the target table but not in the sync source then the record is marked as deleted
  • finally there is a key value match between the sync source and the target table in which case update each column in the target table for changes from the corresponding column in the sync source
ALTER TABLE test ADD COLUMN _in_spk INTEGER UNIQUE

Here spk is the column in the sync source which is the unique key (it need not be distinct in the sync source if the content has been denormalized from multiple tables, although this plus other columns that are involved in the sync to this particular target table must be distinct otherwise there will be errors). This column does not need corresponding _in_mod and _in_src columns despite being a sync'ed column as it is only there to hold the syncing key value and is not intended to be user modifiable.

Finally we need to add a column for holding the value(s) in the sync source that are used to establish a value for the foreign key column, much in the same way as for the primary key. In the example its only one column but it could be any number. These same columns (either by another sync from the same source, another source or locally created to match from local data) must be present in the same form in the related primary table and are used to match and find the foreign key column value directly from that table.

ALTER TABLE test ADD COLUMN _in_sfk INTEGER

Key column mapping as above is a special case of general column mapping. This is necessary when the value in the sync source needs to be modified before syncing into a column in the target table. This can be achieved in two ways. If the modification can be achieved by a simple expression then this can be specified as part of the sync configuration for that columm. If it is a more complex mapping function then the techinque for keys above can be used - that is, a new column is added to hold the real value sync'ed between the sync source and target table and the write custom functions on insert and update to that column so that changes are reflected in the original target table column via an appropriate mapping. Similarly a function must be written on update to the original target table column to map the change back to the actual sync'ed column as a local override.

The final table definition for sync'ing now looks like below.

CREATE TABLE test (
  pk SERIAL,
  a INTEGER,
  b INTEGER,
  c VARCHAR(4),
  fk INTEGER NOT NULL,
  _in_syn BOOLEAN,
  _in_new DATE,
  _in_old DATE,
  _in_mod_a DATE,
  _in_src_a VARCHAR(8),
  _in_mod_b DATE,
  _in_src_b VARCHAR(8),
  _in_spk INTEGER UNIQUE,
  _in_sfk INTEGER
)

Making the Sync Source Table

This table provides an in-database copy of the sync feed data. It can hold data that is sync'ed into one or more target tables and can be de-normalized. In general it is a real table, the content of which is entirely deleted and replaced by new feed content prior to doing a sync operation. It need not be however, it could be a view based on other tables and hence always reflecting the current data to sync. When the source is external it is likely to be a CSV file or other similar text delimited file, in which case the table is simply constructed with a column to match each field in the input file in the same order. Prior to a sync the content of the table is deleted using DELETE FROM and the content replaced from the most up to date source file using COPY FROM for example. Note that none of these steps are part of the generic sync framework.

For a standard incoming feed which is CSV or XLS (as an email attachment) the Incoming.py support script provides everything needed to process the feed and reload the base table as described above. The dedicated feed script simply needs to define the name of the base table, the names of the columns in the base table (that correlate with columns in the source file) and the names of the sync functions that have been auto-created to actually carry out all the dependent sync processes.

Following on from the example above here is an example feed table.

CREATE TABLE testfeed (
  pkey INTEGER,
  alpha INTEGER,
  beta INTEGER,
  gamma VARCHAR(4),
  delta VARCHAR(2),
  key2 INTEGER
)

In the example table above pkey is used to map to _in_spk in the test table and acts as the sync key value, alpha maps to a, beta maps to b, gamma/delta are unused and key2 is mapped to _in_sfk to lookup a value for fk.

Below is an incoming script example that populates the base feed table and triggers the sync.

#!/usr/bin/python
from Incoming import *
table = 'testfeed'
columns = ( 'pkey', 'alpha', 'beta', 'gamma', 'delta', 'key2' )
functions = ( 'test_sync', )
Incoming('TEST', table, columns, functions).processxlsfeedfromemail()

In the example above the source is an XLS file included as an attachment in an email message. Other included options in the Incoming support module are a CSV file. Essentially the import populates the columns of the table from the records in the source (fields must match the defined column order without gaps) and then runs one or more sync functions (that push changes to tables dependent on this feed data). The actual sync function (test_sync) is defined automatically as the result of some basic configuration (see next).

Configuration

The generic sync framework needs to know the sync source table to use, the sync target table, the column(s) to sync between each, the primary key column(s) and foreign key columns (and associated lookup tables, columns and primary key column(s)). Each configuration defines one sync target table, one sync source table can supply any number of target tables, in particular when it is bundled de-normalized data.

This sync configuration is currently defined as mark-up in a conduit. This is something of an interim arrangement, eventually it will be defined in the XML schema (this way table and column references can be properly validated and dependencies established automatically). The conduit is processed to create functions and triggers which are then added into the schema and auto-generated DDL in the normal way.

Below is the configuration for the test example above.

#!/usr/bin/pggurgle
%%define TITLE Theon Function for Test Sync
%%include "../includes/3gSyncFunction.grg"
%%equate _fname "test_sync"
%%equate _fsummary "Test Sync"
%%equate _fdescription "Synchronise data from the Test Feed into the Test table."
%%equate _ssourcetag "TEST"
%%equate _ssourcetab "testfeed"
%%equate _ssourcecols "alpha,beta"
%%equate _ssourcekeycols "pkey"
%%equate _stargettab "test"
%%equate _stargetcols "a,b"
%%equate _stargetkeycols "spk"
%%equate _ssourcerealkcols "key2"
%%equate _stargetrealkcols "sfk"
%%equate _stargetrealcols "fk"
%%equate _stargetrealtabs "test_foreign"
%%end

The 3gSyncFunction.grg header file setups everything to auto-generate a sync function based on the configured information in the remainder of the file. A sync function is created to the same format as any other conduit defined function (and hence integrated into the DDL in the standard way beyond this point).

  • "_fname" is a single value defining the name of the sync function.
  • "_fsummary" and "_fdescription" provide some ancillary documentation about the configured sync process.
  • "_ssourcetab" and "_stargettab" are single values and define the name of the source (feed data) table and target (sync) table.
  • "_ssourcetag" is a single value and is the unique name for this sync process (as used in the _in_src column). Currently used tag names are GEX for the EUGEX source, SSM for the ESSMU (BOXI) source, ADM for the ADMIS (BOXI) source and PAC for the legacy database source. There is also GEX/SSM for combined EUGEX/ESSMU data. If the tag name is prefixed with $ then it will be interpreted as an SQL expression, this allows dynamic tags to be created - these might be time/interval based for example or can also refer to columns in the source data itself.
  • "_ssourcecols" and "_stargetcols" are multiple comma separated values that align, each is the name of a column in the source table or target table that is being sync'ed. In this case "alpha" -> "a" and "beta" -> "b". Note that these are "normal" sync'ed columns only (not columns representing the primary/sync key or any foreign key lookups). Some additional syntax supported listed below.
    • In source columns "$" can be used to refer to an expression, for example this can be a real literal where the feed table does not provide the requisite data (e.g. $'Fixed Value') or can be a "simple" SQL expression to map the feed data value to a more approriate sync table value (e.g. $UPPER(@alpha)) would map the value to uppercase prior to sync (when doing this it is important to prefix the actual sync column name with "@" so that correlation identifiers get applied correctly). It is not neccessary to use "$" to do type conversion, the following is common for example alpha::_a_enum to map the raw feed value to an enumerated value). Anything you can do with a configured mapping can be done by defining a "view" around the feed data table instead, this is necessary for anything much more complex.
    • In target columns by default the name is automatically mapped to _in_X, so in the configuration above "a,b" actually refers to "_in_a,_in_b". This can be escaped by prefixing the column name with "$@". Alternatively the "_in_X" prefix can be explicitly added by prefixing the column name with "$@!". Automatic prefixing can be turned off for the whole configuration by setting "_srawnames" to 1.

The remaining configuration options are associated with key handling. Effectively they define the columns (and tables) that:

  • make up the unique sync key for a sync'ed table with a primary key (which itself is auto-generated somehow outside of the sync framework)
  • make up the unique sync key for a sync'ed table based on a foreign key relationship, so the foreign key must have a matching value for the record to be created/updated
  • resolve any other foreign key references necessary to successfully create/update the sync'ed table record (cannot be empty)
  • resolve any foreign key references for managed columns

Note that there is no distinction between primary and foreign key sync, creation of the sync'ed tables real primary key is not part of the sync framework, but for the table to be sync'ed a mechanism must be defined that creates the primary key automatically (either by derivation from other column values or by simple sequence generator).

In the simplest case where the sync'ed table has a primary key and no foreign keys then:

  • "_ssourcekeycols" and "_stargetkeycols" are a multiple comma separated value defining the columns that comprise the primary sync key columns in the feed data and the sync'ed table (although as noted already this is not the actual primary key of that table as that is independent).

In a more complex case where the sync'ed table has a primary key and also has one or more foreign keys that cannot be left empty on record creation/update then:

  • "_ssourcerealkcols" and "_stargetrealkcols" are a multiple comma separated value defining the columns that comprise each foreign key value, each key itself is separated by ^. For example, a,b^c^d,e defines three keys to be resolved, one made up of columns "a" and "b", the next made up of column "c" and the third made up of columns "d" and "e". Note that it is possible to skip key columns where the "real" value is a literal, for example a,b^^d,e is valid for "_ssourcerealkcols" when "c" does not appear in the feed directly, in which case the definition of "_stargetrealkcols" might be alpha,beta^$'Fixed Value'^delta,echo for example.
  • "_stargetrealcols" is a multiple comma separated value defining the columns in the foreign tables that are looked up based on the key values from the columns above to resolve each foreign key value, each key itself is separated by ^ (aligned with above).
  • "_stargetrealtabs" is a multiple ^ separated value defining the foreign table where each key value is looked up.

In the case where the sync'ed table is based on a foreign key relationship then the key(s) defining the lookup are specified in "_ssourcekeycols" and "_stargetkeycols" (to define the sync key) and the lookup resolution is defined above as for any other foreign key resolution (as well as any other foreign keys that are not part of the sync key but cannot be left empty).

Foreign Key Columns

These are used for columns that are being sync'ed which are (or constitute) foreign keys in the target table and the data source value cannot itself be used (it must be mapped via the table holding the related primary key). For this to be supported the data source values must correlate to one unique primary key. The relation can be optional or mandatory. If optional the target foreign key will need to allow NULL values. If mandatory the lookup must always succeed - if not the whole sync update process for the matching record will not occur. The configuration is as below.

  • "_ssourcelookupcols" is much like "_ssourcecols" but defines columns in the source table that are used to provide a unique lookup key. This is a multiple comma separated value defining the columns that comprise each lookup key, each key itself is separated by ^. See also "_ssourcerealkcols" above which is defined in a similar way. Additional syntax is described above as for "source columns".
  • "_stargetlookupcols" is much like "_stargetcols" but defines columns in the target table that make up the foreign key corresponding to each lookup key in "ssourcelookupcols". This is a multiple comma separated value defining the columns that comprise each foreign key, each key itself is separated by ^. See also "_stargetrealkcols" above which is defined in a similar way. Additional syntax is described above as for "target columns".
  • "_slookuptabs" is a multiple ^ separated value defining the foreign table where each key value is looked up. Lookups are optional by default (i.e. a lookup failure will result in a NULL value for the foreign key to enter in the target table). If the name of a table is prefixed with * then the lookup is made mandatory (a match must be found or the sync will not apply for that record).
  • "_slookupcols" is a multiple comma separated value defining the columns in the foreign tables that are looked up based on the key values from the columns in the source table (defined in "_ssourcelookupcols" above) to resolve each foreign key value, each key itself is separated by ^ (aligned with above).
  • "_slookuprets" is a multiple comma separated value defining the columns in the foreign tables that comprise the primary key to be used as the value for the target table columns (defined in "_stargetlookupcols" above), each key itself is separated by ^ (aligned with above).

An example. If the data source has two columns "a" and "b" which are used to lookup a foreign key in table "aaa" as column "alpha" and "bbb" as column "beta". The primary key columns to return in each case are "ax,ay" (primary key) from "aaa" and "bx,by" (primary key) from "bbb". These are used as the actual foreign key value to maintain the foreign key columns in the target table called "tax,tay" and "tbx,tby". The lookup on "alpha" is optional but on "beta" is mandatory. The configuration would be as below.

  • _ssourcelookupcols = a^b
  • _stargetlookupcols = tax,tay^tbx,tby
  • _slookuptabs = aaa^bbb
  • _slookupcols = alpha^*beta
  • _slookuprets = ax,ay^bx,by

Options

A few general configuration options can be set.

OptionDefault ValueDescription
TS_FAILSAFE1When set to 0 disable the check for no data in the sync source table which causes immediate drop out. Use this when there can be genuinely no raw source data. Default is enabled.
TS_PURGE0When set to 1 automatically purge deleted records, so instead of being left marked as deleted records are really deleted. Use when no local information is attached or we don't care about any locally attached information. Use the option TS_PURGE_INTERVAL described below to delay the purge operation until a period of time has elapsed from being marked as deleted. Default is to not purge deleted records at all.
TS_PURGE_INTERVAL0When purging deleted records set the period of time to elapse after the record has been marked as deleted before purging (actually deleting the record). The default is immediate. Use a standard SQL INTERVAL literal here. e.g. "1 MONTH", "6 MONTHS", "1 YEAR" etc.
TS_CREATE_IN_SYNC1When set to 1 new user created records are created into the sync, meaning they will be automatically picked up and taken over by a sync if and when the sync key is entered by the user and matched in the sync data source. This is the default. Set this option to 0 so that new user created records are created out of sync (unless in sync column also set on creation by user) and must be manually moved in sync to be automatically updated.

Deploying a new Sync Process

In summary, you'll need to do the following (see also ChangeManagement?):

  • commit your incoming script to browser:/conduits/incoming
  • commit your function definition to browser:/conduits/legacy/conduits
    • this should be named in the form pg_fbt3g_<name>_sync
  • build a new set of legacy and incoming conduit RPMs
    • these should be added to live/infdb-conduits.h on DICE
  • update the derived schema:
    • update your local conduit rpms to the latest version
    • run schema and push the new generated files with browser:/dev/scripts/schemaget.sh
    • pushing updated schema files allows svn.theon to update the derived schema.
  • Now apply the new derived data to the server. You can do this incrementally using rowser:/dev/scripts/diff-schema.sh (and of course ddl-split.awk).

Configured Sync Processes

A list of existing generic framework sync processes can be found on the GenericSyncFramework? page.

Last modified 5 years ago Last modified on Jul 28, 2014, 10:34:22 PM