Synchronization of RDF datasets across same-user machines
This is a brain dump of my ideas to let application-private Tracker databases transparently and easily synchronize their RDF data from compatible databases in other machines/hosts from the same user. As fancy features need fancy names, I've codenamed this project "Emergence".
Discovery and pairing
All outside network activity would be delegated on a brand new daemon. This would run in the user session and make negotiation happen at a randomly picked port made discoverable through Avahi. Pairing between hosts should be explicit and persistent, and should ideally be location aware (e.g. remembering what network the connection was established) in order to avoid any conversation whatsoever with possibly spoofing hosts.
When paired, an unique certificate should be created for that host pair, to be used for all data exchange. If this sounds similar to https://gitlab.gnome.org/chergert/bonsai it's not by coincidence, although Bonsai implements a client/server architecture, whereas I would like this service to be peer-to-peer and many-to-many. More below on how to reasonably achieve that at the data level.
Initiating the synchronization
In order to make it possible for applications to show feedback, and avoid database writes and file locks from multiple processes, the synchronization process should be guided by the application, this includes:
- Discovering hosts to synchronize data with.
- Performing the local database updates
These facilities should be made available through new libtracker-sparql API, optionally used by the app, and initiated at the app discretion. The database update would block and queue up with other updates, like regular database updates do.
Beneath this machinery, the application's TrackerSparqlConnection will exchange details with the Emergence daemon about how to access it, either by passing a FD to the database directory to let the daemon open it readonly (i.e. non-blocking for the application), or by opening a temporary D-Bus endpoint and giving the daemon an object path.
The daemon will create a process-local TrackerSparqlConnection to the application database, and proxy it through a TrackerEndpointHttp at a random port. This endpoint should use the certificate created during pairing, and reject any mismatch. The HTTP port of the endpoint will be made known to the other host, where the daemon in response will create a TrackerSparqlConnection for the HTTP port, proxied through a TrackerEndpointDBus to the application, this endpoint should reject requests from other D-Bus names than the application's, on its end.
The same process would happen the other way around, to let each application/host have a TrackerSparqlConnection to the data in the other host.
The synchronization process
The mechanism strongly relies on nrl:modified
as it's already present, a sequential transaction counter. This can be relied to be a good marker for additions and modifications of RDF resources and data at a per-database level.
Since databases in other hosts may evolve differently, one addition we need in the base ontology is a way to track the nrl:modified
of other hosts, e.g.:
nrl:RemoteSyncHost a rdfs:Class .
nrl:remoteHostModified a rdf:Property ;
rdfs:range xsd:integer ;
rdfs:domain nrl:RemoteSyncHost ;
nrl:maxCardinality 1 .
So that a database can keep track on itself of the synchronization state with other hosts.
Synchronization is a several-step process, that should be performed within a single transaction to be able to simply roll back in the case of conflict or failure, this is another reason to let the application lead the process, through libtracker-sparql API.
- The application registers a canonical D-Bus name in the Emergence daemon. The D-Bus name is used for discovery of related databases between the available hosts.
- The application receives a signal about a new host being available for synchronization. At a time convenient for the application, synchronization with that host is started.
- The library initiates the request and obtains a TrackerSparqlConnection piped to the other host. This connection would be temporarily mapped on the local connection through
tracker_sparql_connection_map_connection()
. - The library initiates the synchronization steps:
- Query on the local database the last known
nrl:modified
for the remote host, e.g.SELECT ?m { <http://juanito.local> nrl:remoteHostModified ?m }
- Query the other host about its latest
nrl:modified
, e.g.SELECT ?m { SERVICE <private:123> { SELECT (MAX (?mod) AS ?m) { ?any nrl:modified ?m } } }
- Query the other host about all its known nrl:modified for other synchronization hosts, e.g.
SELECT ?service ?mod { SERVICE <private:123> { ?service a nrl:RemoteSyncHost ; nrl:remoteHostModified ?mod } }
. This information will be also synchronized between hosts, in order to reduce bandwidth with any of those other hosts in future synchronizations. - Initiate the dump of RDF data itself, serialize the data from the remote TrackerSparqlConnection to another via tracker_sparql_statement_serialize_async() with a describe query, e.g.
DESCRIBE { ?u nrl:modified ?m . FILTER (?m > ~lastLocalKnownSequence && ?m <= ~lastRemoteSequence) }
.~lastLocalKnownSequence
and~lastRemoteSequence
are respectively obtained in steps 4.1 and 4.2. - Load the RDF data into the local database through
tracker_sparql_connection_deserialize_async()
- Delete non-blank-node resources that exist locally previously to the last sync and do not exist in the remote host anymore, e.g.
DELETE { ?u a rdfs:Resource } WHERE { { ?u nrl:modified ?m . FILTER (?m <= ~lastRemoteModifiedForLocalHost) . FILTER (!isBlank(?u) } MINUS { SERVICE <private:123> { ?u a rdfs:Resource . FILTER(!isBlank(?u) } } }
.~lastRemoteModifiedForLocalHost
is obtained from step 4.3. - Update the local connection with the MAX() of the local/remote modified sequence for each
nrl:RemoteSyncHost
- Query on the local database the last known
- The remote connection is closed and unmapped
- The application is free to synchronize with another host
The fine grain of synchronization
While new RDF resources should be always safe to insert, RDF resources being updated may require some decision on whether it makes sense to fully erase previously existing properties for the resource, whether values should be only replaced, or appended for multi-valued properties. A new TrackerDeserializeFlags
set of values might be too coarse, another possibility is extending the ontology definitions so that this behavior is expressed in the individual properties.
Another case not observed here are situations where an application updated to a newer ontology in just part of the hosts. There might be cases where an application receives RDF data that it does not know yet about. These cases will at least result safely in failure and rollback.
Achieving many-to-many
While the synchronization steps as proposed here should allow for synchronization in one direction being independent from the other direction, or the other host synchronizing from a third host, it is probably better in terms of bandwidth to have this synchronization happen in pairs that get synchronized bidirectionally, in order to result in minimal bandwith (e.g. if hostA->hostB, then hostB->hostA happens, host A would be redundantly receiving its own data between the updates from hostB). This might involve a "sync up" point before both hosts proceed to write the data on their own ends.
Another bandwidth reduction technique is that the mapping of recent nrl:modified
to remote hosts is also part of the synchronized data, if host A synchronizes from host B which did synchronize in a recent past from host C, when host A synchronizes with host C it will only have to catch up for the possible changes that happened since host B synchronized. The goal is that the minimal amount of data is on-flight, and that synchronization time is proportional to the RDF diff.
Depots
Here's a weird property of RDF and ontologies, and our ontology being part of our RDF data: We could be able to reconstruct a Tracker database from the RDF data without involvement from the application that created it. There could be a class of host that is simply a dummy endpoint creating local databases to collect data from different application databases, this might run on non-desktop (e.g. NAS, or other high-availability home appliance) without applications to back it up. This would be just another synchronization host to connect and pair up with, on the outside.
Goals
The goal is hassle free synchronization of data between hosts in trusted networks, for applications that use private Tracker databases, optimized for the well-behaved RDF data (e.g. resources are mostly named by IRIs, and IRIs are universal to all databases).
This should be care-free, a device that synchronizes in a network (e.g. home local one) ought to be safe to move to other potentially unsafe networks (e.g. conference) without concerns.
This should allow for multiple users in the same machine and physical persons/devices in the network to each independently and privately synchronize their respective data.
Non-goals
This is not an update mechanism that will work well with blank nodes. If a blank node exists representing the same thing exists in host A and B before synchronization, each will end up with two blank node instances. This goes in the nature of blank nodes, also no effort is done in deleting blank nodes.
This is not meant for tracker-miners
, it is in fact an example of where it will break, file:///
URIs break the assumption that IRIs are universal and refer to the same resource on all synchronized databases.