Managing duplicate records


Have I blogged about this before?

Mo matter – I’ll re-do it again, redundantly. We are re-visiting this topic because we are looking about applying this pattern to the new NSL database. The strength of the pattern is that it can be applied to an existing data set without too much trouble.

One of the issues in AFD some time ago was duplicate records for publications laying about in the database. Having duplicate records means that when an editor searches for a publication they get a list of identical records and don’t know which one to use, and it makes it difficult to produce such basic things as “what AFD names were published in this paper?”

The AFD people wanted a fix for this. They wanted to be able to resolve duplicates, and they also wanted to be able to undo. Preferably at any level, in any sequence.

For my part, I needed a solution which could be back-fitted to the existing data and not break the various things external to the AFD which relied on the data.

So, after a brief complaint that what they were asking was impossible and a walk through the gardens, the de-duplication algorithm was born. It’s a way of re-engineering a database to allow duplicate records to be merged, without disrupting too much else.


First, it’s obvious that in order to make the thing work without disrupting everything else, then the publication_ids on the existing records must get updated. Alternatives involving joining to another table (containing the publication id you are supposed to be using) are not acceptable – too much work to fix all the things affected. Another possibility would be to rename the affected tables, and create a view with the old name that would compute correct values for the fields. This struck me as again way too much work and load.

However, the business of wanting to make the de-dupication undoable at any level would seem to mean that we have to keep some sort of journal for each record using a publication id. But actually, this is not the case. All you need to do is keep track of the publication id prior to any de-duplication activity. The question of what gets put where is all implied in the “I got marked as a duplicate” field in the publication record itself.

This does mean that changes to the publication id not managed by the de-duplication framework are not journaled … but so what if they aren’t? If a user corrects a record then, well, is what it is.

It also means that redundant data is stored. If publication A is merged into B, then every thing that initially used A will now use B. Since we are storing A, and the fact that we need to use B could be computed from A and the merge history over publications, the fact that B can be computed means that is therefore redundant. My decision about this was: so what, I don’t care.

Rather than keeping the original, manually assigned publication id in the table that it originally came from, a possibility would be to drop them into another table. I chose not to do this because writing a hibernate mapping for another table would be a drag, and because one column of that table would have to be the name of the table and column that owns the id. We would have to store “that publication[1234].parent_publication_id was originally 5678”. This means we are modelling a database in a database, an antipattern, and means that we can’t put foreign key constraints on that source_id column – 1234 might be a publication id or it might be a reference id in our particular case.

I elected not to be excessively clever with publications containing sub-items (chapters in books, papers in journals). If a book in our data has chapters 1 and 2, and a duplicate record for the book has chapters 2 and 3, then when the duplicates are merged, the book will have two chapter 2 records. But fixing this automatically is difficult – the algorithm will either be cautious to the point of seldom actually doing anything, or it will merge things that oughtn’t be merged. My decision was that if a user is working in that space (de-duplicating records), then they’d be reviewing the results and could see the duplicate chapter anyway.

Finally, we do have business rules needing to be respected. A publication of type ‘paper’ must appear as a child record of a publication of type ‘journal’. But the rule “you may only merge publications with the same type” won’t do, because occasionally the problem is that something that should have been put in as a paper was also put in as a monograph. So the rule was relaxed to “must have the same type, or have no child records”.


So: here’s what you actually do.

Database changes.

Pick an entity that might have duplicates needing to be managed. Let’s say Author.

Whip through the database and add a second field everywhere there’s a foreign key, which will contain the “this is the id that was assigned prior to any de-duplication”. In AFD, I prepended the field name with “original”. Now, Name has an author and a base author, and reference also has an author, so we are talking adding a
original_author_id, original_base_author_id to Name, and original_author_id to reference.

(on reflection, maybe I ought to have stuck the “original” in just in front of the _ID, to keep the fields together)

I will speak of these pairs as each having an ‘original’ value and a ‘current’ value.

Finally, add a ‘duplicate_of_id” to the Author table. I will speak of author records as being “current” or “duplicate” according to whether or not this field is null.

Changes to existing code.

The iron rule of this pattern is: a current id never points at a duplicate record.

Any operation that assigns an author to a author_id or base_author_id field must set (or clear) both the current and the original value fields. These operations must never assign the ids of duplicate records. In practise this means that the picklists that you display to users must not be populated with duplicate records, but that’s almost the entire point of the exercise, anyway, so it’s all good. (The only time the user sees the old, duplicate records is when they are engaged in the task of managing duplicates).

If that doesn’t seem like much – well, that is the point of doing it this way. You don’t need to hack up too much existing code.

Deletion of Author records will be discussed further on.

The merge operation

The merge operation may be performed over any two current Authors (subject to other business rules, which may require a little thought). This ensures that the ‘replaced_by_id” graph is acyclic.

As to the question of which way to do it, well by definition it shouldn’t matter, so if we are not merging as directed by the user, then make the newer record the duplicate – this means that older identifiers are preserved and makes the set of identifiers more stable over time (I might have stolen that idea from a related field of information management).

To mark an Author record B as being a newer duplicate of some other record A,

– update the current ids on each of those id pairs to A wherever it is currently B
– mark Author B duplicate_of_id A.

“But!” (you may ask) “What about the case where C got merged into B previously?” Well, the current id of all relevant records will be B. Not a problem. The original ids of those pairs will be C, but we are not touching that.

The undo operation

The undo operation requires a tree walk. However, it will probably not be a very deep or large tree walk, unless you have a system that (for instance) merges large datasets together daily, automatically de-duplicates identifiers, and these de-duplications occasionally need to be corrected by hand. Even then – if my macbook running postgres can handle it, then whatever you are using should be able to handle it.

To undo the merge of (say) F into E
– find all publications that were merged into F (including F itself) with a treewalk
– then update the current id of every pair whose original id is in that set
– set the replaced_by_id of F to null

The results of the tree walk can be put into a transaction scoped temporary table, or you can simply stuff it into a subquery. Thus:

update name
set author_id = 'F'
where original_author_id in (
  with recursive treewalk(id) as (
      select 'F' as id
    union all
      select author.id
      from treewalk
        join author on author.replaced_by_id = treewalk.id
  )
  select id from treewalk
)

Once for each id pair, wherever it may occur in the schema.

“But!”, you may protest, “this won’t work! It will miss stuff!”. I leave it as an exercise for the reader to satisfy themself that it works perfectly fine, and is awesome.

Deleting a duplicate record

To delete an author that is a duplicate record, let’s say record R which is a duplicate of earlier record Q

– for every pair that has an original id of R, change its original id to Q
– for every author that is marked as a duplicate of R, change it to be marked as a duplicate of Q
– delete R

“But!”, you may protest, “What about those fields whose current id is R? Won’t the foreign key whinge when its referent disappears?”

No current id points to R. R is marked as a duplicate, all current ids pointing at it were moved on when it was so marked.

Deleting a record that is not a duplicate

This means deleting not only the record (which we will call record L), but all records marked as duplicates of L. Other possibilities (reducible to other operations) are:

– undo the “mark as duplicate of L” for all records that are duplicates of L, then continue
– mark L as a dulpicate of some other record, then delete it as a duplicate record

After the “are you sure?” dialog, the steps are these.

– set to null all id pairs whose current id is ‘L’
– delete L and all Authors in its duplicate-of tree

The interesting thing here is that if you treat duplicate-of-id as itself being one of those id pairs, then the ‘mark as a duplicate’ operation will keep the current duplicate_of_id up-to date, meaning that you can find all author records in the duplicate-of tree of a current record straight away.

If this is done, then rather than using dulicate_of_id=null for records that are not duplicate records, set it to self. That way, your joins always join. Of course, queries looking to select only current records need to be adjusted.


Semantic Web Implications

Should you use owl:sameAs?

It depends.

sameAs means “These two ids are ids for the same thing.” There are two ways to go.

You can regard your URI as being the id of the author. If so, your data should not expose the duplicate records at all, except to say that their ids are sameAs. This means that you cannot expose the de-duplication history at all. It’s meaningless to say that F is replaced by E if F and E are both names for the same thing.

Alternatively, you can treat the id as being the id of the author record. In this case, the URIs are not same-as and it’s meaningful to expose those de-duplication events as predicates. However, they may not really be of interest to anyone. I suppose they become interesting when a person is querying about the provenance of facts, rather than what the facts say.

Perhaps a correct approach would be by way of reification. Let’s say author T is a duplicate of author S. it might be sensible to declare

:book1 
  :author :S;
  :duplicationInfo [
    a rdf:Statement;
    a :propertyAssignmentThatsAResultOfDeduplication;
    rdf:subject :book1;
    rdf:predicate :author;
    rdf:object :S;
    :originalValue :T;
  ];
  .

We do have a dcterms:isReplacedBy, but I do not know offhand what other standard vocabularies support this pattern particularly.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: