Bug #7124

Fix authority record collisions on import

Added by Dan Gillean almost 8 years ago. Updated almost 7 years ago.

Status:VerifiedStart date:08/19/2014
Priority:MediumDue date:
Assignee:Dan Gillean% Done:

100%

Category:Import/ExportEstimated time:6.00 hours
Target version:Release 2.2.0
Google Code Legacy ID: Tested version:2.0.1, 2.1, 2.1.1
Sponsored:Yes Requires documentation:

Description

Change authority record import routines to check for matches on both name and source repository before linking. Will reduce the number of authority record collisions (e.g. match and merges/overwrites in a multi-repository environment, that should not be happening, because of a shared name).

bacon-authorityrecord-collision-test.csv Magnifier (12.9 KB) Dan Gillean, 02/27/2015 05:36 PM

isad-ead-crosswalk.xml Magnifier (6.44 KB) Dan Gillean, 05/27/2015 07:21 PM

History

#1 Updated by José Raddaoui Marín over 7 years ago

  • Status changed from New to Code Review
  • Assignee changed from José Raddaoui Marín to Jesús García Crespo
  • % Done changed from 0 to 100

Created PR: 43

Now obtains the first actor with the given name and related to descriptions in the repository indicated in the import. If there isn't a repository indicated it checks relations with descriptions without repository. All event and relation types are included.

The change affects actors created/fetched in DIP upload, in CSV import of accessions, events and archival descriptions, and XML import of:

DC: creator, publisher, contributor
EAD: name, persname, corpname, famname (in did/origination and controlaccess)
MODS: name/namePart

#2 Updated by Jesús García Crespo over 7 years ago

  • Assignee changed from Jesús García Crespo to Mike Gale

I think that Mike Gale should review this.

#3 Updated by Mike Gale over 7 years ago

  • Assignee changed from Mike Gale to José Raddaoui Marín

I took a look at the code, Radda. It looks good, I just had two suggestions that I put on Github.

To summarize:
- Move the $this->save() out of createOrFetchActor(). Please see github for details
- I had a suggestion to reduce duplicate code for one of your SQL queries

Other than that it looks good code-wise, good job :)

#4 Updated by José Raddaoui Marín over 7 years ago

  • Status changed from Code Review to QA/Review
  • Assignee changed from José Raddaoui Marín to Dan Gillean
  • Target version changed from Release 2.2.0 to Release 2.1.0

#5 Updated by Dan Gillean over 7 years ago

  • Status changed from QA/Review to Verified

Now tested with EAD imports, and CSV. Looks good! As we do migration projects with larger datasets and more complicated use cases, we may shake out some more edge cases we need to consider... but for now, this is a vast improvement, and the testing I have done holds up. Thanks guys!

#7 Updated by José Raddaoui Marín over 7 years ago

  • Status changed from Verified to Feedback

Hi Dan,

While I was fixing #7327 I've found a little issue related to this ticket and EAD import.

In EAD import we're relating creators and bioghist by order, but if there are more bioghist nodes than creators, new 'Untitled' actors are created. Also, if the selected by order actor already has the 'History' field populated, the actor won't be updated and a new 'Untitled' actor will be created with that history.

That's what happens when it finds a match for the creator in the same repository and that actor has the history populated, even if the bioghist value is the same as the actor history, a new actor is created without title and with that history.

What do you think? Should we keep it that way, check if the bioghist value is the same as the history, replace the history value, add and attribute ('id="atom_"') in the export like Misty did for the events ...

Also, if the creator has no data, the EAD export adds an empty bioghist:

<bioghist id="md5-1ef3e5ac890e0ebc1c9393719d7677f3" encodinganalog="3.2.2"></bioghist>

This bioghist in the EAD import populates the actor history with ' '. This can be easily fixed in the import but I'm wondering if we should keep that empty bioghist in the export.

#8 Updated by Dan Gillean over 7 years ago

Hi Radda,

Oooof, this is hard.

In EAD import we're relating creators and bioghist by order, but if there are more bioghist nodes than creators, new 'Untitled' actors are created

I think this is the proper behavior. Bioghists and origination elements (creators) should always be 1:1 - if by accident they are not, we should not ignore the extra bioghist or merge it into another record - I think creating a blank actor is the best solution. The user will have to decide what to do with their bad data.

Also, if the selected by order actor already has the 'History' field populated, the actor won't be updated and a new 'Untitled' actor will be created with that history.

That's what happens when it finds a match for the creator in the same repository and that actor has the history populated, even if the bioghist value is the same as the actor history, a new actor is created without title and with that history.

I've talked about this with Sarah, and we can't think of a better behavior right now - I think we might just want to keep it as is, and add notes to the documentation.

If it's a matching actor name but a different parent repo, we should create a new actor.

If the user has a different bioghist for the same actor in the same repo, I don't think we should ignore it, or overwrite the old one, without a means to check with the user first - which we can't really do right now.

Ideally, if the actor and repo AND bioghist is the same, it will just link, and no duplicate will be created. If the actor and repo are the same, but the bioghist is different, I guess we should keep creating a blank record to store it in, so the user can make a decision about what to do (delete, or copy it and manually overwrite the old bioghist). The ideal would be if the user was given a warning somewhere in the application (on the import page? on the description?) that would let them know that a decision is required - but again, that might be more than we can do right now.

On export, we do not need an empty bioghist if there is no data to put in it.

Does that make sense?

#9 Updated by José Raddaoui Marín over 7 years ago

  • Status changed from Feedback to Code Review
  • Assignee changed from Dan Gillean to Mike Gale

Sure Dan, it makes a lot of sense :)

I've created PR: 83

- Empty bioghist elements won't be added in EAD export
- In the import, if the actor and repo AND bioghist are the same, it will just link, and no duplicate will be created

#10 Updated by José Raddaoui Marín over 7 years ago

  • Status changed from Code Review to Deploy
  • Assignee changed from Mike Gale to José Raddaoui Marín

#11 Updated by José Raddaoui Marín over 7 years ago

  • Status changed from Deploy to QA/Review
  • Assignee changed from José Raddaoui Marín to Dan Gillean

#12 Updated by Dan Gillean about 7 years ago

  • File bacon-authorityrecord-collision-test.csvMagnifier added
  • Status changed from QA/Review to Feedback
  • Assignee changed from Dan Gillean to José Raddaoui Marín
  • Target version changed from Release 2.1.0 to Release 2.2.0
  • Tested version 2.1.1 added

Not sure if this is something we can address, but I will put it in feedback for now for review. An interesting error, that could affect large portal migrations (e.g. AC).

Behaved as expected for top-level records. However, when the identical name was used at a lower level in the same fonds as an access point, a new record was created for the lower levels. I guess one of the checks might be LOD? Not sure it should be checked though.

Difficult to explain, easier to see - take a look at the attached CSV and try importing it. There are 3 fonds included - Bacons two, three, and four, as they are titled. Bacon number two and Bacon number four Share both repository and creator name; Bacon three matches the creator name but not the repository. The result should be 2 creator authority records for 3 fonds. This works as expected.

All of them, and their child descriptions, use Artefactual Systems Inc. as name access points, at both the fonds, and again at one of the lower levels. This should result in two authority records total (linked at multiple levels), because of the different repo names between 2/4 and 3.

Instead, I ended up with 3: 1 for two and four at the fonds level, 1 for three at the fonds level, and 1 linked to all the child-levels, regardless of repository.

Thoughts? Are there use cases we can think of that would go against trying to fix this?

#13 Updated by José Raddaoui Marín about 7 years ago

  • Status changed from Feedback to Code Review
  • Assignee changed from José Raddaoui Marín to Mike Gale

To make this a little more complex, an actor may be related with an information object wich repository is inherit from one of its ancestors.

It should be fixed with PR 125

#14 Updated by Mike Gale about 7 years ago

  • Status changed from Code Review to In progress
  • Assignee changed from Mike Gale to José Raddaoui Marín

Looks fine

#15 Updated by José Raddaoui Marín about 7 years ago

  • Status changed from In progress to QA/Review
  • Assignee changed from José Raddaoui Marín to Dan Gillean

#16 Updated by Dan Gillean almost 7 years ago

  • File isad-ead-crosswalk.xmlMagnifier added
  • Status changed from QA/Review to Feedback
  • Assignee changed from Dan Gillean to José Raddaoui Marín

It looks like it is possible that our recent EAD changes have perhaps altered the intended behavior? I'm ending up with duplicates, instead of links, when roundtripping a test description. Attaching a sample file for testing.

To reproduce
  • Import the attached description.
  • Delete the description, but not the repository, or actor records (name access points, creator).
  • Reimport

Resulting error
New actors are created. Repository links fine.

Expected result
From what I understood from the comments above, if the name and bioghist and repo are the same, they should be linked instead of duplicated.

Is this correct? Thanks.

#17 Updated by José Raddaoui Marín almost 7 years ago

  • Status changed from Feedback to QA/Review
  • Assignee changed from José Raddaoui Marín to Dan Gillean

Hi Dan,

I guess the first description imported is the only link between the repository and those actors so, when it's deleted, that indirect relation is deleted too. If you don't delete the description and re-do the import (creating a duplicate) the actors should link.

The only way to fix something like this is creating a direct relation between actors and repositories.

#18 Updated by Dan Gillean almost 7 years ago

  • Status changed from QA/Review to Verified

Thanks for clarifying, Radda.

Yesterday I was testing with a larger hierarchical fonds, and did not have the same issues, so I think this is as fixed as it can be for now, until at some future point we have a way to relate authority records directly to repositories.

Also available in: Atom PDF