Bug #10828

Cannot do metadata only re-ingest if originals have been normalized twice

Added by Sarah Romkey over 5 years ago. Updated about 5 years ago.

Status:QA/ReviewStart date:01/27/2017
Priority:MediumDue date:
Assignee:Sara Allain% Done:

0%

Category:Reingest
Target version:-
Google Code Legacy ID: Pull Request:
Sponsored:No Requires documentation:

Description

To reproduce:

- Create an AIP and normalize for preservation
- Reingest the AIP; add some metadata (I tested with DC metadata), re-normalize for preservation, store the AIP
- Attempt to do a Metadata-only re-ingest.

You should see this error:

Command: parseMETStoDB_v1.0 94e262fd-8996-4aad-94df-6bc7c68c3660 /var/archivematica/sharedDirectory/currentlyProcessing/reingest_x_2-94e262fd-8996-4aad-94df-6bc7c68c3660/

STDOUT

METS Reader
filegrpuse deleted
amdid amdSec_1
file_uuid 1450a172-86fd-4e91-8517-9e5c4f180d1e
original_path %SIPDirectory%objects/abroadcranethoma00craniala_0014-1450a172-86fd-4e91-8517-9e5c4f180d1e.tif

STDERR

Traceback (most recent call last):
File "/usr/lib/archivematica/MCPClient/clientScripts/parse_mets_to_db.py", line 457, in <module>
sys.exit(main(args.sip_uuid, args.sip_path))
File "/usr/lib/archivematica/MCPClient/clientScripts/parse_mets_to_db.py", line 442, in main
files = parse_files(root)
File "/usr/lib/archivematica/MCPClient/clientScripts/parse_mets_to_db.py", line 69, in parse_files
current_path = fe.find('mets:FLocat', namespaces=ns.NSMAP).get(ns.xlinkBNS+'href')
AttributeError: 'NoneType' object has no attribute 'get'

History

#1 Updated by Joel Dunham about 5 years ago

  • Assignee set to Sarah Romkey

Hi Sarah,

I have been unable to recreate this issue. Could you confirm that it still exists and, if so, give me some more details on how to recreate it?

I have been able to determine that the transfer was created from /SampleTransfers/OCRImage/

Looking at your STDOUT, it seems related to the normalize.py client script setting a 'deleted' filegrpuse on a file. However, I have not been able to do get normalization to do that.

This commit may be relevant: https://github.com/artefactual/archivematica/commit/94c2238f9b493f19c7181f3129f461fc28ce2b58

#2 Updated by Sarah Romkey about 5 years ago

  • Assignee changed from Sarah Romkey to Joel Dunham

I confirmed with Joel, this error is still happening. I wasn't totally clear in the original ticket:

1. Make the AIP, making sure to normalize for preservation
2. Do full reingest on the AIP, adding some DC metadata and normalizing again for preservation.
3. The third time around, try a metadata only re-ingest.

Looking more closely at the errors, this appears to have something to do with the deleted filegrp that we added to the filesec.


METS Reader
filegrpuse submissionDocumentation
amdid amdSec_3
file_uuid dc984018-5910-43f3-b342-412fc2b0f0ef
original_path %SIPDirectory%objects/submissionDocumentation/transfer-10828_again-1915677a-cdc7-403c-82b3-db2c1982db14/METS.xml
current_path %SIPDirectory%objects/submissionDocumentation/transfer-10828_again-1915677a-cdc7-403c-82b3-db2c1982db14/METS.xml
checksum 5222aef1781bb7184b71fbd460d80bf5486ca31c7a4b0532ce302b20e3e4b2ca
checksumtype sha256
size 12316
PUID fmt/101
format_version Text (Markup): XML: XML 
derivation event None
related_uuid None
relationship None

filegrpuse deleted
amdid amdSec_1
file_uuid 628ede9b-2df1-4205-a1a2-28ef59d8d495
original_path %SIPDirectory%objects/abroadcranethoma00craniala_0014-628ede9b-2df1-4205-a1a2-28ef59d8d495.tif

STDERR

Traceback (most recent call last):
  File "/usr/lib/archivematica/MCPClient/clientScripts/parse_mets_to_db.py", line 457, in <module>
    sys.exit(main(args.sip_uuid, args.sip_path))
  File "/usr/lib/archivematica/MCPClient/clientScripts/parse_mets_to_db.py", line 442, in main
    files = parse_files(root)
  File "/usr/lib/archivematica/MCPClient/clientScripts/parse_mets_to_db.py", line 69, in parse_files
    current_path = fe.find('mets:FLocat', namespaces=ns.NSMAP).get(ns.xlinkBNS+'href')
AttributeError: 'NoneType' object has no attribute 'get'

#3 Updated by Joel Dunham about 5 years ago

  • Assignee changed from Joel Dunham to Sarah Romkey

The branch at https://github.com/artefactual/archivematica/tree/dev/issue-10828-mo-reingest-double-normalization will prevent the error from happening.

The error was being caused by parse_mets_do_db.py parsing the METS and attempting to get the mets:FLocat child of a mets:file element which is itself a child of mets:fileGrp[USE="deleted"]. Because there is no such FLocat child for a deleted file, this was returning None and then attempting to call None.get, which raises an AttributeError.

The addition of metadata during the first (full) re-ingest was a red herring: the bug arises even if no metadata are entered. The problem is the USE='deleted' files in the mets:fileSec.

This bug raises a bigger issue. When a new METS file is created during re-ingest, deleted files listed in the mets:fileSec will not be present in the new METS file. This is problematic because this information should not be lost. The reason for this behaviour is the use of mets-reader-writer by the client script archivematicaCreateMETSReingest.py (which is called by archivematicaCreateMETS2.py). The metsrw module parses the existing METS file and creates a new one based on that information. However, metsrw takes mets:structMap[TYPE="physical"] as authoritative and any <mets:file> elements that do not correspond to <mets:div> elements in the structMap will be lost. In the present context, this means deleted files in the fileSec will be lost. Fixing this is turning out to be a larger task than I anticipated and may require making changes to metsrw. Sarah, I'm wondering if I should continue working on this issue.

#4 Updated by Joel Dunham about 5 years ago

I went ahead and updated the dev/issue-10828-mo-reingest-double-normalization branch so that a record of deleted derivatives is retained in the METS after multiple re-ingests.

Sarah, I'm wondering if you can deploy AM with the above branch, re-run the original tests and provide feedback on whether the resulting AIP METS is as it should be.

One potential issue I observed was that multiple duplicate PREMIS:AGENTS digiprovMD elements appear to be occurring within a single amdSec. Unsure if that is due to this code change or if that was happening already.

#5 Updated by Sarah Romkey about 5 years ago

  • Status changed from New to QA/Review

#6 Updated by Sarah Romkey about 5 years ago

  • Assignee changed from Sarah Romkey to Sara Allain

Sara, I didn't have a chance to continue testing this so reassigning to you.

Also available in: Atom PDF