Cannot do metadata only re-ingest if originals have been normalized twice
|Assignee:||Sara Allain||% Done:|
|Google Code Legacy ID:||Pull Request:|
- Create an AIP and normalize for preservation
- Reingest the AIP; add some metadata (I tested with DC metadata), re-normalize for preservation, store the AIP
- Attempt to do a Metadata-only re-ingest.
You should see this error:
Command: parseMETStoDB_v1.0 94e262fd-8996-4aad-94df-6bc7c68c3660 /var/archivematica/sharedDirectory/currentlyProcessing/reingest_x_2-94e262fd-8996-4aad-94df-6bc7c68c3660/
Traceback (most recent call last):
File "/usr/lib/archivematica/MCPClient/clientScripts/parse_mets_to_db.py", line 457, in <module>
File "/usr/lib/archivematica/MCPClient/clientScripts/parse_mets_to_db.py", line 442, in main
files = parse_files(root)
File "/usr/lib/archivematica/MCPClient/clientScripts/parse_mets_to_db.py", line 69, in parse_files
current_path = fe.find('mets:FLocat', namespaces=ns.NSMAP).get(ns.xlinkBNS+'href')
AttributeError: 'NoneType' object has no attribute 'get'
#1 Updated by Joel Dunham about 5 years ago
- Assignee set to Sarah Romkey
I have been unable to recreate this issue. Could you confirm that it still exists and, if so, give me some more details on how to recreate it?
I have been able to determine that the transfer was created from /SampleTransfers/OCRImage/
Looking at your STDOUT, it seems related to the normalize.py client script setting a 'deleted' filegrpuse on a file. However, I have not been able to do get normalization to do that.
This commit may be relevant: https://github.com/artefactual/archivematica/commit/94c2238f9b493f19c7181f3129f461fc28ce2b58
#2 Updated by Sarah Romkey about 5 years ago
- Assignee changed from Sarah Romkey to Joel Dunham
I confirmed with Joel, this error is still happening. I wasn't totally clear in the original ticket:
1. Make the AIP, making sure to normalize for preservation
2. Do full reingest on the AIP, adding some DC metadata and normalizing again for preservation.
3. The third time around, try a metadata only re-ingest.
Looking more closely at the errors, this appears to have something to do with the deleted filegrp that we added to the filesec.
METS Reader filegrpuse submissionDocumentation amdid amdSec_3 file_uuid dc984018-5910-43f3-b342-412fc2b0f0ef original_path %SIPDirectory%objects/submissionDocumentation/transfer-10828_again-1915677a-cdc7-403c-82b3-db2c1982db14/METS.xml current_path %SIPDirectory%objects/submissionDocumentation/transfer-10828_again-1915677a-cdc7-403c-82b3-db2c1982db14/METS.xml checksum 5222aef1781bb7184b71fbd460d80bf5486ca31c7a4b0532ce302b20e3e4b2ca checksumtype sha256 size 12316 PUID fmt/101 format_version Text (Markup): XML: XML derivation event None related_uuid None relationship None filegrpuse deleted amdid amdSec_1 file_uuid 628ede9b-2df1-4205-a1a2-28ef59d8d495 original_path %SIPDirectory%objects/abroadcranethoma00craniala_0014-628ede9b-2df1-4205-a1a2-28ef59d8d495.tif STDERR Traceback (most recent call last): File "/usr/lib/archivematica/MCPClient/clientScripts/parse_mets_to_db.py", line 457, in <module> sys.exit(main(args.sip_uuid, args.sip_path)) File "/usr/lib/archivematica/MCPClient/clientScripts/parse_mets_to_db.py", line 442, in main files = parse_files(root) File "/usr/lib/archivematica/MCPClient/clientScripts/parse_mets_to_db.py", line 69, in parse_files current_path = fe.find('mets:FLocat', namespaces=ns.NSMAP).get(ns.xlinkBNS+'href') AttributeError: 'NoneType' object has no attribute 'get'
#3 Updated by Joel Dunham about 5 years ago
- Assignee changed from Joel Dunham to Sarah Romkey
The branch at https://github.com/artefactual/archivematica/tree/dev/issue-10828-mo-reingest-double-normalization will prevent the error from happening.
The error was being caused by parse_mets_do_db.py parsing the METS and attempting to get the mets:FLocat child of a mets:file element which is itself a child of
mets:fileGrp[USE="deleted"]. Because there is no such FLocat child for a deleted file, this was returning
None and then attempting to call
None.get, which raises an AttributeError.
The addition of metadata during the first (full) re-ingest was a red herring: the bug arises even if no metadata are entered. The problem is the USE='deleted' files in the mets:fileSec.
This bug raises a bigger issue. When a new METS file is created during re-ingest, deleted files listed in the mets:fileSec will not be present in the new METS file. This is problematic because this information should not be lost. The reason for this behaviour is the use of mets-reader-writer by the client script archivematicaCreateMETSReingest.py (which is called by archivematicaCreateMETS2.py). The metsrw module parses the existing METS file and creates a new one based on that information. However, metsrw takes
mets:structMap[TYPE="physical"] as authoritative and any
<mets:file> elements that do not correspond to
<mets:div> elements in the structMap will be lost. In the present context, this means deleted files in the fileSec will be lost. Fixing this is turning out to be a larger task than I anticipated and may require making changes to metsrw. Sarah, I'm wondering if I should continue working on this issue.
#4 Updated by Joel Dunham about 5 years ago
I went ahead and updated the dev/issue-10828-mo-reingest-double-normalization branch so that a record of deleted derivatives is retained in the METS after multiple re-ingests.
Sarah, I'm wondering if you can deploy AM with the above branch, re-run the original tests and provide feedback on whether the resulting AIP METS is as it should be.
One potential issue I observed was that multiple duplicate PREMIS:AGENTS digiprovMD elements appear to be occurring within a single amdSec. Unsure if that is due to this code change or if that was happening already.