Bug #10045

AIP reingest failing because of different number of bytes

Added by Sarah Romkey almost 6 years ago. Updated over 5 years ago.

Status:QA/ReviewStart date:06/17/2016
Priority:MediumDue date:
Assignee:Sarah Romkey% Done:

0%

Category:Reingest
Target version:-
Google Code Legacy ID: Pull Request:https://github.com/artefactual/archivematica-storage-service/pull/162
Sponsored:No Requires documentation:

Description

The attached AIP fails on objects/metadata re-ingest with this error:

Error re-ingesting package: Oxum error. Found 20 files and 4185329 bytes on disk; expected 20 files and 4185857 bytes.

Initially it was thought this failure was happening because the AIP had zipped packages (which were deleted after extraction) but other zipped package AIPs are working. I also tried deleting the ds_store files from the AIP and it fails with the same error.

julie_zip-0c01ed1c-71c4-4842-91af-392290e32521.7z (1.14 MB) Sarah Romkey, 06/17/2016 12:31 PM

dataZip.zip-2016-06-16T18_32_58.912193_00_00.tar.gz (1.21 MB) Sarah Romkey, 07/22/2016 04:24 PM

History

#1 Updated by Nick Wilkinson almost 6 years ago

  • Assignee set to Joel Dunham

#2 Updated by Sarah Romkey almost 6 years ago

I'm still able to reproduce this error- attached is a zip that you can use as transfer material and you should get the error described.

#3 Updated by Sarah Romkey almost 6 years ago

  • Target version changed from Release 1.5.1 to Release 1.6

#4 Updated by Sarah Romkey over 5 years ago

  • Target version deleted (Release 1.6)

Removing the target version until time allows to investigate further.

#5 Updated by Joel Dunham over 5 years ago

  • Assignee changed from Joel Dunham to Sarah Romkey

Hi Sarah,

Can you confirm that SS branch dev/issue-10045-bagit-bytes-bug fixes the issue?
For future reference, an explanation of the issue is given below.

Previously, the command used to extract a compressed AIP (in
storage_service/locations/models/package.py) prior to re-ingest was::

unar -force-overwrite -o extract_path AIP_path

The problem with this command is that unar treats .rsrc files in __MACOSX/
differently than 7z does. 7z x is used (exclusively?) in the
extractContents_v0.0 client script. 7z converts these .rsrc files to
._-prefixed files. Similar behaviour with unar can be achieved by passing -k
hidden
. However, while a command like::

unar -force-overwrite -k hidden -o extract_path AIP_path

preserves the .rsrc MACOSX files as ._-prefixed files, it does so
differently than 7z does: the resulting ._-prefixed files have
different sizes than those created via unar. This makes
bag.validate choke and causes the original "Found 20 files and 4185329
bytes on disk; expected 20 files and 4185857 bytes."

The solution is to use 7z to extract AIPs during re-ingest::

7z x -bd -y -oextract_path AIP_path

TODOs/Questions:

1. Should/can the extractContents client script be re-used here so
that extraction is uniform across AM/AM-SS?
2. Why was unar chosen here originally instead of 7z? Was there a
good reason?
3. Will the relative/subpath final argument to 7z work in the
same way that it does in unar?

#6 Updated by Sarah Romkey over 5 years ago

  • Assignee changed from Sarah Romkey to Joel Dunham

Hi Joel,

I was able to do the test successfully- I'm no longer getting the error. However when testing I noticed that if you choose full re-ingest, the AIP is sent back to Ingest instead of to Transfer. I haven't noticed this on other QA branches, but I'll re-test on qa 1.6.x/0.x as well.

#7 Updated by Joel Dunham over 5 years ago

  • Status changed from New to Code Review
  • Assignee changed from Joel Dunham to Nick Wilkinson
  • Pull Request set to https://github.com/artefactual/archivematica-storage-service/pull/162

#8 Updated by Nick Wilkinson over 5 years ago

  • Assignee changed from Nick Wilkinson to Holly Becker

Hi Holly, assigning to you for CR.

#9 Updated by Holly Becker over 5 years ago

  • Status changed from Code Review to Feedback
  • Assignee changed from Holly Becker to Joel Dunham

2. Why was unar chosen here originally instead of 7z? Was there a good reason?

Unfortunately the reason we chose unar was because it supported multiple package types - in this case both .7z and .tar.bz2, which the new 7z command doesn't.

#10 Updated by Joel Dunham over 5 years ago

  • Status changed from Feedback to Code Review
  • Assignee changed from Joel Dunham to Holly Becker

Hey Holly,

What think you now? Under the current state, AIPs containing Mac OS resource fork files will NOT be able to be re-ingested if they are compressed by AM using pbzip2. Assuming the code is sound, is this behaviour acceptable?

#11 Updated by Holly Becker over 5 years ago

  • Status changed from Code Review to Feedback
  • Assignee changed from Holly Becker to Joel Dunham

Given that it didn't work at all previously, that seems acceptable. Perhaps check with one of the analysts as well?

This is making your suggestion of re-using extractContents and the FPR look more appealing!

#12 Updated by Joel Dunham over 5 years ago

  • Status changed from Feedback to Code Review
  • Assignee changed from Joel Dunham to Holly Becker

Hey Holly,

Can you look at the revised SS dev/issue-10045-bagit-bytes-bug? It should now allow you to metadata-only re-ingest AIPs built from Mac-OS-resource-fork-containing .zip transfers, no matter what compression algorithm/tool AM used when storing the AIP.

#13 Updated by Holly Becker over 5 years ago

  • Status changed from Code Review to Feedback
  • Assignee changed from Holly Becker to Joel Dunham

Looks good, includes some nice cleanup too.

#14 Updated by Joel Dunham over 5 years ago

  • Status changed from Feedback to Deploy
  • Assignee changed from Joel Dunham to Nick Wilkinson

#15 Updated by Nick Wilkinson over 5 years ago

  • Status changed from Deploy to QA/Review
  • Assignee changed from Nick Wilkinson to Sarah Romkey

Hey Sarah, assigning to you for either QA and/or inclusion in the SS 0.10 release.

Also available in: Atom PDF