Bug #7949

Spaces in filenames causing manual normalization failure

Added by Hector Akamine over 7 years ago. Updated over 4 years ago.

Status:VerifiedStart date:02/09/2015
Priority:MediumDue date:
Assignee:Sara Allain% Done:

0%

Category:Normalization
Target version:Release 1.7.0
Google Code Legacy ID: Pull Request:
Sponsored:No Requires documentation:

Description

from: Hutchinson, Tim <>

I hit another roadblock in terms of manual normalization, which I’ve now managed to work around, but it exposes a possible bug (albeit for an extreme edge case), so I wanted to pass this on as a bug report rather than a support request.

It failed at Micro-service: Normalize, Job: Relate manual normalized preservation files to the original files, with an error for two files only:

Could not find manualNormalization/preservation/2014-097/Biol_222.3_images/Biol_222._I.ppt.pptx in /var/archivematica/sharedDirectory/currentlyProcessing/Sawhney-cc5dd6a2-5e0d-4db2-9909-170c588c7870/objects/manualNormalization/normalization.csv

Could not find manualNormalization/preservation/2014-097/Biol_325.3_images/Biol_325._I.ppt.pptx in /var/archivematica/sharedDirectory/currentlyProcessing/Sawhney-cc5dd6a2-5e0d-4db2-9909-170c588c7870/objects/manualNormalization/normalization.csv

I can’t find any errors with normalization.csv, most of which was generated through a Windows script.

I originally thought I wouldn’t need a normalization.csv file, but when I processed the transfer without one, I got errors on the very same files:

Too many possible files for: %SIPDirectory%objects/manualNormalization/preservation/2014-097/Biol_222.3_images/Biol_222._I.ppt.pptx

Too many possible files for: %SIPDirectory%objects/manualNormalization/preservation/2014-097/Biol_325.3_images/Biol_325._I.ppt.pptx

I initially wondered whether this was the filename sanitation issue again (see #6870), but all of the files have spaces, and the matched files have been sanitized.

Since *_I.ppt is the only pattern I could see, I renamed those files to remove the space (from the originals, manually normalized versions, and CSV). With that change, it worked. A separate test transfer suggests that the CSV file isn’t needed either. (see test_transfers/sahwney for a transfer that should fail)

So sort of an odd edge case – at this point I haven’t done more systematic testing.

History

#1 Updated by Tim Hutchinson over 7 years ago

I've done some more testing and narrowed down the more general case. It seems the manual normalization fails if the string just before the extension is the same as for another file, and if there is a space before that, but not in the other file. So:

file 1.ppt
file11.ppt
file5.ppt
fails

but:
file 1.ppt
file 11.ppt
file 5.ppt

is OK and so is:

file1.ppt
file11.ppt
file5.ppt

This pattern seems to extend to longer strings, e.g.

file 10.ppt
file110.ppt
file5.ppt
also fails

#2 Updated by Sarah Romkey almost 7 years ago

  • Subject changed from manual normalization problems to Spaces in filenames causing manual normalization failure
  • Category set to Normalization
  • Target version set to Release 1.6

This may be less of an edge case than we first thought, and it is not exclusive to using the normalization.csv method. It has now been reported by a second user.

Consider the following file set:


aug meeting minutes.doc
sept meeting minutes.doc
oct meeting minutes.doc

etc 

Depending on the filenaming conventions of the creator (which are out of the hands of the receiving archivist) you could end up with transfers of similarly named objects quite frequently.

As Tim described in his comment, the space before the last characters seem to be causing the failure. In the second user's case, they were receiving this error in Move access files to DIP:

"No matching file for: Traceback (most recent call last):
File "/usr/lib/archivematica/MCPClient/clientScripts/manualNormalizationMoveAccessFilesToDIP.py", line 97, in <module>
print >>sys.stderr, "No matching file for: ", opts.filePath.replace(opts.SIPDirectory, "%SIPDirectory%", 1)
AttributeError: Values instance has no attribute ‘SIPDirectory"

I have re-titled this ticket and marked it for the 1.6 release since we are close to having a qa candidate for 1.5.

Developers- if you need more info about this error when it comes time to address it, let me know and I can try to reproduce locally.

#3 Updated by Tim Hutchinson almost 7 years ago

Hi Sarah - I was checking the status of a few of the manual normalization issues and got interested in this one again... I've been trying to narrow down the general case.

First, I wasn't able to reproduce the fail with any combination of the file names you mentioned, e.g.:

aug meeting minutes.doc
sept meeting minutes.doc
oct meeting minutes.doc

aug meeting_minutes.doc
sept meeting minutes.doc
oct meeting minutes.doc

aug meetingminutes.doc
sept meeting minutes.doc
oct meeting minutes.doc

(In my earlier testing, the pattern seemed to involve one file with a space and the other without.)

So if the other user had filenames like this, maybe something else is happening? (Although I have also seen the error at Move access files to DIP)

Then I noticed that the earlier files all started with "file" (and the real files that led to this report started with "Biol").

I tried:

files:
meeting minutes augx2014.doc
meeting minutes aug 2014.doc
oct meetingminutes.doc

Job: Relate manual normalized preservation files to the original files
Too many possible files for: %SIPDirectory%objects/manualNormalization/preservation/meeting_minutes_aug_2014.doc.docx
matched: {943bd390-709e-4244-abe8-24a8b116bf8c}%SIPDirectory%objects/meeting_minutes_augx2014.doc
matched: {85c9ac53-76aa-431b-8702-d9c33937d348}%SIPDirectory%objects/oct_meetingminutes.doc

Similar results for:
file 2.doc [2 spaces]
filexy2.doc
oct meetingminutes.doc

One difference is that when I retested the initial set, it failed at "move access files to DIP" rather than at the preservation step. I.e.:
file 1.ppt
file11.ppt
file5.ppt
So there may be other factors here. I used file extension to identify these files (to take the problem FIDO has with .pptx files out of play)

In any case, the pattern I've been able to verify, albeit with limited testing, seems to be: {stringA}{stringB}{stringC}.ext {stringA}{stringD}{stringC}.ext
where string B is one or more spaces and string D is the same length as string B.

This is all using 1.3.1, and I didn't use a CSV file.

Hope this helps. Incidentally Redmine e-mail notification doesn't seem to be working, so if you have any questions about my testing, please send me an e-mail.

#4 Updated by Sarah Romkey over 5 years ago

  • Target version changed from Release 1.6 to Release 1.7.0

#5 Updated by Sarah Romkey about 5 years ago

  • Status changed from New to QA/Review
  • Assignee set to Sara Allain

Going to test to verify that this is still an issue.

#6 Updated by Sara Allain over 4 years ago

Tested this on Archivematica qa/1.x with a few different transfers, using the naming conventions that Tim and Sarah provided above, and was unable to reproduce the error exactly - instead, I'm getting failures at Store DIP:

Standard output (stdout)
Checking if DIP baa7da7c-1691-4e8d-9d50-04a645783f9a parent AIP has been created...
Parent AIP exists so relationship can be created.
Standard error (stderr)
Traceback (most recent call last):
  File "/usr/lib/archivematica/MCPClient/clientScripts/storeAIP.py", line 168, in <module>
    args.sip_uuid, args.sip_name, args.sip_type))
  File "/usr/lib/archivematica/MCPClient/clientScripts/storeAIP.py", line 114, in store_aip
    size = os.path.getsize(aip_path)
  File "/usr/share/python/archivematica-mcp-client/lib/python2.7/genericpath.py", line 49, in getsize
    return os.stat(filename).st_size
OSError: [Errno 2] No such file or directory: '/var/archivematica/sharedDirectory/watchedDirectories/uploadedDIPs/7949-no-csv-6dfab62f-2cdc-4ff4-9d54-828a1e21dec5'

It's possible that this is an unrelated issue, since I'm on a QA branch. Will require more investigation.

#7 Updated by Tim Hutchinson over 4 years ago

Testing with 1.6.1 (vagrant box), I'm no longer able to reproduce this issue.

I initially got the error Sarah reported in comment #2 (which was different than I'd previously seen), but it turned out that was actually an error in one of the filenames, so that the normalized filenames didn't match in one case.

#8 Updated by Sara Allain over 4 years ago

  • Status changed from QA/Review to Verified

Thanks for the update, Tim! Going to verify this; we can always reopen later if needed.

Also available in: Atom PDF