Bug #7735

FITS unable to parse a specific pdf file

Added by Justin Simpson over 7 years ago. Updated almost 6 years ago.

Status:In progressStart date:12/18/2014
Priority:MediumDue date:
Assignee:-% Done:

0%

Category:-
Target version:-
Google Code Legacy ID: Pull Request:
Sponsored:No Requires documentation:

Description

"Weird_files" in the NCDCR dashboard, Transfer tab Characterization error I'm unfamiliar with. See why the pdf is 'weird' below in Rachel's description.

First, there was an error concerning one of the files in our Set 9 (“weird files” set), pubs_serial_reportsanitarysurveyareaA3200403.pdf (attached), and I was going to get you information about what is “weird” about the file. It turns out that its only distinction is that it has page transitions, applied through Acrobat.

pubs_serial_reportsanitarysurveyareaA3200403.pdf (5.09 MB) Justin Simpson, 03/04/2015 11:22 AM

History

#1 Updated by Justin Simpson over 7 years ago

  • Status changed from New to Feedback

The current version of FITS is not able to successfully parse this particular pdf file, due to an error thrown by one of the tools bundled within FITS, the NLNZ Metadata Extractor.

This is a known issue with FITS, reported previously here:
https://github.com/harvard-lts/fits/issues/20
When tools bundled within FITS print errors to standard out, it breaks the xml output of FITS.

Archivematica handles this error, displays it and moves on. The Transfer is not failed, but this one particular file does not have any characterization technical metadata from FITS.

It is possible to configure FITS so that NLNZ Metadata Extractor is not run. I tested this with this particular file, and FITS is able to run successfully when NLNZ Metadata Extractor is not used.

Making that configuration change requires logging into the vm, and editing the config file at /usr/share/fits/xml/fits.xml, commenting out the NLNZ line. The fits nailgun daemon then needs to be restarted, with:

sudo restart fits

It is not clear right now why the NLNZ Metadata Extractor tool is unable to parse this particular pdf. It appears that this tool thinks this file is compressed, and is gettig an error when it tries to uncompress it. Perhaps Acrobat has written some binary data somewhere in the pdf. We could file a ticket on the NLNZ tool project page, but we would need permission to include this file, and it is unlikely we would get a timely response.

I am not sure if the output of NLNZ Metadata Extractor Tools is required, within the FITS technical metadata, for this particular file, or even if it is required in general. We can turn it off for this one pilot site for one transfer, or turn it off more permanently, or stick with the status quo, which is that the file will be successfully stored in an AIP, without any FITS output.

#2 Updated by Justin Simpson about 7 years ago

attaching sample file for testing

#3 Updated by Justin Simpson almost 6 years ago

  • Project changed from Archivematica integration with DuraCloud to Archivematica
  • Assignee deleted (Misty De Meo)

Also available in: Atom PDF