Feature #852

Add normalization scripts for .pdf files

Added by Evelyn McLellan about 11 years ago. Updated almost 9 years ago.

Status:VerifiedStart date:
Priority:CriticalDue date:
Assignee:Evelyn McLellan% Done:

0%

Category:-
Target version:Release 0.7
Google Code Legacy ID:archivematica-197 Pull Request:
Sponsored: Requires documentation:

Description

Desired preservation format for ingested pdf files is pdf/archival. Ghostscript can convert from pdf to pdf/a but it's a 2-step process:

pdf2ps sample.pdf (converts the pdf to a postscript file)
ps2pdf -dPDFA [-additional options] sample.ps (converts the posctscript file to pdf/a)

I'm working on figuring out the best additional options. Note that we have to give the pdf/a file a new name in order to avoid overwriting the original pdf file.

[g] Legacy categories: Preservation planning

History

#1 Updated by Evelyn McLellan almost 11 years ago

  • Priority changed from Medium to High

Here's a better way: gs -dPDFA -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=PDFA.pdf sample.pdf. Direct from pdf to pdfa.

[g] Labels added: Priority-High
[g] Labels removed: Priority-Medium

#2 Updated by Evelyn McLellan almost 11 years ago

OK, here is the command to convert PDF to PDF/A: gs -dPDFA -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=sample.pdf sample.pdf.

Note that both preservation and access formats should be PDF/A.

#3 Updated by Joseph Perry almost 11 years ago

[g] New owner: epmclellan

#4 Updated by Evelyn McLellan almost 11 years ago

I'm not the owner of this issue. I provide the preservation plan and commands but Joseph or Austin needs to write the normalization script. Reassigning to Austin.

[g] New owner: Austin Trask

#5 Updated by Austin Trask almost 11 years ago

  • Priority changed from High to Critical

[g] Labels added: Priority-Critical
[g] Labels removed: Priority-High
[g] New owner: berwin22

#6 Updated by Joseph Perry almost 11 years ago

Committed 714

problem:
NORMALIZING: article.pdf {3c209af0-f829-43f3-82b8-02fc4753313f}
Already in access format. No need to normalize.
processing: cp /var/archivematica/sharedDirectory/.currentlyProcessing/2c7f02a4-2093-4660-86a8-57bb9b8098ee/ImagesSIP-copy-34333336666666666666666-0021665d-119c-4d91-a3cf-f71a4cfcac23/objects/article.pdf /var/archivematica/sharedDirectory/.currentlyProcessing/2c7f02a4-2093-4660-86a8-57bb9b8098ee/ImagesSIP-copy-34333336666666666666666-0021665d-119c-4d91-a3cf-f71a4cfcac23/DIP/objects/.
processing completed
processing: gs -dPDFA -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=/var/archivematica/sharedDirectory/.currentlyProcessing/2c7f02a4-2093-4660-86a8-57bb9b8098ee/ImagesSIP-copy-34333336666666666666666-0021665d-119c-4d91-a3cf-f71a4cfcac23/objects/38a4cde5-a40a-459e-a6c7-f198550cad20article.a.pdf /var/archivematica/sharedDirectory/.currentlyProcessing/2c7f02a4-2093-4660-86a8-57bb9b8098ee/ImagesSIP-copy-34333336666666666666666-0021665d-119c-4d91-a3cf-f71a4cfcac23/objects/article.pdf
GPL Ghostscript 8.71: Annotation set to non-printing,
not permitted in PDF/A, reverting to normal PDF output
processing completed

#7 Updated by Joseph Perry almost 11 years ago

see problem in last comment.

[g] New owner: epmclellan

#8 Updated by Austin Trask almost 11 years ago

see this issue here: http://bugs.ghostscript.com/show_bug.cgi?id=690803#c5

when this error occurs it means the original file has PDF options that are not supported by PDF/a.

This can be circumvented by adding '-dPDFACompatibilityPolicy=1' which will elide nonconforming options.

please change the command to :
gs -dPDFA -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -sOutputFile=sample.pdf sample.pdf

#9 Updated by Evelyn McLellan almost 11 years ago

  • Status changed from New to Verified

#10 Updated by Joseph Perry almost 11 years ago

  • Status changed from Verified to In progress

gs -dPDFA -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -sOutputFile

Tested on my machine, with the article.pdf in the sample sip, and it fills the buffer with warnings and crashes the MCPClient.

GPL Ghostscript 8.71: Annotation set to non-printing,\n not permitted in PDF/A, annotation will not be present in output file\nGPL Ghostscript 8.71: Annotation set to non-printing,\n not permitted in PDF/A, annotation will not be present in output file\n

#11 Updated by Joseph Perry almost 11 years ago

The command works fine on the command line; even works fine running by sudo -u archivematica. Didn't work fine when archivematica/MCP/transcoder/ was calling it.

Worked fine if the command was sudo'd in the MCP.
Fixed.
added gs to archivematica sudo commands.

#12 Updated by Evelyn McLellan almost 11 years ago

  • Status changed from In progress to Verified

Also available in: Atom PDF