Feature #10062

Use pdfinfo to speed up counting pages in imported PDF document

Added by David Juhasz almost 6 years ago. Updated almost 6 years ago.

Status:VerifiedStart date:06/22/2016
Priority:CriticalDue date:
Assignee:Dan Gillean% Done:

0%

Category:Digital object
Target version:Release 2.3.0
Google Code Legacy ID: Tested version:
Sponsored:No Requires documentation:

Description

Background:
For feature #8412 we added code to count the number of pages in a PDF document (via Imagemagick's identify command) to ensure we don't try and create derivatives from a non-existent page. Unfortunately identify is quite slow, especially for large PDFs, taking 90s for one sample with 1000 pages. Exacerbating the problem, the identify command is called four (4) times per PDF - twice each for the reference and thumbnail derivatives.

In testing we've found pdfinfo to be significantly faster, taking only 1s for the same 1,000 page PDF that took 90s with identify.


Related issues

Related to Access to Memory (AtoM) - Feature #8412: Add new setting to select page no. in PDF derivatives Verified 05/08/2015

History

#1 Updated by David Juhasz almost 6 years ago

  • Description updated (diff)

#3 Updated by David Juhasz almost 6 years ago

  • Related to Feature #8412: Add new setting to select page no. in PDF derivatives added

#4 Updated by Nick Wilkinson almost 6 years ago

  • Assignee changed from Steve Breker to Mike Cantelon

Hi Mike, assigning this to you for inclusion in the 2.3 release.

#5 Updated by David Juhasz almost 6 years ago

If the pdfinfo library is not installed on the host server, I think we should default to the old behaviour of using the first page of the PDF, effectively ignoring the "PDF page number" setting. On the Settings page it would be good to add a check for pdfinfo and replace the "PDF page number" setting number box with a warning that "pdfinfo required for this setting" (wording is up for revision) if the library is not installed.

#6 Updated by David Juhasz almost 6 years ago

Detailed anlaysis by Steve Breker:

Summary of calls:

In lib/model/QubitDigitalObject.php in function createRepresentations:
- we call $this->createReferenceImage($connection); followed by
- $this->createThumbnail($connection);

Both of these functions call 'createImageDerivative' and ultimately 'resizeImage':

'resizeImage' in QubitdigitalObject does a few things:
- instantiates a new sfThumnail object
- calls sfThumbnail's 'loadFile' method
- calls sfThumbnail's 'toString' method

These two calls ('loadFile' and 'toString') pass through sfThumbnail.class.php to sfImageMagickAdapter.class.php through functions of the same names. The performance hit is in sfImageMagickAdapter.class.php and are traceable to getCount function in this file. 'getCount' calls ImageMagick's 'identify' which is run via php exec. It is this call that is slow, each time taking about 90 secs with a 1000 pg demo pdf. 'getCount' is called twice per derivative (and we make two derivatives per pdf) - once through 'loadFile', and once through 'toString' (in function save()). Call stack looks like:

loadFile() -> getExtract() -> getCount();
toString() -> save() ->getExtract -> getCount()

Two ways to cut down on the time this takes:

(A) will probably get the most back for buck even if (B) is omitted. Doing (A) on a system that has pdfinfo installed will reduce the time to load this pdf from 6 minutes to about 9 seconds total from my tests. Where (B) is important will be on a system without pdfinfo (and grep and awk).

A) modify atom\plugins\sfThumbnailPlugin\lib\sfImageMagickAdapter.class.php getCount() to try to determine if the program 'pdfinfo' is available on the system. If so, use this preferentially over ImageMagick's Identify (BUT ONLY IF IT IS A PDF - pdfinfo won't work for anything else). If not, fall back to old logic using 'identify' as it is doing today. On my Vagrant VM, counting pages with 'identify' on the command line takes 90 seconds; using pdfinfo takes about one second.

Current command being run ('identify'):
identify -format %n '/usr/share/nginx/atom/uploads/r/null/8/2/e/82e0941d31fbc416c46932da7e8c1a8cd708f894e82233eed8b9f337d9f9cd09/VM1-4-1_4eS_Index_0000_NomsRaisonsSociales_L_opp.pdf'

New command using pdfinfo:
pdfinfo $file | grep Pages: | awk '{print $2}'

...or using the above filename:
pdfinfo '/usr/share/nginx/atom/uploads/r/null/8/2/e/82e0941d31fbc416c46932da7e8c1a8cd708f894e82233eed8b9f337d9f9cd09/VM1-4-1_4eS_Index_0000_NomsRaisonsSociales_L_opp.pdf' | grep Pages: | awk '{print $2}'

Tasks
- add logic to check if this is a pdf or not. ~1 hour
- add logic to conditionally run pdfinfo if it can be found on system. ~2 hours
- test - ~2 hours

- about 5 hours (not padded)

B) Currently for each derivative generated, getCount is run twice - once via function loadFile() and once from function save(). This could be reduced to one call if we saved the count in sfImageMagickAdapter.class.php in $this->options['extract'] and checked to see if it was available in getExtract before running getCount.

Tasks
- add code to check this variable for a page count and, if found, do not run getCount from getExtract. ~ .5 hour
- test - 2 hours

- about 2.5 hours (not padded)

C) Refactor the thumbnail plugin to only load the pdf once, and pass in a list of the derivatives you need. This could potentially reduce the number of calls to identify to 1 from 4, but would involve re-working a large amount of the sfThumbnail plugin. This would take a fair amount of time so am not seriously considering this as an option.

#7 Updated by Mike Cantelon almost 6 years ago

QubitDigitalObject::setPageCount also needs to be updated (it currently directly executes the identity command.

#8 Updated by Mike Cantelon almost 6 years ago

  • Status changed from New to Feedback
  • Assignee changed from Mike Cantelon to David Juhasz

OK, I've pretty much it worked out for thumbnail page selection. Should I also implement pdfinfo page counting for PDFs when the "Upload multi-page files as multiple descriptions" option is turned on? If so I could either have it default to:

a) not creating multiple descriptions if pdfinfo isn't installed or
b) to to defaults to using Imagemagick if pdfinfo isn't installed.

#9 Updated by David Juhasz almost 6 years ago

Hi Mike,

Are (a) and (b) the same amount of work? How much work would you estimate for each option?

#10 Updated by David Juhasz almost 6 years ago

  • Assignee changed from David Juhasz to Mike Cantelon

#11 Updated by Mike Cantelon almost 6 years ago

  • Assignee changed from Mike Cantelon to David Juhasz

Both are the same amount of work and should take little time given I've isolated what needs to happen code-wise.

#12 Updated by David Juhasz almost 6 years ago

  • Assignee changed from David Juhasz to Mike Cantelon

Okay, please go ahead with option (b) then. Thanks!

#13 Updated by Mike Cantelon almost 6 years ago

  • Status changed from Feedback to Code Review
  • Assignee changed from Mike Cantelon to Nick Wilkinson

#14 Updated by Nick Wilkinson almost 6 years ago

  • Assignee changed from Nick Wilkinson to Mike Gale

Hi Mike G, assigning to you for CR.

#15 Updated by David Juhasz almost 6 years ago

  • Priority changed from Medium to Critical

#16 Updated by Mike Gale almost 6 years ago

  • Assignee changed from Mike Gale to Mike Cantelon

Eek, not sure why this was bumped to critical.

It looks good to me, Mike C. I just had 2 minor style nitpicks and 1 question.

cheers

#17 Updated by Mike Cantelon almost 6 years ago

  • Status changed from Code Review to QA/Review
  • Assignee changed from Mike Cantelon to Nick Wilkinson

I've added Mike G's suggestions and merged the resulting code to qa/2.3.x.

#18 Updated by Nick Wilkinson almost 6 years ago

  • Assignee changed from Nick Wilkinson to Dan Gillean

#19 Updated by David Juhasz almost 6 years ago

Mike G. I bumped this to critical because I think it's a must have for the 2.3 release. The performance hit from this bug is big enough that I think it would be a serious problem for AtoM users that have a significant number of PDFs.

#20 Updated by Dan Gillean almost 6 years ago

  • Status changed from QA/Review to Feedback
  • Assignee changed from Dan Gillean to Mike Cantelon

Hi Mike,

I'm not yet sure if this is related to this development or not, but I'm getting a whole bunch of filesystem permission errors out of this.

I'm using the 2.3 vagrant box. Previously it's all worked well and I haven't encountered any kinds of permission errors before. I recently had it on the qa/2.4.x branch, but I switched back, did a purge, did a rebase, etc.

I had no problem using the PDF derivative page setting and seeing PDF files link/upload quickly, using the correct page as per the setting.

However, after that I tried to run the digitalobject:regen-derivatives task. Right away I couldn't get the task to run because of permissions errors. I tried running the global file systems permission setting (sudo chown -R www-data:www-data /usr/share/nginx/atom), and then I ran the task as sudo -u www-data, which worked. But when I tried to reload my descriptions in the UI to see if it worked, I get 500 errors.

Looking in the logs, these appear to be permissions errors, e.g.:

2016/07/05 17:06:13 [error] 3294#0: *28 FastCGI sent in stderr: "PHP message: Unable to open the log file "/usr/share/nginx/atom/log/qubit_prod.log" for writing" while reading response header from upstream, client: 10.10.10.1, server: _, request: "GET /photograph-of-george-gale HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.atom.sock:", host: "10.10.10.10", referrer: "http://10.10.10.10/photograph-of-george-gale/addDigitalObject" 

As I mentioned, I've never had filesystem permission issues prior to this, and since the qubit_prod.log file is in /usr/share/nginx/atom, i would have expected that re-running the chmod command for the www-data user should have resolved any permissions issues, if changes etc were being executed as the www-data user. Any thoughts?

#21 Updated by Mike Cantelon almost 6 years ago

  • Assignee changed from Mike Cantelon to Dan Gillean

What happens when you run it using the www-data user?

i.e.

sudo -u www-data ./symfony digitalobject:regen-derivatives

#22 Updated by Dan Gillean almost 6 years ago

  • Status changed from Feedback to QA/Review

Ehhhh, nevermind, I don't think the above is related to this.

Tested on a different VM and it all worked as expected. I essentially uploaded PDFs with the different page setting used, and tried the digital object regen task. Are there other ways I should test this (without having to uninstall pdfinfo unless that is totally necessary)?

looks good so far! Will wait for feedback before marking verified

#23 Updated by Mike Cantelon almost 6 years ago

Awesome... you can test the multiple-descriptions-per-digitial-object functionality too as that's also optimized.

#24 Updated by Dan Gillean almost 6 years ago

  • Status changed from QA/Review to Verified

LGTM!

Also available in: Atom PDF