Feature #3202

Add the full-text content of uploaded PDFs to the search index

Added by Peter Van Garderen over 10 years ago. Updated over 7 years ago.

Status:VerifiedStart date:
Priority:CriticalDue date:
Assignee:David Juhasz% Done:

0%

Category:-
Target version:Release 1.3
Google Code Legacy ID:atom-1252 Tested version:
Sponsored: Requires documentation:

Description

  • maybe by creating a 'full-text' field in the search index document
  • should look at a way to read in full-text from born-digital PDFs and OCR
    text from scanned PDFs (if provided)
  • look at extending to other text objects? .txt, .doc. .rtf

[g] Legacy categories: Search / browse

History

#1 Updated by Anonymous about 10 years ago

  • Priority set to Low

[g] Labels added: Priority-Low

#2 Updated by Peter Van Garderen over 9 years ago

  • Priority changed from Low to High
  • Target version set to Release 1.2

I know several OJS-based journals successfully used pdftotext for indexing their PDF-based articles. It basically extracts text from the PDF and makes it available to whatever backend you have; ie. it can be used to locate a specific PDF, but it doesn't do in-document highlighting or anything like that.

MJ

On 2011-03-28, at 12:58 PM, David Juhasz wrote:

Oh, also this:
http://en.wikipedia.org/wiki/Pdftotext

On 28-Mar-11, at 9:56 AM, David Juhasz wrote:

How about this?
http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/

Looks like it uses a "pdfinfo" application that I've never heard of, but looks pretty straight-forward. I wonder if we could do something simpler (i.e. just indexing the contents) using ghostscript?

David

On 27-Mar-11, at 8:53 AM, Peter Van Garderen wrote:

we'll need some type of support for this in the 1.x branch. Ideally we can make this work with ZSL or if not, some other PHP component that can plugin relatively painlessly with the existing Qubit architecture and minimum system requirements.

The ugliest hack discussed thus far is to use a full-text search box via Google site search.

[g] Labels added: Milestone-Release-1.2, Priority-High
[g] Labels removed: Milestone-Release-Post-1.2, Priority-Low
[g] New owner: MJ Suhonos

#3 Updated by David Juhasz about 9 years ago

  • Priority set to Medium

[g] Labels added: Priority-Medium

#4 Updated by David Juhasz over 8 years ago

  • Target version set to Release 1.3

Roll over to Release 1.3

[g] Labels added: Milestone-Release-1.3

#5 Updated by Jesús García Crespo about 8 years ago

[g] New owner: David Juhasz

#6 Updated by David Juhasz about 8 years ago

Reassign to David's new account.

[g] New owner: David Juhasz

#7 Updated by David Juhasz about 8 years ago

  • Status changed from New to QA/Review

Fixed in r11863.

#8 Updated by David Juhasz about 8 years ago

QA Note: Requires pdftotext library to be installed on server

#9 Updated by Jessica Bushey about 8 years ago

  • Status changed from QA/Review to Feedback

In testing the following results occurred:
1) only first page of pdf is searchable
2) if pdf is uploaded as multi-page object with multiple descriptions, only first page of pdf is searchable

#10 Updated by David Juhasz about 8 years ago

  • Status changed from Feedback to QA/Review

The PDF text is probably being truncated because the database field only allows 255 characters. In r12026 the size of the database field has been increased to allow up to 65535 characters to be stored.

To test r12026 you must upgrade the database to the latest version. To upgrade the database, run the following on the command-line:

svn update
php symfony tools:upgrade

#11 Updated by Jessica Bushey about 8 years ago

  • Status changed from QA/Review to Feedback

I ran command-line instructions in my VM, and response was that I was at latest upgrade.
But during testing...no luck. I'm not getting any hits in the main search box for terms existing in the pdf.

#12 Updated by Peter Van Garderen about 8 years ago

  • Priority changed from Medium to Critical

[g] Labels added: Priority-Critical
[g] Labels removed: Priority-Medium

#13 Updated by Jessica Bushey about 8 years ago

In ADMIN > Settings > Global > Upload multi-page files as multiple descriptions > Select "No": Pdf search works.

In ADMIN > Settings > Global > Upload multi-page files as multiple descriptions > Select "Yes": Pdf search does not work.

#14 Updated by David Juhasz almost 8 years ago

  • Status changed from Feedback to QA/Review

Multi-page PDF text is now extracted as of r12201.

Please note that the extracted text is linked to the description for the PDF, not the individual page images (i.e. if you search for text in document you will always be linked the general PDF description, not the individual page that the text is on).

#15 Updated by Jessica Bushey almost 8 years ago

  • Status changed from QA/Review to Verified

Works as multiple descriptions and single description.
Searching for a specific term will result in a hit for the relevant pdf.
Note about behaviour: The term is NOT highlighted within the pdf and the actual location / page that the term resides within is NOT located or included in the results. But every term that is in the pdf will produce a hit by using the main search box of ica-atom.

Also available in: Atom PDF