Bug #13095

500 error when link PDF digital object with much text content

Added by David Hume 4 months ago. Updated 3 months ago.

Status:FeedbackStart date:06/20/2019
Priority:MediumDue date:
Assignee:José Raddaoui Marín% Done:

0%

Category:Digital object
Target version:Release 2.5.1
Google Code Legacy ID: Tested version:
Sponsored:No Requires documentation:

Description

When linking a PDF digital object with sufficient text, whether locally or reference URL, a 500 error now occurs.

This is likely during the PDF / OCR process to capture the text hitting strict enforcement limit now present in MySQL 5.7 when exceeding the size of the target field.

20190530pm-Hansard-n267.pdf (1020 KB) David Hume, 06/20/2019 01:29 PM

Coquitlam_garbage-amp-recycling-collection-schedule.pdf (758 KB) David Hume, 06/20/2019 01:29 PM


Related issues

Related to Access to Memory (AtoM) - Bug #13174: Jobs can hang if output is greater than 64K Verified 09/12/2019

History

#1 Updated by David Hume 4 months ago

A PDF file with more graphics than text (such as attached 760K municipal recycling pickup schedule) still works, whereas a client with a smaller 270K paper with more text encountered the error.

As an example of testing, attached Hansard transcript link attempt results in the following log entries:

access.log

[20/Jun/2019:19:36:34 +0000]  184.69.130.182 - 142.4.223.137 upgrtest.accesstomemory.org atom_upgrtest  "POST /humepdf-3/object/addDigitalObject HTTP/1.0" 500 866 
"https://upgrtest.accesstomemory.org/humepdf-3/object/addDigitalObject" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0" 2.565 2.516 .

error.log

2019/06/20 19:36:34 [error] 11573#11573: *729002 FastCGI sent in stderr: "PHP message: Unable to execute INSERT statement. [wrapped: SQLSTATE[22001]: String data, right truncated: 1406 Data too long
 for column 'value' at row 1]" while reading response header from upstream, client: 142.4.223.137, server: upgrtest.accesstomemory.org, request: "POST /humepdf-3/object/addDigitalObject HTTP/1.0",
 upstream: "fastcgi://unix:/var/run/php-fpm.atom.sock:", host: "upgrtest.accesstomemory.org", referrer: "https://upgrtest.accesstomemory.org/humepdf-3/object/addDigitalObject" 

#2 Updated by Dan Gillean 4 months ago

Information for context in diagnosing this issue and coming up with a long-term solution:

Based on the following code, itt appears that, per L3186, this is being stored in the value col of the property_18n table:

The value column in the property_i18n table is currently set as a TEXT type field:

mysql> describe property_i18n;
+---------+-------------+------+-----+---------+-------+
| Field   | Type        | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+-------+
| value   | text        | YES  |     | NULL    |       |
| id      | int(11)     | NO   | PRI | NULL    |       |
| culture | varchar(16) | NO   | PRI | NULL    |       |
+---------+-------------+------+-----+---------+-------+
3 rows in set (0.01 sec)

MySQL 5.7 makes knowing the actual length of a TEXT field.... confusing. According to the docs:

"TEXT and BLOB columns are implemented differently in the NDB storage engine, wherein each row in a TEXT column is made up of two separate parts. One of these is of fixed size (256 bytes), and is actually stored in the original table. The other consists of any data in excess of 256 bytes, which is stored in a hidden table. The rows in this second table are always 2000 bytes long. This means that the size of a TEXT column is 256 if size <= 256 (where size represents the size of the row); otherwise, the size is 256 + size + (2000 × (size − 256) % 2000)."

See:

In MySQL 5.6 TEXT fields could accept approximately 64KB - or exactly 65,535 characters. We also have to factor in that UTF-8 encoding means that some characters are multibyte, so this number is not a 1:1 character count for text.

Previously in 5.6, when the TEXT field limit was reached during the import of a digital object with a text layer, the transcript would just be clipped, but the import would proceed without issue. Thanks (most likely) to the stricter default settings and behaviors in MySQL 5.7, this is instead now causing an error that in turns aborts the import, causing it to fail and deliver a 500 error.

#3 Updated by David Juhasz 4 months ago

  • Status changed from New to Code Review

#4 Updated by José Raddaoui Marín 4 months ago

  • Status changed from Code Review to Feedback
  • Assignee set to David Juhasz

#5 Updated by David Juhasz 4 months ago

This error was also occurring when uploading a PDF finding aid with more than 64KB of text.

#6 Updated by Corinne Rogers 4 months ago

  • Status changed from Feedback to Verified

#7 Updated by José Raddaoui Marín 3 months ago

  • Status changed from Verified to Feedback
  • Assignee changed from David Juhasz to Dan Gillean

We're seeing the same issue in the jobs output column:

https://groups.google.com/forum/#!msg/ica-atom-users/kx1bg_BDQnQ
https://groups.google.com/forum/#!topic/ica-atom-users/Qk86BWttpwc

As suggested by Brandon in one of those user forum threads, I wonder if it would be better to modify the affected columns type to MEDIUMTEXT, maybe not for the PDF contents, but at least for this one. We should also take a deeper look to see if this may happen somewhere else.

Please, feel free to re-close this issue and create a new one if you think that's better.

#8 Updated by David Juhasz 3 months ago

  • Assignee changed from Dan Gillean to José Raddaoui Marín

Hi Radda,

Changing the size of the jobs output column has the potential to greatly increase the amount of data stored in the AtoM database, and consequently the size of the database on disk. I think increasing the column width is a major change to the application functionality that should not be made in a patch release like 2.5.2, but I'm open to discussing it for 2.6.

For AtoM 2.5.x I think we should assess each situation where the column width is exceeded individually and decide on the appropriate remedy based on the particular case. In my opinion any job that is overflowing a 64Kb column is probably too verbose, and the best solution is to reduce the output verbosity. Another possible option is truncating the output in the code, which would replicated the "non-strict" MySQL behaviour of silently truncating data that exceeds the column width. I would argue that silently truncating data is a bad long term strategy in most cases, which is why the strict MySQL behaviour of failing on overflow has been implemented.

José Raddaoui Marín wrote:

We're seeing the same issue in the jobs output column:

https://groups.google.com/forum/#!msg/ica-atom-users/kx1bg_BDQnQ
https://groups.google.com/forum/#!topic/ica-atom-users/Qk86BWttpwc

As suggested by Brandon in one of those user forum threads, I wonder if it would be better to modify the affected columns type to MEDIUMTEXT, maybe not for the PDF contents, but at least for this one. We should also take a deeper look to see if this may happen somewhere else.

Please, feel free to re-close this issue and create a new one if you think that's better.

#9 Updated by José Raddaoui Marín about 1 month ago

  • Related to Bug #13174: Jobs can hang if output is greater than 64K added

Also available in: Atom PDF