Bug #10226

MCP server error when normalization tool's stderr output contains non-unicode characters

Added by Hector Akamine almost 6 years ago. Updated over 5 years ago.

Status:QA/ReviewStart date:08/22/2016
Priority:MediumDue date:
Assignee:Sarah Romkey% Done:

0%

Category:-
Target version:-
Google Code Legacy ID: Pull Request:https://github.com/artefactual/archivematica/pull/497
Sponsored:No Requires documentation:

Description

Ref. #10212 notes 3 and 5
archivematica does not check/sanitize the stderr/stdout from the normalization tool before attempting to save to the Tasks table, breaking when it contains non UTF-8 characters

History

#1 Updated by Hector Akamine almost 6 years ago

  • Subject changed from MCP server error when normalization tool stderr output contains non-unicode characters to MCP server error when normalization tool's stderr output contains non-unicode characters

#5 Updated by Hector Akamine almost 6 years ago

  • Assignee set to Nick Wilkinson

Hi Nick, this is related to #10212 (MoMA, this problem potentially breaks their automation workflow). Could you please assign a developer to estimate how much effort would take to fix it?

#6 Updated by Nick Wilkinson almost 6 years ago

  • Assignee changed from Nick Wilkinson to Jesús García Crespo

#7 Updated by Hector Akamine almost 6 years ago

I uploaded the .wav file mentioned in https://projects.artefactual.com/issues/10212#note-5 to tamias:/home/artefactual/clients/moma/issue-10226/

#8 Updated by Jesús García Crespo almost 6 years ago

I was able to reproduce in stable/1.5.x:

ffprobe version 3.0-7:3.0-2~14.04 Copyright (c) 2007-2016 the FFmpeg developers
  built with gcc 4.8 (Ubuntu 4.8.4-2ubuntu1~14.04.1)
  configuration: --arch=amd64 --enable-pthreads --enable-runtime-cpudetect --extra-version='7:3.0-2~14.04' --libdir=/usr/lib/x86_64-linux-gnu --disable-stripping --prefix=/usr --enable-bzlib --enable-libdc1394 --enable-frei0r --enable-gnutls --enable-libgsm --enable-libmp3lame --enable-librtmp --enable-libopencv --enable-libopenjpeg --enable-libpulse --enable-libschroedinger --enable-libspeex --enable-libtheora --enable-vaapi --enable-vdpau --enable-libvorbis --enable-libvpx --enable-zlib --enable-gpl --enable-postproc --enable-libcdio --enable-x11grab --enable-libx264 --shlibdir=/usr/lib/x86_64-linux-gnu --enable-shared --disable-static
  avcodec     configuration: --arch=amd64 --enable-pthreads --enable-runtime-cpudetect --extra-version='7:3.0-2~14.04' --libdir=/usr/lib/x86_64-linux-gnu --disable-stripping --prefix=/usr --enable-bzlib --enable-libdc1394 --enable-frei0r --enable-gnutls --enable-libgsm --enable-libmp3lame --enable-librtmp --enable-libopencv --enable-libopenjpeg --enable-libpulse --enable-libschroedinger --enable-libspeex --enable-libtheora --enable-vaapi --enable-vdpau --enable-libvorbis --enable-libvpx --enable-zlib --enable-gpl --enable-postproc --enable-libcdio --enable-x11grab --enable-libx264 --shlibdir=/usr/lib/x86_64-linux-gnu --enable-shared --disable-static --enable-libopencore-amrnb --enable-version3 --enable-libopencore-amrwb --enable-version3 --enable-libvo-amrwbenc --enable-version3
  libavutil      55. 17.103 / 55. 17.103
  libavcodec     57. 24.102 / 57. 24.102
  libavformat    57. 25.100 / 57. 25.100
  libavdevice    57.  0.101 / 57.  0.101
  libavfilter     6. 31.100 /  6. 31.100
  libswscale      4.  0.100 /  4.  0.100
  libswresample   2.  0.101 /  2.  0.101
  libpostproc    54.  0.100 / 54.  0.100
Input #0, wav, from '/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/Prueba005-46c72ea9-3b9f-41ae-bca0-1b1a35ec52ca/objects/(PaulStretch_massappeal).wav':
  Metadata:
    title           : (PaulStretch_massappeal)
    NITR            : NTKB?
  Duration: 00:28:32.24, bitrate: 1411 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, 2 channels, s16, 1411 kb/s
[xml @ 0x2056e40] 3 invalid UTF-8 sequence(s) found in string 'NTKB?', replaced with '�'
ERROR     2016-08-24 21:34:22  archivematica.mcp.server:utils:wrapped:15:  Uncaught exception
Traceback (most recent call last):
  File "/vagrant/src/archivematica/src/MCPServer/lib/utils.py", line 13, in wrapped
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/django_mysqlpool/__init__.py", line 10, in wrapper
    return f(*args, **kwargs)
  File "/vagrant/src/archivematica/src/MCPServer/lib/taskStandard.py", line 88, in performTask
    self.check_request_status(completed_job_request)
  File "/vagrant/src/archivematica/src/MCPServer/lib/taskStandard.py", line 97, in check_request_status
    self.linkTaskManager.taskCompletedCallBackFunction(self)
  File "/vagrant/src/archivematica/src/MCPServer/lib/linkTaskManagerFiles.py", line 144, in taskCompletedCallBackFunction
    databaseFunctions.logTaskCompletedSQL(task)
  File "/usr/lib/archivematica/archivematicaCommon/databaseFunctions.py", line 224, in logTaskCompletedSQL
    task.save()
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/base.py", line 589, in save
    force_update=force_update, update_fields=update_fields)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/base.py", line 617, in save_base
    updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/base.py", line 679, in _save_table
    forced_update)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/base.py", line 723, in _do_update
    return filtered._update(values) > 0
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 600, in _update
    return query.get_compiler(self.db).execute_sql(CURSOR)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/compiler.py", line 1005, in execute_sql
    cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/compiler.py", line 787, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/lib/python2.7/dist-packages/django/db/backends/utils.py", line 65, in execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python2.7/dist-packages/django/db/backends/mysql/base.py", line 129, in execute
    return self.cursor.execute(query, args)
  File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 228, in execute
    if not self._defer_warnings: self._warning_check()
  File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 127, in _warning_check
    warn(w[-1], self.Warning, 3)
Warning: Invalid utf8 character string: 'F59D3F'

#9 Updated by Jesús García Crespo almost 6 years ago

  • Assignee changed from Jesús García Crespo to Holly Becker

This solution is working for me but I don't now if it's the right thing to do.

diff --git a/src/archivematicaCommon/lib/databaseFunctions.py b/src/archivematicaCommon/lib/databaseFunctions.py
index 61790a6..c7ef0aa 100755
--- a/src/archivematicaCommon/lib/databaseFunctions.py
+++ b/src/archivematicaCommon/lib/databaseFunctions.py
@@ -219,8 +219,8 @@ def logTaskCompletedSQL(task):
     task = Task.objects.get(taskuuid=taskUUID)
     task.endtime = getUTCDate()
     task.exitcode = exitCode
-    task.stdout = stdOut
-    task.stderror = stdError
+    task.stdout = deUnicode(stdOut, errors='ignore')
+    task.stderror = deUnicode(stdOut, errors='ignore')
     task.save()

 def logJobCreatedSQL(job):
@@ -314,7 +314,7 @@ def getAccessionNumberFromTransfer(UUID):
         raise ValueError("No Transfer found for UUID: {}".format(UUID))

-def deUnicode(str):
+def deUnicode(str, errors=None):
     """ 
     Convert a unicode string into an str by encoding it using UTF-8.

@@ -323,4 +323,7 @@ def deUnicode(str):
     """ 
     if str == None:
         return None
-    return unicode(str).encode('utf-8')
+    kwargs = {}
+    if errors is not None:
+        kwargs['errors'] = errors
+    return unicode(str, **kwargs).encode('utf-8')

Thoughts?

#10 Updated by Holly Becker over 5 years ago

  • Status changed from New to Feedback
  • Assignee changed from Holly Becker to Jesús García Crespo

Encoding the stdout/stderr looks like a good idea. However, I'd suggest using archivematicaFunctions.unicodetoStr instead of deUnicode. It looks like deUnicode is actually buggy (unicode(str_with_encoded_unicode) fails), so replacing it entirely with unicodeToStr updated to handle errors would be great.

#11 Updated by Jesús García Crespo over 5 years ago

  • Status changed from Feedback to In progress

I'll try that, thanks.

#12 Updated by Nick Wilkinson over 5 years ago

  • Assignee changed from Jesús García Crespo to Holly Becker

Hey Holly, reassigning to you but let me know if you think this should stay with Jesús based on progress to date.

#14 Updated by Holly Becker over 5 years ago

  • Status changed from In progress to QA/Review
  • Assignee changed from Holly Becker to Nick Wilkinson
  • Pull Request set to https://github.com/artefactual/archivematica/pull/497

This is the same bug as #9967, where the log output needs to handle (or ignore) problematic characters. This should be fixed with AMPR https://github.com/artefactual/archivematica/pull/497

Does it need to be backported to stable/1.5.x for hosted/support clients?

#15 Updated by Nick Wilkinson over 5 years ago

  • Assignee changed from Nick Wilkinson to Sara Allain

Hi Sara, do you have advice for Holly about her question in note # 14?

#16 Updated by Sara Allain over 5 years ago

  • Assignee changed from Sara Allain to Holly Becker

If this issue sometimes causes the MCP server to crash during transfer (as described in #9967) then it should probably be backported. Some of the Archivematica hosted clients haven't been upgraded yet, so they'd benefit from backporting, assuming it's not a huge amount of work.

I discussed quickly with Justin and he suggested backporting to stable/1.5.x for now - we can cut a 1.5.2 branch if the 1.6 release takes too long.

#17 Updated by Holly Becker over 5 years ago

  • Assignee changed from Holly Becker to Nick Wilkinson

AMPR 497 merged and backported to stable/1.5.x

#18 Updated by Nick Wilkinson over 5 years ago

  • Assignee changed from Nick Wilkinson to Sara Allain

#19 Updated by Sara Allain over 5 years ago

  • Assignee changed from Sara Allain to Sarah Romkey

Sarah, passing to you for testing when you get a chance!

Also available in: Atom PDF