Add the ability for AtoM to check and fix basic CSV issues on import (line endings and character encoding)
Currently, if you import an improperly encoded CSV (such as one created in Microsoft Excel), this can lead to all sorts of issues on import - the worst of which is having hundreds of blank descriptions created in the system when AtoM misinterprets the line endings used from an Excel import, as reported by many users in the forum.
AtoM does currently have a command-line task called csv:check-import that will output basic information about the contents of a CSV for import, but it doesn't currently check values such as character encoding, line endings, separator characters, or general conformance to the CSV specification.
This proposal - drawn from a User forum post on 2018-12-13 ( here ) would add the ability for AtoM to both check - and ideally repair - these types of issues during import. From the forum:
Looking at this issue and the associated documentation makes me wonder if there should be some feature requests open in atom to actually handle these problems.
- atom to handle the line endings (the extra line feed could be stripped by unix2dos or a php regex)
- atom to handle the encoding (php appears to have http://php.net/manual/en/function.utf8-encode.php or http://php.net/manual/en/function.mb-convert-encoding.php which may be useful here?)
- csv:check-import to report on these issues (though in fairness it does this indirectly already - if the reported rows isn’t what was expected it implies a problem).
Building on this idea, I would propose the following:
First, that reporting about character encoding, line endings, and separator characters used in the CSV be added to the output of the check-import task.
Second, that the csv:check-import task include a --fix option that attempts to clean up unexpected line endings and character encodings, and possibly convert different separator characters to AToM's expected ones as well (i.e. commas). I would suggest that when run, the task outputs the converted CSV in the same location as it was, with "_fixed" appended to the filename. Users could then choose to re-run the check task against the fixed version to see an updated output and confirm that the conversion was successful, or proceed directly with the import
Third, that checks for character encoding and line endings be incorporated into the CSV import task. If AtoM's current expectations for these are not found, then by default the import is halted and an error message outlining the issue (e.g. "CSV is not UTF-8 encoded" etc) is provided.
Fourth, that a --fix option be included in the csv:import task. When used, the user is first prompted with a warning, and asked if they have made a backup first (y/n must be entered. We could possibly allow this to be skipped if a --force option is included as well, for scripting purposes, but only from the CLI). When yes is selected, AtoM will attempt to fix any encoding/line ending/separator issues prior to proceeding with the import.
In the user interface, this could be a checkbox available to administrators that says "Fix CSV issues on import" that is selected when configuring the CSV import. When checked, it could immediately trigger a warning modal that appears, encouraging users to make sure they have a backup first (note: the idea of adding functionality to allow administrators to automatically generate a SQL database dump backup on import, store it temporarily, and possibly even load the backup automatically if the import fails has also been discussed, and would pair well with this feature).
This would likely solve a lot of support issues for AtoM users unfamiliar with the complexities of CSV encoding and line endings.