Improve Elasticsearch indexing process
|Category:||Performance / scalability|
|Target version:||Release 2.6.0|
|Google Code Legacy ID:||Tested version:|
Also, after testing the search populate task with a very large database, the process died complaining about memory usage (+ 4GB). We should try to avoid the ORM in places like:
Other possible enhancements:
- Avoid stdClass where possible.
- Use PDO::FETCH_ASSOC to fetch results.
- Pass only the resource id (and type) to update (moved to #13272).
- Clear classes cache periodically during the process (may not be needed after avoiding the ORM).
#10 Updated by José Raddaoui Marín 5 months ago
- Status changed from Code Review to QA/Review
- Assignee deleted (
José Raddaoui Marín)
I have tested this with a considerable big dataset (~1M IOs) and the indexing time went from over 8 days to 6 hours. The four most time consuming queries (+90% of the total query time) have been completely removed from the process and the next one has been reduced to approximately a third. The CPU utilization has been maintained with this changes while the average disk I/O activity increased from ~500 kB/s to ~2.30 MB/s during the process, as it did the amount of queries per second, from 275 to 3K (average).
There are many things to test to verify this improvements didn't change the indexed data. #13291 added a new task to easily get the indexed document by a resource's slug, which may be useful for this tests. The full indexing process and the single resource index differ a little more after this changes so I'd suggest to check the resulting document in both cases, when they're indexed in a full search:populate run and when they are updated (either save from the GUI or use the slug option of the search:populate task). In both cases, the metadata that may be more affected is:
All index types:
- places, subjects and genres (direct and not direct)
- places and subjects (direct and not direct)
- all fields
Please, let me know if you need more information.
#12 Updated by José Raddaoui Marín 5 months ago
Good catch Dan ;)
The full index uses an in memory term -> parent lookup to reduce the DB hits while the single index uses CTE to get those terms and ancestors without loading all terms in memory. As you noted the order doesn't really matter at the moment, so that was an extra optimization we could make.