Skip to content

Improve memory usage of second-time index

From the main commit:

tracker-miner-fs: Change second-time index behavior

Traditionally, we've performed second-time index (i.e. restarting the miner
on an already populated database and have it catch up with changes) in two
distinct phases for each indexed root:
  - A query phase, where information from all indexed files is extracted.
  - A crawling phase, where information from all indexable files in disk
    is extracted.

Before confronting both sides of the data together, and deciding whether
and how a file did change. While this works to detect filesystem changes
and is easy to follow, this approach puts some memory pressure on very large
filesystems since there's a point in time that we cache this info for
every file in the index root being checked.

Change the approach to handling up-to-date checks so that file info is
queried from disk individually on-the-fly as the database cursor is
iterated. This has two advantages:
  - Querying filesystem info from well-known file paths yields better
    performance than iterating all directories for their content.
  - Since most files are most usually unchanged on a second-index start
    (e.g. user just logged in) we can discard all those files while we are
    iterating the cursor. This results in completely flat memory usage
    in the ideal case that no files need reindexing.

But of course we still do have to handle possible changes to the filesystem
in this stage, in these checks it's easy to spot files and directories being
updated or deleted, so we handle those in place.

Detecting newly created files and folders relies on the filesystem layer
implicitly updating the parent folder mtime whenever directory structure
changes. We do detect those changed folders, and schedule crawling and
deeper checks for the content of these changed folders exclusively, as
opposed to the whole indexed folder structure.

This MR aimed to reduce memory usage during second-time index accounting for massive filesystems, but it also turned out faster, while keeping a low memory footprint for the most likely cases (finding no changes on restart, and finding mass changes e.g. due to tracker:extractorHash changes). With half a million files currently indexed here, it reduced memory used at startup from a peak of 400MB to rather flat ~30MB, and overall time from ~3min to ~40seconds in doing the initial checks.

Merge request reports