Skip to content

WIP: Faster dirdiff

Hugo Sena Ribeiro requested to merge hugosenari/meld:faster-dirdiff into master

Before anything, this branch is not ready for merge yet. It just to share my experiments and some insights for future improvements. And don't know where to leave it.

This Branch Story:

Running dirdiff mmap I just noted that meld runs slightly faster for larger dir tree, but nothing amazing.

master branch

Master

Thinking that part of problem could be the amount of os.stat calls (6x2xN), I created list_dirs (it's like os.walk) a generator with all info we need about file, reducing os.stat calls (now ~2xN).

Changing search_recursively_iter by seedling (does the same as search_recursively_iter but using list_dirs) now things runs faster, again not that much, we still stuck by time of gtk operations.

To fix this, first I changed all calls of set_value with one call on append, and it solves... nothing, it still taking to long.

Then I discovered that gtk was wrapping python objects with GObjects, and it took to much time.

The solution, use python objects and call insert_with_values instead of append.

Fortunately most of our data are simple (str, int, double) or GObject, gtk can deal with them without wrappers.

With this we can insert rows 7 times faster!!!

using list_dirs and insert_with_values

No wrapping

Now, the only problem left for larger list of files was time that it takes reading content.

My thoughts here are if some one try to compare two directory they most be 'almost' equal in most of cases, with this in mind, when listing dir I just test if files exists and both have same size. Through this we can provide a basic feedbak of files diff.

And real diff runs in a pool of proccess, this way, we can check files and continue dir tree thread at same time without care to much about GIL.

With a Pool

No wrapping

Here are cProfile data that you can use snakeviz (or any other tool) to visualize.

TODO:

  • Figure how to break it in simple and smaller merge requests;
  • Make actions works files;
  • Remove old/broken code/code clean up;
  • Add back 'ignore link' feature;
  • Add 'ignore' missing dirs content feature #126;

I hope be a better python writer than English. Sorry.

Edited by Hugo Sena Ribeiro

Merge request reports