Skip to content

dirdiff: Use mmap in _files_same for a speed improvement

When some file has content and other hasn't, it will try open both with mmap and fail for empty file. We can't say that they are different because regex or blank lines can equalize both. Since we can't open an empty mmap, I just use mmaps for 'big' files. I'm only not sure about this size. (current set to CHUNK_SIZE, 4096b);

Round max size (KB) master (s) mmap (s)
10000 4 11.46 12.62
1000 4 1.54 1.67
1000 40 2.20 1.91
1000 400 9.81 5.21
1000 4000 91.05 47.50

Every 'round' it runs:

  • 1 x empty files (fastest equal, wont read files)
  • 1 x 1Byte vs 1Byte file (fast equal, read both until end)
  • 1 x max size vs max size (slow equal, read both until end)
  • 1 x empty vs 1Byte file (fast different, first chunk diff)
  • 1 x max size vs max size (fast different, first chunk diff)
  • 1 x max size vs max size (slow different, read both until end)

MMap runs faster if files are large, and slower if files has more or less 4KB

files_same_vs_files_same.py

Part of old dirdiff-mmap merge request

Edited by Kai Willadsen

Merge request reports