Automatic scan cropping / rotating / deskewing
When scanning, we get an image. This image has to be cropped, rotated and deskew.
To crop this image to the paper size, most software ask the user the expected size of the paper (A4, Letter, etc). This question is annoying for many reasons:
- When the paper is not A4/Letter, I've usually no idea what it is and I don't care (example: French taxes papers are slightly bigger than A4)
- On some scanners, the page is on the left side of the image (usually with flatbed). On some scanners, the page is centered (usually with feeders). On Windows, the image is rotated by Libinsane for technical reasons and so the page may appear on the right side.
- Paperwork use Tesseract to guess the orientation of the page. To be fully consistent, users should have to worry about using the correct corner of the Flatbed either. Actually they shouldn't even bother putting the paper straight in the flatbed.
- On some scanner, you can't put the paper up to the flatbed border, or it will be slightly cropped in the image (Epson XP-425 for instance). User have to put the paper a few millimeters off the borders. --> Paper cannot be placed precisely on the Flatbed.
- Some scanner are really imprecise whatever you do. For example, portable scanners like Brother DS-620.
Currently in Paperwork, we work around this problem by making a calibration scan. This is annoying too:
- It still requires the user to be careful about how they place the paper on the flatbed or in the feeder.
- If they scan something that doesn't match the calibration, they have to crop the page again.
- With imprecise scanners, you have to make the calibration slightly larger and pretty much always crop after each scan.
- No deskewing is implemented at this time.
An attempt at automatic page border detection has been made in Libpillowfight. Unfortunately the success rate is so low that I didn't even try to figure it out exactly (less than 50%). It shows this problem is far from easy, and the main problem is scanner diversity:
- Some scanners returns the image with even the dust visible (Ex: Epson XP-425)
- Some scanners clean the image so much that the page borders are close to invisible (Ex: Brother MFC-7360N)
- Some return a strangely long image with an annoying "2 steps" background (Ex: Fujitsu fi-5110Cdj)
- some return a black background instead of a white one (Ex: Brother DS-620)
When trying to clean the images, on some of them, Unpaper's algorithms gives poor results. It probably requires fine-tuning the settings for those images specifically, but it's not something we can afford asking the user.
Since Paperwork 2.0, there is a reset features. This feature let the user revert pages to their original state (before any image processing) so they can fixes anything that would have gone wrong. Thanks to this feature, we can now consider trying more daring techniques of image processing.