Allow better tailoring of text boundaries
Submitted by Behdad Esfahbod
Link to original bug (#530427)
Description
UAX#14 and UAX#29 allow for per-language tailoring. One easy way to tailor for most common needs is to let the tailoring algorithm modify the boundary type classification of a character. If we add such API to the lang engines, tailoring can be implemented much easier than in the current script_break(). It's quite hard to write correct script_break() functions without breaking some rules. For example, there should be a grapheme boundary after all Control chars. Now a tailoring function that wants to disallow boundary before a certain char has to check the previous character to make sure it's not Control...
It's also hard to ensure consistency after tailoring in script_break(). That's why I also filed bug 529747.
All this will be easier if we add callbacks like:
PangoWordBreakType get_word_break_type (gunichar wc, PangoWordBreakType default_type);
The implementation can use pango_word_break_type_for_unichar() to get the default type for some other character if need be, or use all other UCD getters to categorize wc.
All the new enums and their getters will be added engine-only as the types may change in the future. At least until some project shows interest in using them.
This has some API implications, as for correct log_attr computation one would need full itemization results now. So, my previous API proposal in bug 462634 wouldn't be enough.