pango_get_log_attrs() inconsistently sets LogAttr.is_word_start and LogAttr.is_word_end in Japanese text
pango_get_log_attrs()
sets LogAttr.is_word_end
considering the border of Hiragana, Katakana, Kanji scripts,
while it sets LogAttr.is_word_start
without considering.
As a result, the count of the beginning of the words and the end of words are unmatched,
which causes bugs like gtk#5481 (closed).
LogAttr.is_word_start
should be set like LogAttr.is_word_end
because it is more natural.
Steps to reproduce
- Compile japanese-word.c like below, and run
gcc -std=c99 -Wall -Wextra -o japanese-word \
$(pkg-config --cflags pango) $(pkg-config --libs pango) japanese-word.c
Current behavior
LogAttr.is_word_end
is set before whitespaces and before changed scripts
LogAttr.is_word_start
is set after whitespaces but not after changed scripts
1.50.12
start: /この固有ベクトル
end : この/固有/ベクトル/
start: /あア亜あア亜 /あア亜
end : あ/ア/亜/あ/ア/亜/ あ/ア/亜/
start: /Lorem /ipsum /dolor /sit /amet,
end : Lorem/ ipsum/ dolor/ sit/ amet/,
Expected behavior
LogAttr.is_word_end
is set before whitespaces and before changed scripts
LogAttr.is_word_start
is set after whitespaces and after changed scripts
1.50.12
start: /この/固有/ベクトル
end : この/固有/ベクトル/
start: /あ/ア/亜/あ/ア/亜 /あ/ア/亜
end : あ/ア/亜/あ/ア/亜/ あ/ア/亜/
start: /Lorem /ipsum /dolor /sit /amet,
end : Lorem/ ipsum/ dolor/ sit/ amet/,
japanese-word.c
This program inserts "/" at the position LogAttr.is_word_start
(and is_word_end
) is true.
#include <stdio.h>
#include <string.h>
#include <pango/pango.h>
void print_word_boundaries(PangoLanguage *lang, const char *s);
int main() {
const char *s1 = "この固有ベクトル";
const char *s2 = "あア亜あア亜 あア亜";
const char *s3 = "Lorem ipsum dolor sit amet,";
PangoLanguage *ja = pango_language_from_string("ja-JP");
PangoLanguage *en = pango_language_from_string("en-US");
printf("%s\n", pango_version_string());
print_word_boundaries(ja, s1);
print_word_boundaries(ja, s2);
print_word_boundaries(en, s3);
}
// prints first utf-8 character in s
void print_first_char(const char *s) {
gchar *next = g_utf8_find_next_char(s, NULL);
if (*next == '\0') printf("%s", s); // print until end of string
else { // print until next character
for (const char *p = s; p < next; p++) {
putchar(*p);
}
}
}
void print_word_boundaries(PangoLanguage *lang, const char *s) {
int len = strlen(s);
PangoLogAttr attrs[100];
for (int i=0; i<100; i++) {
attrs[i].is_word_start = 0;
attrs[i].is_word_end = 0;
}
pango_get_log_attrs(s, len, -1, lang, attrs, 100);
// print the beginnings of words
printf("start: ");
int i = 0;
const char *p = s;
while (1) {
if (attrs[i].is_word_start) printf("/");
if (*p == '\0') break;
print_first_char(p);
p = g_utf8_find_next_char(p, NULL);
i++;
}
printf("\n");
// print the endings of words
printf("end : ");
i = 0;
p = s;
while (1) {
if (attrs[i].is_word_end) printf("/");
if (*p == '\0') break;
print_first_char(p);
p = g_utf8_find_next_char(p, NULL);
i++;
}
printf("\n");
}