From Cyrus

Jump to: navigation,

Charset Changes

[Page Tools]

Warning

This is mostly internal programming information. If you're not interested in the details, just read this bit:

Executive Summary

  • Upgrade Unicode database from unicode version 2 to unicode version 5.2.
  • Provide an interface for programmers to convert between any two supported character sets (including UTF-8) with a single function call, rather than only producing search-optimised format.
  • Allow streaming searches of the output of a conversion so there's no need to buffer the conversion result - in any character set and with or without search format optimisation.
  • Add support for a handful of extra character sets.

Details

The changes include two new API commands for lib/charset.c:

  • charset_to_utf8 - which takes a string with a charset and possible encoding (NONE, BASE64 or QUOTED_PRINTABLE) and returns a utf-8 string.
  • charset_search_mimeheader - which takes a MIME encoded header and searches for a pre-compiled search pattern within it, first converting to utf-8.

mkchartable.c has been totally rewritten as mkchartable.pl, which required no external modules, but provides a much easier language for the sort of data manipulation required. The Makefile uses this preprocessor to build large state tables for characterset conversion to unicode codepoints (32 bit values) which are then processed down to individual utf-8 characters by a separate conversion phase.

This is a change from the old mkchartable.c, which specified a single pass conversion from the original character set directly to the search optimised form, which meant it was impossible to extract a pure UTF-8 conversion without completely rewriting.

The new internal interface in charset.c provides a set of pluggable conversion routines which can be chained in any arbitrary order. Each converter takes:

  • a state struct
  • a callback to be called with any output character

and provides a function which can be called with a single input character. The special value '-1' can be passed to any function telling it there has been a corruption in the input stream and to flush its internal state buffer.

There are also two "sink" functions:

  • an auto-resizing output buffer
  • a search function which is given a pre-compiled search pattern, and sets a flag on its state structure when it gets a match.

This interface is amazingly flexible, allowing the entire implementation of charset_to_utf8 to look like this:

   tobuffer = buffer_init(0, 0);
   input = uni_init(tobuffer);
   input = table_init(charset, input);
   input = qp_init(0, input); /* or b64_init(input), or nothing, depending on encoding */
   convert_catn(input, msg_base, len);
   res = buffer_cstring(tobuffer);
   convert_free(input);
   return res;

These are chained in reverse order so that each added filter has the previous one to call "input". Basically it takes the initial string, sends each single character to qp_init's handler, which updates the internal state. Once it has a single character of output it calls table_init's handler to run the charset table's processing into unicode codepoints. Once THAT has a character of unicode output, it shoves it to uni_init's handler which converts to one or more utf-8 characters. Each of these gets appended to the buffer. At the end, we convert the buffer to a cstring (basically append '\0' and hand responsibility for the freeing the buffer to the caller) and free the rest of the state.

convert_free follows the chain, calling the "destructor" for each of the handles right down to the buffer.

The implementation for a search is done by replacing buffer_init with search_init, like so:

   tosearch = search_init(substr, pat);
   input = uni_init(tosearch);
   if (searchform) input = canon_init(1, input);
   mimeheader_cat(input, s);
   res = search_havematch(tosearch);
   convert_free(input);

canon_init is the search canonicalisation pass. It converts all characters to lower case, compresses all whitespace to a single SPACE character (0x32) and strips all non-printing characters. The canon_init function gets placed in the stream at the unicode phase, so it's taking unicode codepoints and generating new unicode codepoints. Note that search is done on utf-8 data, not unicode codepoints, hence the uni_init above. mimeheader_cat does the conversion from an arbirary mime-encoded header into unicode codepoints.

Because mimeheaders are generally short, it was considered sane to process the entire header regardless of where the match was found. For a larger input, it is more efficient to exit immediately upon finding the first match:

/* feed the handler */
   while (len-- > 0) {
       convert_putc(tosearch, (unsigned char)*s++);
       if (search_havematch(tosearch)) break; /* shortcut if there's a match */
   }