From Cyrus
Future Ideas
Executive Summary
This is the full list of changes that I posted to the mailing list. Some are already running at FastMail, and some are still work in progress. They're all aimed towards reducing IO or Bandwidth, reducing race conditions, or simplifying code.
- CHARSET changes - see: Charset Changes
- Index record checksums (with one addition, see below) - see: Index Checksums
- GUIDs - sha1 always, always check on reconstruct
- audit tool for GUIDs and checksums - integrity check/cleaner
- lazy cyrus.cache loading and auto-recreate. - see: Lazy Cache
- modseq always enabled (CONDSTORE optional)
- Replication GUID safety
- Replication binary equivalence - see: Low Bandwidth Replication
- mboxlist sort order correctness
- combine cyrus.index, cyrus.expunge (this is the biggie!) - see: Merge Index
- owner/shared seen in cyrus.index (reduce IO) - see: Owner Seen
I've already extracted 1, 2, 3, 4 and 6 out for definite inclusion in 2.4. I think 9 would be reasonable to add as well.
I (RudyGevaert) would like to see the Sieve date extension supported ( rfc5260)
Sounds good to me (BronGondwana) - who wants to implement it?
Original Email
I've copied and pasted from the email below. This can be cleaned up over time.
OK - now descriptions:
- This is what we're already running at FastMail. Proper utf8 support throughout the code. Search that respects whitespace. Unicode 5.2 support, additional characterset support.
- We're already running this at FastMail as well, with great success. The missing part is all the UID(msgno) and friends direct MMAP references in index.c. Also, I want to store the XOR of all record checksums in the index header. More on that later.
- We should just always sha1 the message and store it. No ifs, no buts. It's been working well. There might be an argument for something other than sha1, but the general idea is sound, and it's a lifesaver for...
- mailbox audit tool. We have one at FastMail already. I can run at various levels of sanity checking:
- just check cyrus.index vs mailboxes.db.
- check message file exists (requires a readdir)
- check message file size (requires a stat)
- check message file sha1 (requires a full message read + CPU)
I'd like to see this as basically a "reconstruct but don't change" mode that scans the mailbox for any issues.
- speaking of IO. We currently always open the cyrus.cache file when we select a mailbox. This was a big issues for sites wanting to put just cyrus.index on a fast SSD or similar and have the cache on slow storage. The IO cost of just statting and mmapping the cache file took out all the benefits. Also - if we fail to read a cache record for whatever reason (corrupt/missing cache file, broken pointer in the index record) we abort. There's no reason not to parse the rfc822 on disk and just append a shiny new cache record to the file. It will get compressed next time the file gets rewritten.
- There's no reason not to, we already rewrite the whole record and the header to update checksums. Why not increment the modseq while we're there. Having guaranteed modseq data will make lots of algorithms easier to implement, including efficient sync.
- If a machine dies while replication is not up to date and you fail over to the replica, stuff gets lost. We ran a patch for ages that aborted on GUID mismatch and refused to overwrite the replica, but it had gaps where the message was deleted on the master. Make sync_client able to fix the issue itself (this patch already exists and went into FastMail production today after a whole lot of testing)
- At the moment we have user flags - they can be in different order at different ends. We also convert both user and system flags to full text representation to send over the wire, and reconvert them at the other end into binary values. In theory, the flag names could be in a different order at each end. This adds massive complexity and breaks my later plan of checksum comparisons, so it has to go Change to shipping the new cyrus.header flag name list over the wire separately, and then just pushing the flag updates as numbers. It's misguided to do all this complexity for the theoretical compatibility between different versions of cyrus as master and replica, because the sync protocol has changed just as often as system flags anyway - and if you just ignore "unknown" system flags then it would only matter if the same offset got given different flag names at different ends. In that case we have bigger issues!
- Are you still reading? Good. This one is easy, just switch the default!
- I'm going to write a whole separate email about this one.
- At the moment, we store ALL seen information in per-user seen files, linked by the someone bogus "uniqueid" field on mailboxes. This is awful in two ways:
- 99%+ of all mailboxes in user.seen files are the users' own mailboxes.
- every single shared-seen mailbox on the entire system winds up in a single anyone.seen file, which is a good single point of failure for corruption, and could be a performance bottleneck if you have a lot of shared-seen folders (e.g. if you have businesses who want shared seen semantics)