CouchDB error causes stale file in cdev environment MX node for test user email
Since some days now I keep receiving, sometimes 3 times a day, a pair of messages, the first one notifying about a stale file in cdev environment MX node and the second one notifying about the recovery of that problem:
Date: Sun, 26 Feb 2017 09:12:24 +0000 (UTC)
From: nagios@donkey.unstable.bitmask.net
To: drebs@leap.se, sysdev@leap.se
Subject: PROBLEM: WARNING - lynx/Leap_MX_Queue
***** Nagios *****
Notification Type: PROBLEM
Host: lynx.cdev.bitmask.i (lynx)
Address: 37.218.242.146
Service: Leap_MX_Queue
State: WARNING
Duration: 0d 0h 12m 5s
Date/Time: Sun Feb 26 09:12:24 UTC 2017
Output: WARN - Warning: 1 stale files (=10 min) in /var/mail/leap-mx/Maildir/new/.
And the recovery message:
Date: Sun, 26 Feb 2017 09:16:24 +0000 (UTC)
From: nagios@donkey.unstable.bitmask.net
To: drebs@leap.se, sysdev@leap.se
Subject: RECOVERY: OK - lynx/Leap_MX_Queue
***** Nagios *****
Notification Type: RECOVERY
Host: lynx.cdev.bitmask.i (lynx)
Address: 37.218.242.146
Service: Leap_MX_Queue
State: OK
Duration: 0d 0h 0m 4s
Date/Time: Sun Feb 26 09:16:24 UTC 2017
Output: OK - OK: 0 stale files (=10 min) in /var/mail/leap-mx/Maildir/new/.
Looking up the logs in lynx:/var/log/leap/mx.log
, there is a CouchDBError((u'not_found', u'no_db_file'),)
that causes a message to be skipped resulting in the stale file:
Feb 26 08:48:18 lynx leap-mx: [LEAPPostfixTCPMapAliasServer,48204,0.0.0.0] virtual alias map: test_user_email@cdev.bitmask.net
Feb 26 08:48:18 lynx leap-mx: [-] Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f6565503638>
Feb 26 08:48:18 lynx leap-mx: [-] Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f6565503638>
Feb 26 08:48:18 lynx leap-mx: [LEAPPostfixTCPMapAliasServer,48205,0.0.0.0] virtual alias map: test_user_email@cdev.bitmask.net
Feb 26 08:48:18 lynx leap-mx: [-] Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f6565d97368>
Feb 26 08:48:18 lynx leap-mx: [-] Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f6565d97368>
Feb 26 08:48:18 lynx leap-mx: [LEAPPostfixTCPMapAliasServer,48205,0.0.0.0] virtual alias map: c7bb0d550cbd7ca8ba17837186a6e95a@
deliver.local
Feb 26 08:48:18 lynx leap-mx: [-] Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f6565d972d8>
Feb 26 08:48:18 lynx leap-mx: [-] Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f6565d972d8>
Feb 26 08:48:18 lynx leap-mx: [LEAPPostfixTCPMapAliasServer,48206,0.0.0.0] virtual alias map: deliver.local
Feb 26 08:48:18 lynx leap-mx: [-] Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f6565dab2d8>
Feb 26 08:48:18 lynx leap-mx: [-] Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f6565dab2d8>
Feb 26 08:48:18 lynx leap-mx: [LEAPPostfixTCPMapAliasServer,48204,0.0.0.0] virtual alias map: test_user_bad@cdev.bitmask.net
Feb 26 08:48:18 lynx leap-mx: [-] Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f65654f5ef0>
Feb 26 08:48:18 lynx leap-mx: [-] Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f65654f5ef0>
Feb 26 08:48:18 lynx leap-mx: [-] Processing new mail at '/var/mail/leap-mx/Maildir/new/1488098898.Vca01I305006aM476611.lynx'
Feb 26 08:48:18 lynx leap-mx: [-] Mail owner: c7bb0d550cbd7ca8ba17837186a6e95a
Feb 26 08:48:18 lynx leap-mx: [-] Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f6565da5908>
Feb 26 08:48:18 lynx leap-mx: [HTTP11ClientProtocol,client] Encrypting message to c7bb0d550cbd7ca8ba17837186a6e95a's pubkey
Feb 26 08:48:19 lynx leap-mx: [HTTP11ClientProtocol,client] Exporting message for c7bb0d550cbd7ca8ba17837186a6e95a
Feb 26 08:48:19 lynx leap-mx: [HTTP11ClientProtocol,client] New stalled email FilePath('/var/mail/leap-mx/Maildir/new/148809889
8.Vca01I305006aM476611.lynx'): CouchDBError((u'not_found', u'no_db_file'),)
Feb 26 08:48:19 lynx leap-mx: [-] Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f6565da5908>
Feb 26 08:48:19 lynx leap-mx: [LEAPPostfixTCPMapAliasServer,48207,0.0.0.0] virtual alias map: test_user_email@cdev.bitmask.net
Feb 26 08:48:19 lynx leap-mx: [-] Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f65654ffe18>
Feb 26 08:48:19 lynx leap-mx: [-] Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f65654ffe18>
Feb 26 08:48:19 lynx leap-mx: [LEAPPostFixTCPMapAccessServer,12231,0.0.0.0] check recipient access: test_user_email@cdev.bitmask.net
Feb 26 08:48:19 lynx leap-mx: [-] Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f6565da52d8>
Feb 26 08:48:19 lynx leap-mx: [-] Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f6565da52d8>
Then after some time the skipped file is re-processed, and delivered succesfully:
Feb 26 09:15:24 lynx leap-mx: [-] Starting processing skipped mail...
Feb 26 09:15:24 lynx leap-mx: [-] --------------------------------------------------
Feb 26 09:15:24 lynx leap-mx: [-] Processing new mail at '/var/mail/leap-mx/Maildir/new/1488098898.Vca01I305006aM476611.lynx'
Feb 26 09:15:24 lynx leap-mx: [-] Mail owner: c7bb0d550cbd7ca8ba17837186a6e95a
Feb 26 09:15:24 lynx leap-mx: [-] Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f6565dab3b0>
Feb 26 09:15:24 lynx leap-mx: [HTTP11ClientProtocol,client] Encrypting message to c7bb0d550cbd7ca8ba17837186a6e95a's pubkey
Feb 26 09:15:25 lynx leap-mx: [HTTP11ClientProtocol,client] Exporting message for c7bb0d550cbd7ca8ba17837186a6e95a
Feb 26 09:15:25 lynx leap-mx: [HTTP11ClientProtocol,client] Done exporting
Feb 26 09:15:25 lynx leap-mx: [HTTP11ClientProtocol,client] Removing '/var/mail/leap-mx/Maildir/new/1488098898.Vca01I305006aM476611.lynx'
Feb 26 09:15:25 lynx leap-mx: [HTTP11ClientProtocol,client] Done removing
Feb 26 09:15:25 lynx leap-mx: [HTTP11ClientProtocol,client] ++++++++++++++++++++++++++++++++++++++++++++++++++
Feb 26 09:15:25 lynx leap-mx: [HTTP11ClientProtocol,client] Done processing skipped mail
Some notes:
- Leap MX re-processes skipped messages every 30 minutes, so the error happened at
08:48
and the message was re-processed at09:15
. - We would probably not notice this problem if the interval between MX skipped mail re-processing attempts was shorter than the nagios stale file verifications.
- Seems that for some reason either the test user db wasn't yet created when MX tries to deliver the message or it was created but the couch server just can't find it.
- What is the life cycle of a test user db? Is there an insurance that its creation was successful before running the mail delivery test? What happens with that db in case of failure of the test? Is it expected that the db is still there almost half an hour after the test ran?