At the beginning of this week, Mandrill, a transactional email API for MailChimp users, experienced an outage where users were able to send but were unable to receive emails. The Madrill community also tweeted stating that they were also seeing ongoing errors with scheduled mail and webhooks and would resolve the issue soon.
Update: Most outbound emails are sending, but inbound emails are still not coming through. We're also seeing ongoing errors with scheduled mail and webhooks. We're sorry for the disruption and are doing our best to resolve these issues quickly.
— Mandrill (@mandrillapp) February 5, 2019
Sebastian Lauwers, the VP of Engineering at Dixa, a customer service software tweeted that the issue took too long to resolve. He also asked for the reason why Mandrill was taking so long–nearly 23 hours–to sort the issue.
Could someone take a minute just to explain what on earth is taking so long to fix? Was it a bad deploy? Did your cloud get pwned? Are you being DDoS'd? How do you strand customers for nearly 23 hours and still not explain what is going on?
— Sebastian Lauwers (@teotwaki) February 5, 2019
Today, one of the users with the username GuyPostington posted an email received from Mandrill, on HackerNews. The email explains the reason for Mandrill’s outage and how they will be addressing the issue. Mandrill uses a sharded Postgres setup as one of their main datastores. According to the email, “On Sunday, February 3, at 10:30 pm EST, 1 of our 5 physical Postgres instances saw a significant spike in writes. The spike in writes triggered a Transaction ID Wraparound issue. When this occurs, database activity is completely halted. The database sets itself in read-only mode until offline maintenance (known as vacuuming) can occur.” They have also tweeted the same
They further mentioned that the database is large due to which the vacuum process takes a significant amount of time and resources, and there’s no clear way to track progress.
To address this issue, the community writes, “We don’t have an estimated time for when the vacuum process and cleanup work will be complete. While we have a parallel set of tasks going to try to get the database back in working order, these efforts are also slow and difficult with a database of this size. We’re trying everything we can to finish this process as quickly as possible, but this could take several days, or longer.”
The email also states that once the outage is resolved, the community plans to offer refunds to all the affected users.
To know about this news in detail, visit Mandrill’s Tweet thread.