Correction: this issue has been around for longer than we initially realised, first occurrences were on the 2019-10-15 at around 5PM GMT.
Posted Oct 23, 2019 - 16:54 UTC
Since around 6PM GMT yesterday (2019-10-22) we experienced an increased number of run failures that manifested themselves as timeouts. These failed runs weren't actually timing out but had failed mid execution due to a crashing server.
A fix for the issue was pushed today (2019-1023) at around 1245PM GMT, to handle errors thrown from Postgres based connections in a more orderly way, which we are now monitoring.