Government Digital Service (GDS)
How GOV.UK Notify reliably sends text messages to users
GOV.UK Notify lets central government, local authorities and the NHS send emails, text messages and letters to their users.
We usually send between 100,000 and 200,000 text messages a day. It’s important for services using Notify that they’re able to quickly and successfully send text messages to their users.
Those services rely on us to send important messages, for example a flood warning or a two-factor authentication (2FA) code so their users can sign in to another service. We design and build Notify with this in mind.
Using multiple text message providers
When a central government, local authority or NHS service wants to send a text message to a user, they ask Notify, either manually through our web interface or using our API, to send it. We then send an HTTP request to a text message provider to ask them to deliver the message. No provider will be working perfectly 100% of the time (nor should we expect them to be). Because of this we have 2 different providers, so if one encounters any issues we can use the other provider to send the message.
Our original load balancing design
Originally we sent all text messages through one provider, say provider A. If provider A started having trouble, Notify would automatically swap all traffic to provider B – a process known as a failover. We used 2 measures to decide if a provider was having problems and failover. We measured a:
- single 500-599 HTTP response code from the provider
- slowdown in successful delivery callbacks (a message back from the provider to say it had delivered the message to the recipient)
To determine if callbacks were slow, we’d measure the last 10 minutes of messages being sent. We’d consider callbacks slow if 30% of them took longer than 4 minutes to report back as delivered.
We could also manually swap traffic from, say, provider A to provider B as we wanted. We did this often, maybe once a week, to try and reach a roughly 50/50 split of messages sent between each of our providers. If we ended up sending only a small number of messages through one provider over the long run, they might not be massively incentivised to be a provider in the future.
A problem with our original design
One day, towards the end of 2019, we had a large spike in requests to send text messages. We sent all these requests to one of our providers but it turned out they couldn’t handle the load and started to fail. Our system swapped to the other provider but it turned out that sending a large amount of traffic out of nowhere caused them to start returning errors too. It was likely that our providers needed time to scale up to handle the sudden load we were sending them.
How we improved our resiliency
We changed Notify to send traffic to both providers with a roughly 50/50 split. When a single text message is sent, Notify will pick a provider at random. This should reduce the chance of giving our providers a very large amount of unexpected traffic that they will not be able to handle.
We also changed how we handled errors from our providers. If a provider gives us a 500-599 HTTP response code, we would reduce their share of the load by 10 percentage points (and therefore increase the other provider by 10 percentage points). We will not reduce the share if it’s already been reduced in the last minute.
We also decided that if a provider is slow to deliver messages, measured in the same way as before, we would reduce their share of the load by 10 percentage points. Again, we will not reduce the share if it’s already been reduced in the last minute.
It’s important that we wait a minute before allowing another 500-599 HTTP response code to decrease that provider’s share of traffic again. This means that just a small blip, for example five 500-599 HTTP responses over a second, doesn’t switch all traffic to the other provider too quickly.
Equally balancing our traffic
We still had the manual task of equally balancing our traffic if we no longer needed to push that traffic towards one of the providers. We decided that, if neither provider had changed its balance of traffic in the last hour, we’d move both providers 10 percentage points closer to their defined resting points.
This means our system will automatically restore itself to the middle and removes the manual burden of our team trying to send roughly equal traffic to both providers. We can still manually decide what percentage of traffic goes to each provider if we want to, but this is something we anticipate doing rarely.
We did consider trying to overcorrect traffic to bring the overall balance back to 50/50 over, say, a month. For example, if provider A has an incident and receives no traffic for 24 hours, we could give it 70% of the traffic for the next few days to overcorrect the traffic it lost. We decided doing this would only bring a small benefit and would increase the complexity of our load balancing system. Keeping things as simple as possible won the argument in this case.
How the service is doing now
The following graphs show the number of text messages we sent to each of our providers per second.
On the morning of 26 January one of our providers ran into problems and we reduced their share of traffic down to zero. Every hour for a while after this you can see us give them 10% of traffic to see if they have recovered enough, but they hadn’t so it got reduced back to 0% again.
Finally the next afternoon their system improved and we moved back towards a roughly equal split of traffic.
This fix works for us now. As we continue to grow we'll do more stuff like this to make sure we're providing the best performance, resilience and value for money to Notify’s users.
Visit GOV.UK Notify for more information and to create yourself an account.
Latest News from
Government Digital Service (GDS)
How Government as a Platform is meeting challenges posed by coronavirus14/05/2020 12:25:00
Blog posted by: Miriam Raines and Mark Buckley, 13 May 2020 – Categories: GDS design principles, GOV.UK, GOV.UK Notify, GOV.UK Notify, GOV.UK Pay, Government as a Platform.
Scaling up GOV.UK Verify to help during coronavirus12/05/2020 09:10:00
Blog posted by: GDS Digital Identity Team, 11 May 2020 – Categories: GOV.UK Verify and identity assurance.
Celebrate Global Accessibility Awareness Day with GDS11/05/2020 10:15:00
Accessibility is considered in all aspects of the Government Digital Service’s (GDS) work. Whether it’s testing on assistive technologies, using persona profiles to simulate different users or providing subtitles for films; GDS works to the sixth Government Design Principle of ‘This is for everyone’.
What’s it like to be a Software Developer in the Government Communication Service?04/05/2020 11:38:00
Blog posted by: James Reaver, 01 May 2020.
Podcast: GOV.UK’s initial response to coronavirus28/04/2020 14:33:00
Blog posted by: GDS team, 28 April 2020 – Categories: Podcast.
The Pensions Regulator moves email and web portal to the cloud23/04/2020 12:25:00
This case study is part of guidance on moving services to modern cloud solutions.
Government launches new coronavirus business support finder tool22/04/2020 08:12:00
A new ‘support finder’ tool will help businesses and self-employed people across the UK to quickly and easily determine what financial support is available to them during the coronavirus pandemic.
We’re launching an online ‘Introduction to Content Design’ course16/04/2020 10:20:00
Blog posted by: Content team, 15 April 2020 – Categories: Content design, People and skills, Style, content and design.
Driving better outcomes through continuous support09/04/2020 12:25:00
Blog posted by: Chad Bond – Deputy Director, Standards Assurance, Tyronne Fisher – Senior technology adviser, Joyce Tedone – Consulting Technical Architect and Dan Bowden – Head of Digital Operations, UKEF, Posted on:7 April 2020 – Categories: Transformation.