GeoPlace, plumbing & poka-yoke
Blog posted by: James Roger, Executive Director of IT Services at GeoPlace, 26 April 2022.
When you turn on your kitchen tap you expect water to come out.
Most people don’t think about where the water comes from, what treatments it’s had, or the miles of pipes and values it’s travelled through to reach you.
No-one really cares – until it stops working. Which doesn’t happen that often if it’s fitted correctly in the first place and properly maintained.
It’s the same for other things. Lights, TVs, washing machines. Most people don’t give them a second thought until they go wrong. It should be that way for IT systems too.
The difference is the basic principles of large-scale water supply have been around for millennia. The Mesopotamians used clay pipes 6000 years ago. The ancient Egyptians used copper, the Han Dynasty used bamboo, and the Romans used lead. In fact, the word plumbing comes from plumbum, the Latin for lead.
Much like today, those ancient folk probably didn’t think much about the mechanics of their water supply either. Until it stopped working.
By comparison, IT systems are still in their infancy and are continually evolving through small daily changes, and sometimes massive transformations. So, making IT systems as reliable as plumbing or electrical appliances is tricky – not to mention time consuming and expensive.
GeoPlace needs reliable IT systems and applications to do its work. What we do is unique so we often can’t find off-the-shelf software, which means we have to invent some of it ourselves.
We go to great lengths to make our systems as reliable as we can but it’s a moving target. We made over 2,000 changes (fixes and improvements) in the last year alone, any one of which has the potential to cause unexpected problems with our dozens of applications, 350+ servers, or millions of lines of code.
When it comes right down to it, all IT problems are caused by humans – don’t let the anyone tell you otherwise!
Problems can stem from improper maintenance, inadequate monitoring, cutting corners (known as technical debt), a design flaw or manufacturing fault from years ago, pushing a system beyond what it was designed to do or beyond its planned lifetime, or simply from a typo made yesterday by an overworked and underappreciated engineer (except on the last Friday in July 🤨).
Somewhere along the way there was a human being who created a system, tool, or process that allowed the problem to occur. Someone didn’t put enough checks or guardrails in place to prevent it. Don’t be too hard on them though, it was probably unintentional and due to a lack of information at the time.
Humans are fallible. Mistakes and accidents happen. So, we need to find a way to protect against that.
Poka-yoke (pronounced PO-ka yo-KAY) is a Japanese term from lean manufacturing that roughly translates as mistake-proofing. There are examples of it everywhere. It’s why your sink has an overflow outlet, your washing machine won’t operate with the door open, and plugs only fit in the electric socket one way.
Sir Ranulph Fiennes is attributed with saying “There is no such thing as bad weather, only inappropriate clothing”. In that spirit, let’s say that there is no such thing as computer error, only inappropriate mistake-proofing.
So how do poka-yoke principles apply to IT systems at GeoPlace?
- We limit access to systems by following the principle of least privilege, which means only people who need access can have it. It’s like locking a door and only giving keys to people who need them.
- We have measures in place to spot potential problems early. Any non-routine changes go through an approvals system where work is checked by at least one other pair of eyes, sometimes more. Changes then pass through multiple quality assurance and testing stages before they are deployed to our production systems. This release management process is crucial in reducing problems and making our systems more reliable.
- Despite all the advances in technology, servers often still rely on mechanical components – like spinning fans and disks. We use servers that have redundant components so if a power supply or hard disk fails there’s another one to automatically take its place.
- Whenever possible we build high-availability systems, where multiple servers are used to run one application. This means we can lose an entire server (sometimes more than one) without causing disruption. In some ways it’s like airliners having multiple engines in case one of them fails.
- Almost all our systems are hosted on virtual servers, which separates the hardware from the software. This provides the capability to automatically move systems away from malfunctioning hardware often without any human intervention at all.
- Our cloud providers have a long list of accreditations to ensure their procedures are following good practice and security is maintained, including ISO 27017, ISO 27018, ISO 27701, PCI-DSS, as well as SOC1-3.
- In this VUCA world we also must be vigilant against outside threats. We employ all the usual security measures you would expect, and we are certified to the UK Government’s Cyber Essentials standard which ensures we’re effectively managing things like passwords and anti-malware, and applying security updates – which is a thankless task but it has to be done. A notable example was seen in December 2021 with the scramble to fix the Log4j flaw, which has been called the “the single biggest, most critical vulnerability of the last decade”
- To ensure our information security processes are following good practice we are certified to the ISO 27001 international standard, which includes being assessed by external auditors every six months. This complements our quality management certification to the ISO 9001 standard which amongst other things helps formalise our approach to risk management.
Despite all this, preventative measures only get us so far. We’re human, so errors slip through from time to time and accidents still happen.
When they do, we need to be ready to deal with them and get things back to normal as soon as possible.
- We monitor thousands of different aspects of our systems so we can be alerted about problems, hopefully before they cause noticeable disruption. We can’t monitor everything though, not least because we have to be careful to avoid the Observer Effect, where excessive monitoring can itself actually create problems.
- When we have a problem it’s our highly skilled and experienced staff that save the day. Ultimately, it’s their expertise that resolves problems. We feed any lessons learned into our Post-Incident Review sessions so that we can avoid the problem happening again.
- We invest a lot of time and effort into automating our processes wherever we can, in particular when we’re building our IT systems. By using techniques like Infrastructure-as-code we write scripts which software tools can use to build and configure our servers. Although writing the scripts takes longer than manually building servers, it pays off in the long-term by reducing errors and ensuring future deployments are done quickly and in exactly the same way every time. In fact, it’s now very common that we won’t even try to fix a malfunctioning server because it’s quicker and easier to simply destroy it and build a new one. Back in the day we used to give servers cute names, like Nemo and Dory, and nurse them back to health when they were sick. Now we give them designations, like svswpapil03p, and treat them as if they are disposable.
- For the really big problems we turn to Business Continuity and Disaster Recovery procedures. This is reserved for catastrophic events like major fires. We have duplicate systems already in place that are ready to be used, and copies of our most important data which is continually updated. If we had a disaster we could deploy these systems rapidly, but most of our servers don’t hold data so we would rebuild those from scratch using automated scripts.
We do a lot already that aligns with poka-yoke principles, but we’re always looking to improve. We can look at why our changes fail and how to stop it happening again; we can find ways to avoid disruption by teaching systems how to deal with problems themselves (known as exception handling); and, we can practice restoring systems so we can do it rapidly when problems do occur.
It's not easy or quick, but every improvement brings us a step closer to being as reliable as your kitchen tap.
Location intelligence. What is it and what’s its value?
Geospatial data, analysis, and UPRNs are helping fuel the drive to net zero
Original article link: https://www.geoplace.co.uk/blog/2022/geoplace-plumbing-poka-yoke
|Home||About||Addressess||Streets||Helpdesk||News & Events||Exemplar||Consultancy|
Latest News from
JAG(UK) – Looking Forward23/03/2023 13:25:00
JAG(UK) is an organisation that represents the best interests of highways and road authorities in the United Kingdom.
Ten Years on JAG UK and GeoPlace Strategic partnership is thriving13/03/2023 10:00:00
On the 10th March 2023, the partnership between the Joint Authorities Group (UK) [JAG (UK)] and GeoPlace LLP reached the ten year mark. Both organisations are celebrating – the partnership continues to thrive and is now pivotal in enabling government, utility, and highway colleagues to maintain services and prepare for change as the industry moves forward.
Understanding false results in matching address data29/11/2022 09:10:00
Blog posted by: Laura Gribble, Data Consultant, GeoPlace, 25 November 2022.
Helping the vulnerable – how data gets the right support to the right people15/11/2022 12:25:00
To make sure the most vulnerable people get the right support this winter, local authorities and health services need to share all kinds of data. But how can those independent bodies work efficiently when their datasets are so different?
Government guidance on using UPRNs and USRNs14/10/2022 10:25:00
Are you interested in utilising location information for your public sector project, but aren’t quite sure where to start? This blog post sets out existing guidance from central and local government as well as some of the regulators.
What’s changed in England and Wales over the past year?18/08/2022 12:25:00
Our spatial analysis team here at GeoPlace has undertaken a review of new data and changes to existing data within AddressBase between April 2021 and March 2022.
Linking People to Places: A review of the GeoPlace 2022 conference04/07/2022 16:43:00
The GeoPlace annual conference, which took place in May 2022, explored a wide range of society’s most relevant themes, examining the need for the precise location information that sits at the heart of public services.
A new report shows that widespread adoption and use of address and street data in local authorities could generate £384m savings over the period 2022-2026, with an enhanced return on investment of 6:1.28/06/2022 11:50:00
The report, commissioned by GeoPlace LLP and conducted by location strategy consultants ConsultingWhere, shows that savings are most likely through benefits that could be derived in adult social care, education, planning, and environmental health and data integration.