Known security risks exacerbated by AI

23 May 2019 02:49 PM

As part of our AI auditing framework blog series, Reuben Binns, our Research Fellow in Artificial Intelligence (AI), Peter Brown, Technology Policy Group Manager, and Valeria Gallo, Technology Policy Adviser, look at how AI can exacerbate known security risks and make them more difficult to manage.

Personal data must always be processed in a manner that ensures appropriate levels of security against unauthorised processing, accidental loss, destruction or damage.

There is no “one-size-fits-all” approach to security. The appropriate security measures organisations should adopt depend on the level and type of risks that arise from specific processing activities. Using AI to process any personal data will have important implications for an organisation’s security risk profile, which need to be assessed and managed carefully.

Some implications may be triggered by the introduction of new types of risks, eg adversarial attacks on machine learning models, which we will examine in future blogs. In this post we will focus on the way AI can adversely affect security by making known risks worse and more challenging to control.

Information security is a key component of our AI Auditing Framework, but is also central to our work as the information rights regulator. The ICO is planning to expand its general security guidance to take into account the additional requirements set out in the new General Data Protection Regulation (GDPR). While this guidance will not be AI-specific, it will cover a range of topics that are relevant for organisations using AI, including software supply chain security and increasing use of open-source software.

We are therefore particularly keen to hear your views on this topic so we can integrate them into both the framework and the guidance. We encourage you to use the comments section below, or to email us, to share your thoughts on AI related security challenges, best practices, and any additional guidance you would like the ICO to issue.

Managing security in AI vs. traditional technologies

Some of the unique characteristics of AI mean compliance with security requirements can be more challenging than with more established technologies, both from a technological and human perspective.

From a technological perspective AI systems introduce new kinds of complexity not found in the IT systems most organisations will have dealt with previously. They are also likely to rely heavily on third party code or relationships, and will need to be integrated with several other new and existing IT components, which are also intricately connected. This complexity may make it more difficult to identify and manage some security risks, and may increase others, such as the risk of outages.

From a human perspective, the people involved in building and deploying AI systems are likely to have a wider range of backgrounds than usual, including traditional software engineering, systems administration, data scientists, statisticians, as well as domain experts. Security practices and expectations may vary significantly, and for some there may be less understanding of broader security compliance requirements. Security of personal data may not always have been a key priority, especially if someone was previously building AI applications with non-personal data or in a research capacity.

Common practices about how to process data securely in data science and AI engineering are still developing, which causes further complications.

It is not possible to list all known security risks that might be exacerbated when AI is used to process personal data. The impact of AI on security will depend on the way the technology is built and deployed, the complexity of the organisation, and the strength and maturity of the existing risk management capabilities.

The following hypothetical scenario should raise awareness of some of the known security risks that AI can exacerbate and some of the challenges.

Our key message for organisations is: review risk management practices to ensure personal data is secure in an AI context.

Hypothetical scenario: AI in recruitment

A recruitment firm decides to use an AI system based on machine learning (ML) to match CVs to job descriptions automatically, rather than through a manual review. The AI system will select the best candidates to be forwarded to potential employers for consideration. To make a recommendation, the AI system will process the job descriptions, personal data provided by the candidates themselves, and data provided by the employers about previous hiring decisions for similar roles.

Risk example #1 - Losing track of training data

ML systems require large sets of training and testing data to be shared. In the example above, for the AI system to be effective, employers will need to share data about similar previous hiring decisions (e.g. sales manager) with the recruitment firm.

While some sharing of personal data (e.g. candidates’ CVs) would have taken place while the CV scanning process was manual, it did not involve the transfer of large quantities of personal data between the employers and the recruitment firm.

Leaving aside questions about the legal basis for the processing, sharing this additional data could involve creating multiple copies, in different formats stored in different locations (see below), which require important security and information governance considerations:

The employer may need to copy HR and recruitment data into a separate database system to interrogate and select the data relevant to the vacancies the recruitment firm is working on.
The selected data subsets will need to be saved and exported into files, and then transferred to the recruitment firm in compressed form.
Upon receipt the recruitment firm could upload the files to a remote location, eg the cloud.
Once in the cloud, the files may be loaded into a programming environment to be cleaned and used in building the AI system.
Once ready, the data is likely to be saved into a new file to be used at a later time.

For both the recruitment firm and employers, this will increase the risk of a data breach, including unauthorised processing, loss, destruction and damage.

What should organisations do?

All copies of training data will need to be shared, managed, and when necessary deleted in line with security policies. While many recruitment firms will already have information governance and security policies in place, these may no longer be fit-for-purpose once AI is adopted, and should be reviewed and, if necessary, updated.

Technical teams should record and document all movements and storing of personal data from one location to another. This will help organisations apply the appropriate security risk controls and monitor their effectiveness. Clear audit trails are also necessary to satisfy accountability and documentation requirements.

In addition, any intermediate files containing personal data, eg compressed versions of files created to transfer data between systems, should be deleted as soon as it is no longer required.

Depending on the likelihood and severity of the risk to data subjects, organisations may also need to apply de-identification techniques to training data before it is extracted from its source and shared internally or externally.

For example, the employers may need to remove certain features from their HR data, or apply privacy enhancing technologies (PETs) like differential privacy, before sharing it with the recruitment firm.

For more on these techniques, see our Anonymisation Code of Practice and future blog posts on data minimisation. New guidance on Anonymisation will also be published soon.

Risk example #2 - Security risks introduced by externally maintained software used to build AI systems

Very few organisations build AI systems entirely in-house. In most cases, the design, building, and running of AI systems will be provided, at least in part, by third parties that the organisation may not always have a contractual relationship with.

Even if an organisation hires its own ML engineers, they may still rely significantly on third-party frameworks and code libraries. In fact, many of the most popular ML development frameworks are open source.

Using third-party and open source code is a valid option. Developing all software components of an AI system from scratch requires a large investment of time and resources that many organisations cannot afford, and especially compared to open source tools, would not benefit from the rich ecosystem of contributors and services built up around existing frameworks.

However, one important drawback is that these standard ML frameworks often depend on other pieces of software being already installed on an IT system. To give a sense of the risks involved, a recent study found the most popular ML development frameworks include up to 887,000 lines of code and rely on 137 external dependencies. Therefore implementing AI will require changes to an organisation’s software stack (and possibly hardware) that may introduce additional security risks.

For example, let’s say the recruitment firm above hired an ML engineer to build the automated CV filtering system using a Python-based ML framework. The ML framework depends on a number of specialist open-source programming libraries, which needed to be downloaded on the firm’s IT system.

One of these libraries, contains a software function to convert the raw training data into the format required to train the ML model. It is later discovered the function has a security vulnerability. Due to an unsafe default configuration, an attacker introduced and executed malicious code remotely on the system by disguising it as training data.

This is not a far-fetched example, in January of 2019, such a vulnerability was discovered in ‘NumPy’, a popular library for the Python programming language used by many machine learning developers.

What should organisations do?

Whether AI systems are built in house, externally, or a combination of both, they will need to be assessed for security risks. As well as ensuring the security of any code developed in-house, organisations need to assess the security of any externally maintained code and frameworks.

The ICO has already produced some guidance on managing security of internal and external code in the related context of online services. This includes external code security measures, such as subscribing to security advisories to be notified of vulnerabilities, and internal code security measures, such as coding standards and source code review. The same or similar measures will apply to AI applications. However as we mentioned at the beginning, the ICO is developing further security guidance, which will include additional recommendations for the oversight and review of externally maintained source code, as well as its implications for security and data protection by design.

In addition however, organisations developing ML systems can further mitigate security risks associated with third party code, by separating the ML development environment from the rest of their IT infrastructure where possible.

Two ways to achieve this are:

Use ‘virtual machines’ or ‘containers’ - emulations of a computer system that run inside, but isolated from the rest of the IT system. These can be pre-configured specifically for ML tasks. In our recruitment example, if the ML engineer had used a virtual machine, then the vulnerability could have been contained.
Many ML systems are developed using programming languages that are well-developed for scientific and machine learning uses, like Python, but are not necessarily the most secure. However, it is possible to train an ML model using one programming language (eg Python) but then, before deployment, convert the model into another language (eg Java) that makes making insecure coding less likely. To return to our recruitment example, another way the ML engineer could have mitigated the risk of a malicious attack on CV filtering model, would have been to convert the model into a different programming language prior to deployment.

Your feedback

We would like to hear your views on this topic and welcome any feedback on our current thinking. In particular, we would appreciate your insights on the following questions:

How and to what degree are organisations currently inspecting externally maintained software code for potential vulnerabilities?
Are there any other well-known security risks which AI is likely to exacerbate? If so, which ones and what effect will AI have?
What should any additional ICO security guidance cover?

Please share your views by emailing us at AIAuditingFramework@ico.org.uk