Private AI: machine learning on encrypted data


Encryption of data at rest and in transit (structured and unstructured) is already common practice within organizations to protect confidential, secret and proprietary information. However, encrypting data in use is a less common practice. The data used is data stored in a digital non-persistent state and / or in process of being processed, such as in the life cycle of a machine learning (ML) model.

Leverage Machine Learning in Human Resources (HR)

One of the interesting use cases for any HR department is being able to reduce employee turnover, and its associated cost, by looking at employee satisfaction. Most of the time, employee satisfaction is not just a numerical value – so we have to infer it. To do this, it is best to design an ML model that learns from past experiences. For example, it can examine what conditions, or combination of conditions, caused former employees to leave the company. It can then use these conditions to predict employee churn rate and give companies insight into patterns and conditions that could be improved to increase employee retention.

Another use case would be to use natural language processing to match candidates with vacancies, based on resumes. In either case, the machine learning model must process sensitive personal data that requires specific processing during storage, transit, and use.

Consider a scenario where Company A, Ericsson for example, holds sensitive employee data. Company B, a third party, has the mathematical model specializing in ML and the knowledge in the field. To use ML and apply it to solve Ericsson’s business challenges, the two companies would need to collaborate and share data across organizational boundaries.

Due to the sensitivity of the data, Ericsson wants Company B to perform the math function on our data (Ericsson) without “seeing” (decrypting) the data. But traditional encryption, like Advanced Encryption Standard (AES) or Rivest-Shamir-Adleman (RSA), works in such a way that data is not usable in its encrypted state. This would mean that Company B would need to decrypt Ericsson’s data before applying its mathematical model – exposing sensitive information. So what is the solution?

Emerging Technologies for Machine Learning on Encrypted Data

The Automation and AI team within Ericsson’s IT group are currently investigating the latest technologies as we explore ways to address these challenges to keep data confidential while in use. Two of the most promising emerging protocols used for data encryption are Secure Multiparty Computing (SMPC) and Homomorphic Encryption (HE).

Secure Multiparty Computing (SMPC)

SMPC is the act of jointly computing a function while keeping the inputs private. This allows data scientists and analysts to calculate on distributed data without ever exposing it.

From a functionality perspective, there are four main steps:

  • The data analyst / scientist (external to Ericsson) determines and writes the function to be performed on the data (regression, average, etc.):. The analyst is also responsible for the selection of data sources provided by the data owner. Note that the analyst never owns the data. The owner of the data, Ericsson in this case, launches a virtual machine or container in the Ericsson Corporate Network privacy zone, to access the data. The analyst only sees the headers and metadata from the remote data source. After that, the analyst triggers the calculation of the function on the selected data.
  • The next step in the algorithm is to compile the function into binary, generate random numbers, and distribute both binary and randomness to the compute engines. This means that plain text datasets are parsed and converted to binary files for computer engines to understand. Random numbers are used to randomize data that is shared between distributed computing engines. Note that the ML algorithm is executed where the data resides, locally.
  • At this point, the computer machines execute the binary data and communicate with each other by exchanging random data.
  • Once calculated, the results are sent to the data analyst using an encryption algorithm.

This technique can be applied in fraud management use cases, where communication service providers can learn the fraud model from each other while maintaining confidentiality and avoiding disclosure of the architecture of the network. their system (including the strengths / weaknesses of their architecture).

Homomorphic (HE) encryption

HE is a method that allows analysts and data scientists to calculate analytical functions on encrypted data (cipher text) without needing to decrypt it. HE is classified into different types according to the types of mathematical operations allowed and the number of times these operators can be performed. The encryption algorithm behind HE is based on the Ring-Learning with Errors problem, a very complex (NP-hard) problem that is, as an added benefit, considered quantum safe.

In homomorphic encryption, we define a trust zone where the clear data is stored. Again, the data is located in the privacy area of ​​Ericsson Corporate Network. In this trusted zone, data is encrypted using a homomorphic encryption scheme such as Cheon-Kim-Kim-Song (CKKS).

New technology in action

Let’s go back to the use case where a third party would perform a CV-job match through a cloud microservice. As the matchmaking calculation is done in the public cloud, we don’t want sensitive data to be decrypted, even though we trust the third-party service provider. The public cloud would be our unreliable zone. The data between the trusted zone and the unsecured zone is moved in a homomorphic encrypted manner, maintaining the same confidentiality and security advantages as traditional encryption, with AES for example.

In the non-confidence zone, the data analyst determines and writes the function to be performed on the data (regression, mean, etc.):. The function is performed on the encrypted data, without the need to decrypt the data to perform the function, as is the case with other encryption methods. The result of the execution of the function is encrypted by default (therefore no visibility to the entity executing the function) and returned to the trusted zone for decryption and interpretation of the results. An illustration of this flow can be seen in Figure 1.

Figure 1. High-level design of the homomorphic encryption execution flow. The trusted zone is the Ericsson private network. The untrusted zone represents the public cloud.

In this proposed scenario, Company B owns both the function and the compute resources, so Company A must trust Company B to operate honestly. Such trust can be established through business and accountability agreements such as nondisclosure agreements, where B has a business incentive to use the homomorphic schemas on their functions appropriately and meet the expectations of the company A.

To validate the potential of this technique to meet the business challenge of performing calculations on privacy and sensitive data, a use case proof of concept (PoC) was implemented.

Proof of concept validation

The unencrypted version of this PoC is a simple Python script that trains a logistic regression model on synthetic HR data for the purpose of predicting employee churn rate. The trained model is then applied to a test dataset in a very traditional unencrypted manner, resulting in an accuracy of about 73% churn (see Figure 2).

machine learning on encrypted data

Figure 2. The result of unencrypted logistic regression assessment on HR test data

This same trained model is then hosted on an encrypted review microservice, which listens for customer review requests. The client then provides an encrypted version of the dataset they want to evaluate and the model on the server they want to use for the evaluation. The server evaluates the data in a fully encrypted manner and returns an encrypted prediction to the client, which the client can then decrypt and, among other things, calculate the accuracy. In the case of the previously highlighted PoC, the accuracy of the quantified evaluation was 73% (see Figure 3).

machine learning on encrypted data

Figure 3: The result of the encrypted evaluation of the logistic regression on the HR test data

It is important to point out here that our server implementation succeeded in a) running an ML model on the homomorphically encrypted data and b) achieving an accuracy level for the encrypted evaluation of 73%, the same as in the case unencrypted (traditional ML). In this PoC, there is no loss of precision when applying homomorphic encryption on ML models with sensitive data.

Enabling a safer and smarter future

These two upcoming technologies both meet the growing need for better privacy protection in businesses. An important distinction to make, however, is that SMPC is a broad architectural framework for the design of calculations that can preserve the confidentiality of the parties involved. This contrasts with HE, which is a technology that exists on a comparatively smaller scale, being used as an encryption method in the context of larger systems or use cases, including SMPC.

HE has the potential to complement SMPC by enhancing its privacy and security protection properties to better withstand scenarios where more than half of the parties involved in a multi-party compute are compromised or malicious, a situation with which traditional SMPC solutions are. grappling.

These technologies are both expected to play an important role in Ericsson’s digitization journey, helping to create a smart automation engine that delivers day-to-day business value.

Learn more

Learn more about machine learning use cases and how to design ML architectures for today’s telecommunications systems.

Read our blog on Machine Learning Lifecycle Management.

Find out how AI and ML algorithms are transforming the customer experience in telecommunications.

Learn more about AI in telecommunications networks.

Source link


Leave A Reply