Synthetic data unlocks true cross-organizational data portability
Article by: Adam Cornille
There’s no doubt organizations are collecting more data than ever. In fact, 90 percent of all data has been created in the last two years. Yet, with more data, more problems. Organizations of all sizes are consistently struggling to access and understand all that data.
The only way to unlock the value of all that data is to enable it to become portable — across silos, boundaries and borders. But it can’t come at a risk of leaking private information. Organizations are only beginning to realize the privacy implications of how massive amounts of personal data is pieced together to create patterns that give valuable insights into user behavior, but run the risk of data exposure.
Organizations must balance the need for data agility and innovation with governance, compliance and security.
Over the last few years, artificial intelligence-trained synthetic data has arisen as a high-quality drop-in for real data. Synthetic data retains the statistically equivalent behavioral patterns of the raw data it’s trained on, without running the same risk of re-identification as traditional data masking and anonymization methods. It gets the necessary value out of the real data — usability, compliance, and access to innovation — without using any real data. And synthetic data achieves all this with a measurable level of privacy, something that appeases enterprise information security teams.
Synthetic data enables organizations to share the insights of data across different departments, divisions and with third parties, without risk of data leakage. Synthetic data allows for data portability and breaks down geographical data silos for truly cross-organizational analytics. It allows organizations to evaluate potential vendors, services and algorithms in innovation sandpits. And synthetic data allows orgs to increase speed to decision making, without risking or getting blocked on real data.
I recently cohosted a webinar on Smart Synthetic Data with synthetic data generator Hazy’s Harry Keen and Microsoft’s Tom Davis, where we dove into the topic. We encourage you to listen to this dynamic conversation around privacy and the potential of data portability. And then in this piece we introduce you to the potential of synthetic data to drive it.
The burden of cross-enterprise data
We no longer live in a world where organizations build everything from scratch. There’s not enough time for enterprises to design, build, and test features at a rate that can compete with disruptors like in fintech. It’s become best practice to partner with third-party vendors to leverage their tools and services. This can be anything from access to the cloud to SaaS tools like customer relationship managers (CRMs) to strategic partnerships.
However, heavily regulated multinational institutions like banks are struggling not only to compete with these up-and-coming challengers, but they are also dealing with cross-border and cross-organizational laws and privacy regulations.
Nationwide Building Society was facing this struggle, when looking to evaluate potential innovation partners. When they applied data masking techniques, they discovered that, when paired with not-related sources, like open data, it was possible to de-anonymize that data. That was not an acceptable risk.
Data masking and data anonymization are widespread practices that are notorious for running this risk of re-identification. These processes also fail to meet most quality standards, as anonymization fails to preserve the key statistical relationships and patterns in the original data.
In the end, Nationwide settled on synthetic data generated by Hazy as this was shown to preserve the key behavioral signatures and relationships between data points, while exposing none of the original data.
At Nationwide, Hazy synthetic data had to stand up against four criteria:
• Evaluation of potential third-party partners — Nationwide partners with third-party integrators to create improved services for their customers. To achieve this, Nationwide needs to share realistic data to truly evaluate the technical offering of these external apps and tools.
• Risk mitigation for the data engineer — When a developer creates a batch feed or intake, up until a few months ago, that was done on live data. Developer environments aren’t as secure as production environments and risk leaking data.. Using synthetic data mitigates this risk.
• Reusability for data analytics teams — Data anonymization often needs to be repeated manually for every business application and is therefore resource-intensive. Nationwide was looking for a synthetic data tool that could generate data quickly and that could be used flexibly across multiple business applications.
• Safe sequential data for behavior analytics — Find a way to analyze behavior patterns over time without risking personal information.
Most importantly, this had to be done in an auditable process, ensuring the statistical quality of the data was measured and maintained.
In the end, Hazy synthetic data generation was chosen for how it was proven to limit Nationwide’s risk of data exposure and data regulation fines, while still retaining measurable quality using Hazy’s advanced quality assessment metrics.
It used to take the Nationwide strategic partnership team six months to evaluate new data technology, services and vendors. With Hazy, now it just takes them three days.
To learn more about the world-changing potential of time-series synthetic data to drive data portability, read the full article here: https://www.logic2020.com/insight/synthetic-data-portability?utm_source=social&utm_medium=Medium&utm_campaign=Business_Insights