- Upgrade to Microsoft Office Pro and Windows 11 Pro with this bundle for 87% off
- Get 3 months of Xbox Game Pass Ultimate for 28% off
- Buy a Microsoft Project Pro or Microsoft Visio Pro license for just $18 with this deal
- How I optimized the cheapest 98-inch TV available to look and sound incredible (and it's $1,000 off)
- The best blood pressure watches of 2024
Regulatory uncertainty overshadows gen AI despite pace of adoption
Data governance
In traditional application development, enterprises have to be careful that end users aren’t allowed access to data they don’t have permission to see. For example, in an HR application, an employee might be allowed to see their own salary information and benefits, but not that of other employees. If such a tool is augmented or replaced by an HR chatbot powered by gen AI, then it will need to have access to the employee database so it can answer user questions. But how can a company be sure the AI doesn’t tell everything it knows to anyone who asks?
This is particularly important for customer-facing chatbots that might have to answer questions about customers’ financial transactions or medical records. Protecting access to sensitive data is just one part of the data governance picture.
“You need to know where the data’s coming from, how it’s transformed, and what the outputs are,” says Nick Amabile, CEO at DAS42, a data consulting firm. “Companies in general are still having problems with data governance.”
And with large language models (LLM), data governance is in its infancy.
“We’re still in the pilot phases of evaluating LLMs,” he says. “Some vendors have started to talk about how they’re going to add governance features to their platforms. Retraining, deployment, operations, testing—a lot of these features just aren’t available yet.”
As companies mature in their understanding and use of gen AI, they’ll have to put safeguards in place, says Juan Orlandini, CTO, North America at Insight, a Tempe-based solution integrator. That can include learning how to verify that correct controls are in place, models are isolated, and they’re appropriately used, he says.
“When we created our own gen AI policy, we stood up our own instance of ChatGPT and deployed it to all 14,000 teammates globally,” he says. Insight used the Azure OpenAI Service to do this.
The company is also training its employees about how to use AI safely, especially tools not yet vetted and approved for secure use. For example, employees should treat these tools like they would any social media platform, where anyone could potentially see what you post.
“Would you put your client’s sales forecast into Facebook? Probably not,” Orlandini says.
Layers of control
There’s no guarantee a gen AI model won’t produce biased or dangerous results. The ways these models are designed is to create new material and the same request can produce a different result every time. This is very different from traditional software, where a particular set of inputs would result in a predictable set of outputs.
“Testing will only show the presence of errors, not the absence,” says Martin Fix, technology director at Star, a technology consulting company. “AI is a black box. All you have are statistical methods to observe the output and measure it, and it’s not possible to test the whole area of capability of AI.”
That’s because users can enter any prompt they can imagine into an LLM, and researchers have been finding new ways to trick AIs into performing objectionable actions for months, a process known as “jailbreaking” the AIs.
Some companies are also looking at using other AIs to test results for risky outputs, or use data loss prevention and other security tools to prevent users from putting sensitive data into prompts in the first place.
“You can reduce the risks by combining different technologies, creating layers of safety and security,” says Fix.
This is going to be especially important if an AI is running inside a company and has access to large swathes of corporate data.
“If an AI has access to all of it, it can disclose all of it,” he says. “So you have to be much more thorough in the security of the system and put in as many layers as necessary.”
The open source approach
Commercial AI systems, like OpenAI’s ChatGPT, are like the black boxes Fix describes: enterprises have little insight into the training data that goes into them, how they’re fine tuned, what information goes into ongoing training, how the AI actually makes its decisions, and exactly how all the data involved is secured. In highly regulated industries in particular, some enterprises may be reluctant to take a risk with these opaque systems. One option, however, is to use open source software. There are a number of models, of various licenses, currently available to the public. In July, this list was significantly expanded when Meta released Llama 2, an enterprise-grade LLM available in three different sizes, commercial use allowed, and completely free to enterprises—at least, for applications with fewer than 700 million monthly active users.
Enterprises can download, install, fine-tune and run Llama 2 themselves, in either its original form or one of its many variations, or use third-party AI systems based on Llama 2.
For example, patient health company Aiberry uses customized open-source models, including Flan-T5, Llama 2, and Vicuna, says Michael Mullarkey, the company’s senior clinical data scientist.
The models run within Aiberry’s secure data infrastructure, he says, and are fine-tuned to perform in a way that meets the company’s needs. “This seems to be working well,” he says.
Aiberry has a data set it uses for training, testing, and validating these models, which try to anticipate what clinicians need and provide information up front based on assessments of patient screening information.
“For other parts of our workflows that don’t involve sensitive data, we use ChatGPT, Claude, and other commercial models,” he adds.
Running open source software on-prem or in private clouds can help reduce risks, such as that of data loss, and can help companies comply with data sovereignty and privacy regulations. But open source software carries its own risks as well, especially as the number of AI projects multiply on the open source repositories. That includes cybersecurity risks. In some regulated industries, companies have to be careful about the open source code they run in their systems, which can lead to data breaches, privacy violations, or the biased or discriminatory decisions that can create regulatory liabilities.
According to the Synopsys open source security report released in February, 84% of open source codebases in general contain at least one vulnerability.
“Open source code or apps have been exploited to cause a lot of damage,” says Alla Valente, an analyst at Forrester Research.
For example, the Log4Shell vulnerability, patched in late 2021, was still seeing half a million attack requests per day at the end of 2022.
In addition to vulnerabilities, open source code can also contain malicious code and backdoors, and open source AI models could potentially be trained or fine-tuned on poisoned data sets.
“If you’re an enterprise, you know better than just taking something you found in open source and plugging it into your systems without any kind of guardrails,” says Valente.
Enterprises will need to set up controls for AI models similar to those they already have for other software projects, and information security and compliance teams need to be aware of what data science teams are doing.
In addition to the security risks, companies also have to be careful about the sourcing of the training data for the models, Valente adds. “How was this data obtained? Was it legal and ethical?” One place companies can look to for guidance is the letter the FTC sent to OpenAI this summer.
According to a report in the Washington Post, the letter asks OpenAI to explain how they source the training data for their LLMs, vet the data, and test whether the models generate false, misleading, or disparaging statements, or generate accurate, personally identifiable information about individuals.
In the absence of any federally-mandated frameworks, this letter gives companies a place to start, Valente says. “And it definitely foreshadows what’s to come if there’s federal regulation.”
If an AI tool is used to draft a letter about a customer’s financial records or medical history, the prompt request containing this sensitive information will be sent to an AI for processing. With a public chatbot like ChatGPT or Bard, it’s impossible for a company to know where exactly this request will be processed, potentially running afoul of national data residency requirements.
Enterprises already have several ways to deal with the problem, says Nick Amabile, CEO at DAS42, a data consulting firm that helps companies with data residency issues.
“We’re actually seeing a lot of trusted enterprise vendors enter the space,” he says. “Instead of bringing the data to the AI, we’re bringing AI to the data.”
And cloud providers like AWS and Azure have long offered geographically-based infrastructure to their users. Microsoft’s Azure OpenAI service, for example, allows customers to store data in the data source and location they designate, with no data copied into the Azure OpenAI service itself. Data vendors like Snowflake and Databricks, which historically have focused on helping companies with the privacy, residency, and other compliance implications of data management, are also getting into the gen AI space.
“We’re seeing a lot of vendors offering this on top of their platform,” says Amabile.
Identifying indemnification
Some vendors, understanding that companies are wary of risky AI models, are offering indemnification.
For example, image gen AIs, which have been popular for a few months longer than language models, have been accused of violating copyrights in their training data.
While the lawsuits are playing out in courts, Adobe, Shutterstock, and other enterprise-friendly platforms have been deploying AIs trained only on fully-licensed data, or data in the public domain.
In addition, in June, Adobe announced it would indemnify enterprises for content generated by AI, allowing them to deploy it confidently across their organization.
Other enterprise vendors, including Snowflake and Databricks, also offer various degrees of indemnification to their customers. In its terms of service, for example, Snowflake promises to defend its customers against any third-party claims of services infringing on any intellectual property right of such third party.
“The existing vendors I’m working with today, like Snowflake and Databricks, are offering protection to their customers,” says Amabile. When he buys his AI models through his existing contracts with those vendors, all the same indemnification provisions are in place.
“That’s really a benefit to the enterprise,” he says. “And a benefit of working with some of the established vendors.”
Board-level attention
According to Gibson, Dunn & Crutcher’s Vandevelde, AI requires top-level attention.
“This is not just a CIO problem or a chief privacy officer problem,” he says. “This is a whole-company issue that needs to be grappled with from the board down.”
This is the same trajectory that cybersecurity and privacy followed, and the industry is now just at the beginning of the journey, he says.
“It was foreign for boards 15 years ago to think about privacy and have chief privacy officers, and have privacy at the design level of products and services,” he says. “The same thing is going to happen with AI.”
And it might need to happen faster than it’s currently taking, he adds.
“The new models are and feel very different in terms of their power, and the public consciousness sees that,” he says. “This has bubbled up in all facets of regulations, legislation, and government action. Whether fair or not, there’s been criticism that regulations around data privacy and data security were too slow, so regulators are seeking to move much quicker to establish themselves and their authority.”