The care and feeding of your AI models is crucial

Taylor Armerding
Nerd For Tech
Published in
7 min readMar 4, 2024

--

Everybody knows that the “you are what you eat” cliché is meant figuratively, not literally. But in the world of artificial intelligence (AI) and machine learning (ML) tools, they literally are what they’ve been fed. Which can be good or bad — in some cases very bad.

The so-called large language models used to create AI chatbots have been “fed” massive amounts of data scraped from the internet. And as we all know, the internet is a chaotic repository of data infected with random bias, errors, misinformation, bugs — what experts frequently describe as “poison.”

So, as experts also regularly warn, AI needs constant and intense supervision. It is, after all, still in its infancy. Which means the care and feeding of your AI/ML systems can get very complicated, especially when it comes to unknowingly bringing exploitable vulnerabilities into those systems — more on that momentarily.

But even leaving vulnerabilities aside, any doubts about poisoned datasets are regularly dispelled by stories that range from hilarious to shocking to frightening. AI chatbots are known to “hallucinate” — deliver completely fictitious responses to prompts — with the same “confidence” as when they deliver something accurate.

Just this past week, Google had to suspend its AI image generator Gemini after it refused to portray images of people of European descent. Ask for an image of the Pope and it showed a woman. Ask for an image of a Nazi, and it delivered a Black man. As a columnist for The Free Press put it, “Ask it to show the American Founding Fathers and wow, it turns out the Founding Fathers were a diverse racial and gender mix including, of course, a Native American.”

Rampant immaturity

Not that such absurdities create an existential threat to anything — they’re comedy gold. But they illustrate that AI remains a long way from, uh, maturity. And when it comes to the datasets that companies use to develop their own in-house AI/ML models, it’s not so amusing. Poisoned data could easily endanger their products, reputation, profit, and even existence.

Indeed, a recent post in The Cyber Edge noted that “everything starts with data, and sometimes critical information is a life-and-death issue. Malicious actors targeting a capability that relies on AI can damage it by poisoning the inputs the model will run on.”

It recommends that organizations “assist programmers with AIs that will not take training data from large sets that could bring new risks but instead use sets from controlled enterprise environments where results have an extra layer of safety.”

Another post, this one in VAR India, called for “security specialists — companies, researchers, security vendors, and governments — to put their best efforts into limiting as much as possible the use of artificial intelligence, including generative AI, by hackers for offensive purposes.”

The post noted that security researchers had demonstrated that they “can too easily break through so-called guardrails instituted in AI software and manipulate the software into ignoring safety restraints and then revealing private information.”

Then there is AI cybersecurity startup Protect AI, which recently posted details of eight significant vulnerabilities in the open source software supply chain used to develop AI/ML models.

The vulnerabilities are now public, and all have been assigned common vulnerabilities and exposures (CVE) numbers on the list maintained by the MITRE corporation. The severity level of one is ranked critical and seven others are ranked high.

Patching? It’s complicated

The good news is that there are also patches available that fix all eight vulnerabilities. But patching and updating could be more complicated than it is for problems that don’t involve ML and AI.

Open source software, since it began, has been more challenging to update or patch than proprietary or commercial code. For the latter two, patches are “pushed” out — users don’t have to go looking for them. It’s what happens on your smartphone when you get notified that there are updates available for apps you’re using. All you have to do is tap an icon.

But users of open source components aren’t notified individually about updates — they have to “pull” them from a repository maintained by the creators. That means, of course, they won’t know they need to do so if they don’t know they’re using a component for which a patch just became available.

That’s why there has been a years-long effort to get organizations to create a Software Bill of Materials (SBOM) for every software product they make. Two major annual reports from Synopsys — the “Building Security in Maturity Model” and the “Open Source Security Risk Analysis” have consistently stressed the importance of creating and maintaining an SBOM — a list of all the open source and third-party components in a codebase. It also lists the licenses that govern those components, the versions of the components used in the codebase, and their patch status, which allows security teams to identify any associated security or license risks.

The idea is the same as in manufacturing, where a Bill of Materials (BOM) details all the items in a product. In the automotive industry, a BOM for each vehicle lists the parts built by the original equipment manufacturer itself and those from third-party suppliers. When a defective part is discovered, the auto manufacturer knows precisely which vehicles are affected and can notify owners of the need for repair or replacement.

But according to Protect AI, the current SBOM model doesn’t work for open source components used in ML/AI development.

Polluted pipeline

“Traditional SBOMs only tell you what’s happening in the code pipeline,” Daryan Dehghanpisheh, cofounder of Protect AI told Security Week. “When you build an AI application, you have three distinctly different pipelines where you must have full provenance.”

Dehghanpisheh said those three pipelines are for code, data, and ML. “The ML pipeline is where the AI/ML model is created. It relies on the data pipeline, and it feeds the code pipeline, but companies are blind to that middle machine learning pipeline,” he said.

In a blog post, he called for the development of AI/ML BOMs that would “specifically target the element of AI and machine learning systems. It addresses risks unique to AI, such as data poisoning and model bias, and requires continuous updates due to the evolving nature of AI models.”

Jamie Boote, senior consultant with the Synopsys Software Integrity Group, agrees about the need for an AI/ML BOM. “Just as AI is different from mobile and desktop software, the threats against AI will also differ, so listing where the training data came from would be essential in an AI/ML BOM but it might not be listed in a regular SBOM,” he said, adding that “I’d expect responsible developers who integrate AI into their software and portfolios will start asking for those BOMs so they can begin to make informed risk decisions just like they’re asking for SBOMs from other software vendors.”

But he added a caveat, noting that the issue is much more complicated than that. “BOMs are listings of what items are present in the shipped software,” he said, “and those listings omit software that is only used to build the shipped software, like compilers, IDEs [integrated development environments], and the Windows or Linux version the developers were using to write the shipped software’s code but aren’t included on the software’s installation CD.”

A matter for the courts

In the evolving world of AI, the data used to build the shipped software is a matter of intense dispute. On one side are content creators ranging from newspaper publishers to authors, artists, and musicians, whose intellectual property is being used to train AI models without attribution or compensation.

On the other side are the AI developers who contend that training data isn’t present in the software they deliver — that it is just used to create it, much like using Windows to create software.

That dispute remains unresolved so far “and as of today the AI developers are the ones who are saying what goes into the BOM. And it’s likely they aren’t including data used in training in that listing because they feel training data isn’t present in deployed AI/ML models,” he said.

So it sounds like the content of an AI/ML BOM will eventually be sorted out by the courts.

On the recommendation to train AI/ML models from datasets that have been vetted for both quality and security, Boote said he agrees. He said the AI landscape is evolving so rapidly that he suspects OpenAI’s ChatGPT — the most famous AI chatbot — will be the last to have been trained on “the mass harvesting of semipublic datasets,” given the efforts of popular websites such as X (formerly Twitter) and Reddit to prevent it.

“In the future, finding trusted datasets is going to be vital to prevent watering hole-type attacks. Vetting good training data will be just as important as vetting good software,” he said, “and I expect the industry to come to a consensus in the future about what ‘good’ means when it comes to training data.”

But he gave a flat “no” to the idea that any organization, group of organizations, or governments, no matter how hard they try, will be able to limit the use of AI by cybercriminals, as suggested in The Cyber Edge.

“Hammer manufacturers haven’t been able to stop hammers from being used in crimes no matter how useful they are to building framers, drywall contractors, mechanics, and other trades people,” he said. “Attackers will aways find malicious uses for tools, and as long as the benign benefits outweigh the risks posed by criminals, governments will go after the criminals and not the tooling.”

--

--

Taylor Armerding
Nerd For Tech

I’m a security advocate at the Synopsys Software Integrity Group. I write mainly about software security, data security and privacy.