AI Guardrails Will Shape Society. Here’s How They Work.
Thursday, January 23rd, 2025
You will be hearing a lot about AI guardrails. There will be political battles over what they do and whether they must be disclosed publicly.
Prominent venture capitalist and computer scientist Mark Andreessen recently said, “AI is highly likely to be the control layer for everything in the world.” Technology providers are pushing this, such as the “AI Overview” in Google search results and “Apple Intelligence” in iOS 18 for recent iPhones.
As people become accustomed to getting information from a generative AI (“GenAI”) such as ChatGPT rather than searching multiple sources, whoever controls GenAI outputs through guardrails will have a powerful society-shaping tool.
There are many areas where certain GenAI guardrails might be required by law or effectively required to avoid violating existing law. You could write a large book just describing each of the areas in which that may be the case, such as privacy and data protection, anti-discrimination, preventing intellectual property infringement, protecting national security and cyber security, consumer protection, obscenity laws, child safety laws (such as attacking child pornography), and other public safety concerns.
Today, let’s learn about GenAI guardrails and how they work. This will help you better understand the debates and regulatory action around them.
What are guardrails? They are ways in which GenAI outputs can be manipulated or biased compared to what the output would be solely as a result of the training of that GenAI’s neural network. For example, a guardrail might try to prevent a GenAI from recommending suicide or cause it to adjust its output to conform to DEI principles.
There are five broad ways to do this in GenAI. The first two are true guardrails. The last three are not technically guardrails but are other ways to manipulate GenAI output.
#1. Implement a Soft Guardrail. In GenAI, a soft guardrail is a set of hidden prompts implemented by the GenAI provider that operate alongside your prompt to mold the output’s content. The user does not see and is not given access to these co-prompts. For example, a hidden prompt might be “Do not present information about how to produce illegal substances.” When a popular GenAI such as ChatGPT says it can’t address some topic, that’s likely due to a soft guardrail.
#2. Implement a Hard Guardrail. A hard guardrail is a filter applied after the GenAI produces its output. If the filter does not like what GenAI’s neural network outputs, it probably will cause the GenAI to output nothing or stop output generation.
#3. Use Biased Training Data. GenAIs learn by studying training data and then create their outputs based on the relationships they learn between pieces of data. You can use biased training data to influence output in a desired direction. This bias in data selection could result from the bias of the person selecting it, which could be subconscious or intentional, such as thinking that all legitimate news comes from the New York Times or, on the other hand, the New York Post.
#4. Weight Different Pieces of Training Data Differently. This is like a pollster weighing some polls more than others in building an election model.
#5. Mess with the GenAI’s Reward Function. A GenAI has an internal scoring (reward) function to guide and evaluate its outputs. In this function, you can implement rewards that align with the GenAI designer’s goals, such as making the output woke or conservative.
How can you discover a GenAI’s soft or hard guardrails? They are not published, and GenAIs are instructed not to divulge them. You can attempt to jailbreak the GenAI to make it forget that it’s not supposed to output its guardrails and then get the GenAI to divulge them. “Jailbreak” means manipulating a GenAI to ignore its guardrail rules.
Most likely, politically biased outputs come from soft guardrails. There are problems with implementing the other ways of shaping outputs practically and efficiently.
Turning back to law and policy, few dispute that GenAI guardrails are needed in some situations, such as preventing it from encouraging terrorism. Also, in some areas, the law is clear, so conforming GenAI output to legal requirements makes sense. But there are fuzzy areas where some would claim the law requires doing or not doing something, and others would disagree.
I predict there will be a big fight over whether providers of widely used GenAIs must disclose their guardrails publicly. Some will claim the public has a right to know how their GenAI outputs are being shaped. But disclosing guardrails generally makes them easier to end run. Opponents will argue that divulging guardrails will encourage people to engage in the behaviors disfavored by them.
I don’t know which side will win that fight, but its outcome might be the most consequential thing in shaping society since the invention of the Internet.
NOTE: A longer, more detailed version of this column is available on John Farmer’s Substack, which is here.
Written on January 23, 2025
by John B. Farmer
© 2025 Leading-Edge Law Group, PLC. All rights reserved.