The AI Fundamentalists

AI in practice: Guardrails and security for LLMs

Dr. Andrew Clark & Sid Mangalik Season 1 Episode 35

In this episode, we talk about practical guardrails for LLMs with data scientist Nicholas Brathwaite. We focus on how to stop PII leaks, retrieve data, and evaluate safety with real limits. We weigh managed solutions like AWS Bedrock against open-source approaches and discuss when to skip LLMs altogether.

• Why guardrails matter for PII, secrets, and access control
• Where to place controls across prompt, training, and output
• Prompt injection, jailbreaks, and adversarial handling
• RAG design with vector DB separation and permissions
• Evaluation methods, risk scoring, and cost trade-offs
AWS Bedrock guardrails vs open-source customization
• Domain-adapted safety models and policy matching
• When deterministic systems beat LLM complexity

This episode is part of our "AI in Practice” series, where we invite guests to talk about the reality of their work in AI. From hands-on development to scientific research, be sure to check out other episodes under this heading in our listings.

Related research:


What did you think? Let us know.

Do you have a question or a discussion topic for the AI Fundamentalists? Connect with them to comment on your favorite topics:

  • LinkedIn - Episode summaries, shares of cited articles, and more.
  • YouTube - Was it something that we said? Good. Share your favorite quotes.
  • Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.
SPEAKER_02:

The AI Fundamentalists. A podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses. Here are your hosts, Andrew Clark and Sid Mungalik. Welcome everybody. In this episode, we're talking about deploying guardrails and security measures on LLMs. As part of our practitioner series, we're excited to bring on Nicholas Brocklate to join us for this discussion. Nick is a data scientist currently working in New York at a Fortune 500 insurance company. He's worked several internships previously in the Bay Area and obtained his undergrad and graduate degrees from UC Berkeley, studying data science and information. He was involved in previous BESSA and BGESS chapters, along with various projects and startups. Nick, welcome to our show.

SPEAKER_00:

Hello, Susan. Thank you all for having me today.

SPEAKER_01:

Awesome. Super glad to have you today. I guess for our first fun question, I'm just going to go around and just ask, you know, what are people reading? What have you been reading, Nick?

SPEAKER_00:

So currently I've been reading Principles for Changing World Order by Ray Dalio. And a couple other books kind of similar to the topic were Machine, Platform, and Crowd, The Alignment Problem, and The Coming Wave by Most of Us. Test of East.

SPEAKER_01:

Awesome. Can you is anything you want to share from those books like you've learned or taken away?

SPEAKER_00:

A couple big takeaways in terms of guardrails and responsible AI, but uh the Ray Dalio book has been also interesting given the current climate of the world um and how things are kind of shifting.

SPEAKER_02:

So um Yeah, Nick, I was gonna say it sounds pretty heavy. And Sid, what are you reading?

SPEAKER_01:

I've been reading uh At the Mountains of Madness by H. P. Lovecraft, which is it's been a good time. I'm a big lover of sci-fi and just trying to like read a little bit more of this Eldritch horror type content. Uh and it's been it's been really interesting. I think like it's very dry, but I think that's the delivery. It reads like you're reading like the expedition notes for like uh like that you expect like out of like Moby Dick. It's very dry, it's very like factual. Um, so the the horror is kind of like baked into like the content. So it's very interesting.

SPEAKER_03:

Um I'm currently working to do the Aeneid again from Virgil. So um definitely interesting. I read Odyssey earlier this year. I know there's a movie coming out on that. I'm actually really excited about that. Let's see if they've messed it up or not. But you never know. So it could be really epic or could be a really big flop to be determined. But um definitely getting back into the classics a little bit, which I think is is uh is great. And definitely there's need to have more people making like good solid literature, right? Like, because you you we're already at the point of like the LLMs aren't gonna keep getting better because there's not even any literature to train them on, and synthetic data is never as good. So, like, and it's just it's I'm also working on uh Thoreau Wallin. So it's just very the juxtaposition, kind of like I think, Nick, what you were talking about a little bit as well of like where the what I really like about Thoreau is it's kind of it's really a perspective shifter of are we all even looking at the world around, or why does it matter some of the stuff we're doing, right? So it's just a good like the whole go into the woods and have a think type type situation. It's just interesting, all the different perspectives, and that's what I really love about history and different things, and um, like even the Virgil is just that it really is a different perspective on the world, and it's just very easy for us to get on this little like myoptic hamster wheel train of the next thing happening or whatever. But it's like you zoom out in history, you do, as you mentioned, like the tectonic shape plate shift a little bit, but you know, history does repeat itself, and just seeing the macro view in literature really helps us zoom out. And that's where I think then the age of AI becomes more and more important for us to be going back into literature and really reading and really trying to think the macro and learn from different areas and and different disciplines. Yeah, I think just as a culture, we're very myoptic right now, so it's really good. And like a lot of the insights, uh, I forget the quote exactly, but I think it was from the founder, FedEx, maybe. Someone's like, well, how do you come up with great ideas? It's like, I don't know about you, but I I had I think, or or re or or I forget exactly what the quote was a really good quote. I'll put it in the show notes. But other thing is, most people's great ideas are because they're synthesized, they've read a lot and they synthesized information. Some of like Warren Buffett and things like that. It's like reading gets the perspective and synthesizes, and then just think that's kind of like the lost art these days is actually reading and understanding and stepping back. So apologies for a little bit of the ramble there, but um I think just becomes more important this literature review. And uh, no matter what it's what it is we're reading, the perspective from fiction too, like said what you're reading, it's like all of that is just opens your mind and kind of gets us in a good spot. So yeah, excited to hear about all the books you guys are reading.

SPEAKER_01:

Awesome. Thank you for sharing. I think that I think that is really valuable. So let's let's let's get into like you know some of the meat of the interview here. Uh Nick, tell us a little bit about like, you know, at a high level, the kind of work you do and what kind of problems you're you're solving in the field.

SPEAKER_00:

Yeah, absolutely. So for a little bit about myself and the work I've been doing over at the current company I'm working at here in New York, it's been mostly focused within guardrails, AI, security, and safety. I've been working on the governance and risk team. So for the last two years, since I first got here, it was partially looking at previous statistic projects that we ran on underwriting to kind of verify to tobacco usage for new clients and new potential um people looking to get insurance and get a policy. And then the start of the guardrails process before AWS FedRock, early on when Lomaguard had kind of first came out with their first iteration, and the first couple attempts to come up with security protocols for the new AI systems with GPT 3.5. And I don't think 4.0 has was released at that point as well. Um, so for the last two years, it's been partially going through hugging face and developing as these new models have been either released to the public or just doing research and finding open source solutions that we could potentially implement and other techniques and strategies that will work best to kind of come up with a most robust system and service to protect against potential risks, threats, or any other unforeseen kind of uh repercussions of using these OS systems.

SPEAKER_03:

That's fantastic. Thanks for sharing them. Sounds like you're working on some exciting stuff.

SPEAKER_01:

And and I want to dig in a little bit into the uh the guardrails you're talking about. Tell us a little bit more, you know, for the audience, what are guardrails and what kind of models are they useful for?

SPEAKER_00:

Yeah, absolutely. So for the guardrails that we've been developing, we're seeing as almost like the first line of defense for any potential inputs or outputs to the orchestrator system that we were developing, which consisted of allowing us to use and leverage the various LLM models from providers such as Microsoft and Azure to AWS and the Titan Cloud models, uh, to Databricks and the various models available through their service marketing place. Guardros and their own are like their method and technique for prevention of any potential threats, security breaches, authorizations, and permissions, and the use of these other LLMs that are fine-tuned and or crafted for various different topics.

SPEAKER_01:

Awesome. And can you tell me a little more about like what are the problems that we're hoping to solve with these guardrails? Like, what are some of these kinds of like inputs and outputs that we're trying to be mindful of?

SPEAKER_00:

Yeah, absolutely. So through the different papers that we saw, and for the primary business use case, it's mostly looking at PII prompt security. At one point, we were looking at financial and um financial advice versus uh calculation advice for clients and other home office employees who would be using these models, all the way down to private company information as well. So, whether it be passwords, tokens, API endpoints, or more that people may be potentially using. How do we make sure that as these inputs are being passed as strings into the orchestrator system, we have some form of classification or threshold that's making sure that the model doesn't then output like private information relevant to the company or addresses, credit cards, phone numbers, emails, or anything for employees as well.

SPEAKER_03:

For sure. And that's as we talked about in our recent webinar, um, like the confidentiality of these systems is not great. So like that's a huge risk and like a kind of a non-negotiable, I think, for using these systems like in the financial services, like you're mentioning, is to have these like old school regex, if you will, checkers to make sure you're not doing PII or like making sure that kind of information because the data classification or like how do I make sure I don't accidentally put on a social security or it's the ability to hack these systems is very high on that area, or not even hack, but just like trick them into saying things they shouldn't say. So having that we're really seeing that it's kind of like a any any customer-facing application of these systems in a financial services and type environment at minimum needs to have the basic suite of these you shall not guardrails in place, right? And they can they're relatively simple conceptually, is like almost if statements, if this happens, don't don't respond. So you know it's it's it's very fascinating area of research, and you're definitely doing a lot of um implementation on it, which is which is fantastic.

SPEAKER_00:

Yeah, thank you. I think uh to your point too, the what you mentioned is what I think of is prompt injection, like prompt security, to where you kind of give some other form of instructions and try and get the model to veer off course or veer off the initial intention of what you wanted it to be able to handle. Looking more and more into how do you solve and address those problems has also been fascinating. I know early on there were a couple papers that talked about algorithms like uh an erase and checker version to where you kind of go through different parts of the string input that you have and try to verify if there's any toxicity or any malicious intent within there, and then ultimately try and combine that score and bring it back together and pass it through that classification system. Uh, however, then you get other creative attempts to where they have uh I think it was called damn, like do anything now, abbreviation for that, and then other kind of like jailbreaking attempts that just completely uh avoid any of the other prior instructions for those models.

SPEAKER_03:

Yeah, and I think one thing I'd love to get your thoughts on um for evaluating these systems. I think one thing that often people forget about is all of the different you got to make sure you're doing all these things and people can very much get in the weeds of the tech uh of like the which is great. The the thing that I see often, I want to get your thoughts if you're seeing um when you're deploying these is when when do we step back and say, well, is the juice worth the squeeze, if you will, of like because of all the additional work and all the things on here? Like what is the performance requirement in business? How is it actually performing? What's the cost around all of this stuff and all of your time and everybody else is setting up versus like what you can make it and it looks like a chat bot with literally just deterministic programming that will look like a chat bot and all this is like and you know it's not gonna, if it's looking for an answer, it's gonna say a specific SQL query output, maybe not even SQL query, it's just a set of like what's the operating hours of your business and it will turn it back versus and you don't have to worry about these types of so securities leaking and things. So I'm wondering how you think about that trade-off of evaluating like, is it even worth it? Like it's we great, we have it's something I love it too. It's a very fun area to geek out on all the techniques we can do to prevent this stuff. There's also like setting back like, should we even be going down that route? Is it even worth it, right?

SPEAKER_00:

Yeah, I think evaluation is something that we could should heavily consider going forward based off of some of what the other professional opinions from prior books have mentioned, as well as from what I've seen so far in the attempts of building these guardrails. I'd say for even like classification tasks, for example, right? Building out neural networks and having them specifically trained for a specific type of risk. That also encompasses synthetic data set generation, unless it's available. Um, in most instances, and depending on the industry and the business, the nature of the questions that you may encounter may not necessarily be readily available. So, with that, then there's the before we can get to the evaluation, how do we verify that we have a quality data set that is most similar to what we anticipate encountering in a real world scenario? Then from there is the typical approach of splitting it between your 80 and 20%, depending on if you're trading it or if you're just strictly evaluating one of the either open source or uh straight from the box available guardrails within the market, and you can just go through and evaluate it there.

SPEAKER_01:

I think that you're kind of getting it at something which I've been thinking at throughout this conversation, which is for the for the practitioners, like there's so many places where you can run the guardrail, right? You have the opportunity to run this guardrail at the time of prompting, you have the chance to do it during the fine-tuning stage when you're making the model itself, and you also have the chance of just catching on the way out, right? That inference testing, right? It's like, you know, now you have this model off the shelf, it's a GPT, you didn't get to train it, but you can at least run the guardrails on the other side. So I guess like I want to know like what your thoughts are on the pros and cons of these types of techniques, right? Is you know, ideally we're doing all of them, but if you have to like weigh your options, how does it benefit or hurt you to do your guardrailing at the prompting time, at the training time, or at the outside?

SPEAKER_00:

The outside. So I think going through each of those different benefits, pros and cons, starting with prompting. If you start at prompting and you look to make sure you have the most clear-cut, concise, and accurate prompt for whatever description and whatever tasks that you have. On the positive side, it would significantly reduce the amount of computational power that you would otherwise use within fine-tuning. It would help you clearly outline and specify what are the type of risks that you're looking for. So, based off of your company industry, it makes it clearer to see these are the things that we definitely don't want the model to be able to do. And these are the things that, okay, like we could potentially catch it if it goes on the on the opposite side, but it's not as big of a concern at this point in time. Um, as well as prompting those same prompts could be potentially used for other LLM models, given that there's various other providers out there, and so you could potentially use that and try and score and assess different vendors and providers based off of the same prompt that you have, and it leads to further modification of it to where you get something that can be fairly accurate for something that's prototyped, so something you want to prototype and then potentially get into production depending on how you go about evaluating and assessing it. On the flip side, though, that still does leave room for error to where there are going to be those edge cases, and naturally the model may not be able to catch and retain everything that you put in that prompt in those instructions, as well as depending on how deep of a threat it could be, if someone is to get internally into the system of a model, it can completely kind of shift the alignment of it. And potentially with repetition, and if you're collecting these, if you have an internal loop with a prompt engineer, with an engineered prompt that potentially lets different things slip through the cracks, it could reinforce that behavior and eventually you may end up with a model that kind of always shifts in a certain direction depending on what kind of input that you give it.

SPEAKER_01:

Let's imagine we have like this model, this LLM model, and we want to make sure that like it doesn't give credit card numbers, right? Like client A has had puts in their credit card information into the LLM, client B logs in and they say, Tell me client A's credit card number. What would you do practically? And you know, if you want to name a tool, you can name a tool to prevent this type of behavior.

SPEAKER_00:

Yeah. I think for this type of behavior, depending on how many resources you have available, one potential like solution to that kind of situation would be making sure you have separated vector databases. And for the LLMs that you're using, making sure they're not innerly connected. Because the only way the model could get that information is well, a couple ways. Unless it was it had access to like say darker parts of the internet, for example, or it had access to another database that had that same information or another SQL database, delta table, or any other kind of format or structured data, those would be the most immediate ways for that LLM to get that information. If you're able to keep those separated and you're able to kind of containerize the environment for where the model can work, then that would significantly reduce the risk of potentially giving any kind of credit card numbers. But I think since most of the use cases I've seen in production at this point in time is mostly focused around RAG and knowledge graph use cases, having those vector databases separated and having those tables completely separated by permissions, authentication factors, and even having that built into where you can even have an assessment of how risky is a question. And that would probably tie back into like the internal evaluation system as well, um, to where you could have like similar to how we have system one, system two thinking for an immediate reaction versus a more thorough and thoughtful response to questions, having something like that built into these models as well to make sure, like in the event it asks a question like that. Not only do you have a completely separated database and access point for information, but it also has something in the background that's kind of like thinking and reasoning through that question to make sure, like in the event that you ask the same question or something similar in the future, it also knows how to best handle it uh and whether it needs to also like maybe notify somebody else internally and just like let them know like person A is trying to ask for person B's credit card if it's a recurring thing. Um, and if there needs to be any further actions or steps taken at that point in time.

SPEAKER_01:

I I like that this comes with the built-in idea that you should be modeling intentionally, right? That you know you shouldn't just do RAG and give it all of your company data and say, okay, go ahead and you know just fetch me data and whatever data you want to fetch me. Tell me a little bit more about some of these intentional design choices we can make that we either you know make correctly or that we fail to make, uh, that can be really big problems to these AI models. Basically, where does intentionally designing for governability you know help us out and where does not doing it hurt us?

SPEAKER_00:

Yeah, absolutely. I think within vector databases and having that intentionality behind RAG, graph rag, or any other structure where it recalls or is dependent on other information. I think one of the techniques I saw in a paper early on was called out-of-bounds distributions. And so doing an assessment of the vector database that you have and making a judgment call of based off of the question that you're asking, this database has nothing relevant to it. So therefore, this LLM it may not be equipped to answer that question. Um, I think something along those lines and even subject and topic kind of segmentation as well. So you can have something similar, and there's different parts of that vector database that can easily and accurately recall them, uh, potentially even changing the hierarchical structure for them. Uh from the HF, I'm laying on like the full acronym for it right now. Um but for those structural diagrams for how if you wanted to go through your vector database and go through the different layers and depths of the trees that you have, being able to have something in there as well to where it can catch any of those questions. Um and then even getting into vector embedding spaces too. If there's a if based off of the information or the question that's being asked, there's similarities in say vector spaces. How can we do another assessment to where what you can at best determine the meaning of them based off of how similar those questions are beyond just the cost sign similarity and retrieving the top documents and retrieving the chunks and retrieving those? Uh if there's another additional step that we can add to them to where um you can either shift or adjust those spaces. And one potential solution for that too is well would be like if you catch something that's malicious within uh an attempt to access that rack use case, modifying it and adjusting it. So instead of saying try to get this credit card, um, something along the lines of like where would that kind of information be, to where you don't reveal the information or even changing it further beyond that as well, to where instead of asking and reframing the question of where the information may be, you can change it to be something personal of are you like reframing the question, like are you asking for your own information instead uh to make sure that there's those checks and procedures to not try to access anybody else's information?

SPEAKER_01:

And I guess this this lines up really nicely and very cleanly to what are some of the very material risks to not doing this? If you go in with this unintentional mindset and you build the model and you're all the way at the end, and then you have your oh no moment, yeah. Uh what like you know, what are some of the material consequences with using LLMs that can come from not implementing guard rules from the beginning?

SPEAKER_00:

Oh, so with that, um there are various different consequences that could come from that. Typically, I like to try and be more most optimistic in dealing with these problems. However, again, in from some of the previous literature I was referring to earlier, they talked about different industries and what we could see. So if we started within medicine, for example, and we built a model has access to all these different chemicals, it has access to all these different products, maybe even supply chains in different areas to where they're manufactured, to where you get these ingredients, like the foundational ingredients elements and more, and be able to kind of create and synthesize almost any drug that you want. On the flip side, the malicious part to that is you could potentially garner a system that ultimately leads to creating like more viruses or like harmful biomedical weapons or things of that kind of destructive pattern. Uh and it also mentioned like for almost any good thing that we do, you could also have like a flip side to it as well. So for like general, like the general adversal networks that you potentially create. On the positive side for it, it works great and effectively for like red teaming and potentially stress testing the guardrails that you may have. On the flip side, some of the repercussions between government, for example, and politics. If you're looking to put together the best campaign for a specific event or election or any time coming up, it could try and offscure information that's being fed to different systems, to different regions, to trying to draw up different like kind of um voting areas and restrictions there. To we see it within tech, the flip side to it would be finding loopholes to where you can exploit the type of products that you're trying to get, whether it's subscriptions. We've seen examples with like airlines and uh getting free flights to free cars or like any products within like automobile or locomotive or aerodynamics, there's repercussions there. And then same thing for cybersecurity in terms of information. That would be another risk to where in the event of potentially just a couple malicious like malicious actors with the same access to that information and technology, they could uh potentially disrupt um yeah, they could disrupt multiple like cities, companies, organizations, kind of depending on the industry as well, too. Even for like construction too, based off of construction, if you were able to find the information needed for where the most rubber, where the most rumber is being produced, and somehow was able to kind of like put some kind of string of attacks together, you could kind of disrupt that supply chain. Um, and that could have other repercussions that we may not necessarily know and we may not necessarily want to find out. So it's yeah, it's a combination of partially like industry and the nature of the work that you're you're building.

SPEAKER_01:

For sure. I mean, it can it can almost feel like the risks are unlimited, right? Because this models have so much scope and they potentially could operate in so many ways, it could be very dangerous for us. So I guess like you know, you're coming at this from an expert opinion, and you know, people pay a lot of money for your time to help work on these problems. Some organizations are just using AWS and they're using bedrock, and the models right out of there, and they're just trying to use the built-in bedrock guardrails. What are your feelings about these types of fully automated solutions that basically try and pre-package everything? Is that enough? Is that 80% of the way? Is that 20% of the way? Or how do you feel about these pre-packaged solutions?

SPEAKER_00:

Uh, so for the pre-package solutions, I'd say it's a step in the right direction, given that AWS has um considerably more resources that they can use to build some of these guardrail solutions. However, I don't think it's a one like one-all solution for protection in different companies. I think there would still be a significant amount of development, more focused towards building a model that is also has those safety protocols internally built in, as opposed to using the AWS Lambda with a combination of like the policies that you can develop. And um, having worked with the AWS bedrock guardrails and how you can kind of develop those policies and specify them for the different filters that you want, the different topics and like your high, medium, or low sensitivity for um for different potential filters. I think looking at a combination of open source solutions may be something that becomes more practical, similar to another idea from machine platform crowd, it mentioned there is the expert versus the kind of crowd in terms of sourcing. And one of the like best examples I can recall is when they mentioned Linux and like the development of Linux starting off as open source, and you had a couple. Of like expert leaders who are very enthusiastic about building it. And they were able to kind of monitor the repository and the overall production of that product. For that use case, they mentioned that even as experts and other professionals who get very hyper focused in specific areas and details within the technology, the crowd has an abundance of more just experience and random stress testing ideas that they can throw at the model that you may still not necessarily be equipped to handle, but it at least broadens the robustness and the flexibility and versatility of the models that we be building. So I think ultimately, with that in mind, the open source solution of potentially building something that's in alignment with safety from the ground up may be one of our best solutions. And at the same time, even with the internal evaluation systems you build into it, even with the embedding spaces that you build into it, it would still work with the other providers. So as GPT continues, as Meta continues, as AWS, Databricks, and all these other models continue to be developed, you would have something that's maybe on par capability-wise, but has been in alignment with safety from the very beginning. So that way, in the event of something that becomes more complex and tricky for those models to handle, and any of the other guardrails that are only trained up to a certain point in time, you have this model that's constantly evolving and constantly compatible with any new version system framework, anything that's developed into the future.

SPEAKER_01:

Awesome. Yeah, and I really appreciate that. And I guess, like, you know, as my almost closing thought on this, I really like how much, you know, you're in the LM space, which can feel a little bit like the Wild West in the AI world, but you're still bringing a lot of that methodical thinking, intentional design, and building for security from the beginning, right? It's not like we're patching something at the end. The research you're talking about is showing that guard rules are best applied at all stages of the process, not just as a last little API call at the end.

SPEAKER_00:

Yeah. Exactly. Um, I think even to an addition to that as well, is uh potentially having like the there was another paper that talked about like Blade is like domain-specific models. If we were to train a domain model solely on just responsible building ethics and other like well-intended code, and we're able to append it and attach it to some of the models that are currently out there, that could be a start given that trying to build something from the ground up solely focused within AI safety and fully in alignment may require uh significantly more resources, time, and depending on which companies or providers or people who would want to attempt it could be a considerable monitoring effort as well. However, something that's more uh domain adapted for these type of questions could be something considered as well too. So something similar to like a policy matcher or something that can go into the different industries for like medicine, um medicine, insurance, construction, real estate, and more, and take a summarization or an assessment of all the different regulations and standards that need to be adhered to and be its own domain specific model that you can attach to like the other bigger providers that are out there.

SPEAKER_02:

Well, Nick, um, thank you for joining us today. I think yeah, we've you've shared so much information, and it's been such a such an honor to have you here to really kind of talk about like the practical application of these systems and the securities and the considerations around all the thought and design that goes into building a resilient system.

SPEAKER_00:

Oh nice. Thank you. Thank you all for having me. This has been a pleasure. Uh, I think it's uh critical and important to keep thinking through these kinds of topics as things continuously develop.

SPEAKER_02:

For our listeners, thank you for joining us today. If you have any questions about this episode or the episodes you've previously listened to, please contact us. Until next time.

People on this episode

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.

The Shifting Privacy Left Podcast Artwork

The Shifting Privacy Left Podcast

Debra J. Farber (Shifting Privacy Left)
The Audit Podcast Artwork

The Audit Podcast

Trent Russell
Almost Nowhere Artwork

Almost Nowhere

The CAS Institute