Preparing AI for the unexpected: Lessons from recent IT incidents Artwork

The AI Fundamentalists

A podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses.

All Episodes

The AI Fundamentalists

Preparing AI for the unexpected: Lessons from recent IT incidents

August 19, 2024 • Dr. Andrew Clark & Sid Mangalik • Season 1 • Episode 22

Can your AI models survive a big disaster? While a recent major IT incident with CrowdStrike wasn't AI related, the magnitude and reaction reminded us that no system no matter how proven is immune to failure. AI modeling systems are no different. Neglecting the best practices of building models can lead to unrecoverable failures. Discover how the three-tiered framework of robustness, resiliency, and anti-fragility can guide your approach to creating AI infrastructures that not only perform reliably under stress but also fail gracefully when the unexpected happens.

Show Notes

Technology, incidents, and why basics matter (00:00:03)
- While the recent Crowdstrike incident wasn't caused by AI, it's impact was a wakeup call for people and processes that support critical systems
- As AI is increasingly being used at both experimental and production levels, we can expect AI incidents are a matter of if, not when. What can you do to prepare?

The "7P's": Are you capable of handling the unexpected? (00:09:05)
- The 7Ps is an adage, dating back to WWII, that aligns with our "do things the hard way" approach to AI governance and modeling systems.
- Let’s consider the levels of building a performant system: Robustness, Resiliency, and Antifragility

Model robustness (00:10:03)
- Robustness is a very important but often overlooked component of building modeling systems. We suspect that part of the problem is due to:
  - The Kaggle-driven upbringing of data scientists
  - Assumed generalizability of modeling systems, when models are optimized to perform well on their training data but do not generalize enough to perform well on unseen data.

Model resilience (00:16:10)
- Resiliency is the ability to absorb adverse stimuli without destruction and return to its pre-event state.
- In practice, robustness and resiliency, testing, and planning are often easy components to leave out. This is where risks and threats are exposed.
- See also, Episode 8. Model validation: Robustness and resilience

Models and antifragility (00:25:04)
- Unlike resiliency, which is the ability to absorb damaging inputs without breaking, antifragility is the ability of a system to improve from challenging stimuli. (i.e. the human body)
- A key question we need to ask ourselves if we are not actively building our AI systems to be antifragile, why are we using AI systems at all?

What did you think? Let us know.

Do you have a question or a discussion topic for the AI Fundamentalists? Connect with them to comment on your favorite topics:

LinkedIn - Episode summaries, shares of cited articles, and more.
YouTube - Was it something that we said? Good. Share your favorite quotes.
Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.

Speaker 1: 0:03

The AI Fundamentalists, a podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses. Here are your hosts, andrew Clark and Sid Mungalik. Hello everyone, welcome to today's episode on preventing AI incidents, following our last podcast. If you haven't had a chance to check it out, I highly encourage you to do so, because since we recorded that podcast and published it, the world experienced a global IT incident with CrowdStrike.

Speaker 1: 0:40

We had a lot of feelings about that because, while it was an IT incident and it wasn't a cyber attack per se like we're used to seeing in the news, the very similar things as we watched the news roll out reminded us that this can also happen in an AI world at the same scale, maybe even much larger. At the same scale, maybe even much larger. And you know, following on to that, we're also watching. You know it was weird then to shift into the zeitgeist of the Olympics are happening, in particular watching. You know, andrew and I are huge running fans and we've been watching the men's distance team, especially the US men's distance team, really dominate this year in a way that they never had before. In particular, there was one race, andrew right the men's 1500. That really was mimicked life. When you know something gets so high up that the basics can get missed, right oh?

Speaker 2: 1:37

yeah, don't get me started. Even if you're not a running fan, I highly recommend going to YouTube and looking up the 1500 meter final. Um, yeah, it was. It was crazy.

Speaker 2: 1:47

So basically, the high level synopsis is you have, uh, two of the top expected runners. One of them, um, close to world record, european record in 1500 meters, goes out really hard, runs the whole race. There's someone that's been the world champion last two years. The two of them don't like each other and they're just looking at each other. Well, it comes down to the final sprint at the end and they're so focused on each other that, um, uh, a U S distance guy that nobody thought had a chance at all Even his own family was shocked. Just both the past.

Speaker 2: 2:15

The two guys that are looking at each other and focusing on them, um, I, there's one headline I saw how does a I think it was a three 30 guy beat a three 27 guy and a three 28 race distance times, but essentially that kind of disc like he had like a PR by two seconds. It's not something you normally would have in an Olympic final. Um, so he had the race of his life and then the the people that can run faster than he even did now, significantly faster weren't focusing on the on on. So it's like you can easily, um, you know, be focusing on the wrong things and, um, you know, so, so obsessed with a specific competitor or whatever that missed the broader force for the trees, or you're trying to be secure with cloud strike and you don't realize that, um, you didn't have the right controls in place to prevent that kind of thing. So it's a very good metaphor in a lot of different areas, but also just for any olympics and running fans out there.

Speaker 1: 3:05

It's a it's a fun race to watch and in the in the spirit of like that, imitating maybe ai life. Think about it that wasn't just a decision made by them. There was a whole hype cycle around those two runners that got them so focused on you. It also enabled them focusing on each other, similar to the past couple of years of AI hype. It gets you so focused on, like that, that you know AI is a hammer and everything's a nail that you forget. Oh wait, there's actual goals here. There's actual things that need to be fixed. There's actual like you can't skip the basics, which is also the spirit of this podcast Bringing that all together. That's why we're here to discuss. There's a preparedness IT teams do whenever they're focused on IT incidents, disaster recovery processes and all the things that go with that. We want to talk today a little bit about the similar things that can happen should there be an AI incident and what similar preparedness can do for you should that type of incident happen.

Speaker 3: 4:13

You know it's almost hard to believe it was a month ago, but on July 19th we had our CrowdStrike IT incident right, and to some this really felt like our generation's Y2K right. Y2k had a lot of promise of like. All the systems in the world are going to go down.

Speaker 3: 4:30

You know, hospitals are going to be closed, gas stations won't be able to turn gas, people won't be able to get on flights, and when that time came it didn't happen, but it did happen a month ago and, you know, while this isn't directly an AI incident, this really reminds us that we're always on the cusp of like the big AI like downturn, where our reliance on these AI systems has gotten so tight and so interwoven that when one of these goes down and they do go down what is that going to look like and who is it going to happen to?

Speaker 3: 5:11

And it's really a question of when it's going to happen. I think that there's no doubt in our minds that this is not an if scenario. Not everyone in the space is built with safety in mind, and so someone's going to get caught and someone's going to get their foot stuck, and then we're going to have to deal with the fallout of that situation. So the goal of today is basically to help, you know, walk us through the fundamentals of insulating our groups and our models from the people. That's going to be part of that AI disaster, which is it's going to happen, and so the goal is to have it, not be you.

Speaker 2: 5:44

For sure, and this is where it's always easy to armchair quarterback and go back and play through things. But it also exposes some serious, like CloudStrike as an example. It's a common practice for IT GC, it general controls, blocking and tackling. You don't deploy software updates to computers and things without them being tested and validated. It's a lot of hard yards and doing things the hard way, but there's process to do that. What's crazy is that so many Fortune 500, fortune 100, fortune 50 companies didn't do that. They're just accepting config files from a security vendor who supposedly does things. Also, susan and I have a hypothesis that is completely no basis, that we have nothing to support this, that we're wondering if the company was using GitHub, copilot or something to do their QA, but that's neither here nor there. This is speculation. But there's even basic controls. You have cybersecurity teams. You have all these things of like the controls you could have in place. It's easy to point out the gaps they had now, but you can mitigate a lot of the risk, but you can't mitigate all of the risk for these areas. But the more proper preparation you can have, the more things you can mitigate.

Speaker 2: 6:52

And this reminds me a little bit of our podcast. We had our last podcast with Dr Patrick Hall. If you did not catch that one, highly recommend to go back and listen to that. He's one of the great brains in the AI space at the moment. One of the key things that he highlighted was a lack of accountability, and especially in the culture we have in tech right now, a lot of it is like no one's really accountable. It's a team thing, which is good. There's definitely some positives to when we win as a team, loses the team, things like that.

Speaker 2: 7:18

But when you have the every decision by committee and no one's actually accountable for anything, it's very easy to pass the buck. Or I think somebody else is going to do it I don't have to do that because susan's doing it or whatever like, and nobody ends up if, if the there's nobody that ends up having accountability for that. When the music stops, somebody's without a chair and in some cases, uh, that means your computers are frozen, right. So there needs to be and of course, with ai is part of this of creating the proper processes and things to how do we prevent things from happening? Someone does needs to be and of course, with ai is part of this of creating the proper processes and things to. How do we prevent things from happening? Someone does have to be responsible at the end to be, because if they're responsible, they'll be be more motivated for putting in the proper controls and making sure they exist yeah, I believe his, his the most striking statement was if everyone's responsible, nobody's responsible exactly.

Speaker 2: 8:04

So that's what we're gonna.

Speaker 2: 8:05

We, based on the inspiration from cloud strike, we did a quick podcast uh sorry, a blog post on it, and then we're gonna start doing a series of multiple areas about how would you prevent this type of an ai cloud strike from happening, because, um, gary marcus, who we we talk about a lot as well as definitely believes that this will be happening and it's all.

Speaker 2: 8:24

All of us think that there is going to be some AI incident at some point, like Y2K never happened, but we basically had Y2K. Now, what everybody was scared of has happened. There's going to be something that happens with AI, and it could be a lot worse from an individual perspective. So today we're just more going to be framing about what are some of the principles to to think in mind, to making sure our systems are, how do we prevent our systems and build them in a proper way, um, for these situations and we'll dive in in other times like some of the policies and procedures, but really, and like the validation techniques, but really it's more setting the stage is the goal of this uh podcast.

Speaker 3: 8:58

We have the accompanying blog post so I'll turn us over to an old adage which comes back from world war ii, and it really aligns with our philosophy of doing things the hard way right, and that's our approach to ai governance and modeling systems, and so I'm going to use our podcast pg-13 card. Go for it. Sure, proper planning and preparation prevents piss, poor performance. And so you know what is proper planning and preparation right? These are the pieces that are going to let you be on top of that curve and let you not get caught in the incident, and so we're going to talk about this as, basically, three levels of building a performance system right, where each level builds on the last level and is a higher standard At the lowest level. We'll talk about robustness, and then resiliency and then, finally, anti-fragility right. So seeing these as steps and tiers on your path to make sure that you're not going to be the one who causes an incident. So we'll start at the beginning, and we'll start with robustness, and so we'll look to NIST.

Speaker 3: 10:09

Nist gives a really good definition of robustness here, at least in the AI space, and they specifically call it the ability of an information insurance entity to operate correctly and reliably across a wide range of operational conditions and to fail gracefully outside of that operational range. What does that mean? Very lightly, it just means your model will be robust in a set of scenarios, from the ideal case to the slightly bad case, to the very bad case and if you're in that bad case, your model will fail, but it won't fail in a way that's going to be catastrophic, right. If a bridge has to fail, you know it needs to fail at like points where it's like okay, it was like. You know that's where people aren't. We can warn people ahead of time and we can get people to safety. So that's what we're going to talk about with robustness. So I'll give it over to Andrew to talk a little bit about why is this becoming a problem and why, in AI specifically, are we having troubles with robustness?

Speaker 2: 11:12

Definitely. Definitely, robustness is such a key component and it seems just like a basic building block. But oftentimes when people are building systems especially I think this is more common in the AI world and specifically how we've created like that trend of citizen data science and then data science boot camps, and then you have the rise of Kaggle, which was that competitive website. For essentially, how do we make like the most performant systems? They put money behind it and like bragging rights. It's just this kind of culture of data science of let's get the most performant system. You've really created this. Well, here's a set of data science. Of let's get the most performant system. You've really created this. Well, here's a set of data. Make it as absolutely crazy good at this set of data as possible. But that's that concept of overfitting. Where it's not generalizable to a wider use case, you make it super, super, super optimized for a specific thing. It can't work outside of that.

Speaker 2: 12:02

Great example is data scientists have gotten really good at building Formula One race cars. That is great when you optimize it. You tuned every aspect to be on that Monte Carlo course or whatever right In Monaco, like you've tuned it for that very specific race course Great. Now take it in the rain and drive it down to your local Starbucks. Oh, and there's a pothole. Like it's very different. You're not used to having potholes on that race course, right. So, like you built something.

Speaker 2: 12:29

When you have a Formula One car, that might be super, super good at that specific task, but when you often meet the messy real world of data, it's not going to be the same. There's going to be potholes. There's going to be things you don't want to. You know, get a Ford Mustang or something that can definitely handle the potholes. Yeah, it's still very performant. It's going to be better than some old, uh, you know Jeep from the sixties and speed and zero to 60 and all that kind of fun stuff, but it's not going to be just so like you.

Speaker 2: 12:58

You put a little bit of sand anywhere near it and it has an issue, right, um, so that that's a major problem is that generalization of models, and Christoph Molnar, uh, who's a friend of the program and had a great episode from a while back we did with him is in has a new book in progress on supervised machine learning for science, which has a great chapter I think it's eight on generalization. And this is this whole concept of overfitting and things you've probably heard in the past is really that genesis of let's make something hyper performant for a specific area, of let's make something hyper performant for a specific area. But the problem is the real world in modeling and AI often is messy and it's more like you should be building something for cross-country racing, not for a Formula One racetrack, and that's the problem is the mismatch of how companies actually use something versus where it's being designed in a lab. Sid, any thoughts on robustness?

Speaker 3: 13:49

Yeah, I think that's absolutely good. So at least walking through. What does it look like to take your model outside of that perfect racetrack and what does it mean to get your model onto that all-road condition? That probably looks like doing something that's a little bit counterintuitive. We're going to actually build models that aren't about getting that last 1% of performance.

Speaker 3: 14:10

We're not going to squeeze the hyperparameters to get that perfect training set or validation set performance, Because even validation isn't what the real world is, and in fact, we might even want to go a step further and make our data intentionally quote unquote worse by adding noise to our data. But that noise is going to simulate what it's going to be like when that model's out there in the real world, and so we're basically training our models to operate in less than ideal conditions. So robustness and planning for robustness means that you need to build your model around not just the happy case situation, but building it around the adversarial situation, the different distribution situation and then seeing how your model is going to perform on unseen and difficult data, Because that's going to give your model a real chance to have seen something outside of ideal data situations.

Speaker 2: 15:04

Exactly, and this is I mean we have this and I won't go too deep in running analogies, but a lot of different running analogies you could have here as well, on on training and and things of like. Uh, last one on the olympics, I promise from this year paris olympics, uh, marathons were great. A women's race was also a one to look up. Then it was a nail-biting sprint finish which you don't often get in marathons. So that's definitely one to look at as well. But there's hugely hyped up about there's hills, there's hills, there's hills, there's hills.

Speaker 2: 15:30

Most marathoners aren't used to hills and there are a lot of dnfs more than usual in that race do not finish because people got psyched themselves out in the hills and don't know how to run with hills. They could be like the. They're used to just like a very flat course and in you know, 50 degree weather, like there's very optimized thing. They're not resilient and robust or robust to the, to hills and things like that which make it messier. So it's definitely something you need to embrace the real world. Not everything is run on a, on a tracker and perfect scenarios, and that's that's a big problem I think that the data science world needs to address and so I think that cues up really nicely into our next case of preparedness, which is resiliency.

Speaker 3: 16:12

Right, so if robustness is about, your model is ready to act in the world with adverse stimuli and it can handle itself, and if anything bad happens, it can fail gracefully. Resiliency is that next step up, where not only do we absorb that adverse stimuli without destructing the model, but we can then naturally return back to the pre-event state. Right, basically like an elasticity of the model. Right, when a bad incident happens, the model is able to come back to a point of safety, and so then the model is basically able to take care of itself and we don't have to, you know know, monitor it in that way. Right, we'll still be responsible for our models, but we expect the model to be able to operate on its own.

Speaker 2: 16:54

If it has a bad day, it'll still be there tomorrow exactly, and this is a huge concept in engineering that goes back to uh, we're not. We think the first concept was around uh 1880 by tread goldgold someone named Thomas Treadgold about resilience of timber. But it's been a key component of civil engineering and other disciplines for a long time is how do you make bridges and things like that that they'll get hit by a big storm? How do you make them bounce back and be resilient Earthquakes and things like that? So it's been just kind of an accepted concept in other areas and in the future we're also going to talk about complex systems and that's a component of that as well. But in AI we don't often have that.

Speaker 2: 17:33

In software engineering we're getting better with, like auto scaling and different things like that, but still, I mean, as CloudStrike painfully makes us aware, there's still a lot of areas to grow there. But that component from modeling of resiliency when we talk to individuals a lot of times, even in people within the space, you know, oh, like resiliency and things like that. They're taught, they automatically go to it, but it's like they might have oh, I have auto scaling in place. Well, that's not the same thing as resiliency of your modeling system. So it's definitely an area that we need to be focusing more on as well, and an area for a lot of improvement, and civil engineering, I think, is a great case study for this.

Speaker 3: 18:14

Yeah, absolutely, and I'll give a quick note on, like you know, here's what it might look like for a modeling system. Right, modeling resilience doesn't just mean like downtime, like Andrew's saying. It doesn't mean that like, okay, well, people are hitting the model more, you know, we give it more compute. The types of failures we're talking about are different types of failures failures of predicting correctly right, if some major event happens in the community that your model works on and you don't touch your model, and now your model's doing way worse, it wasn't resilient, and so to add that layer, to add that, let's say, like dampener, to give this model a chance of like recovering, we might look at something at continuous learning or responsive learning. Right, if we see that the model is doing significantly worse, it either sends off an alarm or it retrains itself. Right, it fetches the latest data, creates a new model and adjusts to the new environment that it's in, rather than assuming that the old environment will be persistent and that we should just keep operating that old condition.

Speaker 3: 19:14

It can be a tough sell for a lot of teams to do this type of work. I think that you generally won't be rewarded for this. There aren't any medals given away and when it comes time to do it again, people are going to say, well, nothing happened, so why am I going to spend money on it? And so there's a lot of incentive to cut corners on resilience and just say, well, at least we did a little robustness checking. But in practice, when you're missing these pieces, when you're missing that resilience and robustness, you end up in a situation where you leave yourself open to these types of threats and to the risk of failure. And so when you're dealing with models which are incredibly high risk and important to you and important to your customers or society at large, over engineering, a little bit is actually generally what you expect and want as an almost insurance policy in your model yeah, can.

Speaker 1: 20:07

Can I ask something about that? Because when we were when the CrowdStrike incident happened, I mean there were a lot of people who tried to get on. You know, get online thought leaders trying to really explain what this means. Where do we go from here? And it was a few analysts that fired up a LinkedIn live, and one of the analogies they use I think it's relevant here is how teams delegate and delegates not the right word how they manage the risk of how much of this type of testing and backup to have back up to have the example that they use. Like in Atlanta, you're not going to maintain large fleets of snow equipment or anything like that because it just doesn't get used often enough and that's often not. They'll mitigate a plan a different way versus a city like Buffalo. You know a lot of infrastructure spend is on that equipment because they cannot function without it, and I appreciate the spirit of it, but I felt like there's a different scenario at play here is that there's just too many blind spots. What do you think?

Speaker 2: 21:22

Yeah, yeah, I don't think that analogy is more for like the mitigations and things of it versus like the resiliency. For us it's more like there's only the when you didn't do it you're exposed like CloudStrike. It's really the in the preparation. There's not like large physical infrastructure you have to have to do these things. It's just like adding an extra day to your ML project Right, so like they're just preparation items. You can do this more like preparation and process more than like.

Speaker 2: 21:52

I understand some cities aren't going to be able to have a ton of equipment on staff, like for depreciation and just costing and skills. But this saying oh, we can't have the equipment or something is not really an excuse from an AI perspective to not do this. It's like but you're not having, you know, no accountability is. This is the easy thing to point fingers post fact because, oh, no one told me to do this, I wasn't required to do it. That's why it didn't happen for the cloud strike types incidents.

Speaker 2: 22:19

But if you have these built in your process and you spend a little bit more time here, you're not going to have Fukushima disasters. You're going to have resilient bridges that can survive earthquakes, right, so it's. This is the problem is. There's a large lack of data science and modeling leadership around right now, I think, where it's just always like just cut corners, cut corners, cut corners versus doing a little bit more foresight means your systems are going to be more performant later, but thankfully with this type of testing it's more of a forethought and a little bit of proper planning prevents poor performance versus major infrastructure investments, like you'd have to do in the snow cases.

Speaker 1: 22:53

Yeah, and I think that's an important distinction, because you're talking about the costs don't match right. Yeah, that's a big infrastructure cost in that analogy of the snow equipment, but here you're talking like hours or maybe a day, and that's not too much to spend in my opinion.

Speaker 3: 23:11

Yeah, yeah, absolutely, and I think that just kind of gets to this point. That like this is like it's not a capacity problem, it's a process problem. Right, it's not that we don't know how to do this or that. We can't do this and we don't have the tools. This just wasn't part of your process. You know you deploy models and you don't check for these things, and you shouldn't negotiate for these things. You shouldn't be like, well, I need to negotiate, to have time to build resiliency, robustness.

Speaker 2: 23:33

That's just part of the pipeline and that's a key thing that sits highlighting is like this is a knowledge gap more than anything else. I think like, once you learn how to do this and set this up, like it's not going to add more much time to your deployments or any of those things. It's more like how do we do this? We're not asking for hey, I need an extra three months to do resiliency testing. That's not what we're saying. You don't need to spin up massive GPU clusters to do a resiliency test. It's just part of your process and part of like OK, it used to take me 30 seconds to do a CI CD pipeline on this model. Now it takes 35. You really think that's going to be a problem.

Speaker 2: 24:10

An extra five seconds where it's about like the process and understanding and just having that proper planning around it, and that's where I think a lot of the gap is is it's more of an awareness gap and a leadership gap and a knowledge gap than a resource constraint Although, like the snow thing definitely makes sense. But even for CloudStrike, you're really telling me you guys can't do a quick five-second test and you see the computer screen freezes before you deploy it to every system. There's basic things. That it's process. You have that process in place. It's really not crazy. It's a lot more time and effort and money and resources to try and fix those issues than to prevent them in the first place. But if no one's accountable for having good processes in place, everybody points their finger somewhere else and then nobody gets these things in place in the first place.

Speaker 1: 24:57

Exactly.

Speaker 3: 25:00

And then I think we're on to our last piece in our pipeline here, which is anti-fragility. Right, anti-fragility is the highest standard you can hold a model to. So, unlike robustness, which is like gentle failure, resiliency, which is coming back from failure to a normal state, anti-fragility is about going that next step up, which is that when you come to an adverse scenario, your model actually becomes stronger in that adverse scenario in the future. And so what might that look like? As an example, a really good example is like the human body within reason. If you stress your body with exercise and give it appropriate recovery, you will not only recover from injury or recover from stress or recover from exercise. You will actually become stronger in the place where you were applying stress to the body. And so anti-fragility really gives us a vision of what it looks like to have a model or a system that not only recovers, not only comes back to its original state, but also becomes stronger in the places where it needs to be stronger, also become stronger in the places where it needs to be stronger.

Speaker 2: 26:10

Definitely. It's a great concept and it's hard to make some models to be antifragile, fragile, but that's besides the point. It should be the area we're striving towards right, because it's like that continuum of robustness can just kind of have those bumps in the roads and deal with sand particles and potholes. Resiliency is be able to get back to the same. You hit a pothole but you get back to the same state your shock absorbers work versus your Formula One car that implodes. And then anti-fragile is that runner not going to use Olympic analogies because we promise we wouldn't do any more of those but it's the runner that trains, that gets faster over time because they're allowing their body to heal, recover and those stresses make it stronger. And that's really what we want in these systems is to be making your systems better over time. That they get bad data, they get those areas Well, that makes it be able to generalize better. Then not just respond to that, but thrive it and improve. And it's what we really should be doing from when we talk about AI governance as well. How do we get our policies to be? And any type of technical stuff is what like retros and things like that. It's all about. This is like, how can we make our operating system better by learning from what in the past and failures and and as a team grow. That's where that team stuff is great right. But like, how do we make things better over time? And it's.

Speaker 2: 27:23

It's a hard thing to do for a system and there's not like a a set process that every single modeling system can have for this. But it's a hard thing to do for a system and there's not like a set process that every single modeling system can have for this. But it's definitely the way to strive to make your, your systems be able to get better over time. And that even might be like as a team, understanding how the model is operating and learning from that exposure to how should we modify it in the future. If your program itself we don't want everything to be self-learning, morphing AI systems that's not what we're looking for but if your control structure and processes are around it to see how it responds to conditions, then you're the whole system is anti-fragile and you're making modifications as needed.

Speaker 3: 28:01

That's right and that, and like Andrew saying, that either looks like manual review, but like you look at your own model and you do gap analysis on your own model when is my model weak, where is it making mistakes, and then building in protections for those spaces. You know, not every team is going to have the capacity to build out a model that identifies its own weaknesses and then retrains itself on that data. But at least building in these systems internally that models are reviewed and that they're consistently applied across a variety of risk points and failure points.

Speaker 1: 28:33

It brings up a key question If we're not using antifragility to build our AI systems, then why are we using AI systems at all?

Speaker 3: 28:44

Yeah, I think this is like a very common thing that we talk about in software engineering too which is the only code that will never break and never has any problems, is the code that never gets written. So the only model that will never fail, that will never have any problems, is the model that doesn't exist and doesn't get deployed because you didn't need it, because you had a process change you could make, because you could build systems that weren't as complex and were easier to explain. If we're going to build AI systems and we're going to put them in higher-risk situations, they need to build with the appropriate guardrails. They need to be robust, resilient and anti-fragile, and if we're not going to commit to that, we should ask why we're using an AI system for this problem in the first place. Going to commit to that, we should ask why we're using an AI system for this problem in the first place.

Speaker 2: 29:32

That's a great perspective. I think that's a good way to end it right there. Yeah, I think the broader issue we're uncovering with AI is the desire and we've talked about this a lot is I'll plug in an LLM and I can turn my brain off that the industry is starting to see that that you know they're not this crazy productivity enhancer, they're not the one tool for everything. This magic of what AI can do. We're not there, and having to engage your brain and knowing what you're doing and why you're building it and how it's going to work is required, no matter what model paradigm you're using.

Speaker 1: 30:13

The most fancy chat GPT modeler is, you still got to do this, the hard yards of setting the process up.

Speaker 1: 30:17

That's right, and for the listeners hearing that, differently from a productivity standpoint and using AI a lot of times this is coming up in terms of people are like psychologically getting afraid of using, you know, by delegating to AI and, in particular, llms and chat GPT, they're feeling the effects of like not having to work as much.

Speaker 1: 30:40

My argument is, though I think it's not working as much. The work has shifted Like so maybe that maybe AI has been helping you in one place, but now you've got to watch very carefully in another to be determined on the time productivity standpoint, but, like Andrew was saying, the LLMs and tools built off of them that we're seeing today are no excuse for looking away. Yeah, and just to wrap it up like we want to encourage and we want to encourage the community, like that sounded pretty downer, but we do want to encourage you. The reason we talk about these issues is because we want to encourage you to build systems, then hold them to a higher standard. If you're really looking for big gains from AI, you've got to put in big expectations and see them through.

Speaker 2: 31:31

Yeah, well, that's the thing. These things are possible. This is definitely let's end on a positive note. You can make robust, resilient and anti-fragile systems and processes, and we've seen them done many times before. We see it all the time with bridges. We have all the favorite examples from NASA and building complex dynamic systems and and um taking interdisciplinary inspiration from control theory and things like that, to how do we build a system that performs as expected, uh, and can be resilient and even anti-fragile, as as we can make a process better over time. This is just what high performance looks like and what high performance teams are, and how do you elicit the goals of the system. It's definitely possible.

Speaker 2: 32:13

Our only main point takeaway here is there is no easy button. It's hard yards, but it's definitely possible. And, as this podcast has said the whole time, it's like bringing stats back looking at what's the ideal paradigm for the job, because there is no magic answer. But you can definitely make these systems that are highly performant and we see this, to close, on the running examples, like we're seeing runners getting faster over time because you're getting better at training methodology and things like that with how do we make, not just be robust and resilient, but anti-fragile and improve. We can do that in any discipline. But ask any runner or anybody high performance in any field. It's a lot of work and I think, just as an industry, we've gotten to this like I don't know why this productivity enhancing easy button AI is the answer to everything. Newscast cycle, hype train. That's just categorically false. The technologies are improving. You can make some really really good systems, but you must remember the basics. The fundamentals matter and do things the hard way and make really good systems.

Speaker 1: 33:14

Well said. Thank you, guys. This was a really engaging discussion and I love it. You know, I love it when we can pull like our real life passions into these discussions For the listeners. We do have this as an article form and I'll share it in the show notes. A lot of good questions were asked in this episode, in particular, pointing out process and practice problems versus cost and infrastructure problems in really being able to pay attention to resilience and building performance systems. So let's hear it in the comments or respond to us on our feedback form. We'd love to hear your experiences and maybe opinions on where those problems lie. Until next time, Thank you.

People on this episode

The AI Fundamentalists