The AI Fundamentalists
A podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses.
The AI Fundamentalists
Model validation: Robustness and resilience
Episode 8. This is the first in a series of episodes dedicated to model validation. Today, we focus on model robustness and resilience. From complex financial systems to why your gym might be overcrowded at New Year's, you've been directly affected by these aspects of model validation.
AI hype and consumer trust (0:03)
- FTC article highlights consumer concerns about AI's impact on lives and businesses (Oct 3, FTC)
- Increased public awareness of AI and the masses of data needed to train it led to increased awareness of potential implications for misuse.
- Need for transparency and trust in AI's development and deployment.
Model validation and its importance in AI development (3:42)
- Importance of model validation in AI development, ensuring models are doing what they're supposed to do.
- FTC's heightened awareness of responsibility and the need for fair and unbiased AI practices.
- Model validation (targeted, specific) vs model evaluation (general, open-ended).
Model validation and resilience in machine learning (8:26)
- Collaboration between engineers and businesses to validate models for resilience and robustness.
- Resilience: how well a model handles adverse data scenarios.
- Robustness: model's ability to generalize to unforeseen data.
- Aerospace Engineering: models must be resilient and robust to perform well in real-world environments.
Statistical evaluation and modeling in machine learning (14:09)
- Statistical evaluation involves modeling distribution without knowing everything, using methods like Monte Carlo sampling.
- Monte Carlo simulations originated in physics for assessing risk and uncertainty in decision-making.
Monte Carlo methods for analyzing model robustness and resilience (17:24)
- Monte Carlo simulations allow exploration of potential input spaces and estimation of underlying distribution.
- Useful when analytical solutions are unavailable.
- Sensitivity analysis and uncertainty analysis as major flavors of analyses.
Monte Carlo techniques and model validation (21:31)
- Versatility of Monte Carlo simulations in various fields.
- Using Monte Carlo experiments to explore semantic space vectors of language models like GPT.
- Importance of validating machine learning models through negative scenario analysis.
Stress testing and resiliency in finance and engineering (25:48)
- Importance of stress testing in finance, combining traditional methods with Monte Carlo techniques.
- Synthetic data's potential in modeling critical systems.
- Identifying potential gaps and vulnerabilities in critical systems.
Using operations research and model validation in AI development (30:13)
- Operations research can help find an equilibrium in overcrowding in gyms.
- Robust methods for solving complex problems in logistics and h
What did you think? Let us know.
Do you have a question or a discussion topic for the AI Fundamentalists? Connect with them to comment on your favorite topics:
- LinkedIn - Episode summaries, shares of cited articles, and more.
- YouTube - Was it something that we said? Good. Share your favorite quotes.
- Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.
The AI fundamentalists, a podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our business. Hi, everybody. Welcome to today's episode of the AI fundamentalists. I'm here with Nick and Andrew. And today's episode is going to be about model validation. But before we get into the topic, there was an interesting article released yesterday by the FTC, that's going to tie into a little bit about why we're talking about model validation. And maybe even in our next couple episodes, really digging into the guts of models. I know we've talked about the inputs to models and the ecosystem around it. But these next couple of episodes will probably dig into that for y'all. But in the FTC article that we saw, it was really an article yesterday, or I should say, October 3, really digging into the FTC really taking a pulse of the consumers and the concerns that they're voicing about AI. And it was interesting to see this comments that they cited, because it got me thinking about one, like it seems like the FTC and taking this pulse is starting to hit on the backswing of what happens whenever you give a lot of hype and awareness to a thing, like artificial intelligence. And the hype did such a good job of making it known to everybody that artificial intelligence is integrated into your lives and the products that you use. Cool. But then when you look at some of the examples that they call out in the article, and we'll post it in the Episode Notes for you to see, but you know, one of the things one of the situations they called out is the near the common one that people have gone through for years, this call is being recorded when you're on a customer service call. You know, long ago, if you had relatives like old relatives or something like that, it was kind of a nice and cute thing for them to say, Well, I'm not letting them record my call in the name of customer service, and you kind of laugh. But now that this awareness is there, this pulse is becoming stronger according to this report, because you have people now understanding that, that call that they allow the recording of is used for training purposes, but it is now going into other systems because the news has made it so and they've said it. So I really do think that we're coming into what the FTC has highlighted is that no people, the mainstream consumer now understands enough to ask questions. And I say that thanks to the hype, or no thanks to the hype, depending on how you look at it, where these questions might not have come up in years or decades before when machines and artificial intelligence were the unsexy back end thing of the products that you used. And I just thought that was interesting. What do you guys think?
Sid Mangalik:Yeah, I mean, I think this cues really well into how public perception of AI changes as hype goes up, right? As hype goes up, we really see these doomsday scenarios coming to people's minds where people are thinking about, well, people are gonna use AI to scam me, people are gonna use AI to create, you know, false and fallacious products and images and distribute them to me. And so, you know, with hype comes the responsibility of doing things in a way that ensures trust and, you know, good imaging around using artificial intelligence. Because right now, as we see from the FTC, you know, people are feeling pretty negatively, and maybe even a little bit scared of what these models can do, and that they're not working in the best interests of people.
Andrew Clark:Yeah, and that's a really interesting point. And I think because of the over mediatization, if that's a word of of AI at the moment, people are a little bit giving it a life of its own as well, which from the FTC perspective, their job is to protect American consumers from bad business practices and discriminatory action and things like that. And I think they are doing a good job of this, it's less about the technology, like it doesn't matter what the company is doing, if they have individuals that are doing things if they have RPA. If they have machine learning, if they have chat UBT it doesn't matter if the outcomes that matter that you're making sure that you're upholding the laws of land, you're being fair and not biased or anti competitive for consumers getting consumers best experience so like tampering down there, like I'm scared of AI Well, FTC doesn't really have a say on if you use AI or not, but they do have a say, how are you treating your customers? Are you consistently doing what you should be doing and being fair and proper pricing and not discriminatory?
Susan Peich:honest answer that said, I loved how you put that, you know, with with the hype comes the responsibility, like you said, the heightened awareness around that responsibility. And Andrew, I know you've off episode, you have been talking to us a lot about the expectations of the FTC and you follow it pretty closely. And you know, let's take a look at that. In terms of why we're covering model validation, and maybe why we're taking that approach first,
Andrew Clark:definitely which that's a key part as Mt model validation is really making sure that your model are doing what you think they're doing given their specific expectations. So we've talked a little bit about models in the past, we need to do a couple more like deep dive episodes. But a model is just an abstraction of reality. That because reality is reality, and it's a whole world and all these complex interactions, that we don't even understand all the parts with physics and quantum physics and things, how things work, you have to abstract that reality into something manageable, that you can build a model on, the only reason companies are like using models is to be more efficient than having individuals do things or being more reliable. The end goal of like aI driven systems is they can actually be more fair, unbiased and consistent than humans, because humans have our own biases and inconsistencies. And, you know, is it a Friday afternoon, and you have great plans in the weekend, or you've kind of mentally checked out like modeling systems don't have that problem. So the key goal is modeling systems can actually put us into a more fair, better, more equitable place, we'd have to build them properly. And this is one of the big problems with you know, none of us on this podcast are big fans of LLM in the current generation. They're trying to do everything models cannot do everything models, the smaller subset you get a model to be the more accurate it can be. But turning from like a an FTC perspective, pricing, and Amazon is getting some heat right now for they had an unfair competitive pricing algorithm, that how can they try and raise prices across the board from consumers and kind of first antitrust action? That's the kind of thing FTC is concerned about? If they're doing that with with meetings at Starbucks? Or if they're doing it via an algorithm? FTCS perspective doesn't really matter. But from our perspective, that model are how do you make sure your model is doing the intended purpose, and being following all the laws of the land and making sure it's that limited scope? So that's really the key fundamental aspect to look at. And what we're digging in today is that validation of the model to make sure if you know what the goals are, what the parameters are for this subset of the world, how do we make sure that it's doing what you think you should be, should be doing to meet business objectives and regulatory requirements?
Sid Mangalik:Yeah, that's exactly right. And so this kind of lends itself to saying that, in order to validate a model, you have to know what it's supposed to do. And if your model is supposed to do everything, how can you possibly validate everything? So you're in a situation where the scope of validation is so large, that it's basically impossible, and you're just letting your model do whatever it does, and then you really can't verify that it does what it does. So, you know, model verification then, is a specific form of verification of a specific expectation of a model specifications. Right? That it's, it's targeted, it has specific goals, and it has specific targets and objectives. Rather than, you know, well, the model works good. Some of the times I might when I chatted with it, which is very different than a more formal verification,
Susan Peich:hearing validation. There is one thing that I wish if one of you would both want to clear up in searches, you see, this can even be alliterated. When you're hearing it model validation versus model evaluation. Can one of you just for the sake of this podcast and going discussion, distinguish between the two so that we're clear about model validation?
Andrew Clark:That's very context specific, and specifically what you mean by evaluation? So I would think evaluation from my perspective anyway, and said, Please weigh in on what your thoughts are. Evaluation is more like as a consumer is, is doing what I want it to be doing, but you're not actually formerly interpolating and, and stress testing the sides and figuring out where this model works or doesn't work. You're more like, does this work for me? Is it helpful for me for church? If you needed to make my my to do list properly? Did it misspell something versus the actual validation is something more from the engineering perspective we're doing to make sure that system is in compliance with the business objectives with the regulatory requirements that we are confident? We thought we've talked about when I made all the errors about the Richter scale? Previously, we've talked about like the Fukushima disaster, where engineers will traditionally say this is valid up to this specification and civil engineering, I think this is a common thing for aerospace engineering to is can handle x amount of G forces, whatever. We want to be doing that from a modeling perspective and bringing those inspirations in. So we can really evaluate we can validate saris, using the wrong weather, validate that the model does what we want it to do, and then put the appropriate guardrails to make sure it can't go out of those safe bounds. And then when the consumer evaluates it, it doesn't meet their use cases or not.
Sid Mangalik:Yeah, and I'll just bring that the other way. Right, that model evaluation is an engineering problem. Well, model validation is a business problem. Model evaluation is does my model give me the correct answer? But model validation targets bigger and more broader questions of is my model resilient isn't robust? Does it always work in all scenarios? How Does it work? Does it work the way that I expected to work? Whereas evaluation is just the literal? Did it give me the right answer? Right. So it's, it's going to be on the engineering problem to solve the business problem.
Susan Peich:Perfect. And I think that that really helps some that because as we dig into validation, you know, like we said this in future episodes, you know, we're going to talk about performance, we're going to talk about bias and fairness. But today, we're really going to focus on resilience and robustness in terms of model validation. Andrew, talk to us a little bit about the difference between resilience and robustness.
Andrew Clark:Definitely. So resilience is really how does that model handle handle adverse data scenarios? How is it resilient to seeing bad data? And how does it perform under those less than ideal conditions, you know, and athletics, maybe you can, can do a bunch of pull ups on your own bar at home, but then you second you go outside in the rain, and someone's watching you and looking over your shoulder, your your score comes down. Same with any type of like, that's more stressful environment and having those adverse scenarios. So how does your model happen when it's isn't resilient to adverse scenarios? And the robustness is really how well is that model generalize to unforeseen data, things you haven't done before. And this is a really big problem in in machine learning is that overfitting of data and models are very good at what you seen, but they don't generalize well on new data. But when we're thinking about building these performance safe systems, we want to make sure that they are resilient and robust, because if your model does not have those properties of being resilient and robust, it's really not terribly useful in the real world. And that's a big dichotomy. We see in academia, you make these great to have these normal toy datasets you optimize on, it looks great. It's like the Kaggle effect. And then you bring it outside of its environment, and it falls apart. But you get the paper credit, because it looks great in your pristine environment. We don't want pristine and we prefer lower performance and pristine but better performance in the messy real world.
Susan Peich:You've put this in the context of Aerospace Engineering before explain that a little bit more. Can you put Can you continue and put that into these terms?
Andrew Clark:Definitely. So in aerospace engineering, normally, we have like solid physical properties are somewhat solved. We know from physics like curvature of the earth. And all this I'm not, I'm not a physicist. So I don't know exactly all the things. But you know, these physical properties have been solved by Isaac Newton, people post him and all these types of things. We know what we're evaluating against wind speed, different different types of things like that G forces. So we're trying to make sure like stress test the model, the system will use these like flight simulators to do that. Well, we also kind of like a digital twin honestly, for some of these things. We talked about last episode, test. And that's it gives us a little bit more of a benchmark of what to validate against, hey, I want to make sure this this fighter plane can handle 10 G's. Well, we can test against 10 G's and see what happens. That's when you can also make physical models of things versus just digital, and do that the problem with machine learning and why as a general rule, when preferable we prefer to use other types of modeling systems, when possible, is that machine learning is really about mining for correlations versus cause versus modeling causation. So causation is actually this causes that often machine learning is used to try and get value from data when I think one of the one of the common ones uses people eat ice cream on a beach, there's more shark attacks. So does ice cream cause sharks? No, it just means it's hot, more people are at the beach, it's the summer they're having ice cream, they're on vacation, and more people are there, the probability of getting bitten by a shark, which is still ridiculously low is higher, right? So but if you look at using the machine learning algorithm, Chachi beauty of the world will then infer that ice cream equals shark attacks. No, that's a correlation, not a causality. So a lot of other modeling paradigms do that causality. And same with we know, in the physical realm, those relationships more. So I think there's a lot of misconception there. And that's why these validations, especially machine learning models become a lot of become very high importance, because we don't have those. It's not built on specific mathematical specs that we know prove out, we have to be really intense about making sure that resilience, robustness, and we don't get the crazy effects.
Sid Mangalik:Yeah. And so that kind of sounds like we're not doing physical evaluation, we're doing statistical evaluation, right? So these are hard and fast things you can calculate like, we know gravity, right, that has a fixed number. And so we can just train ourselves to get that number. But that's not the case with these statistical models and machine learning, right? We're going to be dealing with a statistical level of evaluation, where is this close enough, statistically, to our expected result, much more so than in a physical system or in a simulated physical system? So I think we can talk now about how you can do this type of statistical evaluation, right? How can you learn about the properties of a distribution to model it without knowing everything? Because we can't know everything right? If we want to build a product that's going to be served all Across America, we can't ask every American how many car accidents they've been in. Right? We have to statistically model that and get as close as we can reasonably to that. So what's what's you know, what's going to be a good method for that? And, you know, we'll go back to Monte Carlo, right, as a way of sampling, to learn about the distribution we're talking about here. Right? If that's every single American, we can ask every single American, so we can strategically ask a group of Americans to learn about the true distribution of, of data in there. And I'll let Andrew talk a little about the history and then maybe I'll come back and talk about how it works.
Andrew Clark:Definitely, we should do a separate episode specifically on like probability distributions is, and the sampling aspect specifically like this gets very deep and technical. But really, it comes back to what we've said since day one of the podcast is we're bringing stats back and Statistics has kind of fallen out of favor a little bit in some of these quantum metrics and other fields, because machine learning is better, right? Wrong. Machine learning is a tool in your toolbox. It is good for some things like predicting vending machine usage, great, I don't care if court if there's causation there, but there's some areas where I do care about it. So from for us to be able to use Monte Carlo to try and determine that robustness, the model and the parameters and tweak with statistical distributions. It's it's really a type of numerical analysis that lets us perturb those statistical distributions and check different parameters and see what the possible giving a wide variety of parameter inputs, what were the respective outputs from that. So Monte Carlo really came out of physics physicists working on Manhattan Project is the first document we've seen in the 1940s, a lot of things came out around World War two eras, a lot of these technologies we're working on now, first computers, first models, first of all, for first, computer driven models, and simulations really came out of there. So Monte Carlo is really that underlying mathematical approach for identification of all possible possible outcome events, which allows you to assess the risk and impact of those events, allowing for better decision making under uncertain conditions. It's really like what if scenario testing, and NASA has really pioneered this with the Apollo program and subsequent things. And it's one of their main ways that they value evaluate, or sorry, they validate systems before they evaluate if they were successful on on spacecraft.
Sid Mangalik:And so you know, what is what is the Monte Carlo doing, that's testing the robustness rather than the resilience, right? If resilience is about testing, the negative situations in the bad operating conditions, Monte Carlo is testing holistically, as close as you can get to everything, right, it's letting you get a taste for the good times and the bad times, right and exploring the space of the potential input spaces that your model is going to be running into. And so if you can think about flipping a coin, if you flip a coin 10 times, you're not going to get five and five, realistically, that's not what's going to happen. You might even get nine one. And maybe you flip a coin 10,000 times, you're still not going to get 5000 5000. Right. But what Monte Carlo lets us do is it gives us this notion that as we increase the number of samples we take, we're going to at least approach the correct underlying distribution, even though we can never, you know, examine every coin flip, that's going to happen, we can at least learn something about the general aptitude and the general space that these models exist in. Right, by learning about a representative sample at the outcome distribution, even if we can't know the true distribution.
Andrew Clark:In mathematical terms, one of the ways we do this is if if you can't analytically solve an equation, you don't know what the equation is to analytically solve it with like algebra and and things like that. The Monte Carlo allows you to try and determine that those underlying solutions to the problem numerically versus analytically. So that's if we get into the mathy, part of kind of the whys of how why we're doing that. So So do you want to talk a little bit about the the two of the resulting analyses we usually do in conjunction with with Monte Carlo. So when we do Monte Carlo, we figured we create a representative dynamical system model of what are we trying to, to evaluate, we figured out the parameters that we want to, to sweep over along with the distributions, we'll run that in a mat a lot of times I said was mentioning maybe 1000 times per for variation of the parameter to try and understand Understand what the true distribution is on the other side of it. And then what are the kind of analysis techniques we usually do to evaluate that?
Sid Mangalik:Yeah, so there's basically two major flavors of analyses. And there you can think of them as trying to do the same thing but from opposite ends of the problem. Right? So sensitivity analysis in the front end uncertainty analysis on the back. The idea with uncertainty analysis is we want to see how did the outcomes change relative to changes in the input. And on the other side of uncertainty analysis, we're looking at At what do the outputs look like? Right? What is the shape of the outputs? So digging into sensitivity, again, that's our inputs. You may have seen this as feature importance, if you've used a psychic learn or you know, even some nicer Karis models will give you this type of work for you. It's the idea of if I tweak the age parameter, are my outcomes shifting drastically, right? If I put like a 13 year old model and a 68 year old into my model, are they flipping their their outcomes, or their outcomes, you know, moving, like red, like randomly through a curve. This lets us explore the sensitivity of specific inputs to the output space, right, so we can learn a little bit about the features that we're putting into the model. The other side of this, again, is the uncertainty analysis, which is that we look at all of the outcomes we've generated historically. And we want to see the associated probability of each outcome. And if that's a continuous variable that will be like the shape. And if it's a categorical variable, it's going to be a bunch of, you know, probabilities, or just straight probabilities. And so then we're learning about as the inputs change over time, how are the outputs then changing over time? Right? So these both let us use our exploratory Monte Carlo data, which is really nice sample of data. And then letting us tweak the inputs, how do the outputs change, and then see how the outputs change in general.
Andrew Clark:For me, it's a very fascinating field. I love running Monte Carlo simulations, I like to do it for a lot of different valuations and things we do. But it's one of those areas that's not really taught in single disciplines too much. And definitely not my computer said, correct me if I'm wrong, if you've had any experience in computer science, doing it. But it's very much something that's it's taught in like traditional engineering disciplines, civil, electrical, we talk a lot about systems engineering and things like that. And it's used a lot in economics as well. So it's one of those kind of like interdisciplinary tools that we think should be used a lot more often for validation of digital systems, not just physical systems.
Susan Peich:Before we move on like it, that's a really good synopsis of robustness. And some of the testing that it'll do. Is there anything that we would say, you know, to wrap before we move on to like, play the resilience part, is there anything that we would say about Monte Carlo techniques, or any of the analyses run, that would help people understand like the limitations of an LLM?
Andrew Clark:We could do an IC. So this is where we could do an experiment on that to figure out different parameters of what are we wanting to testing and different questions in different scenarios and run them through like, so Monte Carlo can essentially be a glorified for loop in like an LLM system is you have with the with a bunch of different inputs. And you're just going to, we talked about the feature analysis. So you could have a bunch of different questions as an example, run them all through the model and see what the outputs are and analyze those and see percentage of how accurate and things like that. So it's a little bit harder for like, a large language model, especially if you have it prebuilt. And you're not, you don't have parameters you can tweak. But it is still a useful technique to try to run some validations on it and said, I love you to jump in on that. Yeah, and
Sid Mangalik:that's actually true for like models that were totally blackbox how to write if you want to do is for GPT. For it's a bit of good luck. But if you want to do this on a llama, or you know, even an older GPT, which was made available, there is some opportunities to explore those semantic space vectors, right? You can do a traditional type of Monte Carlo experiment, just moving through that semantic space vector, let's tweak dimension one, let's tweak dimension 50. And explore that space a little bit. And then try and do Monte Carlo to that you can hope and try to do that. Now, again, this is gonna be tough to validate, because we don't know what is the correct answer for a lot of these questions. But we can at least explore a little bit when you're allowed to have a little bit more insight into the model.
Andrew Clark:It's weird, it falls off the rails like a crazy answer. But I think said you hit it on main point, there's like the main goal of of Monte Carlo is normally the parameter sweeps. And that's where it's less useful. There are useful use cases for it was less useful at the blackbox stage, there's a model already pre built versus when I'm building or creating a system. That's a key part of how we're advocating for validation is validating through the whole stage of the model development, not just hey, we shipped a thing. And now let's look at it post hoc, because the the at least Monte Carlo technique itself is most valuable. In that params sweep stage. We're sweeping hyper parameters, things that we can change that are not learned from the data aspects of your model and seeing where it performs the best.
Susan Peich:Thank you for doing that. That's really helpful for putting this in context of what people might be experienced. They're being pushed to try as they're trying to, you know, evolve their model their model systems and problem solved within them. That's a good way to segue into stress testing. resilience and I'll leave it to you guys since I seem to be getting tongue tied.
Sid Mangalik:Well, I'll start this one. But I know that this is this is Andrews wheelbarrow through and through. So, you know, resilience, again, contrasting with robustness, which is how does your model do on all outputs? This is focusing on the bad inputs, right? We're talking about things your model is never seen, we're talking about catastrophic events, we're talking about black swan events, we're talking about things that are really going to stress your model. And what stress means is something that hasn't seen before, which is really pushing to see like, what have you learned? And have you learned anything intrinsic and inherent about the world. And so this is testing that those things are part of the model. And so I'll hand it over to Andrew to now talk about some specific examples of what this negative scenario analysis might look like.
Andrew Clark:Definitely. So stress testing, which that resiliency is a very common thing done in finance and in financial settings. So CCAR is what a lot of large banks have to go through its Federal Reserve every year for banks that are critical size have to be submitted to stress testing, where essentially, they have different scenarios with the bank, the Federal Reserve will create what happens to your hole to your bank's balance sheet. If interest rates go up by 2%, and things like that, like it goes different scenarios. So it's really that scenario analysis personally, when we're evaluating when we are validating systems. I like to use Monte Carlo with stress testing, in conjunction, but stress testing in a lot of scenarios, especially for like large banks and things like that is they just have traditional models that they're just seeing how does it happen under one area. But why I like to combine the two is the traditional stress testing is we create a handful of scenarios, you tell us how you do on specific scenarios, and you're done. But it's not actually evaluating or stress testing the edges of your system we try. And when we look at Monte Carlo, we're literally trying to do that that sweep onto we want to break your system to know what it where it breaks, so we know how to dial it back a little bit. So the area like in an airplane example, we want it to be digital, so we're not hurting anybody. But we want to go and tell the wings literally fall off this thing. So we know okay, we got it, we got can only handle x, it can only handle Mach two, it can't handle Mach three, or it can only handle this managing forces. So then you know what those safe operating parameters are that you can then put on and say in the manual and all that kind of thing. So stress testing is really that resiliency to different scenarios, very common in finance. But there's a lot of things done in finance, traditionally quantitative finance that can be improved, like using Monte Carlos to enhance your stress testing versus just a set set of scenarios, as well, as we talked about in synthetic dated podcasts, Gaussian cupola and things like that. There's a lot of statistical, we're about bringing stats back, but bringing them back in a robust way, like nonparametric says, a whole nother conversation. But going away from just the the easy button even in the statistical realm.
Sid Mangalik:Yeah, I think that's good. And then, you know, calling factor synthetic data, episode is is perfect, right? It's, if you want to do something synthetic, right? If you want to, if you want to explore data, which doesn't exist in the real world, you know, let's say that you have a financial model, and you want to increase the unemployment rate by 10%. What happens, you need to make some synthetic data you need to make that happen. So, you know, please refer back to the episode about what it means to make good synthetic data. But with good synthetic data that really models the world. In this major shift, running through your model, again, you'll get a real sense of your models resiliency, to major shocks like this.
Susan Peich:You mentioned earlier, Andrew, you mentioned, you know, some of the financial distress tests in the financial world, are there any other ones that you can think of that are significant that might, you know, be tested here.
Andrew Clark:And I know a lot of areas like aerospace and civil engineering, they're doing these things. I'm hoping any of these critical systems that people are doing these stress tests and doing them in a very, let's break it and then scale it back way. And that's where we see monitoring, like in the discussion of, of you know, MLS monitoring and things like that is monitoring me is basically meaningless unless you know what you're monitoring. One of the key ways we see is we need to know like in bowling, we need to have those guardrails on we don't want gutter balls, we need to actually stress test and figure out oh, if I do this right hook thing, it's going to go into the gutter. Great, don't do that. And let's put a guardrail up to make sure that it bounces off the guardrail and doesn't go out. So without doing this types of stress testing, we won't know where those guardrails are. But in engineering and a lot of different uses of engineering, there's you can use stress testing for that a lot of different financial areas. We do stress testing, I used to work on using dynamical system models and simulations to determine for building economic communities and figuring out where those gaps are a lot of my PhD work was that so it is being used in a lot of other spaces, including in high finance now, it's just like, is it being done on all critical systems? And is it being done where we need it? It's really a case to case basis because per like we talked about earlier, there's not regulations, and there can't really be regulation to say you must do XYZ stress test, because it's so specific to what your use cases, you are holding accountable for the outputs.
Sid Mangalik:And I'll give a fun example. That's not financial, right of where you might do this type of modeling of this type of resilience modeling. So, gyms and ers have a very similar problem, which is that everyone can't go at once. They're not designed for that. I think that a study was recently published a set of 5% of people that had a gym membership, go to the gym, the gym would be full. Oh, wow. So this, there's a modeling problem here, right? It's how much equipment do I buy? How much space do I need? How many memberships? Can I accept? How much should I charge to discourage people after certain amount? And so these are, you know, these are problems that have a resiliency aspect to them, right? It's a learning about where's that happy equilibrium? And evaluating that your model is going to be fine on January 2, when everyone wants to go to the gym for the first time.
Susan Peich:Oh, good point,
Andrew Clark:said thanks for bringing up that's a great example operations research and all the different queueing theories and different things there. It's very fascinating and like, yeah, arrival times. It's a huge rich area of operations research. That's great point. And, and logistics are definitely areas that people use some of these methods. And I honestly forget about I've taken a lot of operational research classes. I love operations research, but I always forget about it. A member of informed us have their certification, but still forget about it. But that's a great example, in an area where I wish that the machine learning and AI and computer science communities, we look at these other disciplines more because like operations research has had like, like the examples it's doing about gyms and about hospitals. And like, literally, we're figuring out how do we staff hospitals? That's a salt. Well, it's not a solved problem. It's never gonna sell. But we have salt, we have robust methods that we've tested over years to figure out how do we prevent overcapacity, and you only hear about an operations research or supply chain when something goes wrong, but you rarely hear about something going wrong. COVID messed up shipping and things and we know about that. But how complex if you think of your UPS packages, and then routing around the US or airlines or like one delay and that cascading effect. That's all operations research and and algorithms that determine how to update these things. And we rarely hear about major issues like that's a huge operations research is a rich area that ML community should be looking at on how do we validate how do we stress test, and here are different ways to try and solve problems? Because it's kind of like a silent, but successful discipline.
Susan Peich:As you're saying this, I'm sitting here thinking about how many, you know, we talk about rosin Thorn. And, you know, I'll start with with Thor and The Thorn is like, you know, how much hype has been dedicated to, you know, are the things that generative AI and MLMs have brought up, but are the ones that the problems that have been brought to the forefront? Are these really the major problems that can be solved? In what you're saying? My the rows that I see is that there is so much opportunity to solve some real problems, you know, whether it's generative AI or really like having a myriad of different modeling tools and modeling types and ways to stress, test them and validate them that can really get to the heart of problems that need to be solved. And similar to the answer to the examples that you guys just raised.
Andrew Clark:Yeah, and that's where just AI in general really frustrates me because like operations science has gotten this down to almost a science. I think they can call it a science like if you think about anything like manufacturing, shipping, any types of logistics or like even military operations, or like how crazy any of that stuff is, or massive civil engineering projects, building dams, all this stuff. People have been doing this stuff the pyramids for years, and there's specific methods like we can always be improving these methods. But there is like experts in this field we just forget about it because Chachi Beatty can make something up so like the whole really takeaway here is use a model that's specific for its task there is no like general the biggest pro and negative to like the LLM ChaCha beauty world is they one model can do everything. People like the easy button. There are no easy buttons, but you can get very good models that solve problems. So what is it a society we think is a problem? Let's solve the problem. But we figured out what's the best model for the problem. And for stop chasing this like correlation and causality for falsity just because something is correlated sharks, tax and ice cream doesn't mean that it's fact and just because there's an LLM might one sound like a human and get something right does not mean it's always going to be right. And without doing stress testing and without doing the robustness and resilience validation. We won't know is it saying Abraham Lincoln was born in 1902? You have to test those things to see if that's what it is. Because just because it does the Gettysburg Address, if you ask it well does not mean it knows when Abraham Lincoln was born, or what Abraham Lincoln's accomplishments were. Maybe it says he was the first man on the moon. Well, I don't know like just get might sound plausible. There's a lot of plausible things coming out of these systems that are not factual.
Susan Peich:This slight tangent that we went on I think it's also an ESA is necessary one because it gives us a little bit of hope.
Sid Mangalik:Yeah, I mean, you know, the hope of model validations is that we're going to be the anti hybrids, we're going to know what the model does, we're going to give people what they asked for. And we're going to have strong, robust and resilient models which people expect every else in their life. But something that we haven't felt like we have to do an AI. So you know, making that a fundamental part of model building.
Andrew Clark:One of the most modern marvels that nobody talks about is Google Maps. network optimization at its finest to modern marvel. It's all modeling.
Susan Peich:Interesting. So we'll dig into that in future episodes. Along this on along these lines, we're gonna get into performance and bias and fairness in our upcoming episodes, so please stay tuned and thanks for tuning in today. If you have any questions about what we discussed a lot of good information here about you know, really improving and involving your model systems. Please reach out to us at the AI