Baseline modeling and its critical role in AI and business performance Artwork

The AI Fundamentalists

A podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses.

All Episodes

The AI Fundamentalists

Baseline modeling and its critical role in AI and business performance

April 16, 2024 • Dr. Andrew Clark & Sid Mangalik • Season 1 • Episode 17

Baseline modeling is a necessary part of model validation. In our expert opinion, it should be required before model deployment. There are many baseline modeling types and in this episode, we're discussing their use cases, strengths, and weaknesses. We're sure you'll appreciate a fresh take on how to improve your modeling practices.

Show notes

Introductions and news: why reporting and visibility is a good thing for AI 0:03

Spoiler alert: Providing visibility to AI bias audits does NOT mean exposing trade secrets. Some reports claim otherwise.
Discussion about AI regulation in the context of current events and how regulation is playing out between Boeing and the FAA (tbc)

Understanding baseline modeling for machine learning 7:41

Establishing baselines allows us to understand how models perform relative to simple rules-based models, aka heuristics.
Reporting results without baselines to compare against is like giving a movie a rating of 5 without telling the listener that you were using a 10-point scale.
Baseline modeling comparisons are part of rigorous model validations and should always be conducted during early model development and final production deployment.
Pairs with analyses of theoretical upper bounds for modeling performance to show how your technique scores between acceptable worst and best case performance.
We often find complex models being deployed in the real world that haven’t proven their value over simpler and explainable baseline models

Classification baselines and model performance comparison 19:40

Uniform Random Selection - simulate how your model does against a baseline model that guesses classes randomly like a dice.
Most Frequent Class (MFC) - the most telling test and often the most telling test in the case of highly skewed data with inappropriate metrics.
Single-feature modeling - Validates how much the complex signal from your data and model improves over a bare minimum explainable model.
And more…

Exploring regression and more advanced baselines for modeling 24:11

Regression baselines: mean, median mode, Single-variable linear regression, Lag 1, and Least 5% re-interpretation
Advanced baselines in language and vision

Conclusions 35:39

Baseline modeling is a necessary part of model validation
There are differing flavors of baselines that are appropriate for all types of modeling
Baselines are needed to establish fair and realistic lower bounds for performance
If your model can’t perform significantly better than a baseline consider scrapping the model and trying a new approach

What did you think? Let us know.

Do you have a question or a discussion topic for the AI Fundamentalists? Connect with them to comment on your favorite topics:

LinkedIn - Episode summaries, shares of cited articles, and more.
YouTube - Was it something that we said? Good. Share your favorite quotes.
Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.

Speaker 1: 0:03

The AI Fundamentalists a podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses. Here are your hosts, andrew Clark and Sid Mungalik. Hello everyone, welcome to today's episode of the AI Fundamentalists, where today we are going to dig into the topic of baseline modeling, and baselining is one of those things that we say we do, we know it should be done, but there are a lot of more nuances and practices around it, and I think you know, andrew, you've seen some good examples, both in building and in some of the press that you've seen, with regard to why baselining is so important.

Speaker 2: 0:54

A lot of industries and stuff. You'll see this kind of thing and OCC actually requires it for like financial models. But we don't see that often in the data science world. Where what am I comparing something to? Oftentimes, you know, data scientists will grab a set of data and they'll import sky kit, learn and they're off to the races and they'll Ooh, I read about XG boost. Let's roll with that Like, but really understanding what is the really that business perspective? Even what are you? What am I currently doing now? Or what? What is a good baseline Other other businesses are doing or even what we're going to get into some of the more specifics today of what we can mean by that, but just some sort of.

Speaker 2: 1:25

I need to compare it to something because as modeling modelers, we often get in this trap of I'm just going to start using a cool tool and whatever results I get, I can optimize it and feature, engineer and make it better, but could you have done it so much simpler? Or what are you? Baselining against having something compared to versus just looking at accuracy and just specific metrics for the sake of the metrics. Oftentimes you read these metrics about data science. Most projects fail and things like that because they're so disconnected from what's the value of the business or what are you actually trying to accomplish. So, in our mind, baseline modeling is a good way to really start hammering that in and solidifying With regard to baselining that in and solidifying.

Speaker 1: 2:04

With regard to baselining, we're going to get into the details of it later, but let's take discrimination and model bias, for example. We've seen an article quoted in the AP like a more intensive but accurate way to identify discrimination in a model would be to require bias audits, which are tests to determine whether AI is discriminating or not, and to make those results public. And that's where the industry starts to push back because it's arguing that that would expose trade secrets.

Speaker 2: 2:37

I fully understand. So I think people are taking it to extremes. Companies, of course, asking of any part of a company and saying, hey, I want you to audit this and publish the results. That's not going to fly about anything. Let's take bias out of the conversation. And then there are parallels in the currently how financial reporting things are doing. But no company ever is going to be OK Somebody looking at their IP and exposing it or saying if they're good or not. However, companies now taking somebody that might be a little naive about what they're asking for and taking it well, we're never going to do that, but it's exposing results.

Speaker 2: 3:08

Well, there's definitely a middle ground that you have with, like, how publicly traded companies have financial audits and things where an audit a trusted third party and of course there's issues of with them and stuff, but there's always with humans involved. But they will then say, in material respects, you're fitting whatever framework, whatever and there's no material exceptions or whatever. So they'll say something like that that's very legalese wording, but they are heavily regulated and there's a new fine that just came out to one of the big four firms for not doing some things properly. So they are being very people are looking at them right now, but some sort of an independent audit that has a standard they're auditing against and will tell investors are you meeting that or not? Or the public or whoever body. We want to have doing that in a way that keeps the IP protected but still protects the consumer. I think that's what we need to be running for. People are sometimes going a little too extreme of like. Publishing results about anything in a company is a little extreme in my mind.

Speaker 3: 4:01

Yeah, and I mean it's just so classic that basically, uh, if you're not forced or required to release this type of information, people are going to like dig their heels and say like, well, I'm not going to do anything. Uh, maybe they feel that they have some ip rights, they have some research rights, they have some development they've done in secret and they don't want to give it up, and so there's going to be a lot of push for them to act like oh well, we definitely can't do this Regulation will roll around and say, well, we have to do it.

Speaker 2: 4:28

So here you go yeah, and that's what. There's definitely a middle ground that could do. What is the end goal we're trying to do? Let's think about that. If we're trying to protect consumers and make sure everything is fair and unbiased, let's go from what's that goal? How do we make that happen? Versus? Sometimes I do think there's like let's get political victories, or I can make a company do something, or things like that that sometimes muddy the waters. But if we're trying to protect consumers or protect individuals and make sure everything's fair and unbiased, let's optimize for that and let's find something that companies can't disagree with and we can get past, because if you say something as extreme of we're exposing your IP, you must publish it. There's going to be enough lobbying to shoot that down. But do something that they can't shoot down, versus going for a moonshot, essentially, which is sometimes, it seems like. Does anybody really think that's going to happen, or are they just trying to get airtime right now? I honestly am not sure.

Speaker 1: 5:17

I'll speak on behalf of the consumer. There's enough awareness out there now about how model driven some of the decisions about them are made and the data that they provide to make them. So, in that awareness are just going to be more questions on companies that are saying, well, we can't publish that or we can't do that. That's actually going to be a demand. That's my opinion, that's my prediction.

Speaker 2: 5:40

Yeah, I definitely think we could see that. We could also see the more general regulations that are enforced. Gdpr and things like that are enforced. Virginia and California have data protection acts for consumers. Those things are enforced in their laws, but they're nebulous enough that companies can protect IP, but then they're regulated by the regulators to make sure they're not doing things that harm consumers. So I think we can hit it from two ways of companies voluntarily subjecting them things things that harm consumers. So I think we can hit it from two ways of companies voluntarily subjecting them things Like you're saying, susan.

Speaker 2: 6:06

My concern about that is that they are kind of game the system and do something that makes consumers happy that doesn't actually do anything for consumers. So I really think we do at some point, which is still fine, but I think we do have to get real regulations at some point with teeth. But I think we can do those in a way that protect both customers and companies and do it and then the regulators enforce to make sure that companies are following the rules. But it shouldn't be a political victory point scoring thing and I think we're approaching it from the wrong angle.

Speaker 1: 6:31

No, totally agree. And to be continued, we're seeing this now with things like the FAA and Boeing. We won't go into much detail on that here, but I think, even if you could draw the examples from the news about that and the ongoing story of that investigation, I do think there's a lot of parallels that you can draw on. If we go at this from the right direction, you get some similar protections that you would see in the physical space.

Speaker 2: 6:55

Agreed and I think Boeing is a great example for many reasons. Like I don't know, I haven't been following it close enough to know. Obviously Boeing's in the wrong. Did FAA miss things or is Boeing? I don't want to get into any of the details and don't misleading it, but I think there's sometimes a misconception that because something is regulated or something, nothing bad will ever happen. That's not the case, because we do have to juxtapose that against. You don't want, nobody wants, big brother state like looking at everything everybody is doing, or we want to have 100% tax rate, to have thousands of government agents doing everything Like you can't do, like there's not practical. However, boeing is in deep trouble and I can guarantee that incident in Brunei is not going to happen again. But sometimes something does happen that then spurs making sure that we make sure it doesn't happen again.

Speaker 1: 7:41

Let's get into today's topic. So, segueing from our discussion about bias and some of the things that we want to do to ensure safe systems, let's talk about baseline modeling. What is it?

Speaker 3: 7:54

Yeah. So sometimes we get a little bit lost in, you know, wanting to build these models, quickly, get them out there, and we forget a little bit about what it means to set a baseline right. And when we're talking about a baseline, we're basically talking about setting expectations for our models what is reasonable performance and what is acceptable performance and what is exemplary performance? Right, you wouldn't let someone put a product out in the world without having some sense of is this good and do people care about it? And so baselining is a way to basically validate that your model is doing something or doing anything really. So we want to compare models to simple, rule-based system, right? People may just call this like heuristics, and so that simplified model we would call the baseline, and our goal is to show that our model, with all its complexities, is doing better than that baseline, and so we want to, for our chosen metrics, show that we are adding value, basically.

Speaker 2: 8:57

And those baseline. That's where it also helps you anchor in what is the problem you're trying to solve and what is like an average or okay result. We're going to get into some of the mechanics, but from a business perspective of what am I currently doing could be a good baseline. Or if it's, if it's fraud detection I don't currently have anything detecting for fraud Well then, if I detect anything, it's an upgrade right, as long as I don't have too much false like thinking of what are those business things. And then that heuristic is the key.

Speaker 2: 9:21

Trying to like there are ways to do good baseline models, especially if you're more advanced, um, and your journey is more advanced, you might have existing models, but even having like what's that business outcome, what's a good heuristic? So instead of doing like a complex uh model, even like suing cat versus dog, predicting or something you can just uh describe an aspect of of the animal or something like that, it's just like a very much true, false, like you, any sort of basic decision tree any of those things can be can be very, very basic heuristics. It just depends on the use case. But having something as your baseline and having it rooted in reality is key, not just choosing, as one of the main reasons you want to do baseline model is how do I bring this into my use case, versus just importing SkyKit learning going from there?

Speaker 1: 10:02

The use case is important and I like the way you phrased. You know, rooted in reality. So let's talk a little bit more about the actual testing. Like one, there's an obvious question of why do I need to do it, but what all is involved.

Speaker 3: 10:19

When we're thinking about what's involved with baselining.

Speaker 3: 10:22

We're thinking about using the data that we already have to establish what is going to be the floor, and so establishing what the floor is is like if you're talking to your friend and you ask them what do you think of that movie just watched, and they say I give it a five. And you're like, oh wow, five, that's pretty good, like five stars. But what your friend was talking about was a 10 point rating scale and actually they didn't like the movie at all. And so what baselines do is basically what they're letting you do is have some understanding of what is the scope of success that you're expecting. Metrics don't exist in a vacuum. They exist against other measurements, and so the goal of the baseline is to do that rigorous model validation and give you a ceiling and a floor of what to expect from your model we just really need those baselines of what can I expect out of this modeling system and make those realistic as well.

Speaker 2: 11:21

You can't have like, I want this thing to be a hundred percent accurate under all circumstances. Well, that's not gonna be realistic. But you want a solid the baseline. For instance, flipping a coin, it should be around 50%. And maybe if you're trying to do something on wall street you're trying to get a 51% flip, well, 50% would be your baseline right. So trying to that context specific is really good. Of course, the more upper, lower bounds you can do, the more fancy.

Speaker 2: 11:46

But we're going to try and talk primarily about the more basic method and start with classification baselines today. And there's like some mathematical methods we can do on top of. If you have a really good domain, like the stock market example, if you're a stock picker, flipping a coin is a good baseline. But for most things we're going to have, it's not quite as cut and dry or black and white. So if you don't have an easy business heuristic, which would be your ideal case or I have a current loan officer that is finding this rate of whatever, like, those would be the ideal baselines. But if you don't have those and you're going to start making those, those using mathematical methods and knowing upper and lower and bound. We'll jump into some of those metrics now.

Speaker 3: 12:27

Yeah, yeah, I think that that totally checks out, and I think we'll get a little bit more of that in our conclusions about business purpose and what baselining gives us in that in that regard. So how do we then do baselining in the real world? Right, baselining is super specific to the problem that you have and the model that you're using. So let's take, for example, a classification model right, this is your dog cat model. This is stock market goes up or down model, right, so simple, like there's a few outcomes you expect and we want to model what is the correct prediction for the next state or for labeling a particular entity? So let's just go up to some really, really simple heuristics and you might think well, this is so simple. And then you'd be shocked to learn that most people aren't even doing this little.

Speaker 3: 13:17

So the first and simplest one you can do is uniform random selection. And what is uniform random selection? It's flipping a coin, it's rolling a dice. So you look at your training data and you look at all the outcomes that are in there and you say I don't know anything about this data. So let me just randomly guess one of the answers that's in the training data and at least you'll have a sense of how you would do against chance, right, someone who clearly knows nothing about the data and is just guessing randomly. Can you beat that? And this will give you a really good catch for basically like, does your model have any performance ability that you couldn't just get from having your pet pick, what it is? You know we've all seen those like goldfish stock pickers, right? And basically like, truly like, can your model do better than pure chance?

Speaker 2: 14:12

Definitely. This is a fantastic. This is like your baseline of your baselines for classification models. This is a great one. If you don't have those business outcomes, start here. This is a great starting spot.

Speaker 3: 14:25

And that builds up really naturally to the next one which is a little bit smarter, right. This is called the most frequent class baseline. So this time we're going to say, well, we have a good amount of training data. Let me look at the training data and find the most frequent class In this data set of cats and dogs. It's 70% dogs. Let me just always guess dog, right, because I'll get at least a 70% accuracy. If I'm looking at the stock market data and it's usually stock goes up, let me just always guess up. And then let's go from there and see how does our model do against that.

Speaker 3: 14:59

And this is a really good test because it catches a lot of mistakes that we get with imbalanced data, right? So, for example, we look at these radiology measurements, these cancer screening technologies, and you know, if you're naive enough and you don't think about it and you do accuracy, you'll find that you get like 99 accuracy. You're like, wow, my model is so good. But if you just run a simple baseline that always guessed not accurate and you saw that you were getting 98% accuracy or 99.9% accuracy, actually higher than your model, then now you understand that the true baseline for your model was not what you thought it was. It's not that it's 50% is what's considered okay. What's considered okay is beating 99%, and so most frequent class baselining is usually the correct, simplest baseline, right? That's like if baseline to baseline is guessing, the true baseline is at least guessing the most common outcome.

Speaker 2: 15:58

I'm surprised how many times models fail this. This one is when you start, so most models will at least hit the uniform random right but most frequent class. I've seen way too many models fail this one, so this is definitely a solid heuristic baseline.

Speaker 1: 16:14

Aside from it being the most common miss, are there any other causes of this?

Speaker 3: 16:20

Yeah, I mean the one. The one thing that you get most frequently is the problem is that you build this model, you think it's super smart and you don't check the outputs manually, and when you look at how your model outputs, it's just outputting the major class anyways, so you're just doing the same as the baseline, because models are basically only as smart as the data you give it and then the amount of parameters it has to store the information of the data. And so you get these models that will just kind of do what this heuristic does, and so we want to beat that.

Speaker 2: 16:47

And this is it becomes more of an issue with models that have imbalanced data.

Speaker 2: 16:50

I think we've talked about that some in the past of say, if you have something that maybe 80% of the time is true, for 20% of the time is false, that maybe 80% of the time is true, 20% of the time is false this just using this heuristic you're 80% of the time correct, you're 80% accuracy, and model models will do that Specifically and this it's a tougher use case. But when you start getting into multi-class so you don't just have true, false, you have true false, maybe who knows, or whatever, like you have multiple things that's when you really can run into issues and have to really test your models more, because oftentimes, as I said, look manually, looking at your outputs, people are just looking at oh, my score set, my accuracy score is 80%, yay, I'm doing great. When you're just only picking true out of those four options all the time and the models are dumb. So if you don't, if you don't do that additional thing, modeling is very powerful, but you have to have all of these checks and balances and controls around them.

Speaker 3: 17:40

So let's say you've gotten past your like true bottom of the barrel baselines. Let's try something that's at least competent, right. Let's use some really rudimentary statistical techniques and let's just run logistic regression against your model. But let's not even do logistic regression because that's, like you know, true modeling. Let's just go feature by feature and then you tell me which feature actually performed the best. So you're just using a single feature to predict the outcome value, right.

Speaker 3: 18:09

So let's say we're looking at stocks and let's say, for some reason, weather is one of your variables and let's say weather does the best. Let's just say that your model is simply is the temperature over 70 or under 70? And tell me what the outcome is. And that's a new type of baseline which is just. If I had the simplest possible model with only one data point, can I beat that? Can I at least beat something which understands the data a little bit beyond just the most frequent class and is doing some kind of modeling? So that's our. You know now we've raised the baseline to at least like we've done a little modeling.

Speaker 2: 18:49

What's interesting is some of these fancier machine learning algorithms. People think they're doing really solid stuff, like XGBoost and things like that. If you actually look at the feature importance sometimes it really does kind of come down to this where understand your data first, because you might not need and there's more complexities with more complex models if weather is like the main predictor and sometimes it is. This is why when you need to do really exploratory analysis and correlation analysis and things like that, you might actually found that one of these 20 things you were going to plug in really that determines most of it anyway. Like, do you even need to go farther or do you just use that?

Speaker 3: 19:23

And so like going to the next step, which is then going further, beyond. Right, this is the best baseline. You can do Most problems that we're solving in data science. It's not the first time this problem is being solved. You know, if you're in a research lab or you're in academia, actually you're probably working on new problems, but most people are working on problems that have been done before.

Speaker 3: 19:40

So compare your model to what's considered state of the art. Find someone that's doing, you know, spam detection, that's doing dog detection, that's doing digit detection, and see is your model performing as well or better than that model? Right, because that's going to give you a sense of for the data that you're using. Is your model performing better than what you could have just picked up off the shelf and paid for? I've rarely seen this one you used. It's a great approach. Yeah, and this one you know, if you read papers, this is required, right? Basically, you can't publish a paper and be like our model is really good. You have to be like OK, but is your model better than what's out there? Like why should?

Speaker 2: 20:19

I care about your model. What's probably just anecdotal evidence is I often see people that are like refining existing methods or staying with. Like, if you're in statistics and you work on statistical methods, they do a great job of this in data science oftentimes. Or if you're trying to bring data science to an existing discipline like economics or statistics those the papers that oh my, I use this xg boost model and didn't compare to to anything. It's the best thing ever. First round of review comments. Well, what are you comparing it to Versus the people that know they're optimizing their ARIMA model and trying a new optimization for periodicity or whatever. They're going to have the baseline model. So it's kind of interesting to see are you a rookie or not? By looking at how much of this do you? How well do you leverage existing models?

Speaker 3: 21:01

Totally, totally, and I'll say that there's like you know. There's like you know. You don't want to get some egg on your face by doing this and finding out. Oh well, I'm not even better than state of the art, but the catch is usually and we see this a lot in NLP papers now you can take your approach and blend it with the state of the art and then often do better than state of the art. It's like, oh, when GPT is less than this, confident, use my model, and when you do these types of blended methods, you often do better than the state of the art. So this is actually not like strictly a negative thing. It's just can help you inform how your model will then fit in with the state of the art.

Speaker 1: 21:37

Oh, interesting. I mean, why does it do better? I'm just curious.

Speaker 3: 21:41

That's a great question. So usually when we're looking at these bigger models or the state of the art models, they were state of the art because they were really good at generalizing that. They were really good for the average problem. But this in-house modeling you're usually doing is better for the data that you own and is relevant to your problem. So you have the home field advantage.

Speaker 1: 22:05

Okay, the home field advantage being the data.

Speaker 2: 22:08

This dovetails in nicely with one of the reasons we don't like LLMs. Well, there's architectural reasons, but one of the things is a chat GPT could be great for you helping you draft emails, but the issue is a lot of companies trying to build companies off of chat GPT. They're always going to be in disadvantage versus you getting someone like Sid to build you a specific optimized large language model off of your data for your use case. And this is where that it's really general versus specific modeling Maybe an interesting podcast of where we're very much like fit for purpose specific modeling versus just grabbing you know what, making something custom versus buying the Walmart version. That's really what it is. It's like the Walmart thing is fantastic and it might. It's very applicable across a wide range of things, but it's not good. It's not really good at anything right. It's kind of like buying a Ford versus buying a Maserati, like they're good at different things and it's what is your use case. If you want high performance, you probably got to get a special Formula One car.

Speaker 3: 23:06

But I mean, it's a good question that really gets at this idea of like why does state-of-the-art exist and why doesn't everyone just use it? If it's so good, why is it just basically plug and play? And it's basically that you know you should solve the problem that you have, and often the state of the art can be part of your solution and often our goal is not just to beat it.

Speaker 2: 23:24

And state of the art is often it's very finicky, Like you wouldn't want to take your formula one racer down to the grocery store to get groceries First off. You can't fit them in the back, you in and out of right, Like it's not. There's a lot of things that don't work. Where your Ford you know minivan is fantastic for it, it depends on what's your use case as well, and you can't do a one size fits everybody. And that's where, a lot of times, if you have a very specific problem you're solving, a very specific high performance model will be best, but it might not work well for everything. So that's where. Just forget our thoughts on LLMs and stuff in general. That's why chat TPT is not the answer for everything, just because it fixes a lot of people's problems adequately, but it doesn't rock and roll on anything specifically.

Speaker 1: 24:08

Good call. So now that we've gone through some of the classification baselines, let's go through and understand a little bit more about regression baselines.

Speaker 3: 24:20

Yeah, so this is basically the other side of the problem, right? So this time now we're not just guessing cat, dog, mouse, we're going to be guessing a number. Right, how expensive is this house in Boston? Not just is the stock going up, what's the price going to be of the stock tomorrow? Right, so now we're going to be guessing a value itself. And now it's somewhat obvious that you can't use the classification metrics. You can't just guess the most common number. That's not going to say anything usually. So let's try some new approaches, and one that might be obvious and jump out to you is let's just guess the mean value. Let's look at the train data, let's look at the mean price of NVIDIA and let's just always guess the mean price and let's see how does that do against your model? Does your model do better than just guessing the average? And this will at least give you the sense of a model that captures no variance, that has no ups and downs. Does your model at least track with that on average?

Speaker 2: 25:23

Another solid baseline that is not used enough and then we'll get into some more specific ones and just to do a quick high level, in case anybody following is a little confused, classification essentially means categories, it's those distinct values cat, dog, ostrich, whatever, like it's specific things you're classifying against. Versus regression means it's continuous, like I said, prices of the house in Boston or whatever. So we're using the terms classification, regression. That's what's normally used in industry, but it's really discrete versus continuous is really the differentiator.

Speaker 3: 25:55

Yeah, totally Totally. And that kind of informs why you use different types of baselines. Right, you want to use the right baseline for the problem that you have. Let's say that you have like a really skewed dataset. Right, you don't want to guess the average price of Nvidia, because one day Nvidia had a really bad pressure base and they got a really bad report and everything is skewed down. Maybe you'll want to use the median right and that will help you control for skew and at least you'll be guessing the median value over time, which still gives you that type of tracking behavior, building off median.

Speaker 3: 26:27

There's also just mode, right, these are the. Those are the three measures of centrality, or mean. Median mode Mode gives you the chance to say, like, if you have a very specific kind of value-based data where there is just like a magic number let's say it's houses and the magic number is six hundred thousand dollars, let's just guess six hundred thousand dollars. Right, and these tests don't have to be done as your only baseline. You can have all three of these baselines, right, because you want to beat all of them. You don't want to be just the mean. You want to be the mean and the median and the mode right? We want to show that your model truly beats these.

Speaker 3: 27:02

Just like brain turned off heuristics and just like before, we're going to basically compare our models now against the bare minimum of modeling. Right? So the same thing with categorical. We used a single logistic regression variable to try and predict the outcome. Let's do the same thing, but let's do linear regression Right. Let's digress on a single other continuous variable and predict the outcome. Right? So maybe one variable is like average number of drinks per day and life expectancy. Right, just a single variable. Does your model beat single variable modeling?

Speaker 2: 27:40

And all of these the same as we talked about before, you'll see so many models that need to just even stick with this method or or or come back to here Next. There's a couple ones that are a little bit more interesting. Is like econometrics, and a time series is a whole thing in itself and these are. We should do a podcast on time series someday. But basically there's lags. It's a concept of lags, which is previous observations. So if you think about where this just intuitively might make more sense is if you think about like year over year, like you can't compare. If you're doing temperature forecasting, you can't really say I'm trying to do accurate temperature forecasting and use data up through from January through June to do July right, like you're, it's not going to be accurate. What you do is a lag of a year. You'll do January over January and things like that where you would actually see trends correctly or you'd be able to predict weather, because if not, if you're trying to predict temperature throughout the year, you're going to always be inaccurate because there's always going to be this. There's that periodicity lag for the seasonality. So lag is essentially.

Speaker 2: 28:37

There's a whole bunch of stuff we could talk about on lags, but the most simple one is lag, one which is repeat the previous. So this is called actually a random walk. In other terms, for instance, stock price is an example If it's $50 today, I'm going to predict it's $50 tomorrow and then go from there. It's $50 today, I'm going to predict it's $50 tomorrow and then go from there. And in a lot of studies, a lot of times the stock market is actually that random where a random walk is in stock, like in the paper analysis we talk about. Anytime you're trying to do stock market prediction, you have to use random walk as your baseline, because you'd be surprised how many trading methodologies don't do that or don't do it repeatedly for a long period of time. It's hard to make unless you're a massive hedge fund that's front running and doing things like that. It's very hard to long-term make money in the stock market by stock picking because of the randomness that is inherent and information built in.

Speaker 3: 29:23

Yeah, it's a totally data-driven, heuristic right.

Speaker 3: 29:29

If your model is basically not able to capture this nature of like oh, it's always warmer in June, oh, the stock prices are always going up in January, right, this is basically what this baseline will capture by just doing a one lag, right, just like, let's go back one day and let's just guess it again, which lets you, like, kind of like, wash out a lot of these changes and just gives you some bare line for a baseline.

Speaker 3: 29:54

An interesting version of this that I have liked and I've been using a little bit is using the last 5% or the last five. So it's a little bit smarter than lag one. It just looks five days into the past and you'll take the average of those values, the median of those values or the mode of those values, and then report that as your guess. This basically gives you a little bit of smoothing behavior. So now we're not just going back one day, we're getting a little bit of the information from five days. So let's say, yesterday was like a weird anomaly. Well, within five days maybe we can smooth out that anomaly and get a really good guess. Does your model do better than that? And to some extent, a last five model is a legitimate model, and does your model do better than that? Like simple but legitimate modeling technique.

Speaker 2: 30:40

Awesome. So, to wrap up, we'll hit just a couple advanced baselines. Since our resident language experts, we can talk about a couple ones of, like your large language models and NLP type processing. So once you do, we'll talk about vision for a little bit and then kind of a wrap up for today.

Speaker 3: 30:56

Yeah. So you know these are going to be baselines that are going to be very like, very, very problem specific you're probably not going to use. And if you want to use something like this, you know it's probably best to just read the literature or see what other papers are doing right now to, let's say, you're working with language, right, you're generating language in your model and you want to baseline. How do you baseline language? Really simple ways to baseline would be just like randomly sample some of the expected outcomes, right? So if you have some training set of data with outcomes, just randomly pick one of those answers, like we did before, and use that as the answer and see is that satisfactory, right? Whether that be some human evaluation of perplexity or whatever metric you're using. If I just give you some random language which is in domain, was that an acceptable answer and did your model do better than that? If you have access to the bundle weights itself, maybe you would look at the entropy that comes out of it uh, through like through the tree and see if this like nonsense language which would have been valid, uh would work, uh. Another common technique we've seen is just paraphrasing uh. Paraphrasing models are in abundance and hugging face. You can also write your own if you have a domain specific one, uh, and just paraphrase the question or paraphrase the answer. Did your model do better than that?

Speaker 3: 32:14

Uh, on the other side, for my vision, folks, you guys have been thinking about baseline for a long time, because that's where your field is based off of. It's built on signal processing. So you've thought about, like, how does my model compare to just outputting noise? How does my model compare to taking the output image and just putting a Gaussian blur on it? Right Does a smooth, smooth, over blurred version of the image? How does that score? Or you have, you know, more intelligent forms of noise where you take the mean of a couple of images from the training set and you just report that. And how does my model do against that? So this is just your classic signal to noise processing that you guys have already been doing for a long time as the baseline.

Speaker 1: 32:54

In these more advanced baselines? Is this what we're more commonly going to see as baseline tests in generative AI and LLMs, or is this still these baseline tests that are across the board?

Speaker 3: 33:10

These baselines are kind of the ideas behind them. The principles behind them are really just the same principles as the categorical and regression based models, so we're going to see these kind of everywhere for anything that's using language and vision or image. Obviously, then, generative is a great use case for them. Right, great overlap. We're still in the same field, still in the same domain. These tests are totally still usable.

Speaker 1: 33:36

If we had to like really distill this down. We covered a lot of ground and there's a lot of you know. Like you said earlier, there's a lot of baseline tests and a lot of things you can do, but it's surprising how many, how much of this is not done. So, in conclusion, like what are the some of the recommendations you would have?

Speaker 2: 33:54

So, if you're going, to take one or two things away from this podcast. It's it's have some sort of a baseline. There are many methods we've talked about. There are many more than we talked about. We just gave some, some examples today. But really try and understand, first and foremost, what are you trying to do with this model. And if there's any existing business information or anything you can use as a baseline, great Use. And if there's any existing business information or anything you can use as a baseline, great Use that.

Speaker 2: 34:17

If you don't have any business related information about, like, what are we doing today or what's the required for this to be viable, we have to make X amount of money, whatever. If there's nothing like that that exists, work on the methods that we had here, depending if it's classification, regression or whatever most common class. Try and just use one feature that you intuitively think is going to be the most predictive, those sorts of things, and then keep that as your baseline and don't change it. And then any other model you're doing, you're comparing to that baseline and then you can kind of see how you're doing and are you progressing and are you making something better, because you don't want to just be optimizing, and then having that baseline also helps. You will help you in not overfitting and things like that, because you want to make sure that you are. You are fulfilling the business need.

Speaker 2: 34:58

Ultimately that's why data science and modeling exists is you're fulfilling a need for something. So being able to use something as your baseline of like, how, how good am I doing? It's like, in any field, any field or any discipline, anybody does you have a. You will have that baseline you come back to. If it's running or or skiing or anything like ways that easy bunny Hill you're doing, like knowing him, I falling on that, well, that's a problem. You know, if I'm a black diamond skier, like that kind of thing, you need to just have some sort of something to come back to and level set on, and oftentimes we modeling doesn't do that and just having something there what specifically you use is less important than having something.

Speaker 3: 35:39

Totally. And let's think about baseline modeling as a necessary step, not a nice to have, not a thing to show off. Let's start thinking about this as a requirement of deployment. Do not deploy a model if you haven't validated that your model is at least better than a baseline model.

Speaker 1: 35:48

Thanks again for a wonderful episode. This topic is sure to spur a lot of questions, and you, our listeners, do not disappoint with delivering on those questions, so keep them coming. We're happy to answer them. Please drop us a line on our feedback form on the AI Fundamentalist podcast page. Until next time, thank you.

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.

The Shifting Privacy Left Podcast Artwork

The AI Fundamentalists