Supervised machine learning for science with Christoph Molnar and Timo Freiesleben, Part 2 Artwork

The AI Fundamentalists

A podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses.

All Episodes

The AI Fundamentalists

Supervised machine learning for science with Christoph Molnar and Timo Freiesleben, Part 2

March 27, 2025 • Dr. Andrew Clark & Sid Mangalik

Part 2 of this series could have easily been renamed "AI for science: The expert’s guide to practical machine learning.” We continue our discussion with Christoph Molnar and Timo Freiesleben to look at how scientists can apply supervised machine learning techniques from the previous episode into their research.

Introduction to supervised ML for science (0:00)

Welcome back to Christoph Molnar and Timo Freiesleben, co-authors of “Supervised Machine Learning for Science: How to Stop Worrying and Love Your Black Box”

The model as the expert? (1:00)

Evaluation metrics have profound downstream effects on all modeling decisions
Data augmentation offers a simple yet powerful way to incorporate domain knowledge
Domain expertise is often undervalued in data science despite being crucial

Measuring causality: Metrics and blind spots (10:10)

Causality approaches in ML range from exploring associations to inferring treatment effects

Connecting models to scientific understanding (18:00)

Interpretation methods must stay within realistic data distributions to yield meaningful insights

Robustness across distribution shifts (26:40)

Robustness requires understanding what distribution shifts affect your model
Pre-trained models and transfer learning provide promising paths to more robust scientific ML

Reproducibility challenges in ML and science (35:00)

Reproducibility challenges differ between traditional science and machine learning

Go back to listen to part one of this series for the conceptual foundations that support these practical applications.

Check out Christoph and Timo's book “Supervised Machine Learning for Science: How to Stop Worrying and Love Your Black Box” available online now.

What did you think? Let us know.

Do you have a question or a discussion topic for the AI Fundamentalists? Connect with them to comment on your favorite topics:

LinkedIn - Episode summaries, shares of cited articles, and more.
YouTube - Was it something that we said? Good. Share your favorite quotes.
Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.

Speaker 1: 0:03

The AI Fundamentalists a podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses. Here are your hosts, Andrew Clark and Sid Mongolek. Welcome back for part two of Supervised Machine Learning for Science. Christoph Molnar and Timo Freisleben are here with us again to take the concepts that we talked about in part one and discuss how they apply in practice. Now, if you didn't listen to our first episode, you'll be fine, but we encourage you to go back to that episode at some point for the concepts that will support the practical applications that we're talking about today. Let's get right to it.

Speaker 2: 0:52

So let's start off with this idea of how the scientists can embed their expert knowledge into the modeling paradigm and the modeling system. As we discussed previously, that can just be done through study design. Right, when a scientist creates a study design, they intimately understand how the model works and what the outcomes need to be, and so they can do that. But let's talk a little bit more about how, in a fine grained and specific and model engendering way, we can do this right. About techniques like improved generalizability, inductive biases what are your thoughts on how an expert can really make the model look more like the expert?

Speaker 3: 1:33

So one thing I already mentioned is like these trivial decisions what are your features, what is your target, how do you frame a prediction task? But one part of this really overlooked I would say is the evaluation metric that you use in the end and optimize your models against. And, because this is often like an off-the-shelf choice, you pick maybe a regression setting, so you pick the mean squared error, so because it's just one of the metrics that you typically use. But the thing with machine learning is that this decides a lot because this is basically even so. You pick it once in the beginning. Maybe you pick even two and look at two metrics and how your models compare on this metric. But this really makes a lot of downstream choices for you.

Speaker 3: 2:25

So if you do feature selection, like automatic feature selection, or do model selection, it's all based on the simulation metric. So whatever you pick there will have this downstream effect and you won't even notice because you only look at this number, maybe no longer questioning it. At least I have that. So I'm just looking at the number. But I'm also coming a little bit from this competition setup where this is always given like OK, this is how we evaluate your models, but you can also design this evaluation metric. So, yeah, you have basically a lot of choices like how I do this.

Speaker 3: 3:04

So you can look at your prediction task and really ask, ok, what is the ablation metric I want to have here? So one thing you can work with is, for example, weights, that you have a cost-specific function and say, ok, based on my domain knowledge, I want certain cases, they just need to be right and for the others it's not as horrible as if they get a wrong prediction. And so, yeah, I think that's a good way, how you can design or pour your domain knowledge into all of your modeling choices. Basically, and if you don't do that, you won't see it. So that's the bad part about it, you will never see it because it's um, you won't see it. So that that's that's a bad part about it. You, you will never see it because it's always the number you look at.

Speaker 4: 3:47

But no longer question what I think you should. It's usually like a really cool way of incorporating your domain logis by the data. So if you really know what you want to represent, then be really careful in getting enough data, because the modeling side is you can only do so much. You can only get like. If there's biases in your data, you will never get them out, at least not easily, by modeling.

Speaker 4: 4:16

And a very cool approach there is data augmentation, so this is a very nice way where you can just think about oh, what are potential alternative data that I also want my model to perform well on, and especially if you know certain kinds of transformations that might regularly happen.

Speaker 4: 4:32

For example, you have only pictures taken during the day and the system should also work during the night.

Speaker 4: 4:39

So there are certain kinds of filters you can use and just extend your database and I think this is a really simple approach to incorporate domain knowledge, but it's also really effective. What I like is the idea that you can also like the domain knowledge can be incorporated in form of the data, but the same domain knowledge you can also incorporate it in the modeling stage sometimes. So if you, for example, think about you want to have an image classifier that classifies pictures that are flipped horizontally in the same to the same class. And this is like a natural requirement and you can do data augmentations or you can just take every picture and flip it and then, yeah, have a better classifier. But alternatively, you can also incorporate this inductive bias into your modeling assumption by choosing a certain model class that has this invariance encoded into it. The only thing is, it's usually the tougher approach, it's usually more difficult to do the latter one, but I think both ways are legitimate ways to deal with that.

Speaker 5: 5:52

yeah, I love this point.

Speaker 5: 5:55

I think this is one thing that we've talked a lot about on the podcast and we see in industry a lot is the lack of domain knowledge.

Speaker 5: 6:01

So there's a major focus on tooling in the data science world and you'll have. So what's great your book is really coming from the perspective of scientists. But then we have a lot of you know practicing data science I mean people that are under the umbrella of data scientists right, not necessarily the same scientists we're talking about in other realm where you're like put onto these different problems that you don't actually have domain expertise on. And I know when I was an early data scientist in first class and stuff, I was very much like that Venn diagram of you need to know computer science, statistics and domain knowledge. I really think the domain knowledge component for working data scientists really gets dropped a lot of time and I know today we're really focusing about scientists using and from scientists using traditional scientists using machine learning for tasks. But I love what you're saying there and I really think machine learning industry and data science industry as a whole has kind of forgotten that you actually need to know what you're modeling and that's, I think, what causes some of the issues we see today.

Speaker 3: 7:01

Yeah, and this is certainly like a spectrum, because even if you say you have no domain knowledge I mean, if it's like something really niche, maybe that is true but typically you do have some idea of what you're modeling, even if you're maybe not an expert for marketing or whatever. But you always have some domain knowledge which the machine doesn't have and can incorporate this. And you do this already by doing feature engineering and so on. And, yeah, one other way you can do this to incorporate domain knowledge is via constraining the model. So also, timo mentioned a little bit things like monotonicity and how you encode your features, like cyclical features, if you know that one of the features represents a day or a month or a year and you want the ends to meet, or you know that one of the features must have a purely monotonic effect, like just positive or negative effect, so you can incorporate these into your model.

Speaker 3: 8:07

They always come at the cost that you constrain your model and if it turns out not to be like, if your domain knowledge turns out to be false, maybe that, or too restrictive, then you might pay for it with predictive performance. But on the other hand, maybe it's. You know, you always have to like weigh it, because maybe it's just like a fluctuation, like a variance in your measurement or in your performance measurement, or maybe you have reasons to believe, or it's still. Maybe more will yield a more robust or more interpretable model and outweighs a drop in performance, and so on. So you also have these decisions to make when pouring into the model, like your domain knowledge.

Speaker 4: 8:52

Yeah, andrew, I think that's really a great point, also something that we should maybe have mentioned earlier. We had longer discussions about how we should call the book and it ended up being supervised machine learning for science, and clearly we have scientists as our target audience, but it's not restricted to scientists, not in the strict sense of you need to be have a postdoc or PhD position at the uni. It's more like everyone who wants to use machine learning to find something out about their data or the phenomenon they study is for us, the scientists, it's a relatively broad idea of what a scientist is. I think that we actually had intended and for your second point, I think people really often.

Speaker 4: 9:49

It's because it started like this because often just the data approach was really getting better than the classical traditional domain knowledge approach. But I think in the future we will see that the combination of having really good machine learning models and this data approach combined with domain models, will be even better than each single one of them.

Speaker 2: 10:12

I really like that example because, yeah, that's from IBM's machine translation team, right, where they're trying to make the best translator and they're getting better results when they fire the experts who are building these complex if-else statement systems and then just using a data approach. And I think that this kind of highlights this difference between the product world and the science world, where in the product world, we were just like, what is the best model I can make, and can I make it better than everyone else's model? But that's not often the goal of the scientist. The goal of the scientist is to understand the underlying relationships between these features and the outcome. And so how can scientists practically use these machine learning things and these tools that we have available to us to measure causality between treatments and outcomes and maybe measure like the uncertainty between these connections and maybe measure like the uncertainty between these connections? How can they do the next step of like not only making a model that's good, but understanding how that model thinks as the model of the world?

Speaker 5: 11:12

Right before you switch that. I think one comment on one thing you just said, sid, which is is it really making a better model in the product world? Or, like to the point Christoph made a few minutes ago, it's all about the metric you're using. So another friend of the program, patrick Hall, did a post recently on LinkedIn basically calling into question, like these leaderboards for LLMs and things like that. It's like so what are you optimizing for? Is this really getting better? Or, said to your point, it's like you're firing the linguists and Timo, but is it? You decided some arbitrary product metric that it's improving on? Yes, it's improving that, but is it actually improving some arbitrary product metric that it's improving on? Yes, it's improving that, but is it actually improving? And, christoph, to your point, like all the things we've talked about so far, it's like you can't really have one, necessarily one solitary metric that says your model is better, specifically if you're trying to do something like language.

Speaker 4: 11:58

Yeah, I totally agree. I'm actually working on a project at the moment on benchmarks and the interpretation, and the crucial aspect we want to emphasize is the construct validity. So is the benchmark actually measuring the thing that we're interested in? And yeah, the literature is really huge on that and there are certain conditions that you have to satisfy as a benchmark in order to get real brokers on your problem, rather than just being better on your metric.

Speaker 2: 12:28

So I mean, basically, the question is like scientists want to do the next step right Beyond evaluation, hacking and reporting, like a nice outcome. You get these problems where you have these models that are really good on some outcome right, Like these machines were really good at translating, but then they were still making really, really fundamental mistakes because they didn't have a strong fundamental base. They had a very strong statistical base and that was enough to win the race at the time. But now in the modern day, we're looking for a little bit more and we're expecting a lot more nuance from these models, and they're not always giving it. So how are scientists using these supervised methods to get at these causal relationships and these underlying uncertainties in the models?

Speaker 4: 13:11

So yeah, great question, I think. And these underlying uncertainties in the models? So yeah, quick question, I think. And causality, like in my home university here in Tübingen, that's one of the core topics that lots of groups are working on. And I think causality is like a big promise in machine learning that everyone expects at some point. Oh, this will be the thing and this will be the solution to many of our problems, like robustness. So maybe we can. If we have causal models, our models will be also more robust. So that's one of the hopes, I think.

Speaker 4: 13:39

From the perspective I got here, I think causality can act on various or can come in at various places and the most simple one is, I would say and the most simple one is, I would say, let's say we have purely associational machine learning model based, like trained, on purely observational data. Then it can learn all kinds of information, information, theoretic dependencies and associations. That's the cool thing about the interpretation methods that Christoph and me worked on for quite some years. We can actually use these techniques to find interesting, relevant associations among the ones that are present in the model and this kind of approach says oh well, maybe we find an interesting association, maybe the reason for this association is a causal one. Maybe there's a causal dependency between those variables and that's a scientific hypothesis we can start to explore. We can run experiments then or we can just try to get more data, but this is like the most simple cases. We use it for exploration of causal relationships.

Speaker 4: 14:48

The second usage, where it is already pretty widely used, which is interesting, is causal inference. So assume you know there's certain causal dependencies, you have causal assumptions in place, you know relevant confounders. Then what you can do is you can do treatment effect estimation with machine learning and you could also do this with classical statistical modeling. But actually you can do is you can do treatment effect estimation with machine learning and you could also do this with classical statistical modeling. But actually you can do it.

Speaker 4: 15:16

Sometimes, depending on what data you have, you can actually do it better with machine learning. And there's all these different kinds of learners where you can just find different ways of estimating your quantity of interest, and those have all different biases and the most popular ones at the moment are these double robust methods, for example. Double machine learning is the most prominent one among these. We just have a very data efficient, unbiased estimation of your treatment effect. So this is the second camp, I would say, and the third one that's like the big hope is that, oh, maybe we can also learn the causally relevant representations. Maybe you can also learn directly the causal dependencies based on the data, and what we see there at the moment is it's more of a theoretical possibility, so there's not so much things that are used to practice.

Speaker 1: 16:20

And the reason is very simple.

Speaker 4: 16:21

Those methods are not super stable.

Speaker 4: 16:24

So if you think about causal discovery the standard approaches they check conditional independencies in the data and based on them, like maybe to better understand it in causality you have certain causal graphs that you think might have generated the data, and the data that you see is, in a sense, only compatible with parts of these graphs. So not every graph is possible. We can rule out some of the graphs based on the data that we have. However and it's like a fundamental insight from Julia Pearl and his colleagues is that even if we get more and more observational data, we will often not be able to identify one unique causal model that has generated the data. But at least we can restrict the number of possible models. And if we make further assumptions and that's what people here at my place are working on, for example about the noise then we can also identify the unique true causal model.

Speaker 4: 17:31

But, as I said, if you take one wrong choice at some point, so if you draw the arrow at some point in the wrong direction, you are a bit fucked and you will just end up with the wrong model.

Speaker 2: 17:46

I want to dig in a little bit here, right? So I'm pushing you a little bit. So let's think about that case where the scientist is going to really complain and they're going to say, like I absolutely cannot give you a causal model of this, right, they said they have. I have a deep learning network. It takes in roberta, large embeddings. It's a totally uninterpretable dimensional space, and then I pass through 17 layers of single pass neural network and I get some outcome. What can that person do, right? What tools does this person have for understanding the model that they've created?

Speaker 3: 18:21

what did you say was the input embedding stuff for?

Speaker 2: 18:24

let's say, let's say it's like language embeddings, right, so they have. They have some large NLP Network and they're just throwing it in and they're getting, and then they're throwing it, there's some feed forward layers, right. And now they're saying, oh well, I can't possibly do causal analysis on this. How would you respond to that person?

Speaker 3: 18:44

yeah, it really depends like is the goal causal analysis. So maybe because it doesn't have to be right, it can also be totally fine if it's not. Yeah, otherwise, it really depends on the research question that the person has and whether it can be answered or not with this type of model, because there can be some research endeavors where it's just about prediction. That can be fine. Yeah, for that hypothetical case, it's a bit difficult for me to say exactly what this person could do. Yeah, timo, do you have some insight?

Speaker 4: 19:26

Yeah, I would agree with Christoph that it's not necessarily the case that we are interested in causal analysis, but usually I think in the end we should be as scientists, we should try to build causal, understandable models.

Speaker 4: 19:38

I mean, in case of language it's really difficult because I'm not even sure what causality means on the level of words.

Speaker 4: 19:44

It's more like we build a theory for, for grammar, for example, and semantics, and so there I okay, causality is maybe not really the issue there, but more like theory. But I think they match very closely. And in these cases, like if you say, okay, we are still interested in causality, but at the moment we are not there yet we cannot get to the causal stuff that we actually interested in, what can we do? So I think this explorative idea of causality that I mentioned earlier, that we use interpretation techniques, that we try to use certain kinds of, for example, saliency maps and maybe, based on that, start hypothesizing and build up simple models again, this could be a route, and I mean something that people at the moment have high hopes for is looking inside the network and using ideas like probing, where you try to give meaning to certain layers in the classifier or in a generative network, for example in an LLM. I think this is very popular at the moment.

Speaker 1: 20:59

The question is also there like it's not super robust what you get.

Speaker 4: 21:04

So you get anecdotal evidence of. I think probing is one of the better cases. But you have many other cases where they try to attribute meaning to individual units and I think very often you see like, okay, this individual unit at this instance really did what you think it would be doing and oh, I have a nice explanation, I can dissect my network somehow. But if you try it on on, even on the same network, a different unit, or if you retrain the network on the same data, this will not generalize. So I think it's very interesting. I'm really intrigued by this kind of research but I'm a bit pessimistic.

Speaker 3: 21:49

Yeah, especially like with mechanistic interpretability, the question also becomes what's the object of study? Is it still your data and your question, or are you moving now to this one specific model, which can be fine? But then you need a theory which tells you to connect now whatever you find out about a model to be true in the real world right. And you have to make that argument that whatever you find out about the model is relevant to your original research question. I mean, if your question or if your object of study is a neural network, then then it's clear, then you're doing the right thing. But if you want to find out about, if it was only a tool that you used, like the neural network network and the interpretability techniques that you used, then you really need a theory that connects these insights to your question. And with mechanistic interpretability I think that's very model specific, right, you're kind of looking at what optimizes certain parts of the or what activates certain parts of the network. So this is very specific to that architecture. So what we also write in the book about is we're kind of advocating for these model agnostic methods which make more sense for like tabular data, like this traditional machine learning or whenever you have interpretable features going into your model and I think that you can make a better case of why it's okay to interpret maybe your like, if you interpret the model, how you can bridge that interpretation to, like your data and this thing you're studying.

Speaker 3: 23:28

So you have to kind of make a few assumptions, like that your model is not biased, which is a strong assumption and you have to kind of check for the variants.

Speaker 3: 23:38

And then if you look at things like feature importance, then you may make conclusions about your data. But yeah, that's a really difficult thing to do, but it's also something that people are already doing. I mean, how often do people fit a machine learning model, look at the feature importance and then I mean, of course they're saying, oh, this describes the model, but what you want in the end is to use this and learn something about the world, even if it's not in a scientific setup. Maybe it's just for the marketing team saying, hey, these factors were the most important in our charm model. Now you can do something with it. And for that as well you need this bridge that tells you, okay, that's well, the permutation feature importance is something about your model, but you want it to be true, or you want it to be useful for the real world, so there should be this connection obviously and I think that really lines up with a lot of how we think about these problems.

Speaker 2: 24:36

Right, that, like you have these large models and you can only do these ablations, right, you can turn off parts of the model, you can turn off parts of the inputs. Uh, you can try and scramble some, some layers in the model, but at the end of the day, we're just understanding the model, right? Are we gaining causal inferences? That's like actually a very open field for research for people that are looking into this kind of problem.

Speaker 4: 25:01

I just want to mention that this question under what conditions you can interpret your model and learn something about your data is something that Christoph and me worked on a lot in the last years of our PhD, and so, yeah, I think this is really interesting, especially because people in the inputability research field they acknowledge something that's called the correlation problem or extrapolation problem.

Speaker 4: 25:32

So if you have dependencies in your features, in your input features, many of the methods are much harder to interpret and what we basically can show is well, you have no data interpretation, so you can analyze the model in these areas, but there's no corresponding idea on the data side, there's no corresponding thing that it represents on the data side, and so that was, I think, one of the bigger insights. Is that, oh, if you want to gain insights about your data, only certain interpretation methods are actually working, only certain interpretation methods are actually helping you.

Speaker 4: 26:14

And those are the ones that we say sample from the conditional distribution. So those are the ones that stay within your data manifold. Stay realistic, because as soon as you start probing in unrealistic regions your interpretation is really difficult. I think even on the model side it's difficult, but on the data side it's impossible.

Speaker 1: 26:34

So yeah, it's something I just wanted to add because I think it's a topic that both Christoph and me really really like.

Speaker 2: 26:40

Yeah, yeah, I really appreciate that. I guess this is our penultimate question here and this is a bit of a selfish question. Can you tell us a little bit about robustifying? I think Andrew and I are really excited about the idea of robustifying.

Speaker 4: 26:56

I'm also really curious about robustness, and I already told you about adversarial examples as one of the classical cases where machine learning fails. But actually, at least in science, there's usually no real adversary, usually there's no one who's manipulating your data. It's more like you want models to work not only in one environment, but also in a changing environment, in a dynamic environment.

Speaker 4: 27:20

Actually, even in an environment that reacts to the model decisions, maybe, which is called performativity, and so at some point I just thought, okay, I need to systematize this and I wondered what is even robustness in the first place? And I tried to unify these different kinds of robustness into one general notion. And I think we always talk about some stability in performance across distribution shifts of a certain kind, and so I think this is what I would understand of robustness is that if we change certain things in our data, how does our model react to it? And there's a huge number of strategies and I think it's almost infinite. There's a huge number of strategies and I think it's almost infinite. I think to really pick the right one, you need to understand what the robustness issue is.

Speaker 4: 28:12

So what is actually changing in your distribution? So there's, for example, what we call a covariate shift. It's only a shift in X, which is like in your input features, but that's a relatively nice case because the relationship between your input features and your target variable remains intact. But it's also much worse shifts. For example, that's called a concept drift, if you actually if even this relationship is not holding anymore.

Speaker 4: 28:45

So think of COVID-19, where you had these different variants of COVID and in the beginning you had maybe a lot of data and you had have learned about the relationship between what symptoms do people have if they have COVID? Okay, so far, so good. Have if they have COVID OK, so far, so good. But say you look two months later and now your data has shifted and it has shifted in a really bad way because there's a different variant and your COVID prediction, like the dependencies that you found in the first case for example, dry cough was really a highly predictive feature in the beginning, but in a new variant variant, dry cough wasn't really a symptom anymore. So the thing that the model has learned actually got wrong after some time, or not wrong, but maybe a bit overfit to the data from the beginning.

Speaker 4: 29:39

So yeah, it's really important to understand your distribution shifts, and if you understand it, then there's various ways to deal with it, and I can highlight some that we listed in the book. So one thing is, for example, that you try to avoid the distribution shift altogether. So sometimes you can do that. A classical distribution shift is, for example, in medical diagnosis. You buy a new imaging device and now you have a distribution shift. You could have avoided that in the first place by just buying the same brand, for example.

Speaker 4: 30:17

That's a very nice controllable case, but this is not often the case. Often you have shifts that just happen and you have no control over them, and you can either say, okay, if it's just a few cases, I can filter them out.

Speaker 4: 30:30

If it happens more regularly, maybe I have to completely retrain my model, or I can use data augmentation again to just account for different shifts that I might expect in practice to occur, shifts that I might expect in practice to occur, and something that is relatively popular at the moment to robustify your models is to just use pre-trained models as well and use knowledge of other data sources that you don't have access to, by using the weights of other models as a starter and then build your features based on that, because usually your model will be more robust. At least, that's something that we've seen very often in image classification or many other cases as well. So I think this kind of transfer learning is really promising, and my hope in the future is that we have more of a kind of modular understanding of machine learning where we say we play like Lego.

Speaker 4: 31:27

So we have a very robust submodel for this task and then we use it as a base for this and compose it for this other task.

Speaker 5: 31:41

Thank you so much for that explanation. I think that section of your book is really well done and I really think that's robustifying or robustness or however you want to say. It is definitely something that's not talked about near enough. I really like that.

Speaker 5: 31:53

One thing in general, I know you go into some methods, but one thing I really like about this book and then, christoph, the model paradigms we had you talk about last time is you guys are more identifying how you should think more than what you should do, right. So, like, a lot of machine learning books are like these playbooks or cookbooks. You do type thing, right, like follow this, do that. But you're more like here are the issues, timo, just given your, your explanation. Here's the issues of things you should be aware of and then, like, as scientists, you need to think about what's the right way to solve it, because you know the, the how do about what's the right way to solve it because you know the, the how do you? What's the right answer?

Speaker 5: 32:26

It depends, right, like you're just helping teach the reader how to think and what are the issues they should be aware of, versus how to solve them. Well, you give options how to solve them, but it's not like it's, it's not that. This is the way you. You shall do it. So I really appreciated that much bigger book if we did also offer the code.

Speaker 3: 32:39

Python solution and so on, but it was already very difficult to write about, I think, because the topics we write about are so different and each of them is like a huge topic, like robustness, interpretability, causality, it's like you can. There's books which are only about one of the topics and even those are like really big books and really we really try to condense it down to the essence of it and to the big ideas of each of these add-ons to machine learning, that you need to make your models more robust, interpretable, and so on. So, yeah, this was also a huge learning experience for us. So, yeah, we learned a lot and, yeah, we split up the chapters and, for example, timo wrote the robustness chapter and so I could just read the chapter and then get a super quick overview of robustness, which I haven't learned about too much before, which was really cool about too much before, which is also really

Speaker 2: 33:47

cool.

Speaker 4: 33:51

Yeah, I mean, I'm always very curious about the concepts, especially with my background in theory side, even though I appreciate the method side as well, and I think it's super important to actually implement these things. But the cool thing about concepts and it was my original math coming in I want things to be persistent over time and I think concepts are persistent. So if you have these ideas about robustness, if you have these different kinds of concepts to analyze robustness, to get these certain kind of strategies for tackling robustness.

Speaker 4: 34:25

I think this will stay over time. I think even in 10 years you will find different versions of these strategies implemented in methods. But the methods I think very quickly, at least at the moment here in machine learning they get very quickly obsolete and so this was part of the motivation to have something that stays over longer time, because methods are just outdated very quickly.

Speaker 2: 34:51

And I think that brings us to our final question, which I think is a good wrap up of kind of what we did in part one of the series and now in part two of this series. We see in both scientific learning and machine learning, that we're in the midst of a reproducibility crisis, that there's a lot of issues with study design, with repeating studies, with having generalizable results, and so, as a send off here, can you talk to me a little bit about the best practices you want to see on both sides and how they meet up together to give us a really strong, robust set of methods for doing scientific learning with machine learning?

Speaker 3: 35:35

so I think that's a really good question because, um, I'm, I have a statistics background and I think, like, statistics is also at the center of this crisis, like with also the p hacking and yeah. So the question is also a bit for me, like, how do we not repeat all these, like these mistakes? So I mean, statisticians are obviously very aware of that, but a lot of people are using statistics and they just have little time to learn about it maybe and they are like, okay, you have to publish this paper now and they're just learning about this new hypothesis test and just using it and then maybe it doesn't come out right in the end, because these things are complicated and things like these tools don't get easier when you suddenly use machine learning. I would say you could argue maybe both ways that they become easier to use, but also harder. And things like reproducibility is also really difficult. Because you're well for linear regression model, you can just report OK, these are the five coefficients and everyone can see that at least, so you can at least report it on your entire model.

Speaker 3: 36:47

But doing this with complex machine learning models, it's like, especially if it's like about large language models, it's really difficult, like. So there's one thing I'm a bit concerned about, or, um, yeah, seeing like critically is like all these papers that are something like jet gpt. They're like, okay, we studied JGPT and this and this version. And you read the paper and like, okay, this version no longer exists anywhere. Nobody can check whether this paper, nobody can reproduce this paper, basically because OpenAI no longer has this model accessible anywhere. So, yeah, especially if you use there's now a case of like a closed model that is not openly shared. But even with openly shared models, it's very difficult to like if you don't have, maybe the resources to reproduce this, for example, and so on. So, yeah, it is a very difficult thing to do.

Speaker 3: 37:42

So there are things emerging like best practices when it comes to reporting. So there's, for example, guidelines for when you do a medical paper about using machine learning and what you should report, like checklist-based approaches. So these are things. Or we also have model cards or these data sheets. So these are things, or we also have model cards or all these data sheets and these are all methods. Most of times, these are like checklists that you can go through and at least report what you're doing and have this go through line by line to see what's important to share with others, also for them to be able to reproduce your results, and also for you. Some of these questions also make you check. Oh, maybe I should improve my model. So I think these checklists could be a good start, at least.

Speaker 4: 38:43

Yeah, maybe something to add as a clarificatory point. Sorry, that's a very difficult word for me as a German. There's replicability and reproducibility. At least that's something that people sometimes distinguish, and in science we clearly have for example, in psychology we have this replicability crisis. So using the same kind of or similar experimental setup, we don't get the same results, and this is pretty bad. But in machine learning we often have the reproducibility crisis. So we use the very same data, we try to use the exact same methods, as reported in the paper, and we cannot even get the same performance. So this is even worse because we have the very same data, and so I think in science, at least on this side, we are usually fine. But in machine learning, in reproducibility, we even have problems, and when it comes to replicability, I think it's sometimes even worse.

Speaker 4: 39:51

So there's really cool studies that I recently found when I looked into benchmarking in machine learning. So, for example, there's this paper by Ben Recht so everyone knows the ImageNet dataset and what they just did is they followed very closely the instructions of how they created the ImageNet dataset and created something that's almost the same they hoped for, but the performance on this alternative dataset of all models that were pretty good on.

Speaker 4: 40:19

ImageNet was like 10 to 15% worse. So this is relatively bad. The good or the positive outlook they gave us well, at least the rankings. So models that are better on the original ImageNet dataset are also better on this alternative ImageNet dataset. At least that stays intact. And then they say oh, that was what we were after anyways. So we didn't want to have exact performance metrics persisting, but only the rankings. And yeah, it's a bit of an interesting move, I think they did.

Speaker 2: 40:56

Well, I want to thank you for both your time here. This is a really great two-part series on the interactions between science and machine learning, and we're really glad you could have you on and talk about your book.

Speaker 4: 41:07

Thanks so much for having us.

Speaker 3: 41:09

Yeah, thanks for having us, it was fun.

Speaker 1: 41:12

Sure thing, and for our listeners, if you haven't already, please listen to part one of this podcast. You're bound to find a ton more information that supports everything you heard in this episode. Also, be sure to check out Christophe and Timo's latest book, supervised Machine Learning for Science how to Stop Worrying and Love your, black Box, available online, and we'll put a link in the show notes. If you have any questions about this episode or any of our other topics, please contact us. Until next time,

People on this episode

The AI Fundamentalists