The AI Fundamentalists
A podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses.
The AI Fundamentalists
Beyond Boosted Trees: Christoph Molnar on the Rise of Tabular Foundation Models
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
As the AI landscape evolves, the methods we use to process structured data are undergoing a silent revolution. Join us to explore how Tabular Foundation Models (TFMs) are challenging the decade-long reign of tree-based algorithms, why the traditional "train and predict" workflow is being replaced by "in-context learning," and what this shift means for the future of resilient modeling.
To help us, Christoph Molnar, renowned expert in machine learning interpretability and author of the Mindful Modeler newsletter, joins us to share his perspective on the emergence of tabular transformers, the surprising power of synthetic data, and how to maintain model safety in a world without parameter updates.
- The decline of the "fit and predict" paradigm in tabular data
- Transformer architectures vs. traditional models like XGBoost and LightGBM
- In-context learning: Predicting without traditional training steps
- The role of Structural Causal Models (SCMs) in generating training data
- Why models trained on "math and probability" succeed on real-world datasets
- Hardware accessibility and running foundation models on local MacBooks
- Integrating SHAP values and conformal prediction for model interpretability
- The future of the data science workflow: One tool among many or a total shift?
This episode is full of technical insights and forward-looking predictions that are sure to change how you approach your next dataset. As we move into a new era of AI, it’s the perfect time to explore the fundamentals of the next frontier!
What did you think? Let us know.
Do you have a question or a discussion topic for the AI Fundamentalists? Connect with them to comment on your favorite topics:
- LinkedIn - Episode summaries, shares of cited articles, and more.
- YouTube - Was it something that we said? Good. Share your favorite quotes.
- Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.
Welcome And The Big Idea
SPEAKER_02The AI Fundamentalists, a podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses. Here are your hosts, Andrew Clark and Sid Mongelik. Welcome to today's episode of the AI Fundamentalists. Today we are excited to bring on Christoph Molner, a returning guest of the show and an expert in machine learning and interpretability. He's now written his fifth book on machine learning, and we highly recommend his mindful modeler newsletter on Substack to listeners of this podcast. Christoph, welcome back.
SPEAKER_00Thanks for having me again.
SPEAKER_01Yeah, we're so glad to have you back. And you've been on a bit of a streak now on your newsletter talking about these tabular foundational models. And I think they're really exciting. And I think that it's definitely time for us to start talking about them. So I'll give a quick little intro and then we can start digging into this. So just for the audience, tabular foundational models might be the next frontier in tabular prediction. This field has been traditionally dominated by stuff like tree-based algorithms, XGBoost like GBM, and in some simpler use cases, just logistic and linear regression tasks. This is a fundamentally different way of doing this. Do you want to start walking us through how this method works and why it might be better?
SPEAKER_00Yeah, for me, it has been also learning experiments over the last few weeks, and just by writing about it. As I learned about Tabler Foundation models, I also wrote my in my blog about it. And yeah, it was quite a different way of thinking about supervised machine learning. So in traditional like regression classification, we have this train and predict paradigm where we train a model with an algorithm, get a model, use it for prediction. And tabular foundation models change how we approach this. So these are basically large neural networks based on transformer models, which are now pre-trained on a lot of data. And instead of then skipping the actual training process and doing just prediction, which is basically in context learning. So you provide at prediction time the entire training data sets plus your data test data set, and in a single forward pass, push it through these tabular foundation models to get out the predictions. And during this process, there's no parameter updates or anything. So all the learning happened before, and during this prediction new type of prediction step, the model basically does in context learning. So it compares the test data to the training data to make to get to the predictions. So coming from this like how we used to think in traditional machine learning is quite a paradigm shift.
In-Context Learning For Tabular Data
SPEAKER_01Yeah, and I think for listeners at home, I think that you might be hearing some really deep parallels with LLMs. And that's because this paradigm is actually extremely similar. The same idea with LLMs is that we don't want to train an LM for every little task we do. There should be some foundational model underneath it trained on a massive amount of highly diverse data, which we can then use to either just run inference directly in our tasks or then do a little bit of extra fine-tuning on tap on top for, let's say, insurance use cases or for casualty use cases, or really anything else that we that these tabular are used for, even soft time series. And we'll get into that a little bit later. For people out there who are looking to just hop off this podcast and start doing things, some examples might be like Cart or XTAB or Tab LLM or Tab PFN or Table CL. There's a lot of great models out there, and I'm sure by the time we shoot this, two more will already come out. It's a very happening space right now. But I think one thing that's going to be really interesting that you talk about in your posts is unlike with language data, where the scope of human language is so vast, we have so much language to build a great probabilistic model. How do you get enough data to train a foundational tabular model?
SPEAKER_00Yeah, I think that's also a very exciting part of all this development with tabular foundation models, and also change my view on synthetic data because there's different ways in which you can train these tabular foundation models, but a lot of them are trained purely on synthetic data. So it can be shown that it helps to also train them on real data, but you get a lot of performance already out of training them on synthetic data, meaning you have a kind of a process from which you draw new data sets. Most of these work with structural causal models where you sample nodes and edges, and the nodes represent your features and the edges, the relationship between them, and you sample from those and then create a like a huge data set. So we're talking millions of data sets. And these are then all used in pre-training these models. And I think it's quite interesting that this actually works because you have to come up with a data generating process that helps then later to make actual predictions with real data. And many of these table foundation models are only trained on synthetic data and still deliver very high performance, like good predictive performance on real data, which I find very surprising. So this was certainly also like a new thing to me that this works so well.
SPEAKER_03Yeah. One thing that I just conceptually at a very high level, thinking about it without the architecture, it does sound like they make more sense than an LML, like how it could work this way. Because like it texts like the potential, if you think like a matrix of numbers and things, the having being able to have really robust, we don't know exactly what's happening around these things, but the covariance and correlation analysis between the inputs and like the target is to Sid's point, it seems a little bit more bounded than you would have in the language context. So, like a at a non-mathy level, it does intuitively make sense that these this could be an interesting paradigm.
SPEAKER_00Yeah, and also if you look at these the architectures of what they do. So we have this transformer-based architectures with the attention mechanism. And while with large language models, as you said, we have these diverse like text as input. With tables, we also have some different types of diversity, like the that they can come with different number of columns, for example, and different types of columns. So how most of these foundation models work is that they have attention mechanisms that can work between like uh within the row or within a column, and basically to make predictions, then later they accumulate for so they represent internally each table cell basically with an embedding vector, and enrich this vector like during multiple transformer steps by paying attention to other columns and other rows. Yeah, that's the basic unit on which these uh transformer-based networks for tabular data make predictions then or build up kind of the information needed.
SPEAKER_03And when digging into your post, one thing that I thought was really interesting and love for us to touch on as well is the fact that it uses like some structured causal modeling under the scenes as well to help do the relationships.
SPEAKER_00Yeah, that's for the creating the synthetic data actually. So also what's interesting, these are called priors because you can motivate these tabular foundation models with the pre-training and then context learning later through Bayesian inference, interestingly. So that you're basically do modeling the posterior predictive distribution of your Y test, like your target, and your the prior is basically your the data sets or the mechanism that generates your data sets, and many use these structural causal models, which kind of makes sense because then I think these could be or seemingly are a good model of how tabular data actually is created in reality, that we have some kind of analyzed underlying causal model. So we have so structural causal model is in part can be represented by a direct acyclic graph, that you have nodes which represent variables which are later than features of the target and edges between them. And these are directed edges, so we can say that one node influences the other, so they later in your tablet data set, it would mean that you have correlated columns. And by using these structural causal models, you can create a lot of diverse data sets. You can create small ones, large ones, ones with highly correlated features, and so on, and also with different data types. And in this data generating process, they also usually sample the connections between these features. Is it a linear relationship? Is it maybe a tree-based relationship with a huge jump and so on? So they also create so these labs that train or pre-train these models create a diverse set of these data sets to train on.
SPEAKER_01And this really rings true with an episode we just did in the podcast about metaphysics and causality, which is this idea that there's networks of causal relationships. So we might look at like height, weight, smoking, and how that interacts with heart attacks, and then how heart attacks interact with mortality. And it's more than just simple correlation. Like there's actually probabilistic dynamics between these variables. And if we can model this in synthetic data, we can create a kind of synthetic data which is actually learnable and actually delivers real insights to tabular models. We've talked a lot about synthetic data on this podcast and how like synthetic language data is actually really questionable, right? Because it doesn't necessarily capture a lot of this diversity or a lot of these like relationships in language. But in math and in numbers, we can really directly derive those relationships out in data again.
SPEAKER_00Yeah, also I think it's really interesting because with if you have tables, you capture like your variables in whatever industry you're in. But they are generated through some process which you you can make assumptions about and like from your domain knowledge, okay, A influences B and so on. And I think these structural models, structural causal models, are a good or seemingly a good way to mimic this, that how data in the real world is generated. So what they also do, which I find is interesting, is that they also sample like which of these nodes then later become a feature, or only a set like a subsample of nodes from such a graph are actually then used in a table that is then later used in pre-training the tablet foundation models. So then we have these effects of hey, the target could actually be like a cause for another feature if we think in causality and not the other way around, or that we maybe don't observe all the confounders. If we again think in causal language in general, that we only observe like our causal network partially, so we don't have all the knowledge, which is true for the real world, that we sometimes cannot measure all the features that would be necessary, and we don't know all the causal structures, but only observe part of it. And also the during pre-training, the model doesn't know this structural graph. It's just a table that goes into the pre-training.
Model Size, Speed, And Prompting
SPEAKER_01And uh, you mentioned this is just to round back on scale. Foundation models can be a little bit scary. We talk about these language foundation models. These are like, as far as we know, trillion parameter models. These don't fit on a single GPU, these have to go in an entire cluster just to run inference in one of these models. With these models, we're talking about millions of tabular data sets. How much training data is going in and how big are these final models? Could I just run this on my laptop?
SPEAKER_00Yeah, you can. So these I don't have I don't know the parameter numbers by heart, but like the these the models, like the model weights are like a couple of hundred megabytes. And I tried a few of them on my MacBook M1. So without a graphic card, just on the CPU, can be a tad slow, so it's definitely slower than using boosted trees, for example. But you can run this on your machine, yeah.
SPEAKER_01Yeah, I think that's really exciting. I and I think that if we can just use ScikitLearn, download this model, maybe it's just like a gigabyte large, that makes this extremely accessible. And so even if these aren't better, I think that this makes them absolutely worth being part of our toolbox. But digging in a little bit on what you were saying, if this model now just exists on our computer as something that we can just immediately run predict on, we no longer need to do this fit piece. Do you think that this moves us a little bit more towards a prompting era for tabular models? Or like an era where like people are just doing few shot learning, like really just here's 10, 20 samples.
SPEAKER_00I think the prompt analogy only works in part because we don't the prompt is the training date, like the your data points, and you can definitely like play around with this. So you're providing a smaller context, which basically means smaller training data part. And when I say training data in this context, it means the label part of your data that goes into the in-context learning. And you can think of it if you have a data set, you paste basically together your training and test data. Actually, you keep it separated because they most of these Chapel Foundation models still use the scikit-learn API language, but internally it's basically paste it together then, or concatenated, and then the test data can pay attention basically to the training data to so for the model to make predictions. And of course, you can now change the context for your data for making predictions. That's also one of the tricks because inference is expensive with Tabula foundation models, especially compared to training, because training like the fit process doesn't exist really. It's basically loading model weights and doing some pre-processing of the data. Pre-training again is expensive, of course. But if you already have the pre-trained model, then you no longer need the training process, but only do the prediction. And this is more expensive than boosted trees, at least for now. Yeah, and then you can. So one of the tricks to bring this time down to make faster predictions is to provide a smaller context window. So maybe you know that a subset of your training data is enough to provide as context for your predictions. So there's a few of these tricks that you can use.
SPEAKER_01Yeah, and in terms of practical performance, then when I use this kind of model, I'm definitely finding like we're getting like 90, 80, sometimes 100% as accurate as a traditional model, and I'm not having to do any training, which is really wonderful. But on some use cases, I'm definitely getting significantly degraded performance. What's our recourse for trying to fine-tune one of these models? Is that accessible?
SPEAKER_00So these are neural networks that you can fine-tune. And I think that's one of the big selling points that you can fine-tune them if you have very specific data. And yeah, beforehand, you also don't know will it work for my case, will it not work? And in many cases, the Tableau Foundation models are at beating or on par with state of the art, like the which are mostly boosted trees, but it's of course not guaranteed. In another case, you might have that it doesn't outperform them or much worse, actually. And yeah, then there's a potential that you can fine-tune those models on very specific tabular data that you might have that may look similar to what your data looks like. But this also means that you have to put in the assumptions like to generate new data, or if you want to use synthetic data, or maybe you have a lot of similar data sets already labeled that you can use for fine-tuning. So this really depends then if you have other data that looks like your data set, or you can create synthetic data which has the similar structure.
SPEAKER_01Yeah, this definitely starting to smell a little bit like a business, or like what we saw with Medbert. It sounds like there's room for organizations to collect a lot of the synthetic data for a specific use case and make their own tabular foundational models. Have you seen any of that going around?
SPEAKER_00Yeah, there's actually a lot of companies now, or startups that center on training these foundational models. Something I haven't seen with other types of models. Like the I haven't seen any specific random forest company. Probably there's one out there, and I'm like no complaining. But there's now a lot of startups pre-training these models. And yeah, one of the I think the main things to you, or IP or what's valuable is really the prior. Like, how do you generate the data to then cover certain use cases? Yeah.
SPEAKER_01And then I guess a question for them and for us here is it sounds like we might be taking on some trade-offs of these models. Since we are going to a foundational model, we are inherently putting a black box in our pipeline, right? There's some part of the model which we don't fully understand the mechanisms of. This is just because transformer models are so dense and they're neural networks, and we can run inference on them. But what's our what are our options? How can we get back some of our explainability in a space where with trees and with boosting, we have like almost perfect explainability?
SPEAKER_00In a way, we are in the same situation because the reason why we have the interpretability is because of arguably they're also black boxes, and we just learn to use a lot of tools like feature importance or Shapley values or partially pendence slots to then increase the interpretability. Inherently, they're still like it's just describing how they behave. And since most of these tools are model agnostic for interpretability, then we can use them still for tabular foundation models. One of the problems is that since predicting is more expensive now, and also for tree-based models, we have a lot of faster implementations. For example, or especially for Shapley values where we have tree sharp, which is much faster than like these model agnostic implementations, we have now the situation that we can do interpretability still for these tabular foundation models, but it's more expensive and just takes much longer. But I expect that there will be more research in this area going on and maybe we get some faster implementations there as well.
SPEAKER_03I think that's really interesting and critical point of that you still have the same toolbox. So it's not, yay, it's an LLM, but you still have the same sort of explainability hypothetically that you would with XGBoost. So that's I think that's a key part, is you can still have a lot of that. Yes, it's slower now, but I would imagine that's definitely another area of the startup area around the space that we talked about of like the there's definitely some optimization techniques and things we could do to get that a lot faster longer term.
SPEAKER_00Yeah. And also we have maybe a new era area open up, like with because of for large language models, we have mechanistic interpretability in these things, like visualizing our attention and stuff like this, which might be valuable. I haven't seen much on this yet, or not done any deep dive, but could also be a potential area for additional insights into the model.
SPEAKER_01Would we have an opportunity to use stuff like we've talked about before, like the uncertainty quantification or conformal prediction? Are these still options in this world?
SPEAKER_00Since they also are model agnostic, like conformal prediction still works. And there you don't have the drawback of the predictions being slow because conformal prediction is something you do as a post-processing tool. So, an example for someone who doesn't know what conformal prediction is, it's basically a toolbox for conformalizing uncertainty quantification. So if you have a quantile regression, like with 10 to 8, 90% quantile, and you want to ensure that for each for the predictions it's like an 80% coverage, at least on average, then you can conformalize it with conformal prediction tools. Conformalizing basically means that you get guarantees that on average it really covers 80% of the real value. And you can apply these tools still to tabular foundation models as well. Interestingly, also the tabular foundation models, the way they are trained, especially for regression, you also get already a model that can do like quantile predictions. So at least it's true for Tap PFN and TAPICM. They are both trained in a way that you don't not only, if you're using it for regression, you not only get the mean prediction, but actually the entire, or like m mostly the entire distribution. So you can also ask the model to give you the prediction for the median or for the 30% quantile, and which I think is pretty neat.
Uncertainty, Defaults, And The Future
SPEAKER_01Yeah, I think that's super exciting. And I think it just shows even further that this is just expanding our toolbox machine learning tools even further past our simple good old-fashioned AI techniques. In the landscape of tabular modeling, then how do you see this fitting in? Do you think that this is going to be an utter replacement? Or do you think that this is a new and unique way of doing things?
SPEAKER_00Yeah, my latest poster wrote about this. Okay, how will this change data science or especially tabular machine learning? And so the first scenario is it's just gonna be one of one of many models that you try out. So just like the latest boosted tree model, maybe it's even state of the art in the best model, but still you're in a situation that you maybe try out 10 models or 10 algorithms and see what works best on your data set, stick with the one that works best and continue from there. We're clearly there that I would say that the software is there and it's usable, at least for small and medium-sized data sets, and you can just try it out, use it. And if the Tableau Foundation model outperforms all the other models, then maybe you want to go with the model, depending on your use case, of course, like how many predictions do you need to make all the time and so on. Second scenario would be okay, it becomes the default model, and you because it works so well, it's just like you're all in one model because you can already do quantum regression and so on, and maybe there's an ecosystem around it, so it's just convenient to use it and becomes your go-to. But on a third level would be that we have this huge paradigm shift, a little bit like a large language model said for an NLP. That is such a strong new paradigm that it's kind of really becomes the go-to that you always would try this first, and this is your area where you feel comfortable. And basically works in 90 plus percent of your cases, and you always use this to solve your tabular machine learning uh tasks. And yeah. But this really depends on how this all develops, especially in terms of compute and the ecosystem around it, like with all the interpretability, uncertainty, quantification, all these tools.
SPEAKER_03Yeah, because at the current state, it sounds like it it's a really good option for you to do for batch processing, but it's kind for most companies that are using any sort of online inference or things, it's too slow to really be considered at the moment. Would you agree?
SPEAKER_00Yeah, 100%. So that comes with the idea of in-context learning. So if you have the scenario that you train a model and then make a prediction maybe once a day or once an hour, but only one by one, you always have to provide the entire context. So you can do some optimizations there if maybe smaller context is enough. But if you don't do that, you would always provide your entire train training data set to make just one prediction. And this is also like for if you would be using uh random forest or cat boost or whatever, it's pretty fast. You do one data point, you just make prediction because training was expensive. Predicting is not so expensive, at least. But yeah, with the tablet foundation models, it's the better scenario is that you do, as you said, batch prediction. So maybe just run it once a night, like for all your customers or whatever, and it's fine if it takes one hour instead of 10 minutes, maybe, or once one second. But if you have this ongoing online uh prediction type of situation, then it's really expensive, yeah.
SPEAKER_01And just to start getting our concluding thoughts out there, I think one really interesting insight that you had on the blog post is that because this is so methodologically distinct, these are potentially really good for ensembling models. And I think that one thing I would also add above that, too, is that we now have a new strong baseline, right? Now you have your most frequent class, you have your average, and now you have can something out of the box do it? And I think that's a question that a lot of implementers are looking at now. It's am I gonna build the software solution or is ChatGPT gonna do it just as well or better? And if you can't clear that bar, then you know, why spend an RD to develop something? I think that this is like a new level of proof we need to provide.
SPEAKER_00Yeah. Yeah, I think so. For me, when I encounter like a new machine learning task, I have the reflex to try out the random forest. It's for the simple reason that it works already without hyperparameter tuning. It's fast, so it's very convenient, and you always get like this first idea of okay, how well can I predict the task? How well like where's the line? And of course, you can improve it then afterwards, maybe to by using boosted cream models like XGBoost or Cat Boost, but it gives you this first idea, and I think that these tabular foundation models now can be used in the same way because you don't need to hyperparameter tune and kind of work for classification and regression. So you can just and you have also that they work quite well, so in many cases at least, so you can just throw that your problem if you encounter it the first time and get like a first baseline and know where you stand, and afterwards, if then it's a concern that predictions expensive and so on, you might switch to a different model. But for this first encounter with a new machine learning problem, it's a good way to start us. Yeah.
SPEAKER_01I I just have one last question, which is looking back on your supervised machine learning for science book, would you have added this? Would this have made the inclusion?
SPEAKER_00Yeah, it's an interesting case. The way we wrote the supervised machine learning for science book, we said, okay, there's this way we do supervised machine learning, but it's not enough for science, so we need all these tools from interpretability to robustness to uncertainty quantification, and wrote it in such a way that these are model agnostic. And even though the paradigm now changed with the tabular foundation models, these tools remained the same, just as we talked about, you can still use interpretability tools, you can still use uncertainty quantification in the same way. So many of these tools are actually the same, or it's well, you can still use them. What's a bit different now is the idea of this pre-training, which I think is adds a new layer to all of this, because now you could say, hey, maybe for my very specific scientific application, maybe it has a very specific structure, and I want to put more domain knowledge into my modeling approach. Now with these tableau foundation models, you can do pre-training. So you could fine-tune them on synthetic data or similar data and give this prior knowledge into the modeling process in this way. I think for the domain, like instilling domain knowledge into your model, I think this adds a new layer. So I think in this place, I would see this new, I would maybe make an update in the book, yeah.
SPEAKER_01Great. Any other clearing thoughts from Angela or Susan on this?
SPEAKER_03Christoph, it's always a pleasure to have you on. This is a great topic. We thoroughly enjoy following your work and excited for whatever next topic or book, and love to have you back if you're willing. Thank you for the show.
SPEAKER_02Yeah, I echoed it. Yeah, I echo that, Christoph. Earlier in the discussion, when you were talking about the compute power and everything like that, I was thinking about what's going on right now with Apple chips and people now actually going out and buying. Apple can't keep up with their supply of Mac minis because like people have now found that they can do this very locally on like very little computes. I was excited to put that together. So thank you.
SPEAKER_00Yeah, thanks for having me. Always a pleasure talking to you. And also, yeah, especially with the tablet foundation models, also something I'm just excited about at the moment. So it was really cool to chat with you about this.
Farewell Announcement And Contact
SPEAKER_02Yeah, absolutely. All right. And we, with that, for our listeners, we have one short announcement for you. This will be my last podcast here with the team. I'm moving on and we'll actually get to admire our work from afar. And so now, from as you keep listening to our episodes and keep contributing your thoughts, you're probably going to hear from a new face. His name is Mike Moore, and he will be helping Sid and Andrew get their thoughts and guests together for the show. I really appreciate all of the feedback that listeners have given us over the past three years. Again, for all our guests and Sid Andrew, thank you for everybody listening. If you have questions about the show, same email address, ai fundamentalist at monetar.ai. Then we'll see you next time.
Podcasts we love
Check out these other fine podcasts recommended by us, not an algorithm.
The Shifting Privacy Left Podcast
Debra J. Farber (Shifting Privacy Left)
The Audit Podcast
Trent Russell