Why validity beats scale when building multi‑step AI systems Artwork

The AI Fundamentalists

A podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses.

All Episodes

The AI Fundamentalists

Why validity beats scale when building multi‑step AI systems

January 06, 2026 • Dr. Andrew Clark & Dr. Sid Mangalik • Season 1 • Episode 40

0:00 | 40:16

In this episode, Dr. Sebastian (Seb) Benthall joins us to discuss research from his and Andrew's paper entitled “Validity Is What You Need” for agentic AI that actually works in the real world.

Our discussion connects systems engineering, mechanism design, and requirements to multi‑step AI that creates enterprise impact to achieve measurable outcomes.

Defining agentic AI beyond LLM hype
Limits of scale and the need for multi‑step control
Tool use, compounding errors, and guardrails
Systems engineering patterns for AI reliability
Principal–agent framing for governance
Mechanism design for multi‑stakeholder alignment
Requirements engineering as the crux of validity
Hybrid stacks: LLM interface, deterministic solvers
Regression testing through model swaps and drift
Moving from universal copilots to fit‑for‑purpose agents

You can also catch more of Seb's research on our podcast. Tune in to Contextual integrity and differential privacy: Theory versus application.

What did you think? Let us know.

Do you have a question or a discussion topic for the AI Fundamentalists? Connect with them to comment on your favorite topics:

LinkedIn - Episode summaries, shares of cited articles, and more.
YouTube - Was it something that we said? Good. Share your favorite quotes.
Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.

SPEAKER_01: 0:02

The AI Fundamentalists, a podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses. Here are your hosts, Andrew Clark and Sid Mongalik. Welcome to today's episode of the AI Fundamentalists. Today we're going to be talking about validity. And to help us, we have today's guest. We're excited to welcome back Sebastian Benthal. Or as you're going to hear referenced throughout this podcast, we call him Seb. He's a senior research director at the Information Law Institute from NYU School of Law and a research scientist at the International Computer Science Institute. Seb, welcome back.

SPEAKER_02: 0:50

Thanks for having me. So, you know, we were reading through this work recently, and we see that's still in submission and obviously, you know, subject to change. But I think we really liked a lot of what's happening in this work. And we were interested if you could help us break down a lot of the concepts in here, since I think there's a lot that we can learn from an academic standpoint and bring over to our worlds either in industry or in other research fields.

SPEAKER_03: 1:16

This paper was really born out of conversations with Andrew about frustrations with kind of the way uh industry has been framing AI, and it was kind of meant to be an intervention in um a broader conversation, which is part of the role of academic work. Um we're seeing now a kind of wave of agentic AI hype. And whenever there's these sort of hype waves, it's useful to unpack the history of it, um, try to figure out what concepts have been feeding into the current moment and see what what's on the other side of it. Um and so we we really had to wrestle with um, among other things, the the notion of AI and agency. Agentic AI is this kind of new term that sounds like it's the same thing as an AI agent, um, but it flips it on its head in interesting ways. So um an agent in sort of classically in AI theory is anything with a sensory motor loop. You know, it senses the world and then it acts on the world. And if it's intelligent, then according to Stuart Russell, one of the modern godfathers of AI, practicing professor, textbook author, you know, then that agent is going to be acting to achieve goals, typically defined by a utility function. And if it's uh solving these problems over time, then it will be a kind of intertemporal utility function, uh called a Bellman equation that talks about future value and present reward. So um when people were trained in AI for the past several decades and they learned about reinforcement learning, this is what they hear about AI agents operating in this way. Um there's another sense of agency which is kind of more from law and economics, but I think is relevant. People think about like having an agent, your agent. Okay, and that's the idea that that legally or in economics, you have someone working for you. They're your employee, um, they're your doctor, they're your lawyer. And um legally, there's something called agency law, which means that there's duties that these agents have to the principals in order to settle what's called the principal agent problem. How do you trust your agent to really work in your interest? Um with agentic AI, I think that there's a blending of these ideas because there's almost a promise that industry is making that your agentic AI will be your agent. Maybe not quite in a legal sense, but almost. But it's also an AI agent in the sense that it's achieving goals. But whose goals? Its own? Yours? So already there's this ambiguity which which blurs some of the questions of trust. Um which I think are actually essential when when we're talking about these problems. Um other aspects of agentic AI that are sort of essential in the contemporary definition are um they have to do with the problem complexity and sort of using multiple steps to accomplish a problem. This seems to really be like a reaction to the the limitations of the large language model um tool, which is fueling everything. There is this promise um that large language models would be generally intelligent. You know, a trillion-dollar bet has been placed on a particular architecture of uh neural networks applied to natural language as accomplishing general intelligence, right? And it hasn't succeeded in many complex tasks. So the the idea that, well, if you start layering multiple queries to an LLM, you can accomplish something more complex. Well, that's multi-step reasoning. Um and there have been some limited successes in this area. There's also, I think, a really important element um, which is external tool use. So people are saying, you know, rather than you know, you your your browser could be smarter, it could go shopping for you. That's the holy grail. You know, if they could find some way to intermediate consumer spending, um, all these companies would be very, very happy, right? So um how much can this thing buy for you or operate other aspects of your your work life, your productivity suite, etc. So um the thing is when you start implicating um potentially like multiple stakeholders that care about the use of this system, uh, the user's data, which might be private data that might be leaked or not to an external tool, you're not talking about like the simple abstraction of the utility maximizing agent anymore. You're talking about a real embedded system with a social context. And um, there is a lot of discipline to systems engineering, security engineering, um, et cetera, et cetera. But these are not kind of native to the AI conversation. And they're starting to leak in in the in these discussions about agentich AI. There are recent kind of NURIPS papers that are really systems engineering papers, but it's um it's late comers to this larger section because it's really compensating for what LLMs cannot do themselves.

SPEAKER_00: 7:15

Sub, thanks for that. We've got great uh setup. And that's what I think where you and I really were connected in talking about this is like from our background from economics and things in systems engineering, is like that's kind of this missing link here. Is to as you pointed out, like the, you know, you have this monolithic LLM, it's gonna be intelligent and can do everything. And then now we're gonna make it do multi-step things. This is a little bit living in dream world right now, and where, as we talked about on our uh last podcast, you know, the scaling is kind of showing cracks. Like you can't, the infinitely scaling, you know, we're talking about moving data centers in space, under the water, all these things, and we're also seeing that the return on on you know more scaling is is kind of falling apart on these on these systems. So that whole promise is being tested. But where it really comes down to now with a GenTech is to really be using them for the multi-step world. I don't see how you like even you talk about Bellman and like the uh like the optimization there, but that's still an actual optimization. You know, any of these algorithms, even if it's an uh an LLM, it optimizes on something. They're optimizing on sounding human, predicting the next word. So we're really running into like this uh I'm starting to call it like a flying car. Like a genetic on how it's being talked about today. What's really crazy is like the beginning of 2025, people are talking about it like it's here already. And you're gonna be replacing people and which is a whole other can of worms, but you're gonna be using it in the enterprise setting. But it's like we all know quantum computing isn't here yet, fusion power is not here yet, self-flying uh self-driving cars is not really even here yet, but flying cars is definitely not here yet. But somehow a Gentec was already sold that it's here. So like what we're kind of having this now crisis here. This is actually a great uh uh article I was reading last night. Wall Street Journal came out with something really interesting. They had we let an AI run our office vending machine, it lost hundreds of dollars. Like it was one of the examples I use instead of what we're talking about, like a travel agent, and that's kind of the example we've been talking through. One of the ones I've actually used is kind of a in conversations like uh unless you're viewing vending machines, like that's a low-risk system. It's kind of almost that. Well, it turns out this low-risk vending machine that Anthropic did with the Wall Street Journal turned out to be buying PlayStations, losing a ton of money. Like the whole idea was just make money make up make enough money to survive or just stay break-even and like be an office uh autonomous agent that is running a vending machine. And it found ways to lose hundreds of dollars and giving people PlayStations as a vending machine. Like basically issues with it without the systems engineering approach, and I think this is what what Seb and I are really trying to like set the stage of the methodology around, is like you're really to actually have successful multi-step systems, and this has been done in pockets of some of Seb's research and things in the agents for a long time, but like to really do it, you have to do systems engineered individual components are thoroughly validated and then they're engineered together. That's why I know I talk about it way too much, it's probably on the podcast here, but like the Apollo program and how impressive that was. And like NASA is still leading on like systems engineering, and very few people understand those areas, and that's what's great. That's bleeding into AI world of historically just scaling. We don't think we scale is kind of in uh now it's overgeneralization and unfair generalization. But that sometimes the computer science industry kind of operates that way versus like the we sent people to the moon using something much less powerful than our smartphone, right? And like buying how do we design and validate and fit together these systems, we could build robust multi-step agentic systems, but we're gonna have to bring back in systems engineering, control theory, like like different optimization techniques for even the say the thing we've talked about in this podcast a lot of like the vending machine example. If we've used utility theory and basic linear programming and just basic ROI maximization calculations, you would have made money, not bought people PlayStations, which was not on the SKU, right? So, like that it this is like the I think there's a major reset needed, and it's what like it's a methodologically framing, is what uh what Sev and I are trying to accomplish with with the setup, and I think he's done a great job with this initial draft.

SPEAKER_02: 10:57

And what I'd really like to dig in is if we have these existing issues like compounding errors in multi-step processes, difficulties in applying these into complex systems, potential misalignments, and maintaining governability. Uh I guess it's right in the title. But I, you know, I want to ask the question why is validity what I need? You know, how does validity basically help us address these concerns and potentially resolve them one of the hardest things about what's implied by agenda is um is how many different sort of actors there are.

SPEAKER_03: 11:39

Um if you if agentic AI involves tool use, you have a third party. That's um you also have your your end user that might be one of several stakeholders involved in the procurement decision for the agentic AI system. You've got the developer of the agentic AI system itself, who may or may not control the large language model, or certainly might not be able to trust or evaluate everything that the large language model can do. So right now you've got four different actors, all potentially misaligned. And even if each of those actors are able to per perform their job perfectly, um you might find that the end user is not satisfied or some of the stakeholders are not. So the the the point of the title of the paper was just to say keep your eye on the prize. What is it that is actually trying being accomplished by this thing? People marketing foundation technologies are going to try to say, you know, all you need is this foundation technology. Uh it's a play on attention is all you need argument that you know to solve all AI problems, you just need this particular transformer technique. Incredibly ambitious, perhaps hubristic statement. What an end user actually needs is something that validly accomplishes their goals. And this is just definitional. Um trying to uh trying to propose a tautology to answer what might be something overhyped. But working backwards from that. Well, how do you show validity to the end user of the system or the stakeholders of the system? I'd say the principal stakeholders of the system, that is the principles to whom the AI system is the agent. Well, that requires a lot of testing, a lot of demonstration, involves giving that principal the information that they need to trust the system, given that the system might be changing in time. If every you know six months there's a new LLM model release, you know, that we've had we've seen this before. You know, uh you end users that love the the chumminess of one version of the uh open AI model that then are disappointed because the personality changes later on. If you have a software pipeline, what tests are you running when you change the LLM model in the back end to ensure that it achieves the same level of quality that you come to expect of it? That's that needs to be communicated back up to the user from the agentic AI system designer. So um I think these are active are challenges that the LLM engine is not going to address. So uh the question is how do we shift focus to these problems? What is the scientific, rigorous method of of solving them? And uh how do we focus on real human needs?

SPEAKER_00: 14:58

I think this is the to the more we talk about this, and I always like I so enjoy our conversations of that like we we get like really abstract and like the the kind of like the metaphysics we've been talking about on the podcast. Well, a metaphysics series, uh stuff we uh we're doing when we're kind of going through like metaphysics of thinking of like what does actually thinking mean and kind of like going back to Aristotle and Plato and stuff like that.

SPEAKER_03: 15:16

So cool.

SPEAKER_00: 15:17

But like we you and I do that kind of on AI when we talk. So like, yeah, actually, we might want to have you come in on one of the metaphysics specific episodes. Let's do yeah, like definitely. Let's uh we'll be brief after the podcast recording and get you up to speed on that. But in any case, um, I think this is really the heart of where I even see what is AI governance even. AI governance is not a compliance task, it's not a like a paperwork, paper pushing thing. It's like the people like put all this baggage on it. To me, the root of it is understand it's a principal agent problem, and how do you understand like that doesn't mean it's not too agentic, even like principal agent problem of what is your system doing? How is it performing? Who is checking it? How are you making sure that it's fit for your purpose? This is where it comes back to like early models, or even like cruise control, or like testing and validation, or like why do we have engineering? We really need to get to the part of AI engineering. Like there's software engineering, there's civil engineering, there's other things. We need to get to AI engineering. And that we're not, we're not there yet right now, right? We have hackers, we have programmers, we're not really gotten to AI engineering. And I think that's the key of like what even is governance? Governance is making sure that you have that alignment of what are you trying to accomplish. And any documentation just helps to be like, I document it first off from my own ability. So to it's a it's a prop, like writing something down is just your own thinking process and internalizing it. So there's that. There's also if I if I want to be able to show Seb and have him understand it, it needs to be written, like those sorts of things. But it really gets down to like the, and this is where validity and validation is the heart of it. So there's really the two parts of like, what is the objective? You can't get away, you can't, um no LLM will ever think for you. And this is the thing that's different of like, even if so, we do get super intelligent LLMs that are going to replace accountants and things. I don't see that happening anytime soon, but hypothetically, let's say that is. As we've seen with any other technological um innovation, the part that makes us human, the thinking is the the actual thinking, the reasoning, the human aspects, the abstract thinking, that's what we'd be good at. So you'd just be leveraging the tool for those capacities. But in any case, whenever you're leveraging a tool as an agent, you're the principal and you're levering an agent, which can be a one-step agent, a generative AI system prompting, is still an agent by that macro definition. You are the one that are defining what is it that you want. And like the siren song that it seems that we're trying to get with AI is I don't want to think. And as a company, like you have to be validations and validity of any AI system, specifically now that we're going from like single-step models to potential multi-step models, just becomes so much more critically important. And the only way you could really even get these in the world you talked about, Seb, of like agents interacting with different agents, like over the MCP protocol and things, is to have them independently validated. And then there's an orchestration layer of like the systems engineering of how you put the pieces together. Like you did not build a rock, sorry, for the Apollo again, but you didn't build a rocket in one factory. It was all over the United States, which tons of different factories of their components are rigorously validated and stress tested for their specs, then they're assembled at some point. A gentic to actually be successful, and again, we're nowhere near here, but it doesn't have to be as far as a flying car, but we have to be like this validity is all you need. I think it's just really the underpinning of the two big issues of what am I doing and why. We're not tackling you need to tackle that, but that's the definition of principal agent problem. But then the validity of the system, how are we validating that these individual components are operating as intended? And then we can know, like, if I'm chaining things together that they're working. But uh I really think this is the crux of what is even AI governance or what what is any of it we're trying to accomplish here. And we can't really get away from this. And this is where we've kind of tried to skip steps. As you know, like you can't if you're a you know it's couch to 5k, not couch to marathon as an example, right? You can't skip steps. And I think we've in 2025, the public narrative, enterprise at times, like tried to skip steps of like we're gonna go ahead and we're gonna just we're gonna be replacing humans, we're gonna be doing all this great stuff because Anthropic says it can happen, and we're skipping why we're even doing it, what we're trying to accomplish, and how do we validate that it's actually ready to be used.

SPEAKER_03: 19:07

I agree. It seems like you know, uh a couple decades ago, uh people would try AI felt still seemed like a research field. There was some use of expert systems more broadly, but in terms of AI research, it was a lot of it was you would you would develop an AI system to solve some kind of problem, uh some discrete problem. And I think where things are at now with Gen AI systems is they're trying to get these things just to accomplish tasks. The the logic is if we're gonna have these things automate labor, we need to figure out what the tasks are that are part of that labor and automate those tasks. And for kind of productivity software sort of stuff, like writing. Writing documents, coding, which are significant significant uh forms of labor, right? But they're they're trying to figure out how to automate the tasks involved with that work. There's a lot that that leaves out. Um and when things go beyond being just something that's accomplishing tasks to someone that's trusted to perform a role in a long term. Like think of think of um hiring a contractor for something versus having an employee, right? There's uh a lot that's wrapped up in the ongoing memory and development of that person that um an agentic AI system might need to evolve to be able to accomplish. There's also um the issue of thinking about long-term interests and goals as opposed to short-term tasks. So you can scope in the classical AI agent definition a problem down so that it's trivial. But what the uh Yonadab Shavid et al. open AI paper, that's one of the definitional ones for a genetic AI today, is about is about um task and environment and goal complexity. And it seems like a lot of that complexity comes from longer-term stake and uh sort of multi-agent interaction in the environment. You know, we talk about sort of impact on social life. Um, so uh if you think about a system that um advises a person, say a clinical health related chatbot, um what kind of information does that need to know about the patient? And how does it maintain that information over time in a trustworthy way? These are a bunch of challenges that are not part of the kind of like the basic um assembly of an agentic AI system, but there are people working on this problem now, um, because they see it as as the frontier. So I think it's it's um some extent these are worthy problems, um, but it does seem like a shift in focus is necessary in order to really uh realize the value that's been promised.

SPEAKER_02: 22:36

I think that that's really poignant, and I think that it it gets me to this idea that we've been talking about this inherent mechanism design that's built into these agentic AI systems, wherein they understand the world around them and they respond accordingly to that world and within that world context. And one thing we talk about in the paper is this idea that you can model an enterprise context as a multi-agent socio-technical system, or this idea wherein agents don't just operate on their own goals, they operate within an environment and within feedback within that environment. Can we talk a little bit more about how we instill these types of contexts into agents and how we see agents properly respond back to feedback that they get in that environment?

SPEAKER_03: 23:24

Yeah, I think that's uh that's uh I've seen this mentioned as sort of an emerging issue. Um in like there's there are people uh you know at Google, at uh OpenAI, etc., are working on kind of individual-facing assistance, but this idea of like the enterprise assistant seems to open up a whole other can of worms. Um because uh in in even like the economic theory, we don't have a very good way of comparing what people want. Um, you know, it's not like the utility of one person adds up easily with the utility of another. There's ways of fudging it with dollar values, and maybe that is the right way to do it, but that that requires building in some assumptions. So aggregating preferences over many people is a known challenge. And um understanding what's in, say, a company's best interest when that can be very multifaceted and there can be multiple bottom lines is is challenging also. And if you talk about the AI alignment problem in that setting, what does it mean for an agentic AI system to be aligned with the interests of the corporate body that is its its principle? It's a fascinating problem, but I don't think that anybody's come up with a clean answer in social theory, computer science theory, economics. Um but that one of the closest things to it would be something like mechanism design, which is saying, like, well, okay, suppose you're designing an auction. You know you've got a lot of people bidding in the auction. How do you design the auction in order to get people to, for example, reveal their preferences honestly or maximize the revenue of the auctioneer? So these are well studied in operations research and economics. Um, applying these, that kind of thinking to the design of a in-house agentic AI in an enterprise setting is extremely ambitious because that's a complex problem. There's inherent computational complexities to solving this, you know, multiple optimization problems wrapped in other optimization problems. And so it's uh there's real questions about how do you make that into some a problem that decomposes nicely into something that can be solved. But that's a great research frontier.

SPEAKER_00: 26:00

And I think this is what we we talked about in our wrap-up uh as well of like, I really think this is, and I there's other folks saying this as well. It's like we're really turning back into scalability is not going to get us to where we want. Like, we've skipped steps. I think what we're really seeing is like there there is a uh such a future here, and I'm excited about the ability of AI to make our lives better, and there's a lot of different avenues here, but I think we're really realizing that we can't skip steps. Research has to come next, and also this like this universal function idea that one like that you can make an a copilot that does everything for your enterprise at the same level of your highly skilled workers. That idea is starting to get a little long in the tooth, and you're realizing the complexity here. However, like just be the going back to there is so much deep research and like the decomposing or the auction design and the mechanism design, there's so much here. Like we can be building these really expertise systems, but it's gonna be a research-led thing, not a we're just gonna throw computer the problem and hope for the best type thing. And that's what the paradigm is shifting. And this is where what we're also trying to set up with the paper is how software as a service has really transformed the enterprise landscape of like you can buy specialized tools for specialized tasks. The V2 of that is agents, right? Like that, and that's what we're trying to set up as the the paradigm of the in the validity is all you need is agentic AI, meaning multi-step AI as a paradigm. And again, AI can have many, whatever the solver is, and AI methods' definition is just doing things humans used to do. So the solver could just be a bunch of if statements. It's whatever it is, it doesn't matter. But the paradigm of multi-step AI that is now built for specific business purposes, you've seen a little bit of people talking about like small language models, but they're still missing the point of like a small fine-tuned language model. It might language model might not be the best solution. But getting back to you can build these very targeted expertise systems that might use mechanism of design, they might use this, they might use an LLM, they might use only linear programming. We don't know, it doesn't matter. Uh, but they're there's very optimized that can solve these enterprise problems and they can be offered as SAS or something like that. However, it takes a research focused driven, let's take a step back, let's let's understand the socio-technical dynamics or or what have you of the specific problem we're looking at. And then we can build those systems. And that's where I think it we're, I don't, I think it's still probably gonna take six months to a year until that we fully transition that way. But I am already seeing the folks talk about like the need for like the research or research led or back to the labs and that kind of conversation point. And I think uh Steph's on the forefront of that and seeing a lot of that develop as well of like people realizing that to what what is the goal we're trying to accomplish? The goal we're trying to accomplish, we can accomplish, but we've been going about it the wrong way. And we're gonna have to do it the the the sequential order and have research embedded into it. And it might and this the universal approximation function might not be viable, at least in the short term, versus optimized systems that are built. Like eBay is a case in point of mechanism design and auction theory. It's beautiful, it's a fantastically built company, and they did this. We could be doing this in actual sys AI systems, not just a marketplace, but you have to do it in the right way.

SPEAKER_03: 29:06

One thing I've heard um is uh a frustration with the recklessness of open AI in kind of pushing pushing the boundary of LLMs with their focus. Um and and uh but also as a as a part of that, as kind of the research community starts to kind of backfill and respond, um there's a role for a lot of other kinds of expertise that have been marginalized by the generative AI thrust. Um and um as those gen AI tools become more commodified, right, they're gonna be a component of these AI agentic systems in the future because people really like natural language interfaces. But um they're gonna be easier and easier to use. And so the value add on top of it is going to depend on things from other fields. There's people uh now engaged in conversation with sort of formal methods in security or software validation. Um, there's people working on multi-agent systems, there's people working on um other kinds of computer science and economics, crossover theory that are seeing the way that they can engage with these systems and add a lot of value to it by providing that systems engineering scaffolding. So I think that's that you're right. That's I think where things are going to have to go. And where it seems that they are going because uh you know the the steam is running out of of you know the current system. So the adjustment. But with all this additional like compute capacity, uh it's amazing the it's amazing the ambition that's been unlocked by this sort of agenda era. I mean, it would be uh 10 years ago, if you were to say, oh, we want to like automate this whole aspect of your business and you trust us, we're gonna accomplish it, it would have been hard to make the argument that that would be feasible or worth the effort. Um but now I think what's happening is it's actually becoming lower, it's becoming cheaper to do many things, uh, partly because of Gen AI advances in kind of coding tooling. So it the cheap the cheaper it is to build, the more valuable it is to know exactly what it is you're trying to build. Because uh that's really the challenge. You can build anything, so what do you do?

SPEAKER_02: 32:08

And I guess that kind of lines me up into asking if we see this as basically maybe a temporary dead end, or we're temporarily blocked, and we're still figuring out how to make these you know universal approximation functions work in the ways that we want them to work, and to understand and quantify our needs and our goals, what should we be looking forward to as the future of AgenTech AI? You know, what what do you see as like the next evolution of it that we're gonna see in the next coming years? And you know, where is the research pointed in to address some of these shortcomings?

SPEAKER_03: 32:51

My gut says it's in requirements engineering, you know, something so basic as uh I mean, the people talk about the specification problem for AI alignment. They think about it in terms of how do you specify the utility function that really reflects the interests of the, you know, the principles of the humans involved? That's essentially requirements engineering. But there might be better, more efficient ways of going about doing it in the future. Like how do you efficiently collect requirements, figure out what pick people actually need, um, not just in the short term, but in the long term. How do you effectively balance the interests of multiple stakeholders within an enterprise setting? These these I think are are important problems. And once once they are solved, if they can be solved, then everything else becomes an engineering problem in service to those goals. It's a bit waterfally, that way of thinking. Um, there's probably a more agile, dynamic way of accomplishing a similar thing, but that that I think is the crux.

SPEAKER_01: 34:04

Yeah. There was something that you were saying earlier that was it was almost alluding to it, but there's a psychology of like explorers, experimenters often don't like being backed into a hard goal or outcome because there's a there's a feeling that they're going to be blocked from doing the exploration that might find them something else or find the next biggest innovation. So when you start getting into like the thing that's going to make it better is like like the requirements management. Where are we going with this? What are the outcomes? I think you were almost saying it was you were getting to a big part of that. Is that does that psychology resonate with maybe what some reluctance is?

SPEAKER_03: 34:49

I guess I I I sympathize with the explore experiment or mindset. Um but um you know I think there's an element of experimentation which we've seen, which is um what happens if we blow this up really big. You know, uh it's a kind of Manhattan project approach to experimentation, which um okay is one way to go about science. But um maybe because I'm uh you know I've got a foot in social science, I think that there's there's um there are plenty of of interesting mysteries to be solved um at the intersection of humans interacting with each other and with machines, and there are people that have been working in that field of human-computer interaction for some time. Um and there are still problems that are not solved. And what talking about sort of requirements engineering and how to back that into a complex systems design when you've got multiple differently interested agents operating, you know, sociotechnically around each other. Um, there are many, many open problems there. So that's the scientists, I think, will not get bored if we can drive the field to take these problems seriously. Um and then you know, if you talk about from kind of an industrial standpoint, you know, I think industry wants to deliver something of value. So um, and they have to innovate on how to deliver that value, how to create that value. So um say I think figuring out how to turn these really challenging conceptual problems into ones that are well formed enough to solve and deliver, I think that's also a contribution for for you know for the engineers who are an important piece of the puzzle.

SPEAKER_02: 37:02

And I guess just to like, you know, take a moment for us to synthesize our thoughts here today, I think that a lot of what we discussed today kind of gets at the heart of this problem where we're currently being sold technologies which solve a problem, but we're not being sold mechanisms that understand our needs and actually resolve those needs. And this is causing some almost inherent disconnect where we're being sold the foundational model rather than being sold uh the application of the of the problem that we actually need to resolve.

SPEAKER_00: 37:37

Yeah, we're we're skip we're skipping steps. I think is the kind of the what we've said here is like there's ways we can accomplish what are we trying to accomplish, but defining what are we trying to accomplish versus right now, FOMO, let's all use AI, quote unquote AI, meaning LLMs. Let's all use the LLMs because everybody else is using them and they'll somehow magically get do what we want them to do. But we still haven't decided what we even want them to do yet. We have to decide what we want them to do, get really good at that, and then designing what's the best way to solve that problem. And the paradigm, and that's where kind of what we put in the paper, of the idea that multi-step AI-based systems is a thing, I think is kind of the biggest takeaway of 2025 and what we can leverage really well to make very solid systems if we relax what that means to means what's actually engineered systems and what's the best solver for the job. As we've talked about, you can interact in natural language, that's the neurosymbolic, and there's different ways we can do that of like you can be talking in natural language, but you don't have to be solving in extra language. But we've been convinced that you need to be doing everything with a natural language machine, right? So it's like, but we can district those. And I think the biggest takeaway is so we know that multi-step can be a thing. We know we have to figure out the principal agent problem, and that's never gonna go away. There's not like a let's build an AI system that can take away that. That's part of us being human. We're gonna have to just embrace that that's never going away. Own that, get good at that, and then once we've decided what those do we want to get good at, then we focus on it's gonna be a little bit harder yards potentially, but we can make really performant systems that are multi-step, and then just uh marrying those together as like the the V2 of what SaaS can be for an enterprise is multi-step AI systems that need to be designed and validate and validate it. And validate it is the key part of like you design it, but how do you make sure it's fit for your purpose? And that's where the validity comes in.

SPEAKER_01: 39:24

Perfect. Well, Seb, it has been a pleasure to have you back to the show to discuss your research. Uh, we really appreciate your insights and everything that you've had to share.

SPEAKER_03: 39:35

Thank you. My pleasure as well.

SPEAKER_01: 39:38

And for our listeners, if you enjoyed today's episode, you will also want to check out our episode with Seb, where we discuss differential privacy and contextual integrity. As always, if you have any questions about the content on our show, please reach out to us at theaifundamentalists at monotar.ai. Until next time.

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.

The Shifting Privacy Left Podcast Artwork

The AI Fundamentalists