The AI Fundamentalists

Synthetic Data in AI

August 07, 2023 Dr. Andrew Clark & Sid Mangalik Season 1 Episode 5
Synthetic Data in AI
The AI Fundamentalists
More Info
The AI Fundamentalists
Synthetic Data in AI
Aug 07, 2023 Season 1 Episode 5
Dr. Andrew Clark & Sid Mangalik

Episode 5. This episode about synthetic data is very real. The fundamentalists uncover the pros and cons of synthetic data; as well as reliable use cases and the best techniques for safe and effective use in AI. When even SAG-AFTRA and OpenAI make synthetic data a household word, you know this is an episode you can't miss.

Show notes

  • What is synthetic data? 0:03
    • Definition is not a succinct one-liner, which is one of the key issues with assessing synthetic data generation.
    • Using general information scraped from the web for ML is backfiring.
  • Synthetic data generation and data recycling. 3:48
    • OpenAI is running against the problem that they don't have enough data and the scale at which they're trying to operate.
    • The poisoning effect that happens when trying to take your own data.
    • Synthetic data generation is not a panacea. It is not an exact science. It's more of an art than a science.
  • The pros and cons of using synthetic data. 6:46
    • The pros and cons of using synthetic data to train AI models, and how it differs from traditional medical data.
    • The importance of diversity in the training of AI models.
    • Synthetic data is a nuanced field, taking away the complexity of building data that is representative of a solution.
  • Differences between randomized and synthetic data. 9:52
    • Differential privacy is a lot more difficult to execute than a lot of people are talking about.
    • Anonymization is a huge piece of the application for the fairness bias, especially with larger deployments.
    • The hardest part is capturing complex interrelationships. (i.e. Fukushima reactor testing wasn't high enough)
  • The pros and cons of ChatGPT. 13:54
    • Invalid use cases for synthetic data in more depth,
    • Examples where humans cannot anonymize effectively
    • Creating new data for where the company is right now before diving into the use cases; i.e. differential privacy.
  • Mentally meaningful use cases for synthetic data. 16:38
    • Meaningful use cases for synthetic data, using the power of synthetic data correctly to generate outcomes that are important to you.
    • Pros and cons of using synthetic data in controlled environments.
  • The fallacy of "fairness through awareness". 18:39
    • Synthetic data is helpful for stress testing systems, edge case scenario thought experiments, simulation, stress testing system design, and scenario-based methodologies.
    • The recent push to use synthetic data.
  • Data augmentation and digital twin work. 21:26
    •  Synthetic data as the only data is where the difficulties arise.
    • Data augmentation is a better use case for synthetic data.
    • Examples of digital twin methodology to create

What did you think? Let us know.

Good AI Needs Great Governance
Define, manage, and automate your AI model governance lifecycle from policy to proof.

Disclaimer: This post contains affiliate links. If you make a purchase, I may receive a commission at no extra cost to you.

Do you have a question or a discussion topic for the AI Fundamentalists? Connect with them to comment on your favorite topics:

  • LinkedIn - Episode summaries, shares of cited articles, and more.
  • YouTube - Was it something that we said? Good. Share your favorite quotes.
  • Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.
Show Notes Transcript

Episode 5. This episode about synthetic data is very real. The fundamentalists uncover the pros and cons of synthetic data; as well as reliable use cases and the best techniques for safe and effective use in AI. When even SAG-AFTRA and OpenAI make synthetic data a household word, you know this is an episode you can't miss.

Show notes

  • What is synthetic data? 0:03
    • Definition is not a succinct one-liner, which is one of the key issues with assessing synthetic data generation.
    • Using general information scraped from the web for ML is backfiring.
  • Synthetic data generation and data recycling. 3:48
    • OpenAI is running against the problem that they don't have enough data and the scale at which they're trying to operate.
    • The poisoning effect that happens when trying to take your own data.
    • Synthetic data generation is not a panacea. It is not an exact science. It's more of an art than a science.
  • The pros and cons of using synthetic data. 6:46
    • The pros and cons of using synthetic data to train AI models, and how it differs from traditional medical data.
    • The importance of diversity in the training of AI models.
    • Synthetic data is a nuanced field, taking away the complexity of building data that is representative of a solution.
  • Differences between randomized and synthetic data. 9:52
    • Differential privacy is a lot more difficult to execute than a lot of people are talking about.
    • Anonymization is a huge piece of the application for the fairness bias, especially with larger deployments.
    • The hardest part is capturing complex interrelationships. (i.e. Fukushima reactor testing wasn't high enough)
  • The pros and cons of ChatGPT. 13:54
    • Invalid use cases for synthetic data in more depth,
    • Examples where humans cannot anonymize effectively
    • Creating new data for where the company is right now before diving into the use cases; i.e. differential privacy.
  • Mentally meaningful use cases for synthetic data. 16:38
    • Meaningful use cases for synthetic data, using the power of synthetic data correctly to generate outcomes that are important to you.
    • Pros and cons of using synthetic data in controlled environments.
  • The fallacy of "fairness through awareness". 18:39
    • Synthetic data is helpful for stress testing systems, edge case scenario thought experiments, simulation, stress testing system design, and scenario-based methodologies.
    • The recent push to use synthetic data.
  • Data augmentation and digital twin work. 21:26
    •  Synthetic data as the only data is where the difficulties arise.
    • Data augmentation is a better use case for synthetic data.
    • Examples of digital twin methodology to create

What did you think? Let us know.

Good AI Needs Great Governance
Define, manage, and automate your AI model governance lifecycle from policy to proof.

Disclaimer: This post contains affiliate links. If you make a purchase, I may receive a commission at no extra cost to you.

Do you have a question or a discussion topic for the AI Fundamentalists? Connect with them to comment on your favorite topics:

  • LinkedIn - Episode summaries, shares of cited articles, and more.
  • YouTube - Was it something that we said? Good. Share your favorite quotes.
  • Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.
Susan Peich:

The AI fundamentalists, a podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses. Here are your hosts, Andrew Clark, and Sid Mangala Mongolic. Hello, everyone. Welcome to this episode of the AI fundamentalists. I'm here with Syd and Andrew. And today's topic is synthetic data. What is synthetic data? That's why we're tackling this topic. And Andrew, you and I were just talking before we started the episode about just the different definitions and the the legs that this is growing away from what it was truly meant to be.

Andrew Clark:

Yeah, synthetic data is a topic that a lot of people talk about. And I think there's a lot of misunderstanding on what synthetic data actually is. There isn't even in academia, even a really necessarily succinct definition of what it what it is some people it's you know, we'll say it's multiple simulations, that they create a data set, it's really somehow trying to to capture with fake data or generated data off of distributions that will mimic specific phenomena that you're trying to trying to copy said, Do you have a, maybe a clear definition that we can work from?

Sid Mangalik:

Yeah, I mean, if I was going to define synthetic data, you can think of synthetic data is basically the utmost of extrapolation or interpolation of the underlying data, finding data that fits the underlying data that you've that you're working with, and expands the bounds of it, or recreates the bounds of the original data. So data that truly fits the mold and the shape of your original data so that it is indistinguishable from true data.

Andrew Clark:

Well put, but still is very well put definition is still not a succinct one liner, which is one of the key issues with with synthetic data generation. And a lot of the confusion of like, let me just randomly generate data or just random number selector, just let me just change fields or names of individuals. And now it's synthetic data. That's, that's not necessarily what the actual concept is. And we're seeing that a lot.

Susan Peich:

Yeah, and let's take the two definitions, as we've put them, and have tried to state them simply, you know, let's put that in terms of AI news, because there's talk of using open source information scraped from the web for LLMs. And it's backfiring on companies. Can we start there? With our definitions of synthetic data?

Andrew Clark:

Yes, that's, that's, that's a great spot. And that's open source data that is pulled from the internet, most likely, at least in the original versions of the GPT was not synthetic. It was actual people's tweet to his actual GitHub information, things like that. The problem is, you can actually argue that LLM output is synthetic data. It's not actual data. It's not actually facts. It's just synthetic generated, as we've talked about the probability of the next word occurring. So now that we're seeing LLM is around for a little bit, they're actually starting to ingest LLM output from other models. And you're kind of having this, this effect where the models are kind of decaying over time, because the quality of the inputs is really different. It's what we talked about in the data podcast garbage in garbage out, we're bringing stats back, that's really an issue here is there's a big fallacy with to do generative AI or AI or anything big data, you need lots of data. But what we're really realizing is, okay, we need bigger data than the small statistical studies we used to do. But the quality of the data is a lot more important than the quality or the quantity. And you don't often know with massive quantity, that the quality is really decreased.

Sid Mangalik:

Yeah, and this plays in right in with the open AI problem where, you know, even an extremely well funded organization like opening is running against the problem that they don't have enough data and the scale at which they're trying to operate. It has never been seen before. They're trying to collect tax that hasn't been ever seen. And you know, now that with the backlash of open source data, they think, well, I'll just get synthetic data. And Sam Altman has explicitly been quoted saying he's pretty confident that all the data that they're going to be using is this type of LLM generated synthetic data, right, which is LM is talking about MLMs. Lm is talking with humans, recording those transcripts. These are proprietary own datasets, and then trying to train the models on these datasets. But yeah, there is this poisoning effect that happens when you try and take your own data over and over and over and over again. This has been seen by researchers at Oxford and Cambridge this is even said by the Gretel executive Ali Gholshan on who works with synthetic data as a business model. And even he has acknowledged that when you type when you do this recycling of data over and over and over again, you get these degradation effects And the degradation effects are because you, every time you sample, you get smaller and smaller, smaller range of conversation and language. It's that's just exhibited to the model. So these models become very narrow. And they only see what is the most popular language out there. I

Andrew Clark:

fully agree with all of that it's really not a panacea of synthetic data generations, we'll get into some of the techniques is not a solved problem people are talking about this is like, it's just Oh, it's just it's a solution. We all know how to do it. It's fantastic. It works just as well or close enough for us training. But synthetic data generation is very hard to do methodology properly. There's several methodologies we'll get into. But it's not, it is not an exact science, it's less of an art or less of a science than even generative AI solution, which we're not, we don't think as a as a science at all anyway. But I want to also switch gears and make sure we're not only talking about synthetic data generation has been around for a long, long time. It's also not only with language models, it's very useful in in tabular time series, have said I believe you were talking earlier about a study about EKG signals, or ECG signals. So it's different medical records, it's been a major issue in the medical industry for a long time to try and train diagnostic tools for for radiologists or even just like, immune sort of those time series, Pulse heart, nanometers, whatever they've been doing it, they've been running issues of the efficacy of actual synthetic data being accurate enough.

Sid Mangalik:

We're talking about a basis in simulation methodology, right. So this is an old school mathematical field, which is interested in making good synthetic data, right, not just making more data, but making data that, you know, meets a paradigm of the original data. And so that comes a lot of pros and cons. And, you know, one example was with an ECG study, he was trying to make really high quality, fake ECGs. To try and mitigate privacy concerns that come with traditional medical data scenarios.

Susan Peich:

There was one article a few weeks ago about the from the Financial Times, touting this as a reason why computer may data is being used to train AI talk about that, because there seems to be like, from what you just said, if there's no curator, then what about things like diversity, what would happen in the training,

Andrew Clark:

that is a fantastic point, and then one that's very often missed, because it's what we've talked about in all of our bringing stats back doing things the hard way, the whole part about the the ethos of the fundamentalist as you need to understand the problem to actually know what you're doing. Same with the whole thing with synthetic data, Susan just nailed it on the head, you need to then also set up this problem to be representative of the problem you're trying to solve. So synthetic data in one place, may be not biased at all. In another scenario, it could be very biased, it depends on what are you modeling? What are the goals of the system? Are you looking at looking at the demographics of Texas or Ethiopia? How are you how you setting up and even if you're modeling Texas demographics, for instance, and you're wanting to accurately show success levels of different socio economic backgrounds, you may need to do some up sampling or down sampling to be able to train a fair model that and be able to capture those nuances. We've talked about like data augmentation, and down sampling and up sampling. So it is a very, very, very, very nuanced field, take away the complexity of building data that is representative that of a solution, you have to then know the expertise of how do we tweak it, so it accomplishes our objectives? And it is fair, and not biased?

Susan Peich:

For the listeners? That was a lot. Let's break this down. Sid, the pros of synthetic data. Let's lay them out.

Sid Mangalik:

Yeah. So it's like, why do we use synthetic data like what is what is our real gain here. You can get more samples in a limited data environment, if you have a very limited trial that you're running. And you can only collect a week of data, but you really need a month of data to make big statements. This can help you get there to fill in those gaps. Often cases, it's better than nothing, where you're in scenarios where you know, you collected the data, you could and you just need a little bit more synthetic data can help pair that out when designed correctly. You can also design systems in scenarios that aren't possible in the real world. Right? You can, you can practice what a shock in the economic market looks like. You can practice what a shock looks like to an aeroplane flight system. And there's potential that synthetic data can be used to get around privacy concerns, but it's not as easy as you may have heard from other groups.

Susan Peich:

Okay, other side, Andrew, lay out the cons for us.

Andrew Clark:

It's a lot more difficult to actually execute then than a lot of people are talking about and one of the one of the techniques. It's embedded in there is the differential privacy, which is there is a way to the US Census uses this as an example, there's been a sort of serious research for long time, which is adequately anonymizing data. So you get the representations of a group without being able to identify individuals. If you think about looking at census at zip code or or census block level and being able to see the demographics and socio economic backgrounds of an area without being identify any individual set of people. And there's a common misconception that just because you're doing randomized data or synthetic data that you're actually anonymizing. So there's a hidden cost there as well, a lot of people aren't applying differential privacy techniques or isn't methodology for that. But also differential privacy, you can't use that everywhere that's very specific for for a type of problem. So there's a lot more complexity than a lot of people realize on on the application for the fairness bias and anonymization anonymization is a huge piece we're running into, especially with with the larger deployments is people aren't anonymizing data properly. It can also generate some really nonsensical or impossible data that makes no logical sense if it's not well controlled how this is set up. For instance, sometimes, as Sid mentioned, when we're doing stress testing on airplanes, or rocket ships, or whatever, we want to be doing a 28 scale Richter scale situation sometimes, which I don't know quite the Richter scale, but I think like seven, eight is about as high as it goes. So maybe a 10. We want to hit a 10 sometimes just to see what will happen. But we want to be seeing what's that safe operating balance and terming. It synthetic data helps us do that. But that instance of a 28 as I misspoke, but he's actually good to roll with here, if we're building examples to train our LLM or something off of completely nonsensical mumbo jumbo. But just because it's possible, we're actually harming the system talking about 28 When the Richter scale maxes out at 10 or nine, I'm sorry, I'm not I don't know, the scale off the die should have researched that prior. But you see what the the point on that?

Susan Peich:

Yeah. And we, and we only hope that we are testing things on a Richter scale of 10 exam, otherwise, we're all not here.

Andrew Clark:

Exactly. Well, that was the problem with the Fukushima reactor is they didn't test it high enough. It was tested up until the fall point. And it went slightly above that. But one of the hardest parts and incentive. So we've talked about all the other cons when the other cons is it's difficult to capture the complex interrelationships between inputs. And this is a part that that is often missed, or not understood as well. Specifically, when we're talking about medical data, or any of these complex sets of data that need to really be representative to train off of you can't just augment augment your data willy nilly, you can't randomly generate samples, there are ways to do it with distributions. But the toughest part that no one is really fully cracked yet is creating those proper interrelationships between inputs to capture the full complexity of the real world, while keeping its individual data anonymized enough with the differential privacy, like capturing that complex relationship, we can get into some of the methodologies is extremely difficult to do. Because if you're just looking at more univariant data and not capturing those, those interrelationships, your data is not near as good. And this is where we can get some of those decays. And that was what those statements that we talked about earlier. People aren't saying what these cons are, but they're alluding to them, but it's often miss.

Sid Mangalik:

Yeah, and I want to just paint a picture of what that looks like. Or that last example of these inter complex interrelations of data. You can imagine a dataset which has the ages of individuals and their incomes. And a simple system, which does create synthetic data could create entirely nonsense data points, like a five year old with a $200,000 income. And because it's not modeling these interactions between features, you get these outcomes, which to the computer look totally fine. But to a observer who understands how these data points are related, in a human way, or in a complex way, are going to be able to, you know, flag the this is useless or impossible data.

Susan Peich:

Interesting. And I know we're gonna get into use cases, it's invalid use cases for synthetic data in more depth. But when you guys were going through the pros and cons, it reminded me of a scenario that gets used a lot in marketing, like chatGPT native chat, GPT public chat, GPT that you can use, it's very popular to use this, one of the powers that marketers have found is, you know, taking a lot of documents, we get a ton of content, taking that and laying it, you know, maybe in batches, it could be like 10 documents at a time and prompting it to kind of find the commonalities and merge these documents and do everything. The subject does come up and communities of like when you're doing that, is there any customer confidential information in those documents whenever you're processing it through public chat? GPT and, of course, the answer you know, the answer varies like some people, you know, some people know what they're working with, and they know they're very careful. Some are still learning and they don't really understand the powers of chat GPT they get some advice to replace company variables, revenue values, and maybe some other information that is regarded as customer confidential. When you were talking about the pros and cons of synthetic data, what do people think they're actually doing? Is that synthetic? Is that an anonymized? Is it helping it all? Help us walk us through that?

Andrew Clark:

I believe they think they're anonymizing the data. And they very well may be. But it's been proven in studies before that humans are not random. It's very easy to know what someone's doing this where their actual random number generator. So this is where this differential privacy technique that has been around for a while and there's one researcher, I'm forgetting her name has done a huge amount of really pioneered the field, there are statistical ways to make sure you perturb that enough to not identify, but I don't trust myself or any other human to just randomly make up things. It's very much a pattern. If you look at the passwords you've done over the past year, even if you think you're being random, you're not. So like people making things up and just randomly anonymizing data, I would call that more anonymizing data versus synthetic is more like creating new data

Susan Peich:

for where we are right now, before we dive into the use cases, I think that we're like we're really hitting on something and you're really bringing to light a great scenario of like, what you're hearing about Senate synthetic data in the news in, especially as people are wrapping their heads around their own data governance policies. This is such valuable information to take away from that. Let's move on into use cases. Sid, you were looking at some things before with in terms of differential privacy.

Sid Mangalik:

Yeah, so let's talk about specific. And I'll say like meaningful use cases for synthetic data, right use cases where you're using the power of synthetic data correctly to generate some outcomes that you really care about. So with the differential privacy, the goal then would be looking at, let's say, again, our ECG data to generate data that is going to pass a filter for for medical for medical licensing, right? If you have private customer data, and you you think an ECG is just like a bunch of numbers in a graph, but it's real human data. So you can just publish that and share that with people freely. You can generate similar data, which looks just like a real ECG, but isn't tied to a real person. And now you can finally conduct these cardiogram studies that you haven't been able to do for years. So this is a meaningful use case of synthetic data, when applied correctly,

Susan Peich:

can ask a question about that. You've outlined synthetic data is a good use case. There's some inferences that, okay, it can be from a live person, you this data can be from a live person. But as long as there's no some will say PII or personal data attached to that EKG reading is there if it's anonymized, and all it is, is EKG readings as like the sum of that, you know, the average of the sum. Is that still synthetic? Or is that still just as usable? What are the what are the pros and cons of that?

Sid Mangalik:

Yeah, so I think this ties that we were talking about earlier, right? Like, there's this feeling that if you just drop the person's name, if you just drop the receipt, if you just drop their user ID number that like, oh, well, isn't this basically as good as embedded data? Not because it's a real human measurement. But in these controlled environments, that's not sufficient, right? There really, is this expectation that you have truly brand new data, because it's not impossible to reverse engineer who these things belong to.

Andrew Clark:

We call that fairness through unawareness. And it's kind of a fallacy.

Susan Peich:

I don't even know if I like that term.

Andrew Clark:

Air quotes, a lot of people think, here because we, we we've cut the columns. But as syncing, those relationships are still there. So we can debate the term. That's what we've called in the past. Maybe it's time to move on for another terminology.

Susan Peich:

I have no dog in this fight. It's just like, as the first time I heard it, when Oh, yeah.

Sid Mangalik:

Yeah. This the spirit of fairness to Any other use cases we should mention? Yes, there's there's a unawareness. It sounds really bad, because it sounds like you're explicitly doing it, but it's really just the casual mistake, right? It's like you just do it the way that you

Andrew Clark:

And as Sid mentioned earlier, and I lot. mentioned a little bit by the way, I've looked it up Richter think is best. But then you don't reflect back and say like, scales and max of 10. So apologies for the earlier Yes, oh, I guess the entire fairness or privacy of this model is excuse me said that. But for stress testing systems is where based on me just pretending it would be fine. Or just assuming synthetic data is very helpful edge case scenario, thought that I had washed away any responsibility. experiments, simulation, stress testing, system design, all of these types of scenario based based methodologies. It's fantastic because you're seeing I built my model off of real data off of or anonymized, Naanum eyes, differential privacy or whatever we're looking at. But let's just for the sake of example, it's an airplane, we're doing simulations. So we built it off of actual data that we're comfortable with, or very well thought out experiments that generated data that is representative based off of physics or whatever what the flight patterns would be. Now we're going to expand, we know what these underlying distributions are. Now, we're just going to perturb these distributions and start stretching those bounds to see where the system breaks. This is This is engineering and aerospace and things and aeronautics, and I mean, shipbuilding, everybody uses engineering has been using the civil engineering have been using this stuff for years. And you figure out where those breaking patterns are. There's huge analogies to this in in machine learning, building and things as well, that you want to build off of either models built off of physical systems where you understand the properties, or off of real data, and then you use synthetic to stress test and determine how that system is operating. But the key difference is we're not using fake data that we have that we just basically randomly generated. To train our models. We're using it to validate our models. And we're building those scenarios. So for me, those are two very different things. And this is what's concerning about the recent push of like, just use synthetic data, data, because we don't want to worry about, you know, all the things about privacy, or now, some of the writers and actors are coming back, like I don't want you to train off my model, because now I want I want rights to it and all that kind of headache. So that's why OpenAI is like, okay, fine, we'll just going to do synthetic data. But that's a very big difference. If you can't capture the complexity of the data, if you're trading off of that is very different than let's test and stress test. I think that's a nuance that a lot of people aren't aren't capturing here.

Sid Mangalik:

Yeah, that's, that's spot on. And that leads to the other use case, which is the data augmentation piece, right. And if you're augmenting physical systems data, you have these nice, understandable, smooth curves of data. And what you're really doing is you're just interpolating, right? You're just saying, I can follow this line. And I can fill in the gaps. And that gives you a much richer data, which is much better for these AI ml systems, versus, you know, randomly guessing, missing some inflection point. And now your model doesn't correctly model reality, right? You have a system, which follows the trends of the data that you already have. And just lets you fill in those gaps little bit more, which models really, really benefit from. That's very key.

Andrew Clark:

Yeah, augmentation is is a good use case of synthetic data generation only synthetic is where we run into difficulties.

Susan Peich:

Yeah, a lot of what you are describing, and we may cover this in a later episode, also sounded like some discussions that I've been able to sit in on when back in my IoT days when they would talk about digital twins.

Andrew Clark:

Very well related to this. And there's probably we can be a podcast itself. And there's all these we haven't even gotten to the techniques yet, we might need to do another podcast with technique. There's a lot of corollaries. So I've done a lot of digital twin work in in my PhD and in previous jobs, where it's essentially you're creating a physical or sorry, a digital replica of such as a factory floor and economy or something like that, where you have some of those physical systems or real input data. And it's essentially a methodology where you create that virtual twin. Same with the aerospace simulators we've talking about you actually built this one digital twin is another new marketing term for something that's been around for a long time. Apollo program, they simulated all this stuff on really old computers, before they launched people in the space, they basically built a digital twin. And that's where their stress testing where those parameters are, it kind of had a resurgence. And in IoT, as you mentioned, and in industrial systems have let's build that physic that digital representation of a physical system. And let's stress test and do that, what if scenario analysis, it's really that lets us see how a system would respond before we actually build it. And it's very useful tool. And in that case, you are using synthetic data to generate those scenarios. Very interesting.

Susan Peich:

I do think we should do an episode on that, because I think the I want to cover techniques, but I do think making that tie from those worlds, especially in industrial industrial solutions. That's really I think, marrying synthetic data with digital twin work that's going to be very valuable. So let's, let's switch gears, let's talk about techniques.

Andrew Clark:

Yes, this will be the TLDR of techniques. As a way in which some of these sit No, both are very Monte Carlo technique is one of them. And all that said to explain is very dear to both of our hearts, we could probably do a whole series on Monte Carlo. We'll just do like that. To keep this this episode brief. We'll give a little snapshot of these techniques, then we can we can go into some of them in more depth in subsequent podcasts.

Sid Mangalik:

Yeah, here's your Wikipedia first paragraph style understanding. Poke at a bunch of times and try and learn the shape of it. Right. If I poke it up, poke enough around inside of a circle inside of a square, I'll learn the the shape of that circle, and maybe I can even learn about pi without having to do any complex geometry, let's just let's just take a bunch of random samples. That's how it started. Then it's been expanded on with techniques like hypercube sampling, which are basically more intelligent ways of searching the space, instead of just randomly picking, let's make a grid and then pick around that mesh nicely to do it efficiently. This gets expanded into other more mathematical methods like Gaussian copulas, which try and understand a little bit of the underlying connections between data points. We also have the machine learning flavor of this, where we let the machine learning algorithms, learn these distributions via generative adversarial networks or GaNS, if you've heard about this before. And then there's also from the economic world, the idea of like random walking, right, which is, it's, it's about proceeding through the dataset, but moving randomly. So these are, these are the kinds of ways you can try and get this synthetic data through intelligently sampling, the original dataset. Excellent, great, great overview. Those are definitely the main techniques. And there's there's limitations to a lot of use as well. And a lot of that capturing interrelationships, for instance, segmentation Gaussian cupola, it got hammered after the financial crisis, because it happened to be used in a lot of simulations for Wall Street, there, it still has uses probably a little bit hammer too hard. That wasn't, that wasn't really the problem, but it got a bad reputation. But he mentioned Gaussian Gaussian is a statistical distribution that looks normal bell curve, so it is capturing the interrelationship based off of the normal distribution. So where you run into problems is not all of data is normally distributed. And this is where stats got a bad name there for a while is a lot of statistic people like economists, we like to use an equilibrium. Well, the world is not an equilibrium. Statistics, I suppose are normal distribution, all these things are easy to calculate, but they're not necessarily accurate. So Gaussian, is great cupola. And that's there's other variations of the cupola. But Gaussian is really the bedrock, that doesn't represent every every relationship of data, if you train all of your data off these normal distributions. When you have these tail effects of like a financial crisis, things will fall apart. Same with like, it's good for generating data, but you're now generating data. And so the complexities of the real world, you're gonna you have to hope your data is representative of a normal distribution, if so Gaussian is great. If not Gaussian, cupola can run into issues.

Susan Peich:

Interesting. And yeah, we definitely need to go dig into some of those techniques, because I think just putting them in, because just putting them into the context of the financial crisis, like you said, back with Enron, that I don't think that in the world that we're working with today, it's almost, I almost get a sense you need we need to re familiarize ourselves with that, especially as we're building models, and we're exercising these techniques for much faster paced machine learning and AI.

Andrew Clark:

Agreed. And this is this is one of the biggest, big issues I have with the field in general. And it's often the machine learning field is or just in the media in general is you forget history. And well there's there's a famous quote by George Santayana, Jana, which is those who cannot remember the past are condemned to repeat it. It's a very common thing here in modeling, people have like five year timespan, so nothing bad has happened. For in five years, we're gonna pretend it didn't happen and don't remember why we're not doing that, or what's the interrelationship here. And that can cause some issues. Same with synthetic data is actually most useful on tabular datasets, I would argue and said, I'd love to hear your opinion on this that images, then the second most useful, the least useful way to do synthetic data is actually on textual data, which is where everybody's wanting to use it now.

Sid Mangalik:

Yeah, that's, that's absolutely these types of like, hardcore mathematical understandings of data. They don't happen in language, right? Because what is this? What is a randomly perturbed version of a sentence? How do you change a sentence? If you change one word of a sentence, it has a totally new meaning. So a really easy technique that people have been trying is you take it, you take a sentence, and you have a pair of Fraser, which attempts to re re say them recently, the sentence, run that a couple of times got a couple of versions of the same sentence, and it should mean the same thing. But there's still struggles because we don't know which direction sentence is being pulled in, right? Is it getting more positive? Is it getting more mean? Is that more factual, we just know that we're randomly moving the data around.

Susan Peich:

In summary, bringing that to light that there is a there is a reason that synthetic data exists. And, you know, one of the this is one of the key areas that where it can be very powerful is, you know, really looking to the past tabular data. And really focusing on like, when we're looking at models now, the difference between the use cases and where it's going to shine.

Andrew Clark:

Exactly. There are no magic bullets anywhere in life. Machine learning is no exception. If someone's telling you it's something that sounds too good to be true, probably because it is and that's the whole ethos of the fundamentalist, do things the hard way you can build really success models and despite what a cursory listening to this podcast might think we are not anti LLMs, we are anti some of the practices that are being used today is a powerful technology that can definitely be used to really make productivity increase for the economy, but you need to know how to do it properly. And a lot of the paranoia out there, and even the techniques floating around aren't really bedded in reality. And that's what we're trying to say there are no magic bullets. Let's build these systems responsibly, well governed and with proper techniques. And you can make really great systems but it's there is no panacea in modeling.

Susan Peich:

And before we wrap up this episode today, really great thoughts Any final thoughts from you Sid

Sid Mangalik:

yeah, I would love to see synthetic data being used in meaningful and and correct and valid use cases. We're going to keep seeing it in the in our NLP LLM space. And there's a lot of frustration out there about basically making models that are less diverse, less understanding, and less apt to generate fair outcomes.

Susan Peich:

We'll wrap this up. Sid, Andrew, it's been a pleasure discussing the topic of synthetic data. I think we found more topics and more podcasts through this episode than any episode we've done. To date. Thank you all for listening. We'll be back next time with more.

Andrew Clark:

Thank you

Podcasts we love