Podchaser Logo
Home
Synthetic Data for AI

Synthetic Data for AI

Released Wednesday, 17th April 2024
Good episode? Give it some love!
Synthetic Data for AI

Synthetic Data for AI

Synthetic Data for AI

Synthetic Data for AI

Wednesday, 17th April 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

Cloudcast Media presents from the Massive

0:02

Studios in Raleigh, North Carolina. This

0:04

is the Cloudcast with Aaron Delb

0:06

and Brian Gracely, bringing you the

0:08

best of cloud computing from around

0:10

the world. Good

0:13

morning, good evening, everyone. Welcome back to the

0:15

Cloudcast. We're coming to you live from

0:17

our Massive Cloudcast Studios here in Raleigh, North

0:19

Carolina. And it is Aaron for the intro

0:22

this week. We are full on into spring

0:24

conference season. Lots of shows

0:26

and events already happened and

0:28

more on the horizon. I'm in work

0:30

crunch mode preparing for upcoming events. And

0:32

it's that time of the year I

0:35

kind of hate, but I also kind of love all

0:37

at the same time. Our topic

0:40

for this week is the generation

0:42

of synthetic data, specifically around AI.

0:44

But this also really applies to

0:46

any application that needs testing data

0:48

at scale. And that conversation is

0:51

coming up right now. And

0:55

we're back. And Aaron, how you doing,

0:57

man? Spring is spring's in full bloom. How have you

0:59

been so far? I'm good. I'm good. Yeah, I've kind

1:01

of said it on once or

1:04

twice on the shows before, but

1:06

it definitely feels like spring. It's

1:08

warming up. The spring conference

1:10

season is here. Lots

1:12

going on. It's been a pretty exciting spring,

1:14

actually. Yeah, no, it's been good. It's been

1:16

good. Now, you and I obviously

1:18

have been talking a lot about the

1:22

topic du jour. And one of the things I was

1:24

thinking about recently is, you

1:26

know, as

1:28

a data scientist or anybody who's managing teams

1:30

of people that are trying to

1:32

manage data and their data

1:34

may not necessarily be in the same place

1:37

as their GPUs or the place where they're

1:39

going to do acceleration and training and so

1:41

forth. I got to

1:43

thinking about that, especially for enterprises. And you think about

1:45

just how restrictive companies are

1:47

with the most simple things. Like, what

1:51

happens if you take your laptop on the road with

1:53

you? There's a whole set of training that goes with,

1:55

you know, don't do this and don't lose the data.

1:57

Or, you know, what do I do with my cell phone?

1:59

Like, how can you do that? What can I put on there?

2:01

What can't I put on there? Then I think

2:03

about the massive amounts of data that are

2:05

involved with, whether it's fine

2:08

tuning or training or even rag with

2:10

some of these AI environments, and

2:13

you start to wonder like, how are

2:15

they going to deal with this? How are they

2:17

going to deal with potentially moving their

2:19

data outside of the four walls or

2:21

in secure environments? We

2:24

got to think about this like this would be a perfect

2:27

opportunity if there was some way to create a

2:29

representation of your data that the

2:32

security teams would feel a little more confident with, and

2:34

that would be probably an interesting topic for us to

2:36

dig into. Our

2:39

topic today is synthetic data. Of

2:41

course, as we want to go and talk about that topic,

2:44

we go and find somebody and we have

2:46

a fantastic guest for today. We have Kalyan,

2:50

CEO and co-founder at DataCebo.

2:53

First of all, welcome to the show. Thank

2:55

you, guys. Thank you. Yes,

2:57

absolutely. Let's start

3:00

at the start because some may

3:02

have heard about synthetic data or they're already

3:04

using it but maybe they've not referred

3:06

to it in that way before, but

3:08

tell everyone a little bit about what

3:10

is the use case

3:12

and the need and the problem synthetic

3:14

data is solving today. Great.

3:17

Thank you. I love the introduction. Just

3:19

Brian just mentioned about laptop on the road and

3:21

I'll come back to that in a minute. Because

3:24

we actually had that situation

3:26

when we founded this project.

3:29

Synthetic data is data generated

3:32

from a generated model that

3:34

is trained on the real data. You train a

3:37

generated model on the real data, a little bit

3:39

of real data, and then you get a model

3:41

and from that model, you can generate data

3:44

that looks like the real, has

3:46

all the statistical properties and format,

3:49

like the real data, but cannot be linked

3:51

to the real data. It's actually a sample

3:53

that is generated from a model, which

3:55

is a mathematical representation of the data.

4:00

great thing that it provides is an

4:02

alternative when access to

4:04

the real data is restricted as we were talking

4:06

about it or needs to

4:08

be restricted or is not available. So

4:11

that's the immediate parent need that enterprises have

4:13

for this kind of data or for this

4:15

kind of modeling technique. What

4:18

is less known, however, is

4:21

how dramatically it changes that data

4:23

access and availability. So

4:25

just to give you an example, once you

4:27

build a generated model, you can

4:29

actually sample as much data as

4:31

you want. So you're not

4:34

limited by your original data size. So you

4:36

can sample 10 million

4:38

records, 1 billion records of

4:40

customers, even though you train

4:43

probably the model only on

4:45

just 10,000 customers' data. What

4:48

that enables people right away is to

4:50

create data for performance testing. I mean,

4:52

that's just the day zero ROI

4:55

that enterprises see with this kind of

4:57

technology where they can create

5:00

a lot of data in their lower

5:02

environments and not have to port the

5:05

data or a database, large database,

5:07

but just use this model that's

5:09

usually very, very small, maybe

5:11

a couple of gigabytes, even though your data could be

5:14

terabytes, and sample from it

5:16

like billions of records and test it for

5:18

performance, your software application for

5:20

performance. So that's just one

5:22

kind of example for that particular

5:24

sort of game-changing technology that this

5:26

is, that it's used for. The

5:30

other example is more involved

5:33

and interesting. It's even more or less known,

5:35

but it's actually a very powerful use

5:38

case, which is it's

5:41

known, generative models are known to

5:43

have the ability to generate data for a

5:45

particular scenario. So what you can do to

5:48

this model is specify a particular scenario.

5:50

So you can say, for example, an

5:52

insurance company, if you're an insurance company,

5:54

you can say, hey, what if I

5:56

had 10% more smokers? take

6:01

the policy, what will happen to the

6:03

distribution of claims that I have. You

6:06

can generate such scenarios and generate

6:08

data and from that estimate the

6:10

distribution and make some analysis and

6:13

do decision making. That's what we

6:15

call classic, it used to be

6:17

called digital twin, it used to

6:19

be called a variety of names.

6:22

But if you get the model

6:24

right, you should

6:26

be able to create some scenarios

6:28

and analyze. Yeah. So,

6:31

I hope you guys understand, obviously there

6:33

is, you have tremendous background

6:35

in this space having done

6:38

a ton of work and research over at MIT, a

6:40

number of different projects, we'll put some links to those

6:42

in the show notes. People

6:46

have been doing things around synthetic data for disaster

6:49

recovery, things like you said for modeling.

6:52

What's really evolved in this space that

6:55

obviously, now with AI being a big deal,

6:58

people are thinking about it more, but have

7:01

there been any big

7:03

breakthroughs that are allowing people to be

7:05

more flexible in saying, for

7:07

example, I start with this

7:10

real set of data of whatever size,

7:12

and I can now do 10, 12, 15 different variations

7:14

of it. What's

7:18

evolving in this space to

7:20

bring this back to the forefront of being important for

7:22

people? One

7:25

of the things that is involved in this

7:27

space is, I mean, technology-wise, what has evolved

7:29

is generative modeling has evolved a lot. I

7:32

think it has become more and more capable

7:35

of building a

7:37

model like this for a multi-table complex

7:40

dataset. So, we focus on a lot

7:42

of relational and data warehouses

7:44

kind of data, not necessarily language.

7:47

Sure. So, we invented the

7:49

first relational model, generative model

7:52

for relational databases. That

7:55

technology has matured over the last few years

7:57

in my group at MIT, so that has enabled

8:00

make this possible. So

8:03

that's one thing. I think the bigger

8:05

item that from the use case point of view,

8:07

the bigger item is that I think a

8:10

lot of applications, software applications, we

8:12

find a prominent use case of

8:14

testing software applications. And a

8:16

lot of these applications over the last decade

8:18

have become data-driven. So what I mean by

8:20

that is there's a lot of conditional logic

8:23

inside the software that

8:25

is dependent on what data

8:27

it sees. Regardless of

8:30

what application, they make

8:32

conditions saying that if you see this in

8:34

the data, this is particular type

8:36

of demographic or this particular type of

8:39

session or any, what have

8:41

you, they make

8:44

decisions within the software application.

8:47

Now to test those applications, you need the data.

8:50

And that data in the past

8:52

has been either generated by copying,

8:54

masking, and anonymizing. That was

8:56

one of the classic ways. Or

8:58

it was generated using writing

9:00

manual rules by

9:02

writing, you almost maintain the script

9:05

on the side, writing rules and creating the

9:07

JIT data, which is what they call synthetic

9:09

data. And that became

9:11

untenable because number of things that

9:14

you have to add to that

9:16

script just became way too many

9:18

every release. You're adding more functionality,

9:20

data-driven functionality, and you're, as

9:24

again, you're adding a lot more to that

9:26

script as well. So this comes

9:28

in really handy because you can try to train

9:30

a generated model on the database next to the

9:32

database. You don't have to move the data around.

9:34

And once you have that model, now you can

9:37

create synthetic data from it just

9:40

by, you know, just one prompt or

9:42

one call to the model, you

9:44

can create the data. So that's where we are

9:46

seeing a lot of demands for it. Then

9:49

the second demand we are seeing a lot

9:51

is training machine learning models.

9:54

Most of the machine learning models, you

9:57

know, the good ones, I

9:59

think to predict an outcome that is of interest

10:02

to the business. And usually that outcome is rare

10:06

in the number of times it happens. So

10:08

what happens is when machinery models are trained

10:10

on data like that, where you have only

10:13

a few examples of that rare occurrence

10:16

and a ton of examples when that did not happen

10:18

and you want to predict that rare occurrence, you

10:21

end up overfitting the machine learning model and overfitting

10:23

to that data that it has seen. Their

10:26

synthetic data, adding synthetic data, provides

10:28

a lot of benefit in

10:31

terms of reducing that overfit.

10:34

So to give you an example, one of our customers

10:36

was trying to predict from

10:39

ServiceNow data. So they had ServiceNow

10:41

database for their local enterprise, for

10:45

their local DevOps work. And

10:47

they're trying to predict what change

10:50

may cause an incident of priority

10:52

P0 and P1, or major incidents.

10:55

Now, the major incidents, thankfully, were

10:58

just like 100, maybe less than 100. Whereas

11:00

the rest of the incidents that are not that major

11:03

were like 80,000 or so. And

11:05

the predictive model was not able to produce a

11:08

good accuracy because it was overfitting to

11:10

the data, the few

11:13

examples it had. In that

11:15

case, they added synthetic data to it.

11:17

They learned a generative model on that

11:19

database. They created the data. They added

11:21

more synthetic examples for

11:24

the major incidents. So just increased the size

11:26

of that data and mentoring the machine learning

11:28

model. And they were able to get a

11:30

better predictive accuracy. And

11:33

how does the concept of bias

11:36

fit in with something like that? I guess

11:38

what I mean by that is I can

11:40

absolutely see synthetic data being created. But

11:43

if it's not created from

11:45

a wide enough initial

11:47

data set, then it seems like bias

11:49

could creep in at scale over time.

11:51

Is that a valid concern? Or

11:53

how would I think about something like that? That's

11:57

a valid concern. I think we... sort

12:00

of focus on how to sub-sample

12:02

the data from the database. So your

12:05

bias could be when you sub-sample to train

12:07

this model, the generative model, it could come

12:10

from that. So we provide

12:12

ability to do proper sampling. I

12:14

mean, there's ways to overcome that.

12:17

There's bias when

12:19

inherently your data collection is biased.

12:22

So that's the even more problematic

12:24

situation in which case, I think

12:26

a lot of downstream

12:28

applications or usage of data

12:31

also have measures to measure bias and

12:34

be able to control for it. What

12:37

happens is when you do that, you

12:39

can actually sample from the model focusing

12:42

on reducing that bias in an application.

12:44

So you can try to do that.

12:46

The models allow you to create data

12:50

that are of a particular

12:52

demographic or a particular sort

12:54

of category so that you can reduce the

12:56

bias in the application itself. So two things,

12:58

if your data itself is good in the

13:00

database, you can be, we

13:03

provide functionality to sample it properly so that

13:06

you don't create bias data to train this

13:08

model. The second thing is if

13:10

your database itself is biased, in

13:12

your application, you can actually sample

13:15

more data of one kind, of one

13:17

category to control for that bias. And

13:21

maybe I'll take it one step further than two. In

13:23

addition to the bias, and

13:26

I don't know what the term is, but how do

13:28

we verify, if you

13:30

will, or how do we verify the

13:32

quality of all of that? And

13:35

so take everyone maybe one step further

13:37

because I kind of think of this as you've

13:40

got the

13:42

generative AI, it's going to, you describe it in

13:44

need, I need this kind of data set, it's

13:46

going to return that. And

13:49

then it will create this fake data

13:51

set or synthetic data set. Well,

13:54

how would an organization,

13:56

if I'm doing this, how

13:58

do they then turn around and verify it? that

14:00

this data set that was created was

14:02

of an acceptable level of quality. So

14:05

I guess maybe not just the bias, but as

14:07

well as it was it quality data, and how

14:10

do you validate that? Great

14:12

question. So the

14:15

open source library, I think

14:18

it's probably the first,

14:20

it's called synthetic data metrics.

14:23

So what it does is that if you

14:25

give real data and synthetic data, it doesn't

14:27

matter where the synthetic data was generated from.

14:30

It measures a variety of

14:32

metrics saying that how similar are

14:34

they statistically. And what

14:36

is interesting is that you would have to

14:38

balance that with two other measures. One

14:41

of that is diagnostic, we call

14:43

it diagnostic. So basically maintaining

14:45

the format, the properties, the referential integrity.

14:48

So those are must haves when

14:50

you generate synthetic data. The

14:52

second property that we have to

14:54

balance is its primacy. So if

14:56

you make it statistically really, really,

14:59

really close to the real data,

15:01

there is a chance you could

15:03

leak privacy, you leak private information.

15:06

So as a result, you counterbalance that

15:08

with some privacy measures. And those metrics

15:10

also can be calculated by passing the

15:12

real data and synthetic data. So

15:15

we have three sort of metrics

15:17

in that library. One

15:20

is quality metrics, which are statistical

15:22

measures of comparing those to data

15:25

sets. Diagnostic measures, which compares if the

15:27

properties are met, or certain basic properties

15:29

that are must haves, and privacy metrics

15:32

that actually calculates the privacy for that.

15:34

And then the enterprises can choose from

15:36

the trade off points,

15:39

because you can actually manage this

15:41

generative model in a way that you can

15:43

create multiple synthetic data sets. And as a

15:45

result, you can actually see the

15:47

metrics and choose the trade off point. Do

15:49

you want more data with high fidelity, high

15:52

quality, but a little less privacy? Or do

15:54

you want a lot of privacy, like you

15:56

want to be able to have the

15:59

data? a perfectly foolproof while

16:01

forgoes some amount of quality. No,

16:06

it makes sense. I'm

16:08

curious as I'm thinking about this and learning about

16:10

it as you go. At

16:13

some point, the

16:15

ability to create the synthetic data

16:17

feels, especially if you can,

16:19

like you said, do things like say, hey,

16:22

I'd like to essentially experiment

16:24

or manipulate it if the

16:27

model looked like this, 10 percent

16:29

more in this direction or 15 percent

16:31

more with this waiting to it. At

16:36

what point does creating

16:38

synthetic data and essentially

16:41

creating a model or training a model

16:43

start to look like the same thing

16:45

in terms of the

16:47

amount of time and resources it takes

16:49

or how much ability you

16:52

have to manipulate the weights and

16:54

so forth? Do they start

16:56

to look, the process

16:58

of doing this begin to look the

17:01

same or is there very,

17:03

very different creating

17:05

synthetic data and creating models? They

17:08

just become very different things. It's

17:13

the creating synthetic data model. They probably

17:16

overlap significantly once you have

17:18

trained the generating model. I

17:23

think the classical way of

17:25

creating synthetic data was very manual.

17:28

I think generative model, the additional advantage what

17:30

it did was that it

17:32

essentially automated that process so you don't

17:35

have to worry about anything

17:38

to write manually or anything. The

17:41

second thing is it gives you a very succinct

17:43

representation of the data in terms of model parameters

17:45

so you can tweak those as well. It

17:48

looks a very different workflow when you try

17:50

to interact with the generative model that's trained

17:53

and create synthetic data from it versus when

17:55

you just go and write it to

17:58

everything from scratch and do. things

18:00

like that. Yeah. I don't

18:02

know if that answered your question, but... No, it's

18:04

helpful. I'm sort of... My

18:06

thought process went down the path of I can

18:08

imagine that companies

18:11

could potentially start thinking about, for

18:13

example, like their competition or think

18:15

about a new market and

18:18

go and create a model that

18:20

they think looks like that and then do

18:22

scenario planning against it and other types of

18:24

things. So yeah. Okay. That's

18:26

actually one of the very common use cases

18:28

where you have launched a product and you

18:31

had like, let's say a million customers in

18:33

US region. And

18:35

now you launch the same product in

18:37

Europe and you only have 10,000 customers.

18:42

It's coming, but the rate at

18:44

which you're getting is slower, but you have a lot more

18:46

data at the US. So what you

18:48

can do is you can see if there

18:50

is some overlapping patterns between the two

18:53

populations. And also,

18:55

if better even, if you

18:58

have overlapping customers that are both lived

19:00

in US and are also living in

19:02

UK or traveling between both places and

19:04

use that generative

19:07

model to augment the data

19:09

trying to do analysis for the UK

19:12

or Europe region. That's

19:15

a very popular use case where people try

19:17

to augment the data for new

19:19

launches or new places where you haven't yet

19:21

got enough data yet. But you have from

19:24

other places, you have some data to anchor

19:26

on. Gotcha. Okay.

19:29

Okay. That's really fascinating. I love

19:31

that. One last topic before we close out

19:33

for the day. Let's talk about

19:36

resources a bit. Whenever I think of

19:39

GNI and specifically around any kind

19:41

of training or fine tuning, I

19:43

immediately think about large amounts

19:45

of hardware, whether it's GPUs

19:47

or, and certainly GPUs these

19:49

days, they tend to have

19:52

limited availability. And is that something that is

19:54

true here? And also, is this something

19:57

that is on-prem or is

19:59

it in- cloud or is it both

20:01

or you know tell me a little bit

20:03

about the resources required and maybe the differences

20:06

between using synthetic data

20:08

and real world for something like this.

20:12

Yeah, we created algorithms

20:14

for especially this kind of data

20:16

that does not require GPU so

20:18

we use very classical techniques, statistical

20:22

techniques but then did we re-purpose

20:24

them or reinvent them I guess

20:26

for relational tables. They

20:28

are called probabilistic graphical models so they

20:30

can be trained on CPUs and much

20:32

faster they can be trained really fast

20:34

on these CPU machines

20:36

classic off-the-shelf CPU machines. So

20:39

as a result what one

20:41

of the one of the customers actually or

20:44

many of them actually compliment

20:46

saying that I think because it

20:48

takes such a less compute power and

20:51

it actually can train effectively and efficiently on

20:53

large amounts of data the number

20:55

of applications that they can imagine to use

20:58

to for this generative model to be used

21:00

for is huge because

21:02

otherwise it was a lot of expense and then

21:04

it cut down the number of use cases they

21:07

can use it for. So to

21:09

answer your question yeah we we are able

21:11

to train on just CPUs alone and GPUs

21:13

are not required. To answer

21:15

the second question cloud

21:18

on-prem so it's on-prem and

21:20

private cloud so whatever cloud

21:22

service provider that they're using we

21:25

can run the software there they can run

21:27

the software there integrate with their data and

21:30

run it or they can just run an

21:32

on-prem cluster or compute or machine that they

21:34

have. One of the important things

21:37

here is that very early on we made

21:39

this call saying that it can't

21:41

be a hosted because it doesn't make

21:43

sense for somebody to say oh

21:45

to protect your data upload to us and

21:49

we are not a cloud provider so it doesn't make

21:51

sense for them to upload the data and then be

21:53

able to get the generative model. So we

21:56

the provider software so that they can install and run

21:58

it on their in- data environment. Yeah,

22:01

makes sense. Makes sense. Yeah, and I think that's,

22:04

that's maybe one of the the under covered,

22:06

or maybe under discussed aspects of,

22:09

of, you know, AI is people are kind of

22:11

following well, all the GPUs are in the cloud,

22:13

so everything must happen in the cloud. And they

22:16

sort of forget that so much

22:18

of company data is still on

22:20

prem and kind of kept under lock and key.

22:22

And so yeah, that, that where do I, where

22:25

do I do the processing? Where do I keep

22:27

data? Where can I do synthetic data is really,

22:29

really important. Before

22:32

we before we wrap up, want to ask real quick,

22:34

because we didn't spend a lot of

22:36

time on data SIBO and how you

22:38

guys deliver this. Real quick, if folks,

22:40

you know, are interested in in how

22:42

you guys deliver synthetic data, what's, what's

22:45

the best way to engage with the

22:47

company? What's sort of the typical engagement

22:49

model? You know, how would how

22:51

people reach out and work with your team? Because

22:53

you guys have tremendous, tremendous amount of experience in

22:55

this space. Yeah,

22:58

the first thing a lot of people

23:00

do is use our source available software

23:04

called the synthetic data walls, which

23:06

has a number of modeling techniques,

23:08

generated modeling techniques for a variety

23:10

of enterprise level data sets. And

23:13

they train them the on prem,

23:15

they install and train them on prem.

23:17

Without us even being involved at that

23:19

stage, there's documentation available that's picked up

23:22

by sort of thousands

23:24

of folks in the in

23:26

the community. So the best

23:29

site is sdb.dev, like that's where all

23:31

the resources are right now, as

23:33

well. So once they, they establish

23:36

a use case, and they are the

23:38

find some success with it, then they

23:40

reach out to us for sdb enterprise

23:42

version, where it allows them

23:44

to scale to many tables, multiple

23:46

tables, and most, you know,

23:48

a lot of cool enterprise features, multiple

23:51

data types, and a lot

23:53

of complexity as well, and enterprises and they're

23:55

able to address all of those, and they

23:57

reach out to us to get sdb enterprise.

24:00

Oh, subscription. Awesome. Thank

24:02

you. That's really great to hear. So, go

24:04

ahead. Go ahead. I

24:08

was just

24:10

going to say, so that's a really

24:12

good summary of everything, and

24:14

thank you for kind of taking

24:17

us through all of that today. So I

24:21

think we'll close it out there, Brian. Yeah,

24:23

no, this has been very, really

24:25

helpful, and it helps, at least in my mind,

24:27

you know, kind of close a

24:29

gap or at least fill in a hole

24:31

in terms of thinking about how our team's

24:35

going to sort of manage this challenge

24:38

of security, where's my

24:40

data, if my data's not perfect for

24:42

what somebody has in mind as to

24:44

what's possible with some of these AI

24:46

models, this feels

24:48

like it fills in so many

24:51

spaces, or at least it starts to unlock so

24:53

many spaces. So, Kalyan, thank you so much for

24:55

the time today, for the insights. We really appreciate

24:57

it. And folks, as we mentioned,

24:59

you know, him and his team have a ton

25:01

of background, both obviously what they've been doing with

25:03

Data CBO, but, you know, the

25:05

work they had done previously and still continue to

25:07

do with MIT and the labs over there. So

25:09

very, very excited to have the opportunity to speak

25:11

with you today. So thank you for the time.

25:14

Aaron, you want to wrap us up and take us home? Yeah,

25:16

absolutely. So, first of all, everyone out there,

25:18

thank you very much for listening. We certainly

25:21

appreciate it. And if you enjoy the show,

25:23

please tell a friend and also please leave

25:25

us a review if it's possible wherever you

25:27

get your podcasts. For

25:29

Brian and myself, I'm going to close this out

25:31

for this week and you can follow us. Of

25:34

course, all the links are in the show notes, as

25:37

always, and on the website. And

25:39

we thank everyone for their time and we will talk

25:41

to everyone next week. The.

25:44

Clouds Cast. Please visit the

25:46

Clouds cause.net to buy more

25:48

shows, show notes, videos and

25:50

everything Social media.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features