Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
Cloudcast Media presents from the Massive
0:02
Studios in Raleigh, North Carolina. This
0:04
is the Cloudcast with Aaron Delb
0:06
and Brian Gracely, bringing you the
0:08
best of cloud computing from around
0:10
the world. Good
0:13
morning, good evening, everyone. Welcome back to the
0:15
Cloudcast. We're coming to you live from
0:17
our Massive Cloudcast Studios here in Raleigh, North
0:19
Carolina. And it is Aaron for the intro
0:22
this week. We are full on into spring
0:24
conference season. Lots of shows
0:26
and events already happened and
0:28
more on the horizon. I'm in work
0:30
crunch mode preparing for upcoming events. And
0:32
it's that time of the year I
0:35
kind of hate, but I also kind of love all
0:37
at the same time. Our topic
0:40
for this week is the generation
0:42
of synthetic data, specifically around AI.
0:44
But this also really applies to
0:46
any application that needs testing data
0:48
at scale. And that conversation is
0:51
coming up right now. And
0:55
we're back. And Aaron, how you doing,
0:57
man? Spring is spring's in full bloom. How have you
0:59
been so far? I'm good. I'm good. Yeah, I've kind
1:01
of said it on once or
1:04
twice on the shows before, but
1:06
it definitely feels like spring. It's
1:08
warming up. The spring conference
1:10
season is here. Lots
1:12
going on. It's been a pretty exciting spring,
1:14
actually. Yeah, no, it's been good. It's been
1:16
good. Now, you and I obviously
1:18
have been talking a lot about the
1:22
topic du jour. And one of the things I was
1:24
thinking about recently is, you
1:26
know, as
1:28
a data scientist or anybody who's managing teams
1:30
of people that are trying to
1:32
manage data and their data
1:34
may not necessarily be in the same place
1:37
as their GPUs or the place where they're
1:39
going to do acceleration and training and so
1:41
forth. I got to
1:43
thinking about that, especially for enterprises. And you think about
1:45
just how restrictive companies are
1:47
with the most simple things. Like, what
1:51
happens if you take your laptop on the road with
1:53
you? There's a whole set of training that goes with,
1:55
you know, don't do this and don't lose the data.
1:57
Or, you know, what do I do with my cell phone?
1:59
Like, how can you do that? What can I put on there?
2:01
What can't I put on there? Then I think
2:03
about the massive amounts of data that are
2:05
involved with, whether it's fine
2:08
tuning or training or even rag with
2:10
some of these AI environments, and
2:13
you start to wonder like, how are
2:15
they going to deal with this? How are they
2:17
going to deal with potentially moving their
2:19
data outside of the four walls or
2:21
in secure environments? We
2:24
got to think about this like this would be a perfect
2:27
opportunity if there was some way to create a
2:29
representation of your data that the
2:32
security teams would feel a little more confident with, and
2:34
that would be probably an interesting topic for us to
2:36
dig into. Our
2:39
topic today is synthetic data. Of
2:41
course, as we want to go and talk about that topic,
2:44
we go and find somebody and we have
2:46
a fantastic guest for today. We have Kalyan,
2:50
CEO and co-founder at DataCebo.
2:53
First of all, welcome to the show. Thank
2:55
you, guys. Thank you. Yes,
2:57
absolutely. Let's start
3:00
at the start because some may
3:02
have heard about synthetic data or they're already
3:04
using it but maybe they've not referred
3:06
to it in that way before, but
3:08
tell everyone a little bit about what
3:10
is the use case
3:12
and the need and the problem synthetic
3:14
data is solving today. Great.
3:17
Thank you. I love the introduction. Just
3:19
Brian just mentioned about laptop on the road and
3:21
I'll come back to that in a minute. Because
3:24
we actually had that situation
3:26
when we founded this project.
3:29
Synthetic data is data generated
3:32
from a generated model that
3:34
is trained on the real data. You train a
3:37
generated model on the real data, a little bit
3:39
of real data, and then you get a model
3:41
and from that model, you can generate data
3:44
that looks like the real, has
3:46
all the statistical properties and format,
3:49
like the real data, but cannot be linked
3:51
to the real data. It's actually a sample
3:53
that is generated from a model, which
3:55
is a mathematical representation of the data.
4:00
great thing that it provides is an
4:02
alternative when access to
4:04
the real data is restricted as we were talking
4:06
about it or needs to
4:08
be restricted or is not available. So
4:11
that's the immediate parent need that enterprises have
4:13
for this kind of data or for this
4:15
kind of modeling technique. What
4:18
is less known, however, is
4:21
how dramatically it changes that data
4:23
access and availability. So
4:25
just to give you an example, once you
4:27
build a generated model, you can
4:29
actually sample as much data as
4:31
you want. So you're not
4:34
limited by your original data size. So you
4:36
can sample 10 million
4:38
records, 1 billion records of
4:40
customers, even though you train
4:43
probably the model only on
4:45
just 10,000 customers' data. What
4:48
that enables people right away is to
4:50
create data for performance testing. I mean,
4:52
that's just the day zero ROI
4:55
that enterprises see with this kind of
4:57
technology where they can create
5:00
a lot of data in their lower
5:02
environments and not have to port the
5:05
data or a database, large database,
5:07
but just use this model that's
5:09
usually very, very small, maybe
5:11
a couple of gigabytes, even though your data could be
5:14
terabytes, and sample from it
5:16
like billions of records and test it for
5:18
performance, your software application for
5:20
performance. So that's just one
5:22
kind of example for that particular
5:24
sort of game-changing technology that this
5:26
is, that it's used for. The
5:30
other example is more involved
5:33
and interesting. It's even more or less known,
5:35
but it's actually a very powerful use
5:38
case, which is it's
5:41
known, generative models are known to
5:43
have the ability to generate data for a
5:45
particular scenario. So what you can do to
5:48
this model is specify a particular scenario.
5:50
So you can say, for example, an
5:52
insurance company, if you're an insurance company,
5:54
you can say, hey, what if I
5:56
had 10% more smokers? take
6:01
the policy, what will happen to the
6:03
distribution of claims that I have. You
6:06
can generate such scenarios and generate
6:08
data and from that estimate the
6:10
distribution and make some analysis and
6:13
do decision making. That's what we
6:15
call classic, it used to be
6:17
called digital twin, it used to
6:19
be called a variety of names.
6:22
But if you get the model
6:24
right, you should
6:26
be able to create some scenarios
6:28
and analyze. Yeah. So,
6:31
I hope you guys understand, obviously there
6:33
is, you have tremendous background
6:35
in this space having done
6:38
a ton of work and research over at MIT, a
6:40
number of different projects, we'll put some links to those
6:42
in the show notes. People
6:46
have been doing things around synthetic data for disaster
6:49
recovery, things like you said for modeling.
6:52
What's really evolved in this space that
6:55
obviously, now with AI being a big deal,
6:58
people are thinking about it more, but have
7:01
there been any big
7:03
breakthroughs that are allowing people to be
7:05
more flexible in saying, for
7:07
example, I start with this
7:10
real set of data of whatever size,
7:12
and I can now do 10, 12, 15 different variations
7:14
of it. What's
7:18
evolving in this space to
7:20
bring this back to the forefront of being important for
7:22
people? One
7:25
of the things that is involved in this
7:27
space is, I mean, technology-wise, what has evolved
7:29
is generative modeling has evolved a lot. I
7:32
think it has become more and more capable
7:35
of building a
7:37
model like this for a multi-table complex
7:40
dataset. So, we focus on a lot
7:42
of relational and data warehouses
7:44
kind of data, not necessarily language.
7:47
Sure. So, we invented the
7:49
first relational model, generative model
7:52
for relational databases. That
7:55
technology has matured over the last few years
7:57
in my group at MIT, so that has enabled
8:00
make this possible. So
8:03
that's one thing. I think the bigger
8:05
item that from the use case point of view,
8:07
the bigger item is that I think a
8:10
lot of applications, software applications, we
8:12
find a prominent use case of
8:14
testing software applications. And a
8:16
lot of these applications over the last decade
8:18
have become data-driven. So what I mean by
8:20
that is there's a lot of conditional logic
8:23
inside the software that
8:25
is dependent on what data
8:27
it sees. Regardless of
8:30
what application, they make
8:32
conditions saying that if you see this in
8:34
the data, this is particular type
8:36
of demographic or this particular type of
8:39
session or any, what have
8:41
you, they make
8:44
decisions within the software application.
8:47
Now to test those applications, you need the data.
8:50
And that data in the past
8:52
has been either generated by copying,
8:54
masking, and anonymizing. That was
8:56
one of the classic ways. Or
8:58
it was generated using writing
9:00
manual rules by
9:02
writing, you almost maintain the script
9:05
on the side, writing rules and creating the
9:07
JIT data, which is what they call synthetic
9:09
data. And that became
9:11
untenable because number of things that
9:14
you have to add to that
9:16
script just became way too many
9:18
every release. You're adding more functionality,
9:20
data-driven functionality, and you're, as
9:24
again, you're adding a lot more to that
9:26
script as well. So this comes
9:28
in really handy because you can try to train
9:30
a generated model on the database next to the
9:32
database. You don't have to move the data around.
9:34
And once you have that model, now you can
9:37
create synthetic data from it just
9:40
by, you know, just one prompt or
9:42
one call to the model, you
9:44
can create the data. So that's where we are
9:46
seeing a lot of demands for it. Then
9:49
the second demand we are seeing a lot
9:51
is training machine learning models.
9:54
Most of the machine learning models, you
9:57
know, the good ones, I
9:59
think to predict an outcome that is of interest
10:02
to the business. And usually that outcome is rare
10:06
in the number of times it happens. So
10:08
what happens is when machinery models are trained
10:10
on data like that, where you have only
10:13
a few examples of that rare occurrence
10:16
and a ton of examples when that did not happen
10:18
and you want to predict that rare occurrence, you
10:21
end up overfitting the machine learning model and overfitting
10:23
to that data that it has seen. Their
10:26
synthetic data, adding synthetic data, provides
10:28
a lot of benefit in
10:31
terms of reducing that overfit.
10:34
So to give you an example, one of our customers
10:36
was trying to predict from
10:39
ServiceNow data. So they had ServiceNow
10:41
database for their local enterprise, for
10:45
their local DevOps work. And
10:47
they're trying to predict what change
10:50
may cause an incident of priority
10:52
P0 and P1, or major incidents.
10:55
Now, the major incidents, thankfully, were
10:58
just like 100, maybe less than 100. Whereas
11:00
the rest of the incidents that are not that major
11:03
were like 80,000 or so. And
11:05
the predictive model was not able to produce a
11:08
good accuracy because it was overfitting to
11:10
the data, the few
11:13
examples it had. In that
11:15
case, they added synthetic data to it.
11:17
They learned a generative model on that
11:19
database. They created the data. They added
11:21
more synthetic examples for
11:24
the major incidents. So just increased the size
11:26
of that data and mentoring the machine learning
11:28
model. And they were able to get a
11:30
better predictive accuracy. And
11:33
how does the concept of bias
11:36
fit in with something like that? I guess
11:38
what I mean by that is I can
11:40
absolutely see synthetic data being created. But
11:43
if it's not created from
11:45
a wide enough initial
11:47
data set, then it seems like bias
11:49
could creep in at scale over time.
11:51
Is that a valid concern? Or
11:53
how would I think about something like that? That's
11:57
a valid concern. I think we... sort
12:00
of focus on how to sub-sample
12:02
the data from the database. So your
12:05
bias could be when you sub-sample to train
12:07
this model, the generative model, it could come
12:10
from that. So we provide
12:12
ability to do proper sampling. I
12:14
mean, there's ways to overcome that.
12:17
There's bias when
12:19
inherently your data collection is biased.
12:22
So that's the even more problematic
12:24
situation in which case, I think
12:26
a lot of downstream
12:28
applications or usage of data
12:31
also have measures to measure bias and
12:34
be able to control for it. What
12:37
happens is when you do that, you
12:39
can actually sample from the model focusing
12:42
on reducing that bias in an application.
12:44
So you can try to do that.
12:46
The models allow you to create data
12:50
that are of a particular
12:52
demographic or a particular sort
12:54
of category so that you can reduce the
12:56
bias in the application itself. So two things,
12:58
if your data itself is good in the
13:00
database, you can be, we
13:03
provide functionality to sample it properly so that
13:06
you don't create bias data to train this
13:08
model. The second thing is if
13:10
your database itself is biased, in
13:12
your application, you can actually sample
13:15
more data of one kind, of one
13:17
category to control for that bias. And
13:21
maybe I'll take it one step further than two. In
13:23
addition to the bias, and
13:26
I don't know what the term is, but how do
13:28
we verify, if you
13:30
will, or how do we verify the
13:32
quality of all of that? And
13:35
so take everyone maybe one step further
13:37
because I kind of think of this as you've
13:40
got the
13:42
generative AI, it's going to, you describe it in
13:44
need, I need this kind of data set, it's
13:46
going to return that. And
13:49
then it will create this fake data
13:51
set or synthetic data set. Well,
13:54
how would an organization,
13:56
if I'm doing this, how
13:58
do they then turn around and verify it? that
14:00
this data set that was created was
14:02
of an acceptable level of quality. So
14:05
I guess maybe not just the bias, but as
14:07
well as it was it quality data, and how
14:10
do you validate that? Great
14:12
question. So the
14:15
open source library, I think
14:18
it's probably the first,
14:20
it's called synthetic data metrics.
14:23
So what it does is that if you
14:25
give real data and synthetic data, it doesn't
14:27
matter where the synthetic data was generated from.
14:30
It measures a variety of
14:32
metrics saying that how similar are
14:34
they statistically. And what
14:36
is interesting is that you would have to
14:38
balance that with two other measures. One
14:41
of that is diagnostic, we call
14:43
it diagnostic. So basically maintaining
14:45
the format, the properties, the referential integrity.
14:48
So those are must haves when
14:50
you generate synthetic data. The
14:52
second property that we have to
14:54
balance is its primacy. So if
14:56
you make it statistically really, really,
14:59
really close to the real data,
15:01
there is a chance you could
15:03
leak privacy, you leak private information.
15:06
So as a result, you counterbalance that
15:08
with some privacy measures. And those metrics
15:10
also can be calculated by passing the
15:12
real data and synthetic data. So
15:15
we have three sort of metrics
15:17
in that library. One
15:20
is quality metrics, which are statistical
15:22
measures of comparing those to data
15:25
sets. Diagnostic measures, which compares if the
15:27
properties are met, or certain basic properties
15:29
that are must haves, and privacy metrics
15:32
that actually calculates the privacy for that.
15:34
And then the enterprises can choose from
15:36
the trade off points,
15:39
because you can actually manage this
15:41
generative model in a way that you can
15:43
create multiple synthetic data sets. And as a
15:45
result, you can actually see the
15:47
metrics and choose the trade off point. Do
15:49
you want more data with high fidelity, high
15:52
quality, but a little less privacy? Or do
15:54
you want a lot of privacy, like you
15:56
want to be able to have the
15:59
data? a perfectly foolproof while
16:01
forgoes some amount of quality. No,
16:06
it makes sense. I'm
16:08
curious as I'm thinking about this and learning about
16:10
it as you go. At
16:13
some point, the
16:15
ability to create the synthetic data
16:17
feels, especially if you can,
16:19
like you said, do things like say, hey,
16:22
I'd like to essentially experiment
16:24
or manipulate it if the
16:27
model looked like this, 10 percent
16:29
more in this direction or 15 percent
16:31
more with this waiting to it. At
16:36
what point does creating
16:38
synthetic data and essentially
16:41
creating a model or training a model
16:43
start to look like the same thing
16:45
in terms of the
16:47
amount of time and resources it takes
16:49
or how much ability you
16:52
have to manipulate the weights and
16:54
so forth? Do they start
16:56
to look, the process
16:58
of doing this begin to look the
17:01
same or is there very,
17:03
very different creating
17:05
synthetic data and creating models? They
17:08
just become very different things. It's
17:13
the creating synthetic data model. They probably
17:16
overlap significantly once you have
17:18
trained the generating model. I
17:23
think the classical way of
17:25
creating synthetic data was very manual.
17:28
I think generative model, the additional advantage what
17:30
it did was that it
17:32
essentially automated that process so you don't
17:35
have to worry about anything
17:38
to write manually or anything. The
17:41
second thing is it gives you a very succinct
17:43
representation of the data in terms of model parameters
17:45
so you can tweak those as well. It
17:48
looks a very different workflow when you try
17:50
to interact with the generative model that's trained
17:53
and create synthetic data from it versus when
17:55
you just go and write it to
17:58
everything from scratch and do. things
18:00
like that. Yeah. I don't
18:02
know if that answered your question, but... No, it's
18:04
helpful. I'm sort of... My
18:06
thought process went down the path of I can
18:08
imagine that companies
18:11
could potentially start thinking about, for
18:13
example, like their competition or think
18:15
about a new market and
18:18
go and create a model that
18:20
they think looks like that and then do
18:22
scenario planning against it and other types of
18:24
things. So yeah. Okay. That's
18:26
actually one of the very common use cases
18:28
where you have launched a product and you
18:31
had like, let's say a million customers in
18:33
US region. And
18:35
now you launch the same product in
18:37
Europe and you only have 10,000 customers.
18:42
It's coming, but the rate at
18:44
which you're getting is slower, but you have a lot more
18:46
data at the US. So what you
18:48
can do is you can see if there
18:50
is some overlapping patterns between the two
18:53
populations. And also,
18:55
if better even, if you
18:58
have overlapping customers that are both lived
19:00
in US and are also living in
19:02
UK or traveling between both places and
19:04
use that generative
19:07
model to augment the data
19:09
trying to do analysis for the UK
19:12
or Europe region. That's
19:15
a very popular use case where people try
19:17
to augment the data for new
19:19
launches or new places where you haven't yet
19:21
got enough data yet. But you have from
19:24
other places, you have some data to anchor
19:26
on. Gotcha. Okay.
19:29
Okay. That's really fascinating. I love
19:31
that. One last topic before we close out
19:33
for the day. Let's talk about
19:36
resources a bit. Whenever I think of
19:39
GNI and specifically around any kind
19:41
of training or fine tuning, I
19:43
immediately think about large amounts
19:45
of hardware, whether it's GPUs
19:47
or, and certainly GPUs these
19:49
days, they tend to have
19:52
limited availability. And is that something that is
19:54
true here? And also, is this something
19:57
that is on-prem or is
19:59
it in- cloud or is it both
20:01
or you know tell me a little bit
20:03
about the resources required and maybe the differences
20:06
between using synthetic data
20:08
and real world for something like this.
20:12
Yeah, we created algorithms
20:14
for especially this kind of data
20:16
that does not require GPU so
20:18
we use very classical techniques, statistical
20:22
techniques but then did we re-purpose
20:24
them or reinvent them I guess
20:26
for relational tables. They
20:28
are called probabilistic graphical models so they
20:30
can be trained on CPUs and much
20:32
faster they can be trained really fast
20:34
on these CPU machines
20:36
classic off-the-shelf CPU machines. So
20:39
as a result what one
20:41
of the one of the customers actually or
20:44
many of them actually compliment
20:46
saying that I think because it
20:48
takes such a less compute power and
20:51
it actually can train effectively and efficiently on
20:53
large amounts of data the number
20:55
of applications that they can imagine to use
20:58
to for this generative model to be used
21:00
for is huge because
21:02
otherwise it was a lot of expense and then
21:04
it cut down the number of use cases they
21:07
can use it for. So to
21:09
answer your question yeah we we are able
21:11
to train on just CPUs alone and GPUs
21:13
are not required. To answer
21:15
the second question cloud
21:18
on-prem so it's on-prem and
21:20
private cloud so whatever cloud
21:22
service provider that they're using we
21:25
can run the software there they can run
21:27
the software there integrate with their data and
21:30
run it or they can just run an
21:32
on-prem cluster or compute or machine that they
21:34
have. One of the important things
21:37
here is that very early on we made
21:39
this call saying that it can't
21:41
be a hosted because it doesn't make
21:43
sense for somebody to say oh
21:45
to protect your data upload to us and
21:49
we are not a cloud provider so it doesn't make
21:51
sense for them to upload the data and then be
21:53
able to get the generative model. So we
21:56
the provider software so that they can install and run
21:58
it on their in- data environment. Yeah,
22:01
makes sense. Makes sense. Yeah, and I think that's,
22:04
that's maybe one of the the under covered,
22:06
or maybe under discussed aspects of,
22:09
of, you know, AI is people are kind of
22:11
following well, all the GPUs are in the cloud,
22:13
so everything must happen in the cloud. And they
22:16
sort of forget that so much
22:18
of company data is still on
22:20
prem and kind of kept under lock and key.
22:22
And so yeah, that, that where do I, where
22:25
do I do the processing? Where do I keep
22:27
data? Where can I do synthetic data is really,
22:29
really important. Before
22:32
we before we wrap up, want to ask real quick,
22:34
because we didn't spend a lot of
22:36
time on data SIBO and how you
22:38
guys deliver this. Real quick, if folks,
22:40
you know, are interested in in how
22:42
you guys deliver synthetic data, what's, what's
22:45
the best way to engage with the
22:47
company? What's sort of the typical engagement
22:49
model? You know, how would how
22:51
people reach out and work with your team? Because
22:53
you guys have tremendous, tremendous amount of experience in
22:55
this space. Yeah,
22:58
the first thing a lot of people
23:00
do is use our source available software
23:04
called the synthetic data walls, which
23:06
has a number of modeling techniques,
23:08
generated modeling techniques for a variety
23:10
of enterprise level data sets. And
23:13
they train them the on prem,
23:15
they install and train them on prem.
23:17
Without us even being involved at that
23:19
stage, there's documentation available that's picked up
23:22
by sort of thousands
23:24
of folks in the in
23:26
the community. So the best
23:29
site is sdb.dev, like that's where all
23:31
the resources are right now, as
23:33
well. So once they, they establish
23:36
a use case, and they are the
23:38
find some success with it, then they
23:40
reach out to us for sdb enterprise
23:42
version, where it allows them
23:44
to scale to many tables, multiple
23:46
tables, and most, you know,
23:48
a lot of cool enterprise features, multiple
23:51
data types, and a lot
23:53
of complexity as well, and enterprises and they're
23:55
able to address all of those, and they
23:57
reach out to us to get sdb enterprise.
24:00
Oh, subscription. Awesome. Thank
24:02
you. That's really great to hear. So, go
24:04
ahead. Go ahead. I
24:08
was just
24:10
going to say, so that's a really
24:12
good summary of everything, and
24:14
thank you for kind of taking
24:17
us through all of that today. So I
24:21
think we'll close it out there, Brian. Yeah,
24:23
no, this has been very, really
24:25
helpful, and it helps, at least in my mind,
24:27
you know, kind of close a
24:29
gap or at least fill in a hole
24:31
in terms of thinking about how our team's
24:35
going to sort of manage this challenge
24:38
of security, where's my
24:40
data, if my data's not perfect for
24:42
what somebody has in mind as to
24:44
what's possible with some of these AI
24:46
models, this feels
24:48
like it fills in so many
24:51
spaces, or at least it starts to unlock so
24:53
many spaces. So, Kalyan, thank you so much for
24:55
the time today, for the insights. We really appreciate
24:57
it. And folks, as we mentioned,
24:59
you know, him and his team have a ton
25:01
of background, both obviously what they've been doing with
25:03
Data CBO, but, you know, the
25:05
work they had done previously and still continue to
25:07
do with MIT and the labs over there. So
25:09
very, very excited to have the opportunity to speak
25:11
with you today. So thank you for the time.
25:14
Aaron, you want to wrap us up and take us home? Yeah,
25:16
absolutely. So, first of all, everyone out there,
25:18
thank you very much for listening. We certainly
25:21
appreciate it. And if you enjoy the show,
25:23
please tell a friend and also please leave
25:25
us a review if it's possible wherever you
25:27
get your podcasts. For
25:29
Brian and myself, I'm going to close this out
25:31
for this week and you can follow us. Of
25:34
course, all the links are in the show notes, as
25:37
always, and on the website. And
25:39
we thank everyone for their time and we will talk
25:41
to everyone next week. The.
25:44
Clouds Cast. Please visit the
25:46
Clouds cause.net to buy more
25:48
shows, show notes, videos and
25:50
everything Social media.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More