Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:13
Hello,
0:13
and welcome to Podcastinit enit. The
0:15
podcast about Python and the people who make it
0:17
great. When
0:18
you're ready to launch your next app or want to try a
0:20
project you hear about on the show, you'll need some more to
0:22
deploy it. So check out our friends over at Linode.
0:24
With their managed Kubernetes platform,
0:27
it's easy to get started with the next generation
0:29
of deployment and scaling powered by the battle
0:31
tested Linode platform including simple
0:33
pricing, node balancers, forty gigabit
0:36
networking, and dedicated CPU and GPU
0:38
instances. And now you can
0:40
launch a managed MySQL, Postgres, or
0:42
MongoDB cluster in minutes to keep your
0:44
critical data safe with automated backups
0:46
and failover. Go to python podcast
0:49
dot com slash linode today to get a
0:51
one hundred dollar credit to try out their new database
0:53
service, and don't forget to thank them for the continued
0:55
support of this show. The biggest challenge
0:58
with modern data systems is understanding what
1:00
data you have, where it is located,
1:02
and who is using it.
1:03
Select Star's data discovery platform
1:06
solves that out of the box with a fully automated
1:08
catalog that includes lineage from where the data
1:10
originated, all the way to which dashboards rely
1:13
on it and who is viewing them every day.
1:15
Just connected to your DBT, Snowflake,
1:18
Tableau, Looker, or whatever you're using,
1:20
and select star will set everything up in just
1:22
a few hours. Go to python
1:24
podcast dot com slash select star
1:26
today to double the length of your free trial
1:28
and get a swag package when you convert to a
1:30
paid plan. Your host, as usual,
1:32
is Tobias Macy, and this month, I'm running
1:34
a series about python's use in machine learning.
1:37
If you enjoyed this episode, you can explore
1:39
further on my new show, The Machine Learning Podcastinit
1:42
helps you go from idea to production with machine
1:44
learning. To find out more, you can go
1:46
to the machine learning podcast dot com.
1:48
Your
1:48
host is Tobias Macy, and today
1:50
I'm interviewing Max Halford about river,
1:53
a python toolkit for streaming and online
1:55
machine learning. So Max, can you start by introducing
1:57
yourself?
1:58
Oh, hi there.
1:59
I'm Max. Yes. I consider myself
2:02
as a data scientist.
2:03
My day job is doing data science. I
2:06
actually measure the common surface of a
2:09
clothing items. But
2:10
I have a wide interest in, you know,
2:12
technical topics or be software engineering
2:15
or
2:16
late engineering,
2:17
build a lot of different source. My academic
2:19
background is leaning towards finance and
2:21
computer science and statistics. I
2:23
actually did a PhD in
2:25
applied machine learning, which
2:27
I finished a couple of years ago. So,
2:29
yeah, all around that, basically. And
2:33
do you remember how he first got started working in
2:35
machine learning?
2:36
Kind of just I was a late bloomer.
2:38
I got started when maybe when I was twenty
2:40
one, twenty two when I was at university.
2:43
I basically had no idea what
2:45
machining was, but I started this curriculum
2:48
that involved that was around statistics. And
2:50
we had a course, which was
2:52
maybe two or three hours a week about machining.
2:55
And it did kind of blow
2:57
our minds. It was around
2:59
the time when well, machine
3:02
learning and particularly deep learning was
3:04
starting to explode.
3:05
So I
3:06
kind of stopped at university. So I was lucky
3:09
enough to get a theoretical training.
3:11
And in terms of the
3:13
river project, can you describe a bit more
3:15
about what it is that you've built and some of
3:17
the story behind how it came to be and
3:19
why you decided that you wanted to build it
3:21
in the first place? When I was
3:23
at university, I received
3:25
a kind of normal
3:27
introduction to a regular introduction
3:29
to machine learning. And then I did
3:31
some internships, I signed to
3:33
PhD, up to my internships. And
3:36
I also did a lot of cargo compositions on the
3:38
side. So I was kind of hooked into
3:40
machine learning. And
3:42
it always felt to me
3:44
that
3:44
something was off because
3:47
when we were learning machine machine learning. And if
3:49
you made sense, but then when you get
3:51
to do in practice, you often
3:53
find that, well, it's not a
3:55
program. Right? The playground
3:57
scenarios that they describe by university,
3:59
when you learn the feeling, just do not apply in
4:02
the available world. The available as well. You have
4:04
data that's coming in. Like, a flow
4:06
of data or evidence is new data
4:08
or yeah. There's, like, an interactive
4:10
aspect to the world around
4:12
us, the way the data is It's not
4:14
like a CSV file. Yeah.
4:16
It just felt like fitting a square peg
4:18
in a round hole. So I
4:19
was always curious in the back of
4:22
my mind about
4:23
how
4:24
you could do online machine learning?
4:27
Or I didn't know it was called online machine learning
4:29
because when I was a kid, I remember going up and
4:31
thinking that AI was this kind of intelligent
4:34
machine that would keep learning as it went
4:36
on and as they experience
4:37
the world around this.
4:38
Anyway, when I started my HD, I was
4:41
looking after a lot of time to read. I read a
4:43
lot of papers and blog posts and
4:45
whatnot. And I can't remember the
4:47
exact day a week that I
4:49
stumbled upon it, but I just started learning
4:51
about online machine learning. Maybe
4:52
some blog posts or something. And
4:54
then it was like a big explosion
4:57
in my head and I was like, wow, this is
4:59
crazy. Right? This this actually exists.
5:02
And I was so curious as to why it wasn't
5:04
more accurate.
5:05
And at
5:07
the time, I did a lot of open source as
5:09
a way to learn. And so it
5:11
just felt natural to me to start
5:13
implementing algorithms that are
5:15
reading papers on everywhere.
5:17
I'm just underwriting code to learn
5:19
basically just to to confirm
5:21
what I've learned and and whatnot. That's just the
5:23
way I know. And it kind
5:25
of evolved into what it is, which is a
5:28
well, an open source package
5:30
that people use. Now if I may expand
5:32
a little bit Waver is actually the
5:34
merger between two projects. So
5:36
the first project is called Psyche Multifraud.
5:38
It was a package that
5:40
was developed Before I even got
5:42
into machining, it has roots in
5:45
academia in New Zealand. It
5:47
comes from an old package called Noah in Java.
5:49
Anyway, I wasn't not aware of that.
5:51
On my end, I started to
5:54
get packaged wood cream at the time.
5:56
So in French creme in the creme
5:58
and it plays friendly with incremental,
5:59
which is another way to say online.
6:02
So, yeah, I developed cream
6:04
on by myself. And
6:06
at some point, it just made sense to
6:08
reach out to the guys from Cycli Multipro
6:11
and to propose an merger. So it
6:14
took us quite a while. But after
6:16
nine months of negotiation
6:18
and, you know, figured out the details,
6:21
we merge and we call the new package
6:23
of paper. You
6:24
mentioned that it's built around this
6:26
idea of online machine learning
6:28
and in the documentation you also refer
6:30
to it as streaming machine learning. I'm
6:32
curious if you can just talk through
6:34
what that really means in the context
6:36
of building a machine learning system
6:38
and some of the practical differences
6:40
between that and the typical
6:42
batch oriented workflow that most folks
6:44
who are working in ML are going to be
6:46
familiar with.
6:47
First, just to recap on machine
6:50
learning, the whole point of machine learning is to
6:52
teach a model to
6:55
learn from data and to take decisions.
6:57
So you know, monkey c, monkey
6:59
doof. And the typical way you do
7:01
that is that you fit a model
7:03
to a bunch of data and That's
7:06
it, really. But online
7:09
machining is the equivalent of that, but
7:11
for streaming data. So you
7:13
new stop thinking about
7:15
data as a file or a
7:17
table in a database, but you think of it as
7:19
a flow of data stream.
7:22
So online machining, you could call it incremental,
7:24
machine learning. You could call it streaming, machine
7:26
learning. I mean, I more
7:28
than see online machine learning
7:30
been used. Although, if you google
7:32
that, you kinda find these online courses.
7:35
So that's online machine learning. So that's that
7:37
kinda cool online. Anyway,
7:39
yeah, it's just this way to
7:41
say, can I do machine learning but
7:43
with sharing data? And so the
7:45
rule is that an online model
7:48
is one that can learn one
7:50
sample at a time. So usually,
7:53
you show a model a whole data set,
7:55
and they can work with that data sets. They
7:57
can even calculate the average of the data. They can
7:59
do a whole bunch of stuff. But
8:01
the restriction here with online machining
8:03
is that the the model
8:05
cannot see the whole data. It can't hold it
8:07
in memory. It can only see one sample
8:09
at a time, and it has to work like
8:11
that. So it's a restriction. Right?
8:13
So makes it harder for the models to learn,
8:16
but it also has many, many implications.
8:18
If you have a model that can learn that way,
8:20
well, you can have a model that just can keep
8:22
learning as you go. Because
8:24
a regular machine learning
8:26
model, once it's been
8:28
fitted to a dataset, you
8:30
have to retrain it from scratch. If
8:32
you want to incorporate new samples
8:35
into your model, that can be a
8:37
source of frustration. And that's why I was
8:39
calling the square peg in the round
8:41
hole before. So say you have a
8:43
model, an online model that is just as
8:45
performance as a a batch model.
8:47
Well, you know, if
8:48
you if you just regardless of
8:50
performance, accuracy. That has
8:52
many implications and it
8:55
actually makes things easier because if you have
8:57
a model that I can keep learning, Well,
9:00
you don't have to, for instance,
9:02
schedule retraining on your
9:04
model. You can just every time you have a
9:06
new sample that arrives, you can just how your model
9:08
to learn from that, and then you
9:10
learn. And so that ensures that your
9:12
model is always as up to date as possible and
9:14
that has obviously many,
9:16
many benefits. If you think
9:18
about people working on the
9:20
stock market, so trying to
9:22
forecast the evolution
9:24
of a particular stock. They've
9:27
actually been doing online machining since the
9:29
eighties because obviously
9:31
they have a lot to lose by making
9:33
all this public. It just never got
9:35
into
9:35
a big thing and it always stayed in in its stock
9:37
market companies. So
9:38
the practical differences
9:41
are that you are
9:43
working over stream of data. You're not working
9:45
over static datasets. This stream
9:47
of data has an order, meaning that
9:49
the time that one sample arrives before
9:51
the other one, that has a lot of meaning and that's
9:53
actually reflecting what's happening
9:55
in the real world. In the real world, well,
9:57
you have data that's arriving in a certain order.
10:00
Well, if you train your model
10:02
offline
10:02
on that
10:03
data, you want to you
10:05
know, process it in the same order.
10:08
And so that ensures that you are actually
10:10
reproducing the conditions
10:12
that happen in the world.
10:14
Now another practical consideration is
10:16
that online learning is
10:18
much less popular
10:20
or predominant than batch
10:22
learning. And so a lot less research and
10:25
software work has been put into online
10:27
learning. So if you are a newcomer to
10:29
the field, well, there's just not a lot
10:31
of resources to learn from. Actually, you
10:33
could just spend a day on Google and you
10:35
probably find all the resources that you they
10:37
are because there's just not so many
10:39
of them. It's probably like by
10:42
memory, just ten
10:43
links on Google app. You can
10:45
learn from about online learning.
10:47
So it's a bit of a niche topic.
10:49
In
10:49
terms of the fact that batch
10:52
is such a predominant mode
10:54
of building these m l systems despite
10:56
the fact that it's not very
10:57
reflective of the way
11:00
that the real world actually
11:02
operates. Why do you think
11:04
that's the case that streaming or
11:06
online machine learning is still such
11:08
a niche topic and hasn't been
11:10
more broadly adopted. Sometimes
11:11
it shows like I'm trying to reach
11:14
a new religion. It was just a
11:16
bit weird because there's not a lot of us doing
11:18
it. So I'm also very
11:20
I never tried to force people into
11:22
this because obviously many good reasons
11:25
why Ashconning is still
11:27
done. And now from a historical
11:29
point of view, I think it's interesting
11:31
because we always used
11:33
to use statistical models to
11:35
explain data. And not
11:37
necessary to predict. So you just
11:39
have datasets and you just like
11:41
to understand, you know, what
11:43
variables are affecting a particular
11:45
outcome. So for instance, if you take linear
11:47
aggression, historically, it's been used
11:49
to explain the
11:51
impact of global warming
11:53
on the rise of sea level. But not
11:55
necessarily to predict if,
11:57
you know, the
11:58
temperature of the
11:59
globe was higher,
12:00
what would be this impact from the sea
12:03
level? But then someone
12:05
said, let's use machine learning to
12:07
predict outcomes in a business
12:09
context, and that's why we
12:11
have this big event of machine learning.
12:14
And we've kind of been using the tools that
12:16
have been lying around. So
12:18
we've been using all these tools that
12:20
we use for to
12:22
the extent and explain it, and now we've been
12:24
using them predicting. So
12:27
these models are
12:28
static. Like,
12:30
the people who when we
12:31
started doing an regression, we never really worried
12:34
about streaming data because they searched
12:36
for small, they searched for status.
12:38
Well, Internet didn't even exist. So there
12:40
was no real notion of IoT or
12:43
sensors or
12:44
streaming data.
12:46
So the
12:47
fact is that we've
12:50
never needed online models.
12:52
And so as a
12:54
field, you look at the academia and
12:56
the industry, we're very used to, much
12:58
learning, and we're very
13:00
comfortable with it. There's a lot of good
13:02
software. And this is what people are being
13:04
taught at university. So
13:07
I'm not
13:07
saying that online learning is necessarily
13:09
better than best learning, but I do
13:12
think that the reasons why
13:15
actually so predominant in comparison
13:17
is because we are too used
13:19
to it, basically. And
13:20
I do think that, and I see
13:22
it every week, people who
13:24
are trying to rethink their
13:27
job or their projects and say,
13:29
maybe I could be doing online learning.
13:31
It actually makes more sense. So
13:34
I think it's a question of how it's
13:36
been. For
13:36
people who are assessing
13:39
which approach to take in their
13:41
ML projects, what are some of the use
13:43
cases where online or
13:45
streaming ML is the
13:47
more logical approach or
13:49
what the decision factors look like for somebody
13:52
deciding, do I go with a batch oriented
13:54
process where I'm going to have this
13:56
large set of tooling available to
13:58
me or do I want to use
14:00
online or streaming ML because the benefits
14:02
outweigh the potential costs
14:04
of plugging into this ecosystem
14:06
of tooling? So
14:07
I'll be honest. I think it always
14:10
makes sense to start with
14:12
a batch model. Why
14:14
because, you know, if you're pragmatic and
14:16
you actually have deadlines to meet and you
14:18
just want to be productive. There's so
14:20
many good solutions to train
14:22
a batch model and deploy it. So,
14:24
you know, I would just go over
14:26
that to start with. And
14:28
then, yeah, there's the question of,
14:30
could I be doing this online? So I
14:32
think there's two cases. There's cases where you
14:34
need it. So I have a great
14:36
example. So Netflix, when they
14:38
do recommendations, you know, you
14:40
arrive on the website and Netflix recommends
14:42
movies to you. Netflix actually
14:44
retrains a model every night.
14:46
Or every week, but they have many models
14:48
anyway. But they
14:50
are learning from your behavior to kind of
14:53
retrain their models to update
14:55
their recommendations. Right?
14:56
There's a
14:57
team of Netflix that are working on
15:00
learning instantly. So if you
15:02
are scrolling on the Netflix website
15:05
and you see your
15:07
recommendation from Netflix. The
15:09
fact that you did not click on that recommendation
15:11
is a signal that you do not
15:13
want to watch that movie movie. Or
15:15
that the recommendations will be changed.
15:17
So if you're able to have a
15:19
model, for instance, maybe in your browser,
15:21
that would learn in real
15:23
time from your browsing
15:25
activity and that could update
15:27
and learn on the fly, that'd
15:29
be really powerful. And the only way to
15:31
do that is to have a model
15:33
per user that is learning
15:35
online. And so
15:36
you cannot just use batch models
15:39
for that. Yeah. You can't just
15:41
every time a user scores or
15:43
ignores a movie, you can't just take all the history
15:45
of data and fit the model. It
15:47
would be much too heavy in it to start practical.
15:49
So sometimes the only
15:51
way forward is to do
15:53
online learning. But again, this is quite
15:55
niche, like Netflix recommendations. I
15:57
mean, obviously, are working reasonably
15:59
well, I
15:59
believe because you
16:01
know, just from their market value. But
16:03
if you are pushing the envelope, then sometimes
16:06
you need online learning.
16:07
Now another case is when you do
16:10
not necessarily need it, but you want it
16:12
because it makes things easier. So
16:14
a
16:14
good example I have is imagine
16:17
you're working on the app that categorizes
16:19
tickets. So, for instance,
16:21
on the help support software. So, you know,
16:23
you go on the website and you're sending form
16:25
or you're sending a message or an email, and
16:27
you're asking you have some problem maybe with
16:29
the reimbursement from your absolute people in
16:31
Amazon? And then, you know,
16:33
there's a customer service behind
16:35
that human beings that are actually answering those
16:37
questions. And
16:38
then it's
16:39
really important to be able to categorize
16:42
each request and put into a
16:44
bucket so that it gets assigned to the right
16:46
person.
16:47
And maybe the product manager
16:50
has decided that we need a new
16:52
category. And so there's this new
16:54
category. And your
16:57
model is classifying tickets
17:00
into one of several categories.
17:02
If you introduce a new category,
17:04
it means that you have to retrain the model to
17:06
incorporate it. I was in
17:08
discussion with a company and they were only
17:10
able to or their budget
17:12
was that they were only able to retrain the model every
17:15
females. So if you
17:17
introduced a new ticket and new category into your
17:19
system, the model would only pick
17:21
it up and predict it after
17:23
females. So that sounded
17:25
kind of insane. And, you know, it wasn't
17:27
collected at all.
17:28
And I not aware
17:30
of the exact details, but it it just seemed
17:33
too expensive for them to retrain their model
17:35
from scratch. So if they were
17:36
using an online model, well, potentially
17:39
that model could just learn the
17:41
new tickets on the fly. And, you know,
17:43
if you just introduced it and people started
17:45
you know, you learn this feedback group where
17:48
you introduce a new category, people
17:50
send the email, maybe a human assigns
17:52
that ticket to a category, so
17:55
that becomes a signal for the model. The model picks it
17:57
up, learns. And, yeah, it's
17:59
gonna incorporate that category into its next
18:01
predictions. So that's
18:03
a scenario where you don't necessarily need online
18:05
learning but actually online learning just makes
18:07
more sense and makes your your
18:09
system easier to maintain to
18:11
Well, quick, basically. There
18:13
are a number
18:14
of interesting things to dig into
18:17
there. One of the things that you
18:19
mentioned is the idea of
18:21
having a model per user
18:23
in the Netflix example.
18:26
And I'm wondering if you could maybe talk
18:28
through some of the conceptual
18:30
elements of saying, okay, I've got this
18:32
baseline structure for this is how I'm
18:34
going to build the model here
18:36
as the initial training set, I've
18:38
got this model deployed, and now
18:41
every time a specific user
18:43
interacts with this model, it is going to learn
18:45
their specific behaviors and
18:47
be tuned to the data that they
18:49
are generating, would you then take that
18:51
information and feed that back into the
18:55
baseline model that gets loaded at the time the browser
18:57
interacts with the website or just some of the ways to
18:59
think about the potential
19:02
approaches for how to say, okay, I've got
19:05
a model, but it's going to be customized
19:07
per user and just managing the kind
19:09
of fan out fan in topologies that
19:11
might result from those
19:14
event based interactions
19:16
spread out across a number
19:18
of users or entities? No.
19:20
It
19:20
sounds insane when you say it because to
19:23
have one model per user and, you
19:25
know, have it deployed on the user
19:27
browser or model or phone or come those
19:29
with an Apple Watch. It does kinda
19:31
sound insane, but it is interesting, I guess.
19:33
I don't think there are so many I
19:35
mean, I'm not aware of a lot of
19:38
companies that would have the
19:40
justification to actually do this,
19:42
and I've never had the occasion to actually work
19:44
in terms of setting where I would I would
19:46
do this. But I had one good
19:48
example where I was kind of doing some pro bono
19:50
consulting. It was this car
19:52
company where the
19:54
onboard navigation system
19:56
they wanted to build a model where they could guess
19:59
where you're going
19:59
to. So basically, depending on
20:01
where you left, if you left home in
20:03
the morning, you're probably going to
20:06
work. And they would then use this to,
20:08
you know, just give you send your
20:10
news about your itinerary or
20:13
things like that. They
20:14
really needed a model that would just be able to learn online
20:17
and they made the
20:18
bold decision to say, okay, we're going
20:20
to embed the model into the
20:23
car. It's not going to be,
20:25
like, a central model
20:27
that's, you know, hosted on some
20:29
big server and that the car
20:31
inspection actually, the intelligence
20:32
is actually happening in the car. And so when you
20:34
think about that,
20:35
it's been interesting because it creates a decentralized
20:38
system. There's not like a single
20:40
it
20:41
actually creates a system where you don't even need the Internet for
20:43
the model to work. So there's
20:45
so many operational requirements
20:47
for that. Actually, now that I
20:49
think around that I'm talking about cars. I mean, that's that
20:51
Tesla that's actually what Tesla's doing. Right? They're
20:54
they're computing
20:55
and making decisions inside
20:56
the car. You
20:58
know, doing a bunch of stuff, and they're also communicating with
21:01
other servers. But the
21:03
actual computer, they actually have GPUs in the
21:05
car doing compute with their deep learning
21:07
models in in one up. So it's
21:09
definitely
21:09
possible to do this. Right? But
21:11
clearly not something that I don't
21:13
know. Our
21:13
company would go would to go through or would
21:15
have the need to to
21:17
do. It's
21:17
interesting also to think about
21:19
how something like that would
21:21
play into the federated
21:24
learning approach where you have
21:26
federated models where there is that core model that's
21:28
being built and maintained at
21:30
the core. And then as users
21:33
interact, at the edge, whether it's on device
21:35
or in their browser, it loads
21:37
a federated learning component that
21:39
has that streaming ML capability
21:42
So the model evolves as the user
21:44
is interacting with it on their device, and
21:46
then that information is then sent
21:49
back to the centralized
21:51
system to be able to feed back into
21:53
the core model so that you can
21:55
have these kind of parallel streams
21:57
of the model that users interacting
21:59
with is being customized to their behavior
22:01
at the time that they're interacting with it,
22:03
but it does still get
22:05
propagated back into the larger
22:07
system so that the new
22:10
intelligence is able
22:12
to generate an updated experience
22:14
for everybody who then goes and interacts with it
22:16
in the future.
22:17
Yeah. That's really, really interesting.
22:20
So I think first off, the the fact is that
22:22
I I'm actually
22:23
still young and there's so many things that I don't
22:25
know and I don't have, like, the technical
22:27
savvy to be able to suggests
22:29
ways forward. But this is obviously things
22:32
like, you know,
22:33
I think about
22:34
so this thing like
22:35
HubSpot, which is a a project from Google,
22:37
they have a paper. discuss
22:39
these things. I think that's
22:41
a really simple thing that if you wanted to do
22:43
this, if you're the the listener wants to do
22:45
something like this, I think there's a simple
22:47
pattern which is to maybe once a
22:49
month have a model that is retrained and
22:51
that you just train in batch. And that
22:53
model is going to be
22:55
it's gonna be like a hydro. Like, you're gonna
22:58
copy it and you're gonna send it to each
23:00
user. And then each
23:02
copy for each user is going to be able
23:04
to learn in this whole environment.
23:06
And for instance, a good idea would maybe
23:08
to with your model to increase,
23:10
like, the learning rates so
23:12
that every sample that the user
23:14
gives you matters a lot. So
23:16
for instance, if we take the Netflix example, you
23:18
would have, you know, your
23:20
one of the mail recommendation system
23:23
model that you would just chain and
23:25
match, you know, and you just use all the
23:27
tools that we community use. But
23:29
then you would embed that into
23:31
each person's browser and
23:33
maybe you do this one to them. And then
23:35
that model page user
23:37
would be a coffee, a kroner, like a just
23:39
a separate model now.
23:41
And, you know, it would
23:42
keep running. In online manner.
23:45
So maybe your model was trained in batch
23:47
initially. But now for each user, it's yeah.
23:49
It's actually it's been trained online. And so
23:51
for instance, you can do this with
23:53
characterization machines that can be trained in batch, but also
23:56
online. And, yeah, you would use a
23:58
a high learning rate. So
24:00
that every sample matters a lot basically.
24:02
And so you, the user, tuning
24:04
your model. And so
24:06
I don't know how YouTube does it
24:08
trend but I do imagine they have some sort of
24:10
core model. They're just learning how
24:12
to make good recommendations. But
24:14
obviously, YouTube,
24:15
there are some rules that
24:18
make it so that's, you know, recommendations
24:20
are
24:20
tailored to each user. And I
24:22
don't know if this is done online, and I don't
24:24
know if it's actually machining. It's probably just
24:26
rules or scores. But, yeah, I think it's
24:28
a really fun idea to play around
24:31
with, and I do think that online learning
24:33
enables this even more. As
24:35
far
24:35
as the
24:37
operational model for people who are
24:39
using online and streaming machine
24:42
learning. If they're coming from a
24:44
batch background, they're going to be used to dealing
24:46
with the train
24:48
test deploy cycle where
24:51
I have my dataset. I build this model.
24:53
I validate the model against
24:55
the test dataset that I've held out from
24:57
the training data. Everything
24:59
looks good based on the
25:01
area under curve or whatever metrics I'm
25:03
using to validate that model, then I'm going
25:05
to put it into my production environment
25:08
or maybe being served as a flash
25:10
app or a fast API app.
25:12
And then I'm going to monitor it for
25:14
concept drift. But eventually, I say,
25:16
okay. This is no longer forming up to these
25:18
specifications. So now I need to go back and
25:20
retrain the model based on my updated data
25:22
sets. And I'm wondering, what
25:24
that process looks like for somebody building a
25:26
streaming ML model with something like River and
25:28
how you address things like
25:31
concept drift and, you know, how concept
25:33
drift manifests in this
25:35
streaming environment where you are continually
25:37
learning and you don't have to worry
25:40
about you know, the real world data that I'm seeing
25:42
is widely divergent from the data that
25:44
I use to train against. There's so many
25:46
things to
25:46
dig into and I'll try to give a comprehensive
25:49
answer. So first
25:51
off, it's important to understand that whoever
25:53
itself is to
25:55
online learning, what psychic learn is to batch
25:57
learning. So it's
25:58
only
25:59
desires to be a machine
26:02
learning library. Right?
26:03
So it just contains
26:06
basically algorithms routines
26:08
to train a model and
26:10
to have models that can learn and
26:12
predict. And what you're going towards
26:14
to your question is in our ops. So
26:17
how does the
26:19
life cycle look like for an online
26:21
model? So this is obviously something
26:23
that I'm
26:23
spending a lot of time to look into.
26:26
The answer is that the
26:28
first problem is that online learning
26:30
enables different
26:32
patterns. And I
26:34
believe
26:34
that these patterns are simple to
26:36
reason about. So as you said,
26:38
you usually start off by
26:41
training
26:41
a model, then evaluating against the
26:43
test set and maybe going
26:45
to report to your stakeholders and show
26:47
them the performance and guarantee that
26:50
you
26:50
know, the the centers of full
26:51
positives is underneath a
26:53
certain threshold, then yes, we can
26:56
diagnose cancer with this model or
26:58
not. And yeah, and then you kind of
27:00
deploy it. Maybe if you get lucky, if you get the
27:02
approval and you sleep, you
27:05
know, well or not well at night depending
27:07
on how much you trust your model, but there's this
27:09
notion of you deploy a model and
27:11
it's like a baby in the world
27:13
and this baby is not meant to keep
27:15
learning. So you know, it's a
27:17
lie to believe that if you deploy a
27:19
batch model, you're going
27:21
to be able to just let
27:23
it you know, learn by itself. There's actually main things
27:25
maintenance has to happen there. So the
27:28
reality is that any machine learning projects,
27:30
you know, any serious projects, is
27:33
never finished. It's like software,
27:35
basically. We have to think of machine learning.
27:37
Projects have software engineering. And
27:39
obviously, what we all know that you
27:41
never just deploy a feature, a software
27:43
engineering feature, and just never look at it.
27:45
You monitor it. You
27:47
take care of that, well, investigating
27:49
bugs and whatnot. So batch learning in that
27:51
sense is a bit it's a
27:53
bit difficult to work with because, obviously, you
27:55
can have if your model is
27:57
drifting, So meaning that its performance
27:59
is dropping because the data
28:01
that it's looking at is different
28:03
than the training set it was trained on.
28:06
basically have to be very lucky if you want your model
28:08
to pick up performance. So you're
28:10
gonna have to do something about it. And, yeah, you can just
28:12
retrain it. But what
28:14
you do online learning is that you can
28:16
have the model just keep learning as you
28:18
go. So
28:19
there is no distinction
28:22
between training and testing.
28:23
What online learning encourages to do
28:26
is to deploy your
28:28
model as soon as possible.
28:30
So say you have
28:32
a model, and it's not been trained on anything. Well,
28:34
you can put it into production straight
28:36
away when samples arrive, it's gonna make
28:38
a prediction. So
28:39
maybe you
28:40
know, user arrives on your website, you make
28:43
recommendations, that's your prediction.
28:44
And then your
28:45
user is going to click or
28:48
not on something you recommended to
28:50
her. And
28:51
that's
28:52
gonna be feedback your free model to keep
28:55
training.
28:55
So that
28:56
is already a guarantee that your model is
28:58
kind of up to date and kind of learning.
29:00
And so that's been interesting because
29:02
just enables
29:03
so many good patterns.
29:05
You
29:06
can still monitor your the
29:08
performance of your model. If the performance of
29:10
your online model is dropping,
29:12
I mean, I haven't seen that yet, but it probably
29:15
means that your problem is
29:17
really hard to solve. So
29:19
a really cool thing
29:21
I stumble account was this idea of test
29:23
then
29:23
train. So the idea that
29:26
imagine the
29:26
scenario where you
29:29
you have a classification model that is running online. And so
29:31
what would happen is that you have
29:34
your model, your model
29:35
is generating features. So
29:38
say the user lives on the website and
29:40
the features are what's the time of the day? What's
29:42
the weather like? What are the top films at the
29:44
moment? And these are features and your
29:46
features that you have at a certain point in time
29:49
t, you generate these features and
29:51
then later on when you get
29:53
the feedback. So did your
29:55
recommendation was it a success or not? That's
29:57
training data for your your
29:59
model. You use
29:59
the same feature that you used for
30:02
predictions, you use those
30:04
features for training. And
30:06
so
30:06
you can see
30:06
here that there's a clear feedback group.
30:09
The event happens, the user comes on the website,
30:11
your model generates features,
30:13
and then at some later
30:15
point in time, the feedback arrives.
30:18
So was the prediction successful model.
30:20
If so, if not, by by how much
30:22
was the error.
30:23
And then,
30:24
yeah, you can use this
30:26
feedback join it with a feature that he
30:28
generates predicting and use
30:30
that as as training data. So
30:32
and you essentially have, like, a small
30:34
queue or database that's stowing your
30:37
predictions, your features,
30:39
and your training data, and
30:41
the labels that make your training
30:44
data.
30:44
So And the
30:46
big difference here is that
30:49
you you do
30:50
not necessarily have to do
30:52
a training test phase before they
30:54
train your model. You'd like to just deploy model initially and
30:56
it just learns online and then you can
30:59
monitor. A really cool
31:00
thing is that if you do this,
31:02
you have a log of people
31:05
coming on your website,
31:06
you making predictions, you gain features,
31:08
people
31:09
clicking around and
31:10
interact with your recommendations. This
31:13
creates a long of what's happening on your website.
31:15
And so this
31:16
log, what's really cool, is that you can
31:19
offline afterwards After the
31:21
fact, you can process it in the same order it arrives
31:24
in and you can
31:24
replay what the history and
31:27
what happens. So it means that
31:29
if you On
31:30
the side, when you're redeveloping a model or you want to
31:32
develop a better model, you can just take this log
31:34
of events. Run for
31:36
it. And do this prediction
31:38
and training dance the whole
31:40
life cycle. You know, you replaying
31:42
the feedback group and then you have a very
31:44
accurate representation of how
31:47
your model would have performed
31:49
on that sequence of events.
31:51
So that's really powerful because
31:53
the way you're designing your model
31:55
there is that you
31:56
have a rough sketch of a model, which you deploy, and then
31:58
you have a log in that model.
32:01
So you know you
32:03
can evaluate the performance of that model, but more importantly, you
32:06
can have a log of the
32:08
events. And then when
32:09
you're designing the version two of
32:12
your model, you have
32:15
a very reliable way to
32:17
estimate
32:17
how your new model would
32:20
have performed. And that's really cool because when you are
32:22
doing
32:22
train test splits
32:24
and batch learning, that is
32:26
not representative of do we work do we work? Do we work?
32:29
problem is that what you do with train and test
32:31
is people are
32:31
spending so much time making
32:33
sure that their train test split
32:36
is correct. When in fact even
32:38
having a good train test fit is
32:40
not a good proxy out of
32:42
their world. A good proxy out of their world
32:44
is to just
32:45
replay through history.
32:47
So and
32:47
that's something that you can only do on online
32:50
learning. That's
32:50
really cool. Now
32:52
to come to your point about concept
32:54
lifts, So concept
32:55
drift is there was many
32:57
different kind of concept drift, and
32:59
Chip Huen has a very good old Honda
33:01
on her blog. What matters really is
33:03
that concept drift, the result of it
33:05
is usually that your model is not performing as well.
33:08
Right? It's gonna be a drop in
33:10
performance. And so the first thing you
33:12
see on your in your monitoring dashboard
33:14
is that the metric has dropped. And
33:16
then when you dig into it, you
33:18
see that maybe there's a class imbalance or
33:20
the the
33:20
correlation between the feature and the class has
33:23
changed or something like
33:25
that.
33:25
So essentially saying
33:27
that the data the model has been trained
33:29
on
33:29
is not representative of the
33:32
new data that it's been seen
33:34
in production. Again, I have said this a
33:35
few times, but online models, we should put
33:38
them in place with the correct camera offsets up
33:40
to you. They are able to learn as
33:42
soon as possible. So that
33:43
just guarantees that your model is as
33:46
up to date as possible. So you basically made it
33:48
in the best you can. So the drift
33:49
is always possible. You can always obviously, have
33:51
a model that's degrading or that's just going haywire.
33:54
That's not related necessarily to the
33:56
online learning aspects of things.
33:59
And
33:59
so there are also ways
34:00
to cope with this. So
34:03
Vincent's Dan
34:03
Crankshell and his team at Berkeley, they
34:05
developed a system called Clipper.
34:08
It's
34:08
a kind of an ops tool. It's a research
34:10
project, but it's it's a it's I think
34:12
it's been deprecated, but it's yeah. There's
34:15
a store there. It's a project where they have a
34:17
method model, which is kind of looking at
34:20
many models being with in production and
34:22
deciding online which model should
34:24
be used making
34:26
prediction. So it's kind
34:26
of like a teacher selecting the
34:29
best
34:29
student at certain point in time and,
34:31
you know, kind of seeing throughout
34:33
the year how the students are evolving and,
34:35
like, which students
34:36
are getting better or
34:37
not good. And so you can do this
34:39
with Bandits for
34:42
instance. Yeah.
34:42
So just to say that there are
34:43
many ways to deal with concept
34:45
drift and but
34:47
on my novels, again, help
34:50
to
34:50
cope with concept drift and in just the
34:52
way, actually, it just makes sense more
34:55
so than batch models. And
34:57
so digging now into river itself, can you talk
34:59
through how you've implemented that framework
35:01
and some of the design
35:04
considerations that went into, how do I
35:06
think about exposing this online learning
35:09
capability in a way that
35:11
is accessible and under attable
35:13
to people who are used to building
35:15
batch models. So
35:16
I like to think of livermore as
35:18
a library than a framework. If
35:21
I'm not mistaken, framework kind of forces you into
35:23
a certain behavior or way
35:26
to do things. I think there's an inversion of
35:28
control where the framework is kind of designing things
35:30
for you.
35:32
You know, you should look at keros
35:34
and Pytorch. Pytorch is very much more framework in
35:36
comparison to Pytorch because Pytorch
35:40
for me, the reason why it was successful is that it it kind of
35:42
gave in the version of
35:43
control towards the user. You can do so many
35:45
things in Python
35:45
solution is very flexible. And there's
35:48
a very impose a
35:50
single webinar. So we
35:52
have that in mind of river. River again
35:54
is just a library to do
35:56
machine learning, online machine learning, but it's it just contains the
35:58
algorithms. It doesn't really
35:59
force you to, you know,
36:01
lead your date
36:04
a certain way or you could use it in a web app. You could use it
36:06
offline. You could use it on an offline
36:08
IoT sensor. Liver is
36:10
not concerned with that. It's just
36:13
an IV that is a caustic
36:15
liver transplant. So now to
36:17
come in to to what liver
36:19
is, it is In
36:20
terms of online machining, if there's general purpose, so it's not dedicated
36:22
to on the anomaly detection or
36:25
forecasting or classification, it covers all
36:27
of that. That's the ambition at
36:30
least. So just a note there is that
36:32
it's actually really hard to develop and maintain because other maintainers and
36:34
I, we are not necessarily
36:36
asleep
36:37
specialized in different domains, and we kinda have
36:40
to, you know, one day, I'm going to be
36:42
doing working on the
36:43
forecasting module other than
36:46
the other I'm gonna work on it's kinda
36:48
crazy. So it's still
36:50
fun. What we do
36:50
provide is a common interface. So just
36:53
like like it on every
36:56
piece of the puzzle and liver follows a certain interface. So
36:58
we have transformers, we have regressors, we
37:01
have anomaly detectors, course,
37:04
with classifieds, so binary and multi class. We
37:06
are forecasting not also time series. And
37:09
so
37:09
we guarantee
37:10
to the user that Each
37:13
model follows a certain API. So every model is
37:15
gonna be able to
37:16
have a learn method, so we can just learn from
37:20
new data. And
37:20
they usually have a predict method to make prediction. So
37:23
forecasters will have a forecast
37:25
method. The nominal detectors will have a
37:27
score
37:27
method, which I'm supposed
37:30
to say nominal score. And so the
37:32
strength of river is to, yeah,
37:34
provide this
37:36
consistent API for
37:38
doing on my machine learning. And it's a bit opinionated
37:40
because
37:40
it's well, it just it likes,
37:43
like, it learn minute. It just says, okay. You're gonna
37:45
have learn and predict, but that's
37:47
a reasonable thing to impose. And
37:50
that makes it easier for users
37:52
to switch in new models
37:54
because they have the
37:56
same interface. So
37:56
again, just to conclude on why I said the start, we made the
37:58
explicit choice to
37:59
follow
37:59
the single responsibility principle
38:04
in that mover only manages the machinery aspect of things
38:06
and not the deployment and whatnot. And
38:08
so if you wanted to use live
38:10
in production, see
38:12
people doing this, you have to worry about some
38:14
of the details yourself. Right? If you want to
38:16
deploy in a in a web app, well,
38:19
we do not help at the moment. At all
38:21
of that, you have to deploy your own web
38:23
app. As
38:24
far as the overall design
38:26
of the framework, you mentioned, that actually started
38:28
off as two separate projects and
38:30
then you went through the long process
38:33
of merging them together. I'm wondering
38:35
how the overall design and goals
38:37
of the object changed or evolved since you first started working
38:39
on this idea. The reason
38:41
actually
38:41
why the merger between Freeman and
38:43
Psyche Multiflow took us a long
38:45
time was that although
38:47
we were both online
38:48
learning libraries, there
38:50
were some subtle differences which were
38:53
kind of important. So my
38:56
vision with cream
38:56
at the time and lower now is that we should
38:58
only cater to
38:59
models which are what I call sure online
39:01
models. Is that and that they can
39:03
learn from a single sample on
39:05
data at a time. But there are also many
39:08
batch models, so models
39:10
which can learn from streaming data,
39:12
but in chunks, so, like, in many
39:14
batch data. And Saikit
39:16
Multiflow was kind of doing
39:18
this. So much like Pytorch
39:20
and and to flow and, you know, deep
39:22
learning models. And so I
39:24
kind of have to convince them
39:26
that there were reasons why it was
39:28
just a bit better.
39:30
Why? Because you know, if you think about, again, a user impact on the
39:32
website or just any
39:34
web requests or things that are happening
39:36
in the real life, you
39:39
want to learn as soon as possible. You don't
39:41
want to have to wait
39:43
for, you know,
39:44
thirty two samples to arrive to have
39:47
a batch to be able to feed that
39:49
to your model. You could obviously, but it just made sense to me to have something
39:51
simpler where we only care
39:53
about sure online learning
39:54
because it means that you don't have to store
39:58
anything you
39:59
just learn on the fly.
39:59
And so, I guess,
40:01
the interaction
40:02
I had with Falcon
40:03
whatsoever kind of confirmed
40:05
this idea and
40:07
then I guess,
40:08
you know, there were a bit down tool when we did the merger
40:10
because and maybe I was a bit too opinionated, but
40:12
it is to be proved that it
40:15
actually made sense. And it's
40:17
not a decision that we look back on.
40:19
Like, we're really happy with this stuff.
40:21
So whoever has, you
40:23
know,
40:23
arguably not a success. So it's
40:26
working. It's alive. It's breathing. It's
40:28
been going on for two years and a
40:30
half of the project.
40:31
And so we have a
40:33
steady intake of users
40:34
that are adopting it. And, you know, we get we
40:36
see this from emails we receive from get
40:38
the discussions and issues and just
40:40
general feedback again. So
40:42
though idea
40:44
of having a library that is only focused
40:47
towards ML and
40:48
just the algorithms is something
40:51
that we I'm just gonna keep
40:51
going with because it just it looks like it's
40:54
working and it looks like this is what people
40:56
want. You know, a simple
40:58
example is hey, I want to compute a matrix
41:00
online while liver aims to be
41:02
the
41:02
go to the library to answer those
41:05
kind of questions. Right?
41:08
The
41:08
the truth is that people they don't just
41:10
need that. They also need ways to
41:12
deploy these models and do
41:14
m r ops online. So Well,
41:17
we basically did the the next steps are for
41:19
us to build new tools in
41:21
that direction.
41:22
Now we also
41:24
think that The
41:24
initial development of mirrors was a bit faster than sheariers. The
41:26
aim was to implement as many
41:29
algorithms as possible and, you know, just
41:31
to cover the wide spectrum of
41:34
machine
41:34
learning, now that we've,
41:35
you know, covered quite a
41:37
few topics, and we also have
41:39
day jobs. So when I
41:41
was developing a I was in a
41:43
PhD. So I had ironically, I time than now because I have a proper
41:46
job. But we value our time a
41:48
bit more, and we're not in this faster shows
41:50
mode. We kind
41:52
of just focused on taking certain models which are
41:54
valuable and we see value in and just
41:56
spending time to implement
41:58
them properly.
42:00
And we also see the final aspect is that we see that people
42:02
they don't just want well, our user
42:04
base, there's not just want algorithms.
42:07
They also want as to educate them. So they have
42:09
general questions as towards, you know, what
42:11
is online
42:11
learning? And how do I do it? And how
42:13
do I decide what model to
42:16
use? And all
42:16
the questions that we're covering in this podcast, basically. So I
42:18
think there's a huge need
42:20
for us to kind of move
42:23
into educational aspect. So when
42:26
I was younger, Psyche learned was my bible.
42:28
I kind of just spent so much time
42:30
not even using it. Not
42:32
even just using the code, but actually just reading through the documentation because it's
42:35
just so excellent. So, obviously,
42:37
that takes a lot of time, a
42:39
lot of energy.
42:41
People, contributors
42:42
and help, but definitely something towards
42:45
which we are moving. In
42:46
terms
42:47
of the overall
42:50
process of building the model using something like
42:52
river. When people are building a
42:54
batch model, they end up getting a
42:56
binary artifact
42:58
out that is the entire state of that model after it
43:00
has gone through that training process. And I'm
43:02
curious if you can talk to how
43:06
River manages the stateful aspect of that
43:08
model as it goes through this
43:11
continual learning process, both
43:14
in a, you know, sandbox use case where somebody's
43:16
just playing around on their laptop, but
43:18
also as you push it into production where
43:21
maybe you want to able to use this model
43:24
and scale out serving it across a
43:26
fleet of different servers and just some
43:28
of the state management
43:30
that goes into being able to
43:32
continually learn as new information is
43:34
presented to it. Howard Bauchner:
43:36
So bachelor's in the
43:38
great advantage of batch learning is that once you train your model,
43:40
it's essentially a pure function.
43:42
There are no side effects.
43:46
The,
43:46
you know, decision
43:47
process that's underlying the model is not
43:49
gonna change. So, you know, you
43:51
can push the envelope and compile
43:53
it. You can pick away, you can convert
43:55
it to another format. So that's what early next does. You
43:58
can compile it so they can run
43:59
on a
43:59
mobile device.
44:02
I mean, it
44:03
does not need the ability
44:05
to
44:05
train anymore. It's just basically a
44:08
Python function. Oh, just it's just a
44:10
function basically that takes an
44:11
input and amp or something. So there's
44:13
also
44:14
a good reason why rationing is predominant.
44:16
But with river, it's
44:19
different because online they need to keep this
44:21
ability to learn. So that's what
44:23
you've been saying. So it's
44:25
actually kind of
44:26
straightforward, but
44:28
the internal representation of
44:31
most models of river is
44:34
fluid dynamic. It's usually still in
44:36
dictionaries that can increase
44:37
and decrease in size.
44:39
So
44:39
imagine you have a new feature that arrives in
44:42
your stream, well, every model we have a
44:44
copes with them. It's they're not static.
44:46
There's a new feature that pairs well.
44:48
They handle it gracefully. So for instance,
44:50
and then the regression model is just
44:52
going to add a new weight to its internal dictionary
44:55
of weight. Now in terms
44:57
of serialization and pickling
45:00
and whatnot, Waver
45:01
is mostly bitten. Well,
45:03
basically, Waver stands on
45:05
the shoulders of
45:06
i felt hyphen
45:07
very much so. So we do not
45:09
depend very much on Numpy or Pampers or Citibank. We mostly depend
45:12
on Python standard library. We use
45:14
dictionaries a
45:16
lot. And that plays
45:17
really nicely with the
45:17
standard library. It's very easy just to,
45:20
you can take any ripper model,
45:22
pick words,
45:23
the going and just save it. You can
45:25
also just dump it to Jason or
45:27
one off. Also, the
45:30
paradigm of you train a
45:32
model, you pick or it, and you have an
45:34
artifact that you can upload anywhere. It's a
45:36
bit different with online learning because you
45:38
would trade this differently.
45:40
You would maintain your model in memory. So if you have a
45:42
web server serving your model, you
45:44
would not just load the model to
45:46
make prediction, you would just keep it
45:48
in memory. And
45:50
make prediction of it because it's a memory, so you don't
45:52
have to load the anymore, preload
45:54
it. And then when a
45:56
sample arrives, your model is in
45:58
a mansion to make it learn from that. So, yeah, I think the big difference is that you
46:00
hold your model memory
46:02
rather than picking it to the desk
46:06
and loading it when necessary. In terms of use
46:08
of
46:08
the dictionary as that internal
46:12
state representation as you said, it gives
46:14
you the flexibility to be able
46:16
to evolve with the
46:18
updates and data. But at
46:20
the same time, you have this
46:22
heterogeneous data structure
46:25
that can be mutated as
46:27
the system is in flight
46:29
and you don't necessarily have strict
46:31
schema being applied to it. And I'm just curious if you
46:34
can talk to the trade
46:36
offs of being able to
46:38
add that flexibility, but also lacking in some of
46:40
the validation and, you
46:42
know, schema and structure information that
46:44
you might want in something
46:47
that's dealing with these volumes of data.
46:49
So, yeah,
46:50
we will use dictionaries. So
46:52
the advanced of dictionaries are
46:55
twenty fold. First of all, a very important thing
46:57
is that the dictionary is out
46:59
to lists, what kind of data frames
47:01
out to nonprofits, So
47:04
a dictionary has names
47:06
that's very important. So it means that each
47:08
one of your feature actually has a
47:10
name to And I find that hugely important because, you know, we
47:12
always see features as just numbers, but they also
47:14
have names, and that's just
47:16
really important. Imagine
47:18
you have a bunch of features coming
47:20
in. Now if that was
47:22
a list or an empire way, you
47:24
have no real way of
47:26
knowing which column corresponds to be
47:28
variable. If you switch two columns with each other,
47:30
that could just be a really silent
47:32
bug, which will affect you, whereas
47:36
if you name each feature, if the column or the changes,
47:38
well, the name of the columns are being
47:40
commuted too. So you can't
47:42
kind of identify that. So
47:45
what's really called the dictionaries, and and that
47:47
we'll deliver is that the order
47:49
of the features that you're receiving, there's
47:51
a massive. Because
47:52
we access every feature by name and not by
47:54
position, Friction is
47:55
also allow you a mutable in size. So, you know, if
47:57
a new feature arrives
47:58
or a new a feature
47:59
disappears between two
48:02
different samples, that
48:05
just works. So it's really cool. Also, that dictionary
48:07
is on when you think about
48:09
it naturally starts. So
48:12
imagine that on the Netflix products, the features that you arrive that
48:15
you receive are the name of the
48:17
user,
48:17
semi count,
48:19
colon one, or you
48:21
know,
48:21
the dates one or yeah.
48:23
You can
48:23
just store sparse information
48:26
in a dictionary. That's kind of really useful.
48:29
There's this robustness principle that we
48:32
follow with ever. So robustness
48:34
principle is that we are used to
48:36
be conservative in what you do,
48:38
but labeling what you accept.
48:40
So we will be very labeling that
48:42
accepts heterogeneous
48:44
data as you said. So dictionaries are
48:46
different size. Dictionaries would have which
48:48
have different orders or whatnot, but
48:50
that is really flexible for users.
48:53
So a common use cases to deploy mover in
48:55
the web app. And in the web app, you're
48:58
receiving JSON data a lot of
49:00
times. So
49:02
the fact that JSON data has a one to one relationship with
49:04
hyphen dictionaries makes it very easy
49:06
to integrate with into a
49:08
web app. you have a
49:11
regular batch model, you have to
49:13
mess about with casting the
49:15
JSON data to
49:16
and
49:18
then highway And, you know, that has a cost actually. They actually have a cost
49:20
because although non pry
49:22
torch tens of flow are
49:25
you know, good at posting matrices is actually
49:28
a cost that comes with
49:30
taken native data, such as dictionaries,
49:32
and casting
49:34
them a higher order data structure, so it says,
49:36
and then primary. That has a
49:38
real cost in a web app where, you
49:40
know, it's you're using in terms of
49:42
many seconds, well,
49:44
you're spending a lot of your time just
49:46
converting your JSON data to Numpy.
49:49
Well, whoever, because it
49:51
consumes dictionary as well, the
49:53
data you receive, I don't know if you're coding
49:55
in Django, Flask, FastAPI. The data you
49:57
receive your request is a dictionary,
49:59
so you don't have to convert the data.
50:02
It just runs. So actually, if you if you take a river
50:04
model, like a a linear regression river and a
50:06
linear regression in touch, it's
50:08
actually gonna be much faster in the
50:10
river because there's no
50:12
conversion costs. Plus, the features
50:14
are names. And plus, you just you
50:16
don't have to worry about these features being a mixed
50:18
order or anything. So it just makes a lot
50:20
of sense really dictionaries in that sense. But
50:22
the pitfalls, obviously, it's it's not
50:24
perfect. The pitfalls is that. I kind
50:26
of disagree that there's a problem with the
50:28
short dictionaries. I actually think that
50:30
dictionary is, well, if you wanted
50:32
to, you could create, like, an issue in
50:34
Python. You can actually use a data class,
50:36
and you can convert that to a dictionary and feed
50:38
that into your
50:39
model. The data
50:40
class helps you to create structure. So I
50:42
don't think that's really problem. Quite the contrary, I
50:44
think. The fact is that also dictionary
50:47
can be nested maybe your features that you're feeding to the
50:49
model doesn't have to be a flat dictionary. It's
50:51
actually nested. And that's really cool too. You know, you
50:53
can have features for your user. You can have
50:55
features for your page,
50:57
the features for the day, if
50:59
anything. Things that
51:00
you cannot necessarily do with a flat
51:02
structure, such as a data frame
51:04
on an environment. Anyway, I'm talking about benefits. I
51:06
should be talking about cons. But yeah,
51:08
I guess just the main con of
51:11
processing
51:11
dictionaries is that
51:13
you know, if you want
51:13
today, we have a model to process a
51:16
million samples, it would
51:18
take much more time than
51:20
processing a million
51:22
samples with Panasonic framework on the inquiry. Because,
51:24
yeah, the point of VIVA is the process. It's the
51:26
behalf of processing one time
51:28
bar time.
51:30
But
51:30
not necessarily posting a million samples at a time. But
51:32
those are two
51:33
different problems. So although,
51:36
you know, you take calls,
51:38
so I can learn. Their
51:40
goal is to be able to process
51:42
offline data really quick. But the
51:44
goal of ever is
51:46
to process online data single samples as fast as possible. And you're
51:48
comparing apples and oranges if you wanna do the
51:50
comparison. It just doesn't work. So yeah.
51:52
Actually, you know what? I don't think
51:54
they'll they'll downsides to use
51:56
indigenous. It just helps a lot.
51:58
And and to confirm this, we
51:59
have a lot of users who tell us
52:01
about this. They say, well, it's actually fun to use
52:04
river. I just It just makes sense because
52:06
it's very close to the data structures I use in Python.
52:08
I don't have to introduce a new data structure
52:11
to my system. So
52:13
for somebody who is
52:16
using River to build a machine
52:18
learning model, can you just talk through
52:20
the overall process of going from idea through
52:22
development to deployment? I'm going
52:24
to rehash what I said before,
52:25
but I think the great
52:27
benefit of online learning and
52:30
the river is that you can cut the R and
52:32
D phase. So I've seen so many projects
52:34
where there's an R and D phase
52:38
and the model, you know, gets validated at some point in time,
52:40
but there's, like, a real
52:42
big gap of time between the
52:44
star of the R and D phase and
52:48
the moment when the model is deployed. And the process
52:50
of using river and then
52:52
extreme model in general is
52:55
to actually as I said, deploy the model as soon
52:57
as possible, monitor its predictions,
53:00
and
53:00
it's okay because
53:01
sometimes that model will, you
53:03
know, you can deploy it
53:06
in production, and those predictions
53:08
do not necessarily have to be served to
53:10
the user. So you just made the
53:12
predictions, you monitor them, and it creates you
53:14
a log again of training data
53:17
and predictions and features. And that's what
53:19
you call shadow deployments. You have
53:21
a model which is deployed,
53:24
is making predictions, but those predictions
53:26
are not being used, you know, to inform
53:28
decisions or to change. To
53:30
influence the behavior users. They just exist for the sake of
53:33
existing and for monitoring. One thing to
53:35
mention is that once you
53:37
deploy this model, you have your
53:39
of that's the phase where you want to
53:42
maybe design a new model. And
53:44
you're going to have
53:46
this model replace the
53:48
existing model in production or coexist with it
53:50
because you have a methanol or not. So
53:52
I mentioned that you can
53:54
take your log of events and
53:56
replace it in the order in which
53:58
arrived and have a good idea of how
53:59
well your model would
54:01
have performed. That's
54:03
called progressive validation. So
54:06
it's just this idea that if you
54:08
have your log of events, every time
54:10
for you first gonna make a prediction and then you're
54:12
gonna learn from it.
54:12
So I have a good example.
54:15
There's
54:15
a dataset on CAGR called the New
54:17
York taxes dataset and it's basically
54:19
a log of people
54:21
asking for a hailing a taxi to arrive and
54:23
they depart from a position, and
54:26
they arrive at another position later in
54:28
time. And
54:30
so the goal of a machine learning system
54:32
in this case could be to predict how long the tax
54:34
chip is going to last. So
54:37
when the taxi departs, you want
54:40
your model to make a prediction. How long is
54:42
this that
54:42
for you? Are you gonna ask? And so maybe that's gonna
54:44
inform, I don't know, the cost
54:47
of the trip or
54:48
it's gonna help decision maker, you
54:50
know, via via the taxis or, I don't
54:52
know, whatever. But you can imagine
54:54
that this is a great feedback loop because you
54:56
have your model makes a prediction. And then later,
54:59
maybe eighteen minutes later or something, we
55:01
had the ground truth around. So you know
55:03
how long the taxi trip actually lasts, and
55:05
that's your ground truth. And then
55:07
you can compare your prediction with your model. And that
55:10
enables progressive validation because you have
55:12
a a log of events. You have when the
55:14
tax hardship
55:16
departs. What was the value my model predicted? What
55:18
features I used?
55:19
Half prediction time?
55:21
And later on, I
55:23
have the ground truth. And
55:25
so I can just replay the
55:27
logs of events for, you know, I
55:29
don't know, seven
55:32
days and progressively
55:34
evaluate my model. So I like
55:36
this tax example because it's easy
55:38
to reason about and, you know, taxes are
55:40
easy
55:41
to understand. But the tax example is really what online learning is
55:43
about. It's about it's feedback
55:45
loop between predicting and
55:48
learning And just to
55:50
remind you, but how it would be
55:52
with a batch model is that you would have your taxi
55:54
dataset and well, I don't
55:56
know. You would split your dataset too. You
55:58
would have start of the week, end of the week, train your
56:00
model on the start of the week, evaluate on the rest
56:03
of the week. Oh, no. The data
56:05
I trained on for the start of the week
56:07
is not representative of the weekend.
56:09
Yeah. And it just becomes a
56:11
bit weird. It becomes this situation where
56:13
you're trying to reproduce
56:16
conditions in real life, but you're never really sure
56:18
of it. And you can only really know how well your batch is
56:20
going to do well in
56:22
production. And
56:23
online learning, it just kind
56:25
of encourage you to go
56:28
for it, deploy your model straight away and not have to have
56:30
this weird R and D phase where you live
56:32
in a lab. You think you might be
56:35
right, but then you're not really sure. And, yeah, online learning
56:37
just brings you closer to development in my
56:40
opinion.
56:40
in my opinion As
56:41
you have been developing
56:44
this project and helping
56:46
people understand and adopt it.
56:48
What do you see is some
56:50
of the conceptual challenges or complexities that
56:52
people experience as they're
56:54
starting to adapt their thinking to how
56:56
to build a machine learning model in this
57:00
streaming format versus the
57:02
batch oriented workflow where they do have to
57:04
think about the train test split, just
57:06
the overall shift in the
57:09
way that they think about developing these models?
57:11
It's a hard question, but I think
57:13
there's two
57:13
aspects. There's the online learning aspect, and
57:15
then there's the
57:18
ML ops suspect. Now in terms of m r
57:19
ops, I think I I covered enough,
57:22
but it's much like a batch model.
57:23
You
57:24
have to
57:26
deploy your model, which means maybe survey behind the app. As I mentioned,
57:29
the idea of situation is to
57:31
have your model loaded
57:34
into memory and that's making prediction and
57:36
training. Bolda is really
57:38
harder to do than
57:40
to do them to say
57:42
to say. The truth is that there's actually no framework out
57:44
there, which allows you to do this. You could
57:46
do this yourself. This is what we
57:48
see. I mean, we get users who ask
57:50
us questions in a way of context
57:52
on on GitHub or in mails, but and
57:54
they're asking us
57:55
how do I deploy an model? What
57:57
should
57:57
I be doing? And we always give
57:59
the same
57:59
answers. But
58:00
the the fact is
58:02
that, you know, we have these users who have
58:04
basically embraced whoever and they understand it,
58:06
but then they get into the production phase.
58:08
And
58:08
that's not what we're trying to well, we
58:11
feel bad because
58:12
they are all making the same mistakes
58:13
in some way.
58:16
And liver is is not there to help them because that's
58:18
not the purpose of liver. So, yeah,
58:20
there's
58:21
a lack of tooling
58:23
to actually just, you know,
58:25
deploy an online model. Oh, yeah.
58:27
That's the ML ops aspect. I think
58:29
in terms of some online learning,
58:31
a big challenge is that not everyone has the
58:34
luxury to have a
58:36
PhD during which you can spend
58:38
days and
58:40
nights going through online learning papers and trying to understand it. And
58:42
that's what I
58:43
and others had the
58:44
chance to do, but a lot of lavar
58:47
users our users you
58:48
know, they see the
58:49
value of online learning and they want to
58:51
put it into production, but they have deadlines to me.
58:53
Right? They have to ship their project in
58:55
six weeks and they just don't
58:57
have the time to understand things in detail.
59:00
So things like, well, I
59:02
just described progressive validation.
59:04
It kind of takes them a bit of
59:06
time to understand. And so
59:07
again, what we need to do is
59:10
to spend
59:10
more time creating resources or,
59:13
you know, just diagrams
59:16
to explain what online learning is about. And that in terms
59:18
of library design, it's
59:20
really important. Right? If we want to
59:22
introduce a new method to all our
59:24
estimators, I would
59:26
be against like, the whole point of view is to make it as simple as
59:28
possible so that people can
59:29
just, you know, be
59:31
productive,
59:31
understand that.
59:33
So, yeah, I think that just to encapsulate those two
59:36
problems is that people do not necessarily
59:38
have the resources to
59:40
learn about online learning. And
59:42
then there are operational problems
59:44
around serving these models
59:45
into production. So it's
59:48
kind of
59:48
like a batch model because you
59:50
have to serve the model behind an API, you know, and you have
59:53
to monitor it. And these are things well, you know, that
59:55
are common to about model,
59:57
but there's the added complexity
1:00:00
of having your model being, you know,
1:00:02
maintaining memory and keep learning
1:00:04
and stuff and things that are
1:00:07
basically not common. I mean, if you have to
1:00:09
Google it or find something on GitHub, you you
1:00:11
just kind of find these hacky
1:00:13
projects, but know real good at allowing me
1:00:15
to do that at least not yet. one of the things that we
1:00:17
didn't
1:00:17
discuss yet is the types
1:00:20
of machine learning use
1:00:22
cases that
1:00:24
rivers supports where I'm speaking specifically to things, logistic
1:00:26
regressions and decision trees versus deep
1:00:28
learning and neural networks. And I'm just
1:00:30
wondering if you can talk to the
1:00:34
types of machine learning approaches
1:00:36
that River is designed to support
1:00:38
and some of the reasoning
1:00:40
that went into where you decide
1:00:43
to put your focus?
1:00:45
We've
1:00:45
again is a general purpose
1:00:47
library, so there's quite a few things.
1:00:49
There are some cases
1:00:50
or flavors of machine learning, which are
1:00:53
especially interesting when you
1:00:55
cast them in an online
1:00:57
learning scenario. So
1:01:00
If you're doing anomaly detection so instance, you have
1:01:03
people
1:01:03
doing transactions on the in
1:01:05
a banking system, so they're
1:01:07
making the payments. And you might
1:01:09
want to be doing a dummy detection to detect
1:01:12
fraudulent
1:01:14
payments. That is very much
1:01:16
a situation where you have
1:01:18
streaming data. And
1:01:19
so in that case,
1:01:20
you would like to be doing online anomaly
1:01:23
detection. So we
1:01:24
see that
1:01:26
every time we put out a notebook
1:01:29
or a new a architecture method,
1:01:31
a lot of people start using it.
1:01:33
We start having bug reports
1:01:36
and and whatnot. So it's kind of surprising.
1:01:38
It's a good thing. But, yeah, I think there are
1:01:40
modules and aspects of
1:01:43
Uber which are clearly bring a
1:01:46
lot of value to users. So that would be a
1:01:48
known detection, but
1:01:50
also we have forecasting models. So
1:01:52
when you do online forecasting, that just
1:01:54
makes sense. But you have sensors which are, I don't know, measuring the
1:01:56
temperature of something. Keep on to the day of your
1:01:58
life in in real
1:01:59
time. There's also a
1:02:01
good example I have is we
1:02:03
have this engineer who's working on water pipes
1:02:05
in Italy. He's
1:02:07
trying to predict
1:02:08
how much water is going to
1:02:10
flow through certain points in his pipeline.
1:02:14
So he has sensors all over the
1:02:16
pipeline, and he's trying to just do a
1:02:18
forecasting model. And so it just makes so much sense for him
1:02:20
to be able to have his model run
1:02:22
online
1:02:22
inside the
1:02:24
sensors. Going for the IoT
1:02:26
systems is running.
1:02:28
So just all that to say
1:02:30
that there are some
1:02:32
more exotic parts of the first
1:02:34
such as architecture and forecasting, which
1:02:36
are not which I'm probably in more
1:02:38
value than the
1:02:39
classic modules such as,
1:02:42
you know, in
1:02:42
the regression, classification, regression. Again, at
1:02:44
the start, so I I talked about
1:02:47
Netflix recommendations. So
1:02:49
we
1:02:49
have some very
1:02:51
the area basic
1:02:52
bricks to able
1:02:53
to make recommendations. Well, we
1:02:55
have factorization
1:02:56
machines and we have some kind
1:02:58
of
1:02:58
banking system so that if you have users and
1:03:00
items into my you of build
1:03:02
a linking of preferred
1:03:05
items for user. So we have
1:03:07
this kind of exotic machine
1:03:09
learning cases, which provide
1:03:12
value, but require us
1:03:14
to spend a lot of time to
1:03:16
work
1:03:17
on them, basically. So
1:03:19
It's
1:03:19
very difficult for me and for
1:03:21
other contributors to be specialized in a
1:03:23
non detection time series
1:03:26
forecasting recommendation. But,
1:03:27
yeah, all this to say that
1:03:29
liver covers in a wide spectrum,
1:03:31
you can do preprocessing, you can
1:03:33
extract features, you can do classification, regression,
1:03:35
and forecasting anything. You try
1:03:38
to well, because it's online. It's just a bit
1:03:40
unique. Any
1:03:40
experience of working
1:03:43
with the river library and
1:03:45
working with end users of of
1:03:47
the tool? What are some of the most
1:03:49
interesting or innovative or unexpected ways that you've
1:03:51
seen it used? Well, unexpected
1:03:52
is a good one. There's one thing
1:03:54
that comes to mind. We have this person
1:03:56
who is a beekeeper. So a person who is, you know,
1:03:58
taking care of bees
1:03:59
and, I guess, they once a week or
1:04:02
every two
1:04:04
weeks, they go to the beehive and they pick up the honey
1:04:06
in the beehive. And this
1:04:08
person has many beehives
1:04:10
and so they don't have to waste
1:04:12
their time going into the
1:04:13
be hive and actually check-in if there's honey or
1:04:16
not. So they have a sensor. We have
1:04:18
sensors that are in each
1:04:20
beehive.
1:04:21
They're kinda measuring how much money
1:04:23
is in each behalf, and he likes to forecast how much
1:04:25
money he is expected to have
1:04:27
in, you know, the
1:04:29
weeks or come based on the weather, based on
1:04:32
past data, based on, I don't
1:04:34
know, what information
1:04:36
it uses. But really, really
1:04:38
just fun just to see this person
1:04:40
doing this hackish project where they
1:04:42
just thought it would be fun to use online learning to
1:04:44
do it. And again, there wasn't
1:04:46
an IoT context. So that made sense. I guess, a normative,
1:04:48
I was kind of impressed when I heard about
1:04:50
this project of having a
1:04:54
you know, a model within each car to determine where your
1:04:57
destination would be. So, I don't know, you
1:04:59
wake them up and morning you take
1:05:02
your car, you
1:05:03
know, is it to it weekday you're going to
1:05:05
work? It sounds silly obviously,
1:05:07
but having this this
1:05:10
idea of having one model per user is is kind of fun.
1:05:12
The most impactful project I heard
1:05:14
about and I know which is being used
1:05:18
is a situation where this company, they
1:05:20
prevent cyberattacks. So they
1:05:23
monitor this grid of
1:05:25
servers and computers. And
1:05:28
they're monitoring traffic between
1:05:30
machines. And so they're trying to
1:05:32
understand
1:05:34
when some
1:05:35
of the traffic is malicious and hacker is basically trying
1:05:37
to get into a system. So you
1:05:40
can
1:05:40
you can by
1:05:42
looking at the patterns of
1:05:44
this traffic.
1:05:45
Right? And the
1:05:47
trick is that
1:05:49
behind the traffic, the malicious
1:05:52
traffic, these hackers, and they're constantly
1:05:54
changing their patterns.
1:05:56
So
1:05:56
access patterns to actually not be
1:05:59
detected. And so
1:05:59
if
1:05:59
you manage to label
1:06:02
traffic as, you
1:06:03
know know, malicious,
1:06:04
well, you want your model to
1:06:06
keep learning. So they have this system where they
1:06:08
have, like, thousands of machines and they have
1:06:10
a few machines that are dedicated to
1:06:12
just learning from the traffic and
1:06:15
in real time that's
1:06:16
in learning, detecting anomalous traffic,
1:06:18
sending it to human beings so
1:06:21
that we can actually verify themselves,
1:06:24
labels, etcetera.
1:06:25
And so it's really cool to know that VIVU has been used
1:06:28
in that context. Like, it
1:06:30
just made so much sense for them
1:06:32
to say, wow, we can
1:06:34
actually do this online and we'd have to
1:06:36
retrain them. And
1:06:36
batch learning was getting in their
1:06:39
way. They
1:06:39
had this system which was going a
1:06:41
thousand miles an hour, just hundreds
1:06:43
of thousands of days of community all the time. And batch learning
1:06:45
was just,
1:06:45
you know, it was just, again,
1:06:47
annoying for
1:06:48
them. They
1:06:50
they having
1:06:50
a system that will enable them to do all this online, just made sense for them and to
1:06:52
know that you can do this at such a
1:06:55
high amount of traffic. It was
1:06:57
really cool and exciting. In
1:06:59
your
1:06:59
own experience of building the
1:07:02
project and using it for your own work, what are some of
1:07:04
the most interesting or unexpected or
1:07:06
challenging lessons that you've learned in
1:07:08
the process? I
1:07:08
think I'm just gonna focus a bit on a human aspect there. But overall, I
1:07:10
have been doing open source,
1:07:13
you know, quite a bit.
1:07:15
I've always had this approach
1:07:18
where I probably work too
1:07:20
much on new projects that I
1:07:22
make myself rather than on existing
1:07:25
projects. So I just rather just do my thing to
1:07:27
myself, rather than contribute just to
1:07:29
existing stuff and It's
1:07:31
not always that's fairly good, but it's just the way I
1:07:34
work. And so Wivel is really the
1:07:36
first open source project where I work for
1:07:38
other people. So, like,
1:07:40
probably many people, a lot of my
1:07:42
open source book is, I just work in it
1:07:44
myself. And obviously, I you're working
1:07:46
companies and where
1:07:48
you probably a review process and you work with a people. But
1:07:50
this is the first open source
1:07:52
project where I really work with a a team
1:07:54
of people. And
1:07:56
it's fun. It's just really so much
1:07:58
fun. Like, just a month ago,
1:07:59
we
1:07:59
actually got to meet all together and
1:08:02
to have this like
1:08:04
informal region. So that was really
1:08:06
fun. And you've realized
1:08:08
that, you know, after three years, there
1:08:10
are ups and downs, and there's moments where you
1:08:12
just do not want to work on the move anymore,
1:08:14
and you want to, you know, you have work, you have friends,
1:08:16
girlfriends, whatnot. And so the only
1:08:18
way to subexist as a open source
1:08:20
project in the long term is to
1:08:23
have multiple people working. So do
1:08:25
an open source and, you know, it's not realistic to do it on your own if you
1:08:27
want someone to be successful and to actually
1:08:29
have an impact in the
1:08:31
long term. So it's
1:08:35
actually really important to just be nice
1:08:37
and to
1:08:37
have people
1:08:38
around you who help you. And
1:08:41
although not everyone contributes as much
1:08:43
as I do or common painters
1:08:45
do. People help a lot, and they make things alive. Like, it's always a
1:08:47
joy when I open an
1:08:51
issue on GitHub. And I see that someone
1:08:53
from the community has answered the question, and I don't have to do anything. It
1:08:55
helps tremendously.
1:08:56
Yeah. We've already
1:08:58
talked
1:08:59
a bit about some of
1:09:01
the situations where online learning might not be the right choice. But for the case
1:09:03
where somebody is going to use an
1:09:08
online streaming machine learning approach? What are the cases
1:09:10
where river is the wrong choice? And maybe there's a different library or framework that
1:09:14
would be better suited? Well, yeah. Again,
1:09:15
honestly, I think that online learning is the wrong
1:09:17
choice in ninety five percent cases.
1:09:19
Like, you do not want to make
1:09:21
no mistake to think that your problem
1:09:24
is a online problem. You probably, most
1:09:26
of the time, have a batch problem that you can solve with a batch library. You know, I
1:09:28
mean, it's like a learn now.
1:09:30
If you open it and you just
1:09:33
when it is always going to work reasonably well. So sometimes I would
1:09:35
just go for that. One thing we do get lot is people asking how you
1:09:37
can do deep learning, whatever. So they
1:09:39
want to train deep
1:09:43
learning models online. So the answer is that we do
1:09:45
have a Cisco library that is called
1:09:47
CallRiver, and
1:09:50
it's
1:09:52
dedicated training torch
1:09:52
models online. So but again, that is a bit
1:09:54
finished at the moment and still need somewhere being done
1:09:56
on it. But, yeah, if
1:09:58
you want to
1:09:59
be doing deep
1:10:00
learning and you want to be working
1:10:03
with images and sound and, you know, structured data. Live is not the right
1:10:05
choice. You're not mine, and you probably
1:10:07
have to be looking at
1:10:10
it goes you
1:10:11
continue to build and iterate on the river
1:10:14
project, what are some of the things you
1:10:16
have planned for the near to
1:10:18
medium term or any app locations of this online learning
1:10:20
approach that you're excited to dig
1:10:22
into? We have a public road
1:10:24
map, so it's a notion page.
1:10:26
We have a list of stuff we're working
1:10:28
on. That one mostly
1:10:30
has a list of algorithms to implement, and it's mostly there to,
1:10:34
true
1:10:35
you know, make
1:10:36
people know what we're working on
1:10:39
and to encourage new contributors to work on something. So the
1:10:41
few contributors
1:10:44
we have just pick what they want
1:10:46
to work on and, you know, just in general order of preference. So for instance, me this
1:10:48
summer, I decided
1:10:51
to work on online covariance
1:10:53
matrix estimation. So if you actually, an online covariance matrix
1:10:55
is kinda useful
1:11:00
because It's very different financial trading. And if you
1:11:02
have an inverse covalence matrix that you can estimate online
1:11:04
that unlocks so many
1:11:07
other algorithms such as Beijing,
1:11:09
integration, elective, enveloped method for auto detection, gosh,
1:11:11
and processes, whatnot. So I
1:11:14
think I'm still
1:11:15
in the nitty
1:11:17
gritty details of implementing our rooms and not necessarily applying them to stuff. I'm
1:11:20
kinda counting on users
1:11:22
to to do the applications.
1:11:26
It just shows lights at the moment. Now
1:11:28
one thing that I'm working on in the
1:11:30
mid to long term is beaver. So
1:11:33
eventually, I want to train the two
1:11:35
spend less time on river and welcome a tool I'm building called
1:11:37
beaver. So beaver is a
1:11:40
tool to
1:11:42
deploy and maintain online learning model. So essentially an
1:11:44
MLR tool for an MLR
1:11:46
tool for online
1:11:47
learning.
1:11:50
So it's in its infancy, but it's something I've
1:11:52
been thinking about a lot. So I
1:11:54
recently gave a tour kinda in Sweden. I
1:11:57
schedged a blog post and
1:11:59
some signs where I tried to describe what it's
1:12:02
going to look like. But the goal of
1:12:03
this project is to
1:12:05
create a very simple user
1:12:08
friendly tool to deploy a model, and
1:12:10
I'm hoping that that is going to
1:12:12
encourage people to actually
1:12:14
use river and to use on my learning because they're gonna say, hey, okay, I can learn, but can
1:12:16
also just deploy
1:12:19
the model and you
1:12:21
know, and both tools play nicely together. So,
1:12:23
yeah, the future of ever is to have and to have
1:12:24
this left hand tool
1:12:27
to deploy online models.
1:12:30
It's not going to be catered just towards whoever.
1:12:32
The goal is to be able to,
1:12:34
you know, run it with any model
1:12:37
that can learn online. Well,
1:12:38
for anybody who wants to get in touch
1:12:40
with you and follow along with the work that you're doing,
1:12:42
I'll have you add your preferred contact information to
1:12:45
the show notes and is the final question, I'd
1:12:47
like to get your perspective on what you see
1:12:49
as being the biggest barrier to adoption for
1:12:51
machine learning today. I'm
1:12:52
always impressed by how much
1:12:55
the field is maturing. I think
1:12:56
that there's a clear separation now between
1:12:58
regular machine learning, like business machine learning, I
1:13:01
might call it, and deep
1:13:03
learning. I think two becoming
1:13:04
my separate fields. So I've
1:13:06
kind of stayed
1:13:07
away from deep learning
1:13:10
because I just not my
1:13:12
capacity,
1:13:12
but it's still
1:13:13
very interesting in business machine learning. So getting things
1:13:15
that I call it. And I think
1:13:17
I'm impressed by how much
1:13:20
the community has
1:13:22
evolved in terms of knowledge. People are the average female practitioner today is
1:13:24
just so much more
1:13:26
professional than five years ago.
1:13:30
And I think it's a big question of
1:13:32
education and tooling. The tricky
1:13:34
thing about
1:13:34
an ML model when
1:13:37
it's not deterministic. And so it's
1:13:39
difficult
1:13:39
to guarantee that its performance
1:13:41
over time is going to
1:13:43
be good. And
1:13:44
let alone certify the model
1:13:46
or convince stakeholders that they should adopt it.
1:13:48
So in the world, you
1:13:50
don't just deploy a model across your
1:13:52
fingers. So
1:13:54
although we've gone past
1:13:56
the tests and r and d
1:13:58
phase of a model, we are still not there in terms of
1:13:59
deploying we have still not then sums
1:14:02
of the coin model model.
1:14:03
And so The
1:14:04
reality is that there's usually
1:14:06
a feedback group where you monitor your model and
1:14:08
possibly retrain it,
1:14:10
be online or
1:14:13
you
1:14:13
know, offline retraining. It doesn't matter. And so I don't think we're gonna go that right now. I
1:14:15
don't think that we have great
1:14:16
tools to have human
1:14:19
beings in the loop. how
1:14:21
human beings and Work
1:14:23
hand in hand with machine learning models.
1:14:25
So I
1:14:26
think that tools like Progyny,
1:14:29
which is
1:14:30
a tool to have a user work hand in
1:14:32
hand of an ML system by labeling
1:14:34
data that the model is unsure about.
1:14:36
They're crucial. They're game changers
1:14:38
because they create real systems where
1:14:41
you care
1:14:42
about, you know, new data coming in, retraining your model,
1:14:44
having
1:14:44
a human validate
1:14:47
predictions, stuff like that. So
1:14:51
I think we have to move away from only having models
1:14:53
that are only having tools that
1:14:55
are destined towards training model, but
1:14:57
we also need to get better
1:14:59
at tools that you
1:15:01
know, encourage you
1:15:01
to monitor your model, to keep training it, to work with it, to
1:15:04
-- Yeah. --
1:15:06
again, just treat machining
1:15:08
as software engineering
1:15:10
and not just as some research projects. Alright. Well, thank you very much for taking
1:15:12
the time today to join
1:15:13
me and share the work that you've
1:15:15
been doing on river and
1:15:19
helping to introduce the overall concept of online
1:15:21
machine learning. It's definitely a very
1:15:23
interesting space, and it's great to
1:15:25
have tools like River available
1:15:27
to help people will take advantage of this approach. So thank
1:15:29
you for all of the time and effort that you and the other maintainers are putting into
1:15:31
the project, and I hope
1:15:33
you enjoy the rest of
1:15:35
your day. Oh,
1:15:36
thank you. Thanks for having me.
1:15:38
It was great. Thank you for listening. Don't forget
1:15:40
to check out our other
1:15:42
shows. The data engineering podcast which
1:15:45
covers the latest on modern data management and the machine learning podcast, which helps you go
1:15:47
from idea to production with machine
1:15:52
learning. Visit aside at pythonpodcast dot com
1:15:54
to subscribe to the show, sign up for the mailing list and read the show notes. And if you learned
1:15:56
something or tried out a project from
1:15:58
the show, then tell us
1:15:59
about it. Email
1:16:02
hosts at python podcast dot com with your story. And to help other people find the show, please leave a review
1:16:07
on Apple Podcastinit tell
1:16:11
your friends and
1:16:14
coworkers.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More