Podchaser Logo
Home
Update Your Model's View Of The World In Real Time With Streaming Machine Learning Using River

Update Your Model's View Of The World In Real Time With Streaming Machine Learning Using River

Released Monday, 12th December 2022
Good episode? Give it some love!
Update Your Model's View Of The World In Real Time With Streaming Machine Learning Using River

Update Your Model's View Of The World In Real Time With Streaming Machine Learning Using River

Update Your Model's View Of The World In Real Time With Streaming Machine Learning Using River

Update Your Model's View Of The World In Real Time With Streaming Machine Learning Using River

Monday, 12th December 2022
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:13

Hello,

0:13

and welcome to Podcastinit enit. The

0:15

podcast about Python and the people who make it

0:17

great. When

0:18

you're ready to launch your next app or want to try a

0:20

project you hear about on the show, you'll need some more to

0:22

deploy it. So check out our friends over at Linode.

0:24

With their managed Kubernetes platform,

0:27

it's easy to get started with the next generation

0:29

of deployment and scaling powered by the battle

0:31

tested Linode platform including simple

0:33

pricing, node balancers, forty gigabit

0:36

networking, and dedicated CPU and GPU

0:38

instances. And now you can

0:40

launch a managed MySQL, Postgres, or

0:42

MongoDB cluster in minutes to keep your

0:44

critical data safe with automated backups

0:46

and failover. Go to python podcast

0:49

dot com slash linode today to get a

0:51

one hundred dollar credit to try out their new database

0:53

service, and don't forget to thank them for the continued

0:55

support of this show. The biggest challenge

0:58

with modern data systems is understanding what

1:00

data you have, where it is located,

1:02

and who is using it.

1:03

Select Star's data discovery platform

1:06

solves that out of the box with a fully automated

1:08

catalog that includes lineage from where the data

1:10

originated, all the way to which dashboards rely

1:13

on it and who is viewing them every day.

1:15

Just connected to your DBT, Snowflake,

1:18

Tableau, Looker, or whatever you're using,

1:20

and select star will set everything up in just

1:22

a few hours. Go to python

1:24

podcast dot com slash select star

1:26

today to double the length of your free trial

1:28

and get a swag package when you convert to a

1:30

paid plan. Your host, as usual,

1:32

is Tobias Macy, and this month, I'm running

1:34

a series about python's use in machine learning.

1:37

If you enjoyed this episode, you can explore

1:39

further on my new show, The Machine Learning Podcastinit

1:42

helps you go from idea to production with machine

1:44

learning. To find out more, you can go

1:46

to the machine learning podcast dot com.

1:48

Your

1:48

host is Tobias Macy, and today

1:50

I'm interviewing Max Halford about river,

1:53

a python toolkit for streaming and online

1:55

machine learning. So Max, can you start by introducing

1:57

yourself?

1:58

Oh, hi there.

1:59

I'm Max. Yes. I consider myself

2:02

as a data scientist.

2:03

My day job is doing data science. I

2:06

actually measure the common surface of a

2:09

clothing items. But

2:10

I have a wide interest in, you know,

2:12

technical topics or be software engineering

2:15

or

2:16

late engineering,

2:17

build a lot of different source. My academic

2:19

background is leaning towards finance and

2:21

computer science and statistics. I

2:23

actually did a PhD in

2:25

applied machine learning, which

2:27

I finished a couple of years ago. So,

2:29

yeah, all around that, basically. And

2:33

do you remember how he first got started working in

2:35

machine learning?

2:36

Kind of just I was a late bloomer.

2:38

I got started when maybe when I was twenty

2:40

one, twenty two when I was at university.

2:43

I basically had no idea what

2:45

machining was, but I started this curriculum

2:48

that involved that was around statistics. And

2:50

we had a course, which was

2:52

maybe two or three hours a week about machining.

2:55

And it did kind of blow

2:57

our minds. It was around

2:59

the time when well, machine

3:02

learning and particularly deep learning was

3:04

starting to explode.

3:05

So I

3:06

kind of stopped at university. So I was lucky

3:09

enough to get a theoretical training.

3:11

And in terms of the

3:13

river project, can you describe a bit more

3:15

about what it is that you've built and some of

3:17

the story behind how it came to be and

3:19

why you decided that you wanted to build it

3:21

in the first place? When I was

3:23

at university, I received

3:25

a kind of normal

3:27

introduction to a regular introduction

3:29

to machine learning. And then I did

3:31

some internships, I signed to

3:33

PhD, up to my internships. And

3:36

I also did a lot of cargo compositions on the

3:38

side. So I was kind of hooked into

3:40

machine learning. And

3:42

it always felt to me

3:44

that

3:44

something was off because

3:47

when we were learning machine machine learning. And if

3:49

you made sense, but then when you get

3:51

to do in practice, you often

3:53

find that, well, it's not a

3:55

program. Right? The playground

3:57

scenarios that they describe by university,

3:59

when you learn the feeling, just do not apply in

4:02

the available world. The available as well. You have

4:04

data that's coming in. Like, a flow

4:06

of data or evidence is new data

4:08

or yeah. There's, like, an interactive

4:10

aspect to the world around

4:12

us, the way the data is It's not

4:14

like a CSV file. Yeah.

4:16

It just felt like fitting a square peg

4:18

in a round hole. So I

4:19

was always curious in the back of

4:22

my mind about

4:23

how

4:24

you could do online machine learning?

4:27

Or I didn't know it was called online machine learning

4:29

because when I was a kid, I remember going up and

4:31

thinking that AI was this kind of intelligent

4:34

machine that would keep learning as it went

4:36

on and as they experience

4:37

the world around this.

4:38

Anyway, when I started my HD, I was

4:41

looking after a lot of time to read. I read a

4:43

lot of papers and blog posts and

4:45

whatnot. And I can't remember the

4:47

exact day a week that I

4:49

stumbled upon it, but I just started learning

4:51

about online machine learning. Maybe

4:52

some blog posts or something. And

4:54

then it was like a big explosion

4:57

in my head and I was like, wow, this is

4:59

crazy. Right? This this actually exists.

5:02

And I was so curious as to why it wasn't

5:04

more accurate.

5:05

And at

5:07

the time, I did a lot of open source as

5:09

a way to learn. And so it

5:11

just felt natural to me to start

5:13

implementing algorithms that are

5:15

reading papers on everywhere.

5:17

I'm just underwriting code to learn

5:19

basically just to to confirm

5:21

what I've learned and and whatnot. That's just the

5:23

way I know. And it kind

5:25

of evolved into what it is, which is a

5:28

well, an open source package

5:30

that people use. Now if I may expand

5:32

a little bit Waver is actually the

5:34

merger between two projects. So

5:36

the first project is called Psyche Multifraud.

5:38

It was a package that

5:40

was developed Before I even got

5:42

into machining, it has roots in

5:45

academia in New Zealand. It

5:47

comes from an old package called Noah in Java.

5:49

Anyway, I wasn't not aware of that.

5:51

On my end, I started to

5:54

get packaged wood cream at the time.

5:56

So in French creme in the creme

5:58

and it plays friendly with incremental,

5:59

which is another way to say online.

6:02

So, yeah, I developed cream

6:04

on by myself. And

6:06

at some point, it just made sense to

6:08

reach out to the guys from Cycli Multipro

6:11

and to propose an merger. So it

6:14

took us quite a while. But after

6:16

nine months of negotiation

6:18

and, you know, figured out the details,

6:21

we merge and we call the new package

6:23

of paper. You

6:24

mentioned that it's built around this

6:26

idea of online machine learning

6:28

and in the documentation you also refer

6:30

to it as streaming machine learning. I'm

6:32

curious if you can just talk through

6:34

what that really means in the context

6:36

of building a machine learning system

6:38

and some of the practical differences

6:40

between that and the typical

6:42

batch oriented workflow that most folks

6:44

who are working in ML are going to be

6:46

familiar with.

6:47

First, just to recap on machine

6:50

learning, the whole point of machine learning is to

6:52

teach a model to

6:55

learn from data and to take decisions.

6:57

So you know, monkey c, monkey

6:59

doof. And the typical way you do

7:01

that is that you fit a model

7:03

to a bunch of data and That's

7:06

it, really. But online

7:09

machining is the equivalent of that, but

7:11

for streaming data. So you

7:13

new stop thinking about

7:15

data as a file or a

7:17

table in a database, but you think of it as

7:19

a flow of data stream.

7:22

So online machining, you could call it incremental,

7:24

machine learning. You could call it streaming, machine

7:26

learning. I mean, I more

7:28

than see online machine learning

7:30

been used. Although, if you google

7:32

that, you kinda find these online courses.

7:35

So that's online machine learning. So that's that

7:37

kinda cool online. Anyway,

7:39

yeah, it's just this way to

7:41

say, can I do machine learning but

7:43

with sharing data? And so the

7:45

rule is that an online model

7:48

is one that can learn one

7:50

sample at a time. So usually,

7:53

you show a model a whole data set,

7:55

and they can work with that data sets. They

7:57

can even calculate the average of the data. They can

7:59

do a whole bunch of stuff. But

8:01

the restriction here with online machining

8:03

is that the the model

8:05

cannot see the whole data. It can't hold it

8:07

in memory. It can only see one sample

8:09

at a time, and it has to work like

8:11

that. So it's a restriction. Right?

8:13

So makes it harder for the models to learn,

8:16

but it also has many, many implications.

8:18

If you have a model that can learn that way,

8:20

well, you can have a model that just can keep

8:22

learning as you go. Because

8:24

a regular machine learning

8:26

model, once it's been

8:28

fitted to a dataset, you

8:30

have to retrain it from scratch. If

8:32

you want to incorporate new samples

8:35

into your model, that can be a

8:37

source of frustration. And that's why I was

8:39

calling the square peg in the round

8:41

hole before. So say you have a

8:43

model, an online model that is just as

8:45

performance as a a batch model.

8:47

Well, you know, if

8:48

you if you just regardless of

8:50

performance, accuracy. That has

8:52

many implications and it

8:55

actually makes things easier because if you have

8:57

a model that I can keep learning, Well,

9:00

you don't have to, for instance,

9:02

schedule retraining on your

9:04

model. You can just every time you have a

9:06

new sample that arrives, you can just how your model

9:08

to learn from that, and then you

9:10

learn. And so that ensures that your

9:12

model is always as up to date as possible and

9:14

that has obviously many,

9:16

many benefits. If you think

9:18

about people working on the

9:20

stock market, so trying to

9:22

forecast the evolution

9:24

of a particular stock. They've

9:27

actually been doing online machining since the

9:29

eighties because obviously

9:31

they have a lot to lose by making

9:33

all this public. It just never got

9:35

into

9:35

a big thing and it always stayed in in its stock

9:37

market companies. So

9:38

the practical differences

9:41

are that you are

9:43

working over stream of data. You're not working

9:45

over static datasets. This stream

9:47

of data has an order, meaning that

9:49

the time that one sample arrives before

9:51

the other one, that has a lot of meaning and that's

9:53

actually reflecting what's happening

9:55

in the real world. In the real world, well,

9:57

you have data that's arriving in a certain order.

10:00

Well, if you train your model

10:02

offline

10:02

on that

10:03

data, you want to you

10:05

know, process it in the same order.

10:08

And so that ensures that you are actually

10:10

reproducing the conditions

10:12

that happen in the world.

10:14

Now another practical consideration is

10:16

that online learning is

10:18

much less popular

10:20

or predominant than batch

10:22

learning. And so a lot less research and

10:25

software work has been put into online

10:27

learning. So if you are a newcomer to

10:29

the field, well, there's just not a lot

10:31

of resources to learn from. Actually, you

10:33

could just spend a day on Google and you

10:35

probably find all the resources that you they

10:37

are because there's just not so many

10:39

of them. It's probably like by

10:42

memory, just ten

10:43

links on Google app. You can

10:45

learn from about online learning.

10:47

So it's a bit of a niche topic.

10:49

In

10:49

terms of the fact that batch

10:52

is such a predominant mode

10:54

of building these m l systems despite

10:56

the fact that it's not very

10:57

reflective of the way

11:00

that the real world actually

11:02

operates. Why do you think

11:04

that's the case that streaming or

11:06

online machine learning is still such

11:08

a niche topic and hasn't been

11:10

more broadly adopted. Sometimes

11:11

it shows like I'm trying to reach

11:14

a new religion. It was just a

11:16

bit weird because there's not a lot of us doing

11:18

it. So I'm also very

11:20

I never tried to force people into

11:22

this because obviously many good reasons

11:25

why Ashconning is still

11:27

done. And now from a historical

11:29

point of view, I think it's interesting

11:31

because we always used

11:33

to use statistical models to

11:35

explain data. And not

11:37

necessary to predict. So you just

11:39

have datasets and you just like

11:41

to understand, you know, what

11:43

variables are affecting a particular

11:45

outcome. So for instance, if you take linear

11:47

aggression, historically, it's been used

11:49

to explain the

11:51

impact of global warming

11:53

on the rise of sea level. But not

11:55

necessarily to predict if,

11:57

you know, the

11:58

temperature of the

11:59

globe was higher,

12:00

what would be this impact from the sea

12:03

level? But then someone

12:05

said, let's use machine learning to

12:07

predict outcomes in a business

12:09

context, and that's why we

12:11

have this big event of machine learning.

12:14

And we've kind of been using the tools that

12:16

have been lying around. So

12:18

we've been using all these tools that

12:20

we use for to

12:22

the extent and explain it, and now we've been

12:24

using them predicting. So

12:27

these models are

12:28

static. Like,

12:30

the people who when we

12:31

started doing an regression, we never really worried

12:34

about streaming data because they searched

12:36

for small, they searched for status.

12:38

Well, Internet didn't even exist. So there

12:40

was no real notion of IoT or

12:43

sensors or

12:44

streaming data.

12:46

So the

12:47

fact is that we've

12:50

never needed online models.

12:52

And so as a

12:54

field, you look at the academia and

12:56

the industry, we're very used to, much

12:58

learning, and we're very

13:00

comfortable with it. There's a lot of good

13:02

software. And this is what people are being

13:04

taught at university. So

13:07

I'm not

13:07

saying that online learning is necessarily

13:09

better than best learning, but I do

13:12

think that the reasons why

13:15

actually so predominant in comparison

13:17

is because we are too used

13:19

to it, basically. And

13:20

I do think that, and I see

13:22

it every week, people who

13:24

are trying to rethink their

13:27

job or their projects and say,

13:29

maybe I could be doing online learning.

13:31

It actually makes more sense. So

13:34

I think it's a question of how it's

13:36

been. For

13:36

people who are assessing

13:39

which approach to take in their

13:41

ML projects, what are some of the use

13:43

cases where online or

13:45

streaming ML is the

13:47

more logical approach or

13:49

what the decision factors look like for somebody

13:52

deciding, do I go with a batch oriented

13:54

process where I'm going to have this

13:56

large set of tooling available to

13:58

me or do I want to use

14:00

online or streaming ML because the benefits

14:02

outweigh the potential costs

14:04

of plugging into this ecosystem

14:06

of tooling? So

14:07

I'll be honest. I think it always

14:10

makes sense to start with

14:12

a batch model. Why

14:14

because, you know, if you're pragmatic and

14:16

you actually have deadlines to meet and you

14:18

just want to be productive. There's so

14:20

many good solutions to train

14:22

a batch model and deploy it. So,

14:24

you know, I would just go over

14:26

that to start with. And

14:28

then, yeah, there's the question of,

14:30

could I be doing this online? So I

14:32

think there's two cases. There's cases where you

14:34

need it. So I have a great

14:36

example. So Netflix, when they

14:38

do recommendations, you know, you

14:40

arrive on the website and Netflix recommends

14:42

movies to you. Netflix actually

14:44

retrains a model every night.

14:46

Or every week, but they have many models

14:48

anyway. But they

14:50

are learning from your behavior to kind of

14:53

retrain their models to update

14:55

their recommendations. Right?

14:56

There's a

14:57

team of Netflix that are working on

15:00

learning instantly. So if you

15:02

are scrolling on the Netflix website

15:05

and you see your

15:07

recommendation from Netflix. The

15:09

fact that you did not click on that recommendation

15:11

is a signal that you do not

15:13

want to watch that movie movie. Or

15:15

that the recommendations will be changed.

15:17

So if you're able to have a

15:19

model, for instance, maybe in your browser,

15:21

that would learn in real

15:23

time from your browsing

15:25

activity and that could update

15:27

and learn on the fly, that'd

15:29

be really powerful. And the only way to

15:31

do that is to have a model

15:33

per user that is learning

15:35

online. And so

15:36

you cannot just use batch models

15:39

for that. Yeah. You can't just

15:41

every time a user scores or

15:43

ignores a movie, you can't just take all the history

15:45

of data and fit the model. It

15:47

would be much too heavy in it to start practical.

15:49

So sometimes the only

15:51

way forward is to do

15:53

online learning. But again, this is quite

15:55

niche, like Netflix recommendations. I

15:57

mean, obviously, are working reasonably

15:59

well, I

15:59

believe because you

16:01

know, just from their market value. But

16:03

if you are pushing the envelope, then sometimes

16:06

you need online learning.

16:07

Now another case is when you do

16:10

not necessarily need it, but you want it

16:12

because it makes things easier. So

16:14

a

16:14

good example I have is imagine

16:17

you're working on the app that categorizes

16:19

tickets. So, for instance,

16:21

on the help support software. So, you know,

16:23

you go on the website and you're sending form

16:25

or you're sending a message or an email, and

16:27

you're asking you have some problem maybe with

16:29

the reimbursement from your absolute people in

16:31

Amazon? And then, you know,

16:33

there's a customer service behind

16:35

that human beings that are actually answering those

16:37

questions. And

16:38

then it's

16:39

really important to be able to categorize

16:42

each request and put into a

16:44

bucket so that it gets assigned to the right

16:46

person.

16:47

And maybe the product manager

16:50

has decided that we need a new

16:52

category. And so there's this new

16:54

category. And your

16:57

model is classifying tickets

17:00

into one of several categories.

17:02

If you introduce a new category,

17:04

it means that you have to retrain the model to

17:06

incorporate it. I was in

17:08

discussion with a company and they were only

17:10

able to or their budget

17:12

was that they were only able to retrain the model every

17:15

females. So if you

17:17

introduced a new ticket and new category into your

17:19

system, the model would only pick

17:21

it up and predict it after

17:23

females. So that sounded

17:25

kind of insane. And, you know, it wasn't

17:27

collected at all.

17:28

And I not aware

17:30

of the exact details, but it it just seemed

17:33

too expensive for them to retrain their model

17:35

from scratch. So if they were

17:36

using an online model, well, potentially

17:39

that model could just learn the

17:41

new tickets on the fly. And, you know,

17:43

if you just introduced it and people started

17:45

you know, you learn this feedback group where

17:48

you introduce a new category, people

17:50

send the email, maybe a human assigns

17:52

that ticket to a category, so

17:55

that becomes a signal for the model. The model picks it

17:57

up, learns. And, yeah, it's

17:59

gonna incorporate that category into its next

18:01

predictions. So that's

18:03

a scenario where you don't necessarily need online

18:05

learning but actually online learning just makes

18:07

more sense and makes your your

18:09

system easier to maintain to

18:11

Well, quick, basically. There

18:13

are a number

18:14

of interesting things to dig into

18:17

there. One of the things that you

18:19

mentioned is the idea of

18:21

having a model per user

18:23

in the Netflix example.

18:26

And I'm wondering if you could maybe talk

18:28

through some of the conceptual

18:30

elements of saying, okay, I've got this

18:32

baseline structure for this is how I'm

18:34

going to build the model here

18:36

as the initial training set, I've

18:38

got this model deployed, and now

18:41

every time a specific user

18:43

interacts with this model, it is going to learn

18:45

their specific behaviors and

18:47

be tuned to the data that they

18:49

are generating, would you then take that

18:51

information and feed that back into the

18:55

baseline model that gets loaded at the time the browser

18:57

interacts with the website or just some of the ways to

18:59

think about the potential

19:02

approaches for how to say, okay, I've got

19:05

a model, but it's going to be customized

19:07

per user and just managing the kind

19:09

of fan out fan in topologies that

19:11

might result from those

19:14

event based interactions

19:16

spread out across a number

19:18

of users or entities? No.

19:20

It

19:20

sounds insane when you say it because to

19:23

have one model per user and, you

19:25

know, have it deployed on the user

19:27

browser or model or phone or come those

19:29

with an Apple Watch. It does kinda

19:31

sound insane, but it is interesting, I guess.

19:33

I don't think there are so many I

19:35

mean, I'm not aware of a lot of

19:38

companies that would have the

19:40

justification to actually do this,

19:42

and I've never had the occasion to actually work

19:44

in terms of setting where I would I would

19:46

do this. But I had one good

19:48

example where I was kind of doing some pro bono

19:50

consulting. It was this car

19:52

company where the

19:54

onboard navigation system

19:56

they wanted to build a model where they could guess

19:59

where you're going

19:59

to. So basically, depending on

20:01

where you left, if you left home in

20:03

the morning, you're probably going to

20:06

work. And they would then use this to,

20:08

you know, just give you send your

20:10

news about your itinerary or

20:13

things like that. They

20:14

really needed a model that would just be able to learn online

20:17

and they made the

20:18

bold decision to say, okay, we're going

20:20

to embed the model into the

20:23

car. It's not going to be,

20:25

like, a central model

20:27

that's, you know, hosted on some

20:29

big server and that the car

20:31

inspection actually, the intelligence

20:32

is actually happening in the car. And so when you

20:34

think about that,

20:35

it's been interesting because it creates a decentralized

20:38

system. There's not like a single

20:40

it

20:41

actually creates a system where you don't even need the Internet for

20:43

the model to work. So there's

20:45

so many operational requirements

20:47

for that. Actually, now that I

20:49

think around that I'm talking about cars. I mean, that's that

20:51

Tesla that's actually what Tesla's doing. Right? They're

20:54

they're computing

20:55

and making decisions inside

20:56

the car. You

20:58

know, doing a bunch of stuff, and they're also communicating with

21:01

other servers. But the

21:03

actual computer, they actually have GPUs in the

21:05

car doing compute with their deep learning

21:07

models in in one up. So it's

21:09

definitely

21:09

possible to do this. Right? But

21:11

clearly not something that I don't

21:13

know. Our

21:13

company would go would to go through or would

21:15

have the need to to

21:17

do. It's

21:17

interesting also to think about

21:19

how something like that would

21:21

play into the federated

21:24

learning approach where you have

21:26

federated models where there is that core model that's

21:28

being built and maintained at

21:30

the core. And then as users

21:33

interact, at the edge, whether it's on device

21:35

or in their browser, it loads

21:37

a federated learning component that

21:39

has that streaming ML capability

21:42

So the model evolves as the user

21:44

is interacting with it on their device, and

21:46

then that information is then sent

21:49

back to the centralized

21:51

system to be able to feed back into

21:53

the core model so that you can

21:55

have these kind of parallel streams

21:57

of the model that users interacting

21:59

with is being customized to their behavior

22:01

at the time that they're interacting with it,

22:03

but it does still get

22:05

propagated back into the larger

22:07

system so that the new

22:10

intelligence is able

22:12

to generate an updated experience

22:14

for everybody who then goes and interacts with it

22:16

in the future.

22:17

Yeah. That's really, really interesting.

22:20

So I think first off, the the fact is that

22:22

I I'm actually

22:23

still young and there's so many things that I don't

22:25

know and I don't have, like, the technical

22:27

savvy to be able to suggests

22:29

ways forward. But this is obviously things

22:32

like, you know,

22:33

I think about

22:34

so this thing like

22:35

HubSpot, which is a a project from Google,

22:37

they have a paper. discuss

22:39

these things. I think that's

22:41

a really simple thing that if you wanted to do

22:43

this, if you're the the listener wants to do

22:45

something like this, I think there's a simple

22:47

pattern which is to maybe once a

22:49

month have a model that is retrained and

22:51

that you just train in batch. And that

22:53

model is going to be

22:55

it's gonna be like a hydro. Like, you're gonna

22:58

copy it and you're gonna send it to each

23:00

user. And then each

23:02

copy for each user is going to be able

23:04

to learn in this whole environment.

23:06

And for instance, a good idea would maybe

23:08

to with your model to increase,

23:10

like, the learning rates so

23:12

that every sample that the user

23:14

gives you matters a lot. So

23:16

for instance, if we take the Netflix example, you

23:18

would have, you know, your

23:20

one of the mail recommendation system

23:23

model that you would just chain and

23:25

match, you know, and you just use all the

23:27

tools that we community use. But

23:29

then you would embed that into

23:31

each person's browser and

23:33

maybe you do this one to them. And then

23:35

that model page user

23:37

would be a coffee, a kroner, like a just

23:39

a separate model now.

23:41

And, you know, it would

23:42

keep running. In online manner.

23:45

So maybe your model was trained in batch

23:47

initially. But now for each user, it's yeah.

23:49

It's actually it's been trained online. And so

23:51

for instance, you can do this with

23:53

characterization machines that can be trained in batch, but also

23:56

online. And, yeah, you would use a

23:58

a high learning rate. So

24:00

that every sample matters a lot basically.

24:02

And so you, the user, tuning

24:04

your model. And so

24:06

I don't know how YouTube does it

24:08

trend but I do imagine they have some sort of

24:10

core model. They're just learning how

24:12

to make good recommendations. But

24:14

obviously, YouTube,

24:15

there are some rules that

24:18

make it so that's, you know, recommendations

24:20

are

24:20

tailored to each user. And I

24:22

don't know if this is done online, and I don't

24:24

know if it's actually machining. It's probably just

24:26

rules or scores. But, yeah, I think it's

24:28

a really fun idea to play around

24:31

with, and I do think that online learning

24:33

enables this even more. As

24:35

far

24:35

as the

24:37

operational model for people who are

24:39

using online and streaming machine

24:42

learning. If they're coming from a

24:44

batch background, they're going to be used to dealing

24:46

with the train

24:48

test deploy cycle where

24:51

I have my dataset. I build this model.

24:53

I validate the model against

24:55

the test dataset that I've held out from

24:57

the training data. Everything

24:59

looks good based on the

25:01

area under curve or whatever metrics I'm

25:03

using to validate that model, then I'm going

25:05

to put it into my production environment

25:08

or maybe being served as a flash

25:10

app or a fast API app.

25:12

And then I'm going to monitor it for

25:14

concept drift. But eventually, I say,

25:16

okay. This is no longer forming up to these

25:18

specifications. So now I need to go back and

25:20

retrain the model based on my updated data

25:22

sets. And I'm wondering, what

25:24

that process looks like for somebody building a

25:26

streaming ML model with something like River and

25:28

how you address things like

25:31

concept drift and, you know, how concept

25:33

drift manifests in this

25:35

streaming environment where you are continually

25:37

learning and you don't have to worry

25:40

about you know, the real world data that I'm seeing

25:42

is widely divergent from the data that

25:44

I use to train against. There's so many

25:46

things to

25:46

dig into and I'll try to give a comprehensive

25:49

answer. So first

25:51

off, it's important to understand that whoever

25:53

itself is to

25:55

online learning, what psychic learn is to batch

25:57

learning. So it's

25:58

only

25:59

desires to be a machine

26:02

learning library. Right?

26:03

So it just contains

26:06

basically algorithms routines

26:08

to train a model and

26:10

to have models that can learn and

26:12

predict. And what you're going towards

26:14

to your question is in our ops. So

26:17

how does the

26:19

life cycle look like for an online

26:21

model? So this is obviously something

26:23

that I'm

26:23

spending a lot of time to look into.

26:26

The answer is that the

26:28

first problem is that online learning

26:30

enables different

26:32

patterns. And I

26:34

believe

26:34

that these patterns are simple to

26:36

reason about. So as you said,

26:38

you usually start off by

26:41

training

26:41

a model, then evaluating against the

26:43

test set and maybe going

26:45

to report to your stakeholders and show

26:47

them the performance and guarantee that

26:50

you

26:50

know, the the centers of full

26:51

positives is underneath a

26:53

certain threshold, then yes, we can

26:56

diagnose cancer with this model or

26:58

not. And yeah, and then you kind of

27:00

deploy it. Maybe if you get lucky, if you get the

27:02

approval and you sleep, you

27:05

know, well or not well at night depending

27:07

on how much you trust your model, but there's this

27:09

notion of you deploy a model and

27:11

it's like a baby in the world

27:13

and this baby is not meant to keep

27:15

learning. So you know, it's a

27:17

lie to believe that if you deploy a

27:19

batch model, you're going

27:21

to be able to just let

27:23

it you know, learn by itself. There's actually main things

27:25

maintenance has to happen there. So the

27:28

reality is that any machine learning projects,

27:30

you know, any serious projects, is

27:33

never finished. It's like software,

27:35

basically. We have to think of machine learning.

27:37

Projects have software engineering. And

27:39

obviously, what we all know that you

27:41

never just deploy a feature, a software

27:43

engineering feature, and just never look at it.

27:45

You monitor it. You

27:47

take care of that, well, investigating

27:49

bugs and whatnot. So batch learning in that

27:51

sense is a bit it's a

27:53

bit difficult to work with because, obviously, you

27:55

can have if your model is

27:57

drifting, So meaning that its performance

27:59

is dropping because the data

28:01

that it's looking at is different

28:03

than the training set it was trained on.

28:06

basically have to be very lucky if you want your model

28:08

to pick up performance. So you're

28:10

gonna have to do something about it. And, yeah, you can just

28:12

retrain it. But what

28:14

you do online learning is that you can

28:16

have the model just keep learning as you

28:18

go. So

28:19

there is no distinction

28:22

between training and testing.

28:23

What online learning encourages to do

28:26

is to deploy your

28:28

model as soon as possible.

28:30

So say you have

28:32

a model, and it's not been trained on anything. Well,

28:34

you can put it into production straight

28:36

away when samples arrive, it's gonna make

28:38

a prediction. So

28:39

maybe you

28:40

know, user arrives on your website, you make

28:43

recommendations, that's your prediction.

28:44

And then your

28:45

user is going to click or

28:48

not on something you recommended to

28:50

her. And

28:51

that's

28:52

gonna be feedback your free model to keep

28:55

training.

28:55

So that

28:56

is already a guarantee that your model is

28:58

kind of up to date and kind of learning.

29:00

And so that's been interesting because

29:02

just enables

29:03

so many good patterns.

29:05

You

29:06

can still monitor your the

29:08

performance of your model. If the performance of

29:10

your online model is dropping,

29:12

I mean, I haven't seen that yet, but it probably

29:15

means that your problem is

29:17

really hard to solve. So

29:19

a really cool thing

29:21

I stumble account was this idea of test

29:23

then

29:23

train. So the idea that

29:26

imagine the

29:26

scenario where you

29:29

you have a classification model that is running online. And so

29:31

what would happen is that you have

29:34

your model, your model

29:35

is generating features. So

29:38

say the user lives on the website and

29:40

the features are what's the time of the day? What's

29:42

the weather like? What are the top films at the

29:44

moment? And these are features and your

29:46

features that you have at a certain point in time

29:49

t, you generate these features and

29:51

then later on when you get

29:53

the feedback. So did your

29:55

recommendation was it a success or not? That's

29:57

training data for your your

29:59

model. You use

29:59

the same feature that you used for

30:02

predictions, you use those

30:04

features for training. And

30:06

so

30:06

you can see

30:06

here that there's a clear feedback group.

30:09

The event happens, the user comes on the website,

30:11

your model generates features,

30:13

and then at some later

30:15

point in time, the feedback arrives.

30:18

So was the prediction successful model.

30:20

If so, if not, by by how much

30:22

was the error.

30:23

And then,

30:24

yeah, you can use this

30:26

feedback join it with a feature that he

30:28

generates predicting and use

30:30

that as as training data. So

30:32

and you essentially have, like, a small

30:34

queue or database that's stowing your

30:37

predictions, your features,

30:39

and your training data, and

30:41

the labels that make your training

30:44

data.

30:44

So And the

30:46

big difference here is that

30:49

you you do

30:50

not necessarily have to do

30:52

a training test phase before they

30:54

train your model. You'd like to just deploy model initially and

30:56

it just learns online and then you can

30:59

monitor. A really cool

31:00

thing is that if you do this,

31:02

you have a log of people

31:05

coming on your website,

31:06

you making predictions, you gain features,

31:08

people

31:09

clicking around and

31:10

interact with your recommendations. This

31:13

creates a long of what's happening on your website.

31:15

And so this

31:16

log, what's really cool, is that you can

31:19

offline afterwards After the

31:21

fact, you can process it in the same order it arrives

31:24

in and you can

31:24

replay what the history and

31:27

what happens. So it means that

31:29

if you On

31:30

the side, when you're redeveloping a model or you want to

31:32

develop a better model, you can just take this log

31:34

of events. Run for

31:36

it. And do this prediction

31:38

and training dance the whole

31:40

life cycle. You know, you replaying

31:42

the feedback group and then you have a very

31:44

accurate representation of how

31:47

your model would have performed

31:49

on that sequence of events.

31:51

So that's really powerful because

31:53

the way you're designing your model

31:55

there is that you

31:56

have a rough sketch of a model, which you deploy, and then

31:58

you have a log in that model.

32:01

So you know you

32:03

can evaluate the performance of that model, but more importantly, you

32:06

can have a log of the

32:08

events. And then when

32:09

you're designing the version two of

32:12

your model, you have

32:15

a very reliable way to

32:17

estimate

32:17

how your new model would

32:20

have performed. And that's really cool because when you are

32:22

doing

32:22

train test splits

32:24

and batch learning, that is

32:26

not representative of do we work do we work? Do we work?

32:29

problem is that what you do with train and test

32:31

is people are

32:31

spending so much time making

32:33

sure that their train test split

32:36

is correct. When in fact even

32:38

having a good train test fit is

32:40

not a good proxy out of

32:42

their world. A good proxy out of their world

32:44

is to just

32:45

replay through history.

32:47

So and

32:47

that's something that you can only do on online

32:50

learning. That's

32:50

really cool. Now

32:52

to come to your point about concept

32:54

lifts, So concept

32:55

drift is there was many

32:57

different kind of concept drift, and

32:59

Chip Huen has a very good old Honda

33:01

on her blog. What matters really is

33:03

that concept drift, the result of it

33:05

is usually that your model is not performing as well.

33:08

Right? It's gonna be a drop in

33:10

performance. And so the first thing you

33:12

see on your in your monitoring dashboard

33:14

is that the metric has dropped. And

33:16

then when you dig into it, you

33:18

see that maybe there's a class imbalance or

33:20

the the

33:20

correlation between the feature and the class has

33:23

changed or something like

33:25

that.

33:25

So essentially saying

33:27

that the data the model has been trained

33:29

on

33:29

is not representative of the

33:32

new data that it's been seen

33:34

in production. Again, I have said this a

33:35

few times, but online models, we should put

33:38

them in place with the correct camera offsets up

33:40

to you. They are able to learn as

33:42

soon as possible. So that

33:43

just guarantees that your model is as

33:46

up to date as possible. So you basically made it

33:48

in the best you can. So the drift

33:49

is always possible. You can always obviously, have

33:51

a model that's degrading or that's just going haywire.

33:54

That's not related necessarily to the

33:56

online learning aspects of things.

33:59

And

33:59

so there are also ways

34:00

to cope with this. So

34:03

Vincent's Dan

34:03

Crankshell and his team at Berkeley, they

34:05

developed a system called Clipper.

34:08

It's

34:08

a kind of an ops tool. It's a research

34:10

project, but it's it's a it's I think

34:12

it's been deprecated, but it's yeah. There's

34:15

a store there. It's a project where they have a

34:17

method model, which is kind of looking at

34:20

many models being with in production and

34:22

deciding online which model should

34:24

be used making

34:26

prediction. So it's kind

34:26

of like a teacher selecting the

34:29

best

34:29

student at certain point in time and,

34:31

you know, kind of seeing throughout

34:33

the year how the students are evolving and,

34:35

like, which students

34:36

are getting better or

34:37

not good. And so you can do this

34:39

with Bandits for

34:42

instance. Yeah.

34:42

So just to say that there are

34:43

many ways to deal with concept

34:45

drift and but

34:47

on my novels, again, help

34:50

to

34:50

cope with concept drift and in just the

34:52

way, actually, it just makes sense more

34:55

so than batch models. And

34:57

so digging now into river itself, can you talk

34:59

through how you've implemented that framework

35:01

and some of the design

35:04

considerations that went into, how do I

35:06

think about exposing this online learning

35:09

capability in a way that

35:11

is accessible and under attable

35:13

to people who are used to building

35:15

batch models. So

35:16

I like to think of livermore as

35:18

a library than a framework. If

35:21

I'm not mistaken, framework kind of forces you into

35:23

a certain behavior or way

35:26

to do things. I think there's an inversion of

35:28

control where the framework is kind of designing things

35:30

for you.

35:32

You know, you should look at keros

35:34

and Pytorch. Pytorch is very much more framework in

35:36

comparison to Pytorch because Pytorch

35:40

for me, the reason why it was successful is that it it kind of

35:42

gave in the version of

35:43

control towards the user. You can do so many

35:45

things in Python

35:45

solution is very flexible. And there's

35:48

a very impose a

35:50

single webinar. So we

35:52

have that in mind of river. River again

35:54

is just a library to do

35:56

machine learning, online machine learning, but it's it just contains the

35:58

algorithms. It doesn't really

35:59

force you to, you know,

36:01

lead your date

36:04

a certain way or you could use it in a web app. You could use it

36:06

offline. You could use it on an offline

36:08

IoT sensor. Liver is

36:10

not concerned with that. It's just

36:13

an IV that is a caustic

36:15

liver transplant. So now to

36:17

come in to to what liver

36:19

is, it is In

36:20

terms of online machining, if there's general purpose, so it's not dedicated

36:22

to on the anomaly detection or

36:25

forecasting or classification, it covers all

36:27

of that. That's the ambition at

36:30

least. So just a note there is that

36:32

it's actually really hard to develop and maintain because other maintainers and

36:34

I, we are not necessarily

36:36

asleep

36:37

specialized in different domains, and we kinda have

36:40

to, you know, one day, I'm going to be

36:42

doing working on the

36:43

forecasting module other than

36:46

the other I'm gonna work on it's kinda

36:48

crazy. So it's still

36:50

fun. What we do

36:50

provide is a common interface. So just

36:53

like like it on every

36:56

piece of the puzzle and liver follows a certain interface. So

36:58

we have transformers, we have regressors, we

37:01

have anomaly detectors, course,

37:04

with classifieds, so binary and multi class. We

37:06

are forecasting not also time series. And

37:09

so

37:09

we guarantee

37:10

to the user that Each

37:13

model follows a certain API. So every model is

37:15

gonna be able to

37:16

have a learn method, so we can just learn from

37:20

new data. And

37:20

they usually have a predict method to make prediction. So

37:23

forecasters will have a forecast

37:25

method. The nominal detectors will have a

37:27

score

37:27

method, which I'm supposed

37:30

to say nominal score. And so the

37:32

strength of river is to, yeah,

37:34

provide this

37:36

consistent API for

37:38

doing on my machine learning. And it's a bit opinionated

37:40

because

37:40

it's well, it just it likes,

37:43

like, it learn minute. It just says, okay. You're gonna

37:45

have learn and predict, but that's

37:47

a reasonable thing to impose. And

37:50

that makes it easier for users

37:52

to switch in new models

37:54

because they have the

37:56

same interface. So

37:56

again, just to conclude on why I said the start, we made the

37:58

explicit choice to

37:59

follow

37:59

the single responsibility principle

38:04

in that mover only manages the machinery aspect of things

38:06

and not the deployment and whatnot. And

38:08

so if you wanted to use live

38:10

in production, see

38:12

people doing this, you have to worry about some

38:14

of the details yourself. Right? If you want to

38:16

deploy in a in a web app, well,

38:19

we do not help at the moment. At all

38:21

of that, you have to deploy your own web

38:23

app. As

38:24

far as the overall design

38:26

of the framework, you mentioned, that actually started

38:28

off as two separate projects and

38:30

then you went through the long process

38:33

of merging them together. I'm wondering

38:35

how the overall design and goals

38:37

of the object changed or evolved since you first started working

38:39

on this idea. The reason

38:41

actually

38:41

why the merger between Freeman and

38:43

Psyche Multiflow took us a long

38:45

time was that although

38:47

we were both online

38:48

learning libraries, there

38:50

were some subtle differences which were

38:53

kind of important. So my

38:56

vision with cream

38:56

at the time and lower now is that we should

38:58

only cater to

38:59

models which are what I call sure online

39:01

models. Is that and that they can

39:03

learn from a single sample on

39:05

data at a time. But there are also many

39:08

batch models, so models

39:10

which can learn from streaming data,

39:12

but in chunks, so, like, in many

39:14

batch data. And Saikit

39:16

Multiflow was kind of doing

39:18

this. So much like Pytorch

39:20

and and to flow and, you know, deep

39:22

learning models. And so I

39:24

kind of have to convince them

39:26

that there were reasons why it was

39:28

just a bit better.

39:30

Why? Because you know, if you think about, again, a user impact on the

39:32

website or just any

39:34

web requests or things that are happening

39:36

in the real life, you

39:39

want to learn as soon as possible. You don't

39:41

want to have to wait

39:43

for, you know,

39:44

thirty two samples to arrive to have

39:47

a batch to be able to feed that

39:49

to your model. You could obviously, but it just made sense to me to have something

39:51

simpler where we only care

39:53

about sure online learning

39:54

because it means that you don't have to store

39:58

anything you

39:59

just learn on the fly.

39:59

And so, I guess,

40:01

the interaction

40:02

I had with Falcon

40:03

whatsoever kind of confirmed

40:05

this idea and

40:07

then I guess,

40:08

you know, there were a bit down tool when we did the merger

40:10

because and maybe I was a bit too opinionated, but

40:12

it is to be proved that it

40:15

actually made sense. And it's

40:17

not a decision that we look back on.

40:19

Like, we're really happy with this stuff.

40:21

So whoever has, you

40:23

know,

40:23

arguably not a success. So it's

40:26

working. It's alive. It's breathing. It's

40:28

been going on for two years and a

40:30

half of the project.

40:31

And so we have a

40:33

steady intake of users

40:34

that are adopting it. And, you know, we get we

40:36

see this from emails we receive from get

40:38

the discussions and issues and just

40:40

general feedback again. So

40:42

though idea

40:44

of having a library that is only focused

40:47

towards ML and

40:48

just the algorithms is something

40:51

that we I'm just gonna keep

40:51

going with because it just it looks like it's

40:54

working and it looks like this is what people

40:56

want. You know, a simple

40:58

example is hey, I want to compute a matrix

41:00

online while liver aims to be

41:02

the

41:02

go to the library to answer those

41:05

kind of questions. Right?

41:08

The

41:08

the truth is that people they don't just

41:10

need that. They also need ways to

41:12

deploy these models and do

41:14

m r ops online. So Well,

41:17

we basically did the the next steps are for

41:19

us to build new tools in

41:21

that direction.

41:22

Now we also

41:24

think that The

41:24

initial development of mirrors was a bit faster than sheariers. The

41:26

aim was to implement as many

41:29

algorithms as possible and, you know, just

41:31

to cover the wide spectrum of

41:34

machine

41:34

learning, now that we've,

41:35

you know, covered quite a

41:37

few topics, and we also have

41:39

day jobs. So when I

41:41

was developing a I was in a

41:43

PhD. So I had ironically, I time than now because I have a proper

41:46

job. But we value our time a

41:48

bit more, and we're not in this faster shows

41:50

mode. We kind

41:52

of just focused on taking certain models which are

41:54

valuable and we see value in and just

41:56

spending time to implement

41:58

them properly.

42:00

And we also see the final aspect is that we see that people

42:02

they don't just want well, our user

42:04

base, there's not just want algorithms.

42:07

They also want as to educate them. So they have

42:09

general questions as towards, you know, what

42:11

is online

42:11

learning? And how do I do it? And how

42:13

do I decide what model to

42:16

use? And all

42:16

the questions that we're covering in this podcast, basically. So I

42:18

think there's a huge need

42:20

for us to kind of move

42:23

into educational aspect. So when

42:26

I was younger, Psyche learned was my bible.

42:28

I kind of just spent so much time

42:30

not even using it. Not

42:32

even just using the code, but actually just reading through the documentation because it's

42:35

just so excellent. So, obviously,

42:37

that takes a lot of time, a

42:39

lot of energy.

42:41

People, contributors

42:42

and help, but definitely something towards

42:45

which we are moving. In

42:46

terms

42:47

of the overall

42:50

process of building the model using something like

42:52

river. When people are building a

42:54

batch model, they end up getting a

42:56

binary artifact

42:58

out that is the entire state of that model after it

43:00

has gone through that training process. And I'm

43:02

curious if you can talk to how

43:06

River manages the stateful aspect of that

43:08

model as it goes through this

43:11

continual learning process, both

43:14

in a, you know, sandbox use case where somebody's

43:16

just playing around on their laptop, but

43:18

also as you push it into production where

43:21

maybe you want to able to use this model

43:24

and scale out serving it across a

43:26

fleet of different servers and just some

43:28

of the state management

43:30

that goes into being able to

43:32

continually learn as new information is

43:34

presented to it. Howard Bauchner:

43:36

So bachelor's in the

43:38

great advantage of batch learning is that once you train your model,

43:40

it's essentially a pure function.

43:42

There are no side effects.

43:46

The,

43:46

you know, decision

43:47

process that's underlying the model is not

43:49

gonna change. So, you know, you

43:51

can push the envelope and compile

43:53

it. You can pick away, you can convert

43:55

it to another format. So that's what early next does. You

43:58

can compile it so they can run

43:59

on a

43:59

mobile device.

44:02

I mean, it

44:03

does not need the ability

44:05

to

44:05

train anymore. It's just basically a

44:08

Python function. Oh, just it's just a

44:10

function basically that takes an

44:11

input and amp or something. So there's

44:13

also

44:14

a good reason why rationing is predominant.

44:16

But with river, it's

44:19

different because online they need to keep this

44:21

ability to learn. So that's what

44:23

you've been saying. So it's

44:25

actually kind of

44:26

straightforward, but

44:28

the internal representation of

44:31

most models of river is

44:34

fluid dynamic. It's usually still in

44:36

dictionaries that can increase

44:37

and decrease in size.

44:39

So

44:39

imagine you have a new feature that arrives in

44:42

your stream, well, every model we have a

44:44

copes with them. It's they're not static.

44:46

There's a new feature that pairs well.

44:48

They handle it gracefully. So for instance,

44:50

and then the regression model is just

44:52

going to add a new weight to its internal dictionary

44:55

of weight. Now in terms

44:57

of serialization and pickling

45:00

and whatnot, Waver

45:01

is mostly bitten. Well,

45:03

basically, Waver stands on

45:05

the shoulders of

45:06

i felt hyphen

45:07

very much so. So we do not

45:09

depend very much on Numpy or Pampers or Citibank. We mostly depend

45:12

on Python standard library. We use

45:14

dictionaries a

45:16

lot. And that plays

45:17

really nicely with the

45:17

standard library. It's very easy just to,

45:20

you can take any ripper model,

45:22

pick words,

45:23

the going and just save it. You can

45:25

also just dump it to Jason or

45:27

one off. Also, the

45:30

paradigm of you train a

45:32

model, you pick or it, and you have an

45:34

artifact that you can upload anywhere. It's a

45:36

bit different with online learning because you

45:38

would trade this differently.

45:40

You would maintain your model in memory. So if you have a

45:42

web server serving your model, you

45:44

would not just load the model to

45:46

make prediction, you would just keep it

45:48

in memory. And

45:50

make prediction of it because it's a memory, so you don't

45:52

have to load the anymore, preload

45:54

it. And then when a

45:56

sample arrives, your model is in

45:58

a mansion to make it learn from that. So, yeah, I think the big difference is that you

46:00

hold your model memory

46:02

rather than picking it to the desk

46:06

and loading it when necessary. In terms of use

46:08

of

46:08

the dictionary as that internal

46:12

state representation as you said, it gives

46:14

you the flexibility to be able

46:16

to evolve with the

46:18

updates and data. But at

46:20

the same time, you have this

46:22

heterogeneous data structure

46:25

that can be mutated as

46:27

the system is in flight

46:29

and you don't necessarily have strict

46:31

schema being applied to it. And I'm just curious if you

46:34

can talk to the trade

46:36

offs of being able to

46:38

add that flexibility, but also lacking in some of

46:40

the validation and, you

46:42

know, schema and structure information that

46:44

you might want in something

46:47

that's dealing with these volumes of data.

46:49

So, yeah,

46:50

we will use dictionaries. So

46:52

the advanced of dictionaries are

46:55

twenty fold. First of all, a very important thing

46:57

is that the dictionary is out

46:59

to lists, what kind of data frames

47:01

out to nonprofits, So

47:04

a dictionary has names

47:06

that's very important. So it means that each

47:08

one of your feature actually has a

47:10

name to And I find that hugely important because, you know, we

47:12

always see features as just numbers, but they also

47:14

have names, and that's just

47:16

really important. Imagine

47:18

you have a bunch of features coming

47:20

in. Now if that was

47:22

a list or an empire way, you

47:24

have no real way of

47:26

knowing which column corresponds to be

47:28

variable. If you switch two columns with each other,

47:30

that could just be a really silent

47:32

bug, which will affect you, whereas

47:36

if you name each feature, if the column or the changes,

47:38

well, the name of the columns are being

47:40

commuted too. So you can't

47:42

kind of identify that. So

47:45

what's really called the dictionaries, and and that

47:47

we'll deliver is that the order

47:49

of the features that you're receiving, there's

47:51

a massive. Because

47:52

we access every feature by name and not by

47:54

position, Friction is

47:55

also allow you a mutable in size. So, you know, if

47:57

a new feature arrives

47:58

or a new a feature

47:59

disappears between two

48:02

different samples, that

48:05

just works. So it's really cool. Also, that dictionary

48:07

is on when you think about

48:09

it naturally starts. So

48:12

imagine that on the Netflix products, the features that you arrive that

48:15

you receive are the name of the

48:17

user,

48:17

semi count,

48:19

colon one, or you

48:21

know,

48:21

the dates one or yeah.

48:23

You can

48:23

just store sparse information

48:26

in a dictionary. That's kind of really useful.

48:29

There's this robustness principle that we

48:32

follow with ever. So robustness

48:34

principle is that we are used to

48:36

be conservative in what you do,

48:38

but labeling what you accept.

48:40

So we will be very labeling that

48:42

accepts heterogeneous

48:44

data as you said. So dictionaries are

48:46

different size. Dictionaries would have which

48:48

have different orders or whatnot, but

48:50

that is really flexible for users.

48:53

So a common use cases to deploy mover in

48:55

the web app. And in the web app, you're

48:58

receiving JSON data a lot of

49:00

times. So

49:02

the fact that JSON data has a one to one relationship with

49:04

hyphen dictionaries makes it very easy

49:06

to integrate with into a

49:08

web app. you have a

49:11

regular batch model, you have to

49:13

mess about with casting the

49:15

JSON data to

49:16

and

49:18

then highway And, you know, that has a cost actually. They actually have a cost

49:20

because although non pry

49:22

torch tens of flow are

49:25

you know, good at posting matrices is actually

49:28

a cost that comes with

49:30

taken native data, such as dictionaries,

49:32

and casting

49:34

them a higher order data structure, so it says,

49:36

and then primary. That has a

49:38

real cost in a web app where, you

49:40

know, it's you're using in terms of

49:42

many seconds, well,

49:44

you're spending a lot of your time just

49:46

converting your JSON data to Numpy.

49:49

Well, whoever, because it

49:51

consumes dictionary as well, the

49:53

data you receive, I don't know if you're coding

49:55

in Django, Flask, FastAPI. The data you

49:57

receive your request is a dictionary,

49:59

so you don't have to convert the data.

50:02

It just runs. So actually, if you if you take a river

50:04

model, like a a linear regression river and a

50:06

linear regression in touch, it's

50:08

actually gonna be much faster in the

50:10

river because there's no

50:12

conversion costs. Plus, the features

50:14

are names. And plus, you just you

50:16

don't have to worry about these features being a mixed

50:18

order or anything. So it just makes a lot

50:20

of sense really dictionaries in that sense. But

50:22

the pitfalls, obviously, it's it's not

50:24

perfect. The pitfalls is that. I kind

50:26

of disagree that there's a problem with the

50:28

short dictionaries. I actually think that

50:30

dictionary is, well, if you wanted

50:32

to, you could create, like, an issue in

50:34

Python. You can actually use a data class,

50:36

and you can convert that to a dictionary and feed

50:38

that into your

50:39

model. The data

50:40

class helps you to create structure. So I

50:42

don't think that's really problem. Quite the contrary, I

50:44

think. The fact is that also dictionary

50:47

can be nested maybe your features that you're feeding to the

50:49

model doesn't have to be a flat dictionary. It's

50:51

actually nested. And that's really cool too. You know, you

50:53

can have features for your user. You can have

50:55

features for your page,

50:57

the features for the day, if

50:59

anything. Things that

51:00

you cannot necessarily do with a flat

51:02

structure, such as a data frame

51:04

on an environment. Anyway, I'm talking about benefits. I

51:06

should be talking about cons. But yeah,

51:08

I guess just the main con of

51:11

processing

51:11

dictionaries is that

51:13

you know, if you want

51:13

today, we have a model to process a

51:16

million samples, it would

51:18

take much more time than

51:20

processing a million

51:22

samples with Panasonic framework on the inquiry. Because,

51:24

yeah, the point of VIVA is the process. It's the

51:26

behalf of processing one time

51:28

bar time.

51:30

But

51:30

not necessarily posting a million samples at a time. But

51:32

those are two

51:33

different problems. So although,

51:36

you know, you take calls,

51:38

so I can learn. Their

51:40

goal is to be able to process

51:42

offline data really quick. But the

51:44

goal of ever is

51:46

to process online data single samples as fast as possible. And you're

51:48

comparing apples and oranges if you wanna do the

51:50

comparison. It just doesn't work. So yeah.

51:52

Actually, you know what? I don't think

51:54

they'll they'll downsides to use

51:56

indigenous. It just helps a lot.

51:58

And and to confirm this, we

51:59

have a lot of users who tell us

52:01

about this. They say, well, it's actually fun to use

52:04

river. I just It just makes sense because

52:06

it's very close to the data structures I use in Python.

52:08

I don't have to introduce a new data structure

52:11

to my system. So

52:13

for somebody who is

52:16

using River to build a machine

52:18

learning model, can you just talk through

52:20

the overall process of going from idea through

52:22

development to deployment? I'm going

52:24

to rehash what I said before,

52:25

but I think the great

52:27

benefit of online learning and

52:30

the river is that you can cut the R and

52:32

D phase. So I've seen so many projects

52:34

where there's an R and D phase

52:38

and the model, you know, gets validated at some point in time,

52:40

but there's, like, a real

52:42

big gap of time between the

52:44

star of the R and D phase and

52:48

the moment when the model is deployed. And the process

52:50

of using river and then

52:52

extreme model in general is

52:55

to actually as I said, deploy the model as soon

52:57

as possible, monitor its predictions,

53:00

and

53:00

it's okay because

53:01

sometimes that model will, you

53:03

know, you can deploy it

53:06

in production, and those predictions

53:08

do not necessarily have to be served to

53:10

the user. So you just made the

53:12

predictions, you monitor them, and it creates you

53:14

a log again of training data

53:17

and predictions and features. And that's what

53:19

you call shadow deployments. You have

53:21

a model which is deployed,

53:24

is making predictions, but those predictions

53:26

are not being used, you know, to inform

53:28

decisions or to change. To

53:30

influence the behavior users. They just exist for the sake of

53:33

existing and for monitoring. One thing to

53:35

mention is that once you

53:37

deploy this model, you have your

53:39

of that's the phase where you want to

53:42

maybe design a new model. And

53:44

you're going to have

53:46

this model replace the

53:48

existing model in production or coexist with it

53:50

because you have a methanol or not. So

53:52

I mentioned that you can

53:54

take your log of events and

53:56

replace it in the order in which

53:58

arrived and have a good idea of how

53:59

well your model would

54:01

have performed. That's

54:03

called progressive validation. So

54:06

it's just this idea that if you

54:08

have your log of events, every time

54:10

for you first gonna make a prediction and then you're

54:12

gonna learn from it.

54:12

So I have a good example.

54:15

There's

54:15

a dataset on CAGR called the New

54:17

York taxes dataset and it's basically

54:19

a log of people

54:21

asking for a hailing a taxi to arrive and

54:23

they depart from a position, and

54:26

they arrive at another position later in

54:28

time. And

54:30

so the goal of a machine learning system

54:32

in this case could be to predict how long the tax

54:34

chip is going to last. So

54:37

when the taxi departs, you want

54:40

your model to make a prediction. How long is

54:42

this that

54:42

for you? Are you gonna ask? And so maybe that's gonna

54:44

inform, I don't know, the cost

54:47

of the trip or

54:48

it's gonna help decision maker, you

54:50

know, via via the taxis or, I don't

54:52

know, whatever. But you can imagine

54:54

that this is a great feedback loop because you

54:56

have your model makes a prediction. And then later,

54:59

maybe eighteen minutes later or something, we

55:01

had the ground truth around. So you know

55:03

how long the taxi trip actually lasts, and

55:05

that's your ground truth. And then

55:07

you can compare your prediction with your model. And that

55:10

enables progressive validation because you have

55:12

a a log of events. You have when the

55:14

tax hardship

55:16

departs. What was the value my model predicted? What

55:18

features I used?

55:19

Half prediction time?

55:21

And later on, I

55:23

have the ground truth. And

55:25

so I can just replay the

55:27

logs of events for, you know, I

55:29

don't know, seven

55:32

days and progressively

55:34

evaluate my model. So I like

55:36

this tax example because it's easy

55:38

to reason about and, you know, taxes are

55:40

easy

55:41

to understand. But the tax example is really what online learning is

55:43

about. It's about it's feedback

55:45

loop between predicting and

55:48

learning And just to

55:50

remind you, but how it would be

55:52

with a batch model is that you would have your taxi

55:54

dataset and well, I don't

55:56

know. You would split your dataset too. You

55:58

would have start of the week, end of the week, train your

56:00

model on the start of the week, evaluate on the rest

56:03

of the week. Oh, no. The data

56:05

I trained on for the start of the week

56:07

is not representative of the weekend.

56:09

Yeah. And it just becomes a

56:11

bit weird. It becomes this situation where

56:13

you're trying to reproduce

56:16

conditions in real life, but you're never really sure

56:18

of it. And you can only really know how well your batch is

56:20

going to do well in

56:22

production. And

56:23

online learning, it just kind

56:25

of encourage you to go

56:28

for it, deploy your model straight away and not have to have

56:30

this weird R and D phase where you live

56:32

in a lab. You think you might be

56:35

right, but then you're not really sure. And, yeah, online learning

56:37

just brings you closer to development in my

56:40

opinion.

56:40

in my opinion As

56:41

you have been developing

56:44

this project and helping

56:46

people understand and adopt it.

56:48

What do you see is some

56:50

of the conceptual challenges or complexities that

56:52

people experience as they're

56:54

starting to adapt their thinking to how

56:56

to build a machine learning model in this

57:00

streaming format versus the

57:02

batch oriented workflow where they do have to

57:04

think about the train test split, just

57:06

the overall shift in the

57:09

way that they think about developing these models?

57:11

It's a hard question, but I think

57:13

there's two

57:13

aspects. There's the online learning aspect, and

57:15

then there's the

57:18

ML ops suspect. Now in terms of m r

57:19

ops, I think I I covered enough,

57:22

but it's much like a batch model.

57:23

You

57:24

have to

57:26

deploy your model, which means maybe survey behind the app. As I mentioned,

57:29

the idea of situation is to

57:31

have your model loaded

57:34

into memory and that's making prediction and

57:36

training. Bolda is really

57:38

harder to do than

57:40

to do them to say

57:42

to say. The truth is that there's actually no framework out

57:44

there, which allows you to do this. You could

57:46

do this yourself. This is what we

57:48

see. I mean, we get users who ask

57:50

us questions in a way of context

57:52

on on GitHub or in mails, but and

57:54

they're asking us

57:55

how do I deploy an model? What

57:57

should

57:57

I be doing? And we always give

57:59

the same

57:59

answers. But

58:00

the the fact is

58:02

that, you know, we have these users who have

58:04

basically embraced whoever and they understand it,

58:06

but then they get into the production phase.

58:08

And

58:08

that's not what we're trying to well, we

58:11

feel bad because

58:12

they are all making the same mistakes

58:13

in some way.

58:16

And liver is is not there to help them because that's

58:18

not the purpose of liver. So, yeah,

58:20

there's

58:21

a lack of tooling

58:23

to actually just, you know,

58:25

deploy an online model. Oh, yeah.

58:27

That's the ML ops aspect. I think

58:29

in terms of some online learning,

58:31

a big challenge is that not everyone has the

58:34

luxury to have a

58:36

PhD during which you can spend

58:38

days and

58:40

nights going through online learning papers and trying to understand it. And

58:42

that's what I

58:43

and others had the

58:44

chance to do, but a lot of lavar

58:47

users our users you

58:48

know, they see the

58:49

value of online learning and they want to

58:51

put it into production, but they have deadlines to me.

58:53

Right? They have to ship their project in

58:55

six weeks and they just don't

58:57

have the time to understand things in detail.

59:00

So things like, well, I

59:02

just described progressive validation.

59:04

It kind of takes them a bit of

59:06

time to understand. And so

59:07

again, what we need to do is

59:10

to spend

59:10

more time creating resources or,

59:13

you know, just diagrams

59:16

to explain what online learning is about. And that in terms

59:18

of library design, it's

59:20

really important. Right? If we want to

59:22

introduce a new method to all our

59:24

estimators, I would

59:26

be against like, the whole point of view is to make it as simple as

59:28

possible so that people can

59:29

just, you know, be

59:31

productive,

59:31

understand that.

59:33

So, yeah, I think that just to encapsulate those two

59:36

problems is that people do not necessarily

59:38

have the resources to

59:40

learn about online learning. And

59:42

then there are operational problems

59:44

around serving these models

59:45

into production. So it's

59:48

kind of

59:48

like a batch model because you

59:50

have to serve the model behind an API, you know, and you have

59:53

to monitor it. And these are things well, you know, that

59:55

are common to about model,

59:57

but there's the added complexity

1:00:00

of having your model being, you know,

1:00:02

maintaining memory and keep learning

1:00:04

and stuff and things that are

1:00:07

basically not common. I mean, if you have to

1:00:09

Google it or find something on GitHub, you you

1:00:11

just kind of find these hacky

1:00:13

projects, but know real good at allowing me

1:00:15

to do that at least not yet. one of the things that we

1:00:17

didn't

1:00:17

discuss yet is the types

1:00:20

of machine learning use

1:00:22

cases that

1:00:24

rivers supports where I'm speaking specifically to things, logistic

1:00:26

regressions and decision trees versus deep

1:00:28

learning and neural networks. And I'm just

1:00:30

wondering if you can talk to the

1:00:34

types of machine learning approaches

1:00:36

that River is designed to support

1:00:38

and some of the reasoning

1:00:40

that went into where you decide

1:00:43

to put your focus?

1:00:45

We've

1:00:45

again is a general purpose

1:00:47

library, so there's quite a few things.

1:00:49

There are some cases

1:00:50

or flavors of machine learning, which are

1:00:53

especially interesting when you

1:00:55

cast them in an online

1:00:57

learning scenario. So

1:01:00

If you're doing anomaly detection so instance, you have

1:01:03

people

1:01:03

doing transactions on the in

1:01:05

a banking system, so they're

1:01:07

making the payments. And you might

1:01:09

want to be doing a dummy detection to detect

1:01:12

fraudulent

1:01:14

payments. That is very much

1:01:16

a situation where you have

1:01:18

streaming data. And

1:01:19

so in that case,

1:01:20

you would like to be doing online anomaly

1:01:23

detection. So we

1:01:24

see that

1:01:26

every time we put out a notebook

1:01:29

or a new a architecture method,

1:01:31

a lot of people start using it.

1:01:33

We start having bug reports

1:01:36

and and whatnot. So it's kind of surprising.

1:01:38

It's a good thing. But, yeah, I think there are

1:01:40

modules and aspects of

1:01:43

Uber which are clearly bring a

1:01:46

lot of value to users. So that would be a

1:01:48

known detection, but

1:01:50

also we have forecasting models. So

1:01:52

when you do online forecasting, that just

1:01:54

makes sense. But you have sensors which are, I don't know, measuring the

1:01:56

temperature of something. Keep on to the day of your

1:01:58

life in in real

1:01:59

time. There's also a

1:02:01

good example I have is we

1:02:03

have this engineer who's working on water pipes

1:02:05

in Italy. He's

1:02:07

trying to predict

1:02:08

how much water is going to

1:02:10

flow through certain points in his pipeline.

1:02:14

So he has sensors all over the

1:02:16

pipeline, and he's trying to just do a

1:02:18

forecasting model. And so it just makes so much sense for him

1:02:20

to be able to have his model run

1:02:22

online

1:02:22

inside the

1:02:24

sensors. Going for the IoT

1:02:26

systems is running.

1:02:28

So just all that to say

1:02:30

that there are some

1:02:32

more exotic parts of the first

1:02:34

such as architecture and forecasting, which

1:02:36

are not which I'm probably in more

1:02:38

value than the

1:02:39

classic modules such as,

1:02:42

you know, in

1:02:42

the regression, classification, regression. Again, at

1:02:44

the start, so I I talked about

1:02:47

Netflix recommendations. So

1:02:49

we

1:02:49

have some very

1:02:51

the area basic

1:02:52

bricks to able

1:02:53

to make recommendations. Well, we

1:02:55

have factorization

1:02:56

machines and we have some kind

1:02:58

of

1:02:58

banking system so that if you have users and

1:03:00

items into my you of build

1:03:02

a linking of preferred

1:03:05

items for user. So we have

1:03:07

this kind of exotic machine

1:03:09

learning cases, which provide

1:03:12

value, but require us

1:03:14

to spend a lot of time to

1:03:16

work

1:03:17

on them, basically. So

1:03:19

It's

1:03:19

very difficult for me and for

1:03:21

other contributors to be specialized in a

1:03:23

non detection time series

1:03:26

forecasting recommendation. But,

1:03:27

yeah, all this to say that

1:03:29

liver covers in a wide spectrum,

1:03:31

you can do preprocessing, you can

1:03:33

extract features, you can do classification, regression,

1:03:35

and forecasting anything. You try

1:03:38

to well, because it's online. It's just a bit

1:03:40

unique. Any

1:03:40

experience of working

1:03:43

with the river library and

1:03:45

working with end users of of

1:03:47

the tool? What are some of the most

1:03:49

interesting or innovative or unexpected ways that you've

1:03:51

seen it used? Well, unexpected

1:03:52

is a good one. There's one thing

1:03:54

that comes to mind. We have this person

1:03:56

who is a beekeeper. So a person who is, you know,

1:03:58

taking care of bees

1:03:59

and, I guess, they once a week or

1:04:02

every two

1:04:04

weeks, they go to the beehive and they pick up the honey

1:04:06

in the beehive. And this

1:04:08

person has many beehives

1:04:10

and so they don't have to waste

1:04:12

their time going into the

1:04:13

be hive and actually check-in if there's honey or

1:04:16

not. So they have a sensor. We have

1:04:18

sensors that are in each

1:04:20

beehive.

1:04:21

They're kinda measuring how much money

1:04:23

is in each behalf, and he likes to forecast how much

1:04:25

money he is expected to have

1:04:27

in, you know, the

1:04:29

weeks or come based on the weather, based on

1:04:32

past data, based on, I don't

1:04:34

know, what information

1:04:36

it uses. But really, really

1:04:38

just fun just to see this person

1:04:40

doing this hackish project where they

1:04:42

just thought it would be fun to use online learning to

1:04:44

do it. And again, there wasn't

1:04:46

an IoT context. So that made sense. I guess, a normative,

1:04:48

I was kind of impressed when I heard about

1:04:50

this project of having a

1:04:54

you know, a model within each car to determine where your

1:04:57

destination would be. So, I don't know, you

1:04:59

wake them up and morning you take

1:05:02

your car, you

1:05:03

know, is it to it weekday you're going to

1:05:05

work? It sounds silly obviously,

1:05:07

but having this this

1:05:10

idea of having one model per user is is kind of fun.

1:05:12

The most impactful project I heard

1:05:14

about and I know which is being used

1:05:18

is a situation where this company, they

1:05:20

prevent cyberattacks. So they

1:05:23

monitor this grid of

1:05:25

servers and computers. And

1:05:28

they're monitoring traffic between

1:05:30

machines. And so they're trying to

1:05:32

understand

1:05:34

when some

1:05:35

of the traffic is malicious and hacker is basically trying

1:05:37

to get into a system. So you

1:05:40

can

1:05:40

you can by

1:05:42

looking at the patterns of

1:05:44

this traffic.

1:05:45

Right? And the

1:05:47

trick is that

1:05:49

behind the traffic, the malicious

1:05:52

traffic, these hackers, and they're constantly

1:05:54

changing their patterns.

1:05:56

So

1:05:56

access patterns to actually not be

1:05:59

detected. And so

1:05:59

if

1:05:59

you manage to label

1:06:02

traffic as, you

1:06:03

know know, malicious,

1:06:04

well, you want your model to

1:06:06

keep learning. So they have this system where they

1:06:08

have, like, thousands of machines and they have

1:06:10

a few machines that are dedicated to

1:06:12

just learning from the traffic and

1:06:15

in real time that's

1:06:16

in learning, detecting anomalous traffic,

1:06:18

sending it to human beings so

1:06:21

that we can actually verify themselves,

1:06:24

labels, etcetera.

1:06:25

And so it's really cool to know that VIVU has been used

1:06:28

in that context. Like, it

1:06:30

just made so much sense for them

1:06:32

to say, wow, we can

1:06:34

actually do this online and we'd have to

1:06:36

retrain them. And

1:06:36

batch learning was getting in their

1:06:39

way. They

1:06:39

had this system which was going a

1:06:41

thousand miles an hour, just hundreds

1:06:43

of thousands of days of community all the time. And batch learning

1:06:45

was just,

1:06:45

you know, it was just, again,

1:06:47

annoying for

1:06:48

them. They

1:06:50

they having

1:06:50

a system that will enable them to do all this online, just made sense for them and to

1:06:52

know that you can do this at such a

1:06:55

high amount of traffic. It was

1:06:57

really cool and exciting. In

1:06:59

your

1:06:59

own experience of building the

1:07:02

project and using it for your own work, what are some of

1:07:04

the most interesting or unexpected or

1:07:06

challenging lessons that you've learned in

1:07:08

the process? I

1:07:08

think I'm just gonna focus a bit on a human aspect there. But overall, I

1:07:10

have been doing open source,

1:07:13

you know, quite a bit.

1:07:15

I've always had this approach

1:07:18

where I probably work too

1:07:20

much on new projects that I

1:07:22

make myself rather than on existing

1:07:25

projects. So I just rather just do my thing to

1:07:27

myself, rather than contribute just to

1:07:29

existing stuff and It's

1:07:31

not always that's fairly good, but it's just the way I

1:07:34

work. And so Wivel is really the

1:07:36

first open source project where I work for

1:07:38

other people. So, like,

1:07:40

probably many people, a lot of my

1:07:42

open source book is, I just work in it

1:07:44

myself. And obviously, I you're working

1:07:46

companies and where

1:07:48

you probably a review process and you work with a people. But

1:07:50

this is the first open source

1:07:52

project where I really work with a a team

1:07:54

of people. And

1:07:56

it's fun. It's just really so much

1:07:58

fun. Like, just a month ago,

1:07:59

we

1:07:59

actually got to meet all together and

1:08:02

to have this like

1:08:04

informal region. So that was really

1:08:06

fun. And you've realized

1:08:08

that, you know, after three years, there

1:08:10

are ups and downs, and there's moments where you

1:08:12

just do not want to work on the move anymore,

1:08:14

and you want to, you know, you have work, you have friends,

1:08:16

girlfriends, whatnot. And so the only

1:08:18

way to subexist as a open source

1:08:20

project in the long term is to

1:08:23

have multiple people working. So do

1:08:25

an open source and, you know, it's not realistic to do it on your own if you

1:08:27

want someone to be successful and to actually

1:08:29

have an impact in the

1:08:31

long term. So it's

1:08:35

actually really important to just be nice

1:08:37

and to

1:08:37

have people

1:08:38

around you who help you. And

1:08:41

although not everyone contributes as much

1:08:43

as I do or common painters

1:08:45

do. People help a lot, and they make things alive. Like, it's always a

1:08:47

joy when I open an

1:08:51

issue on GitHub. And I see that someone

1:08:53

from the community has answered the question, and I don't have to do anything. It

1:08:55

helps tremendously.

1:08:56

Yeah. We've already

1:08:58

talked

1:08:59

a bit about some of

1:09:01

the situations where online learning might not be the right choice. But for the case

1:09:03

where somebody is going to use an

1:09:08

online streaming machine learning approach? What are the cases

1:09:10

where river is the wrong choice? And maybe there's a different library or framework that

1:09:14

would be better suited? Well, yeah. Again,

1:09:15

honestly, I think that online learning is the wrong

1:09:17

choice in ninety five percent cases.

1:09:19

Like, you do not want to make

1:09:21

no mistake to think that your problem

1:09:24

is a online problem. You probably, most

1:09:26

of the time, have a batch problem that you can solve with a batch library. You know, I

1:09:28

mean, it's like a learn now.

1:09:30

If you open it and you just

1:09:33

when it is always going to work reasonably well. So sometimes I would

1:09:35

just go for that. One thing we do get lot is people asking how you

1:09:37

can do deep learning, whatever. So they

1:09:39

want to train deep

1:09:43

learning models online. So the answer is that we do

1:09:45

have a Cisco library that is called

1:09:47

CallRiver, and

1:09:50

it's

1:09:52

dedicated training torch

1:09:52

models online. So but again, that is a bit

1:09:54

finished at the moment and still need somewhere being done

1:09:56

on it. But, yeah, if

1:09:58

you want to

1:09:59

be doing deep

1:10:00

learning and you want to be working

1:10:03

with images and sound and, you know, structured data. Live is not the right

1:10:05

choice. You're not mine, and you probably

1:10:07

have to be looking at

1:10:10

it goes you

1:10:11

continue to build and iterate on the river

1:10:14

project, what are some of the things you

1:10:16

have planned for the near to

1:10:18

medium term or any app locations of this online learning

1:10:20

approach that you're excited to dig

1:10:22

into? We have a public road

1:10:24

map, so it's a notion page.

1:10:26

We have a list of stuff we're working

1:10:28

on. That one mostly

1:10:30

has a list of algorithms to implement, and it's mostly there to,

1:10:34

true

1:10:35

you know, make

1:10:36

people know what we're working on

1:10:39

and to encourage new contributors to work on something. So the

1:10:41

few contributors

1:10:44

we have just pick what they want

1:10:46

to work on and, you know, just in general order of preference. So for instance, me this

1:10:48

summer, I decided

1:10:51

to work on online covariance

1:10:53

matrix estimation. So if you actually, an online covariance matrix

1:10:55

is kinda useful

1:11:00

because It's very different financial trading. And if you

1:11:02

have an inverse covalence matrix that you can estimate online

1:11:04

that unlocks so many

1:11:07

other algorithms such as Beijing,

1:11:09

integration, elective, enveloped method for auto detection, gosh,

1:11:11

and processes, whatnot. So I

1:11:14

think I'm still

1:11:15

in the nitty

1:11:17

gritty details of implementing our rooms and not necessarily applying them to stuff. I'm

1:11:20

kinda counting on users

1:11:22

to to do the applications.

1:11:26

It just shows lights at the moment. Now

1:11:28

one thing that I'm working on in the

1:11:30

mid to long term is beaver. So

1:11:33

eventually, I want to train the two

1:11:35

spend less time on river and welcome a tool I'm building called

1:11:37

beaver. So beaver is a

1:11:40

tool to

1:11:42

deploy and maintain online learning model. So essentially an

1:11:44

MLR tool for an MLR

1:11:46

tool for online

1:11:47

learning.

1:11:50

So it's in its infancy, but it's something I've

1:11:52

been thinking about a lot. So I

1:11:54

recently gave a tour kinda in Sweden. I

1:11:57

schedged a blog post and

1:11:59

some signs where I tried to describe what it's

1:12:02

going to look like. But the goal of

1:12:03

this project is to

1:12:05

create a very simple user

1:12:08

friendly tool to deploy a model, and

1:12:10

I'm hoping that that is going to

1:12:12

encourage people to actually

1:12:14

use river and to use on my learning because they're gonna say, hey, okay, I can learn, but can

1:12:16

also just deploy

1:12:19

the model and you

1:12:21

know, and both tools play nicely together. So,

1:12:23

yeah, the future of ever is to have and to have

1:12:24

this left hand tool

1:12:27

to deploy online models.

1:12:30

It's not going to be catered just towards whoever.

1:12:32

The goal is to be able to,

1:12:34

you know, run it with any model

1:12:37

that can learn online. Well,

1:12:38

for anybody who wants to get in touch

1:12:40

with you and follow along with the work that you're doing,

1:12:42

I'll have you add your preferred contact information to

1:12:45

the show notes and is the final question, I'd

1:12:47

like to get your perspective on what you see

1:12:49

as being the biggest barrier to adoption for

1:12:51

machine learning today. I'm

1:12:52

always impressed by how much

1:12:55

the field is maturing. I think

1:12:56

that there's a clear separation now between

1:12:58

regular machine learning, like business machine learning, I

1:13:01

might call it, and deep

1:13:03

learning. I think two becoming

1:13:04

my separate fields. So I've

1:13:06

kind of stayed

1:13:07

away from deep learning

1:13:10

because I just not my

1:13:12

capacity,

1:13:12

but it's still

1:13:13

very interesting in business machine learning. So getting things

1:13:15

that I call it. And I think

1:13:17

I'm impressed by how much

1:13:20

the community has

1:13:22

evolved in terms of knowledge. People are the average female practitioner today is

1:13:24

just so much more

1:13:26

professional than five years ago.

1:13:30

And I think it's a big question of

1:13:32

education and tooling. The tricky

1:13:34

thing about

1:13:34

an ML model when

1:13:37

it's not deterministic. And so it's

1:13:39

difficult

1:13:39

to guarantee that its performance

1:13:41

over time is going to

1:13:43

be good. And

1:13:44

let alone certify the model

1:13:46

or convince stakeholders that they should adopt it.

1:13:48

So in the world, you

1:13:50

don't just deploy a model across your

1:13:52

fingers. So

1:13:54

although we've gone past

1:13:56

the tests and r and d

1:13:58

phase of a model, we are still not there in terms of

1:13:59

deploying we have still not then sums

1:14:02

of the coin model model.

1:14:03

And so The

1:14:04

reality is that there's usually

1:14:06

a feedback group where you monitor your model and

1:14:08

possibly retrain it,

1:14:10

be online or

1:14:13

you

1:14:13

know, offline retraining. It doesn't matter. And so I don't think we're gonna go that right now. I

1:14:15

don't think that we have great

1:14:16

tools to have human

1:14:19

beings in the loop. how

1:14:21

human beings and Work

1:14:23

hand in hand with machine learning models.

1:14:25

So I

1:14:26

think that tools like Progyny,

1:14:29

which is

1:14:30

a tool to have a user work hand in

1:14:32

hand of an ML system by labeling

1:14:34

data that the model is unsure about.

1:14:36

They're crucial. They're game changers

1:14:38

because they create real systems where

1:14:41

you care

1:14:42

about, you know, new data coming in, retraining your model,

1:14:44

having

1:14:44

a human validate

1:14:47

predictions, stuff like that. So

1:14:51

I think we have to move away from only having models

1:14:53

that are only having tools that

1:14:55

are destined towards training model, but

1:14:57

we also need to get better

1:14:59

at tools that you

1:15:01

know, encourage you

1:15:01

to monitor your model, to keep training it, to work with it, to

1:15:04

-- Yeah. --

1:15:06

again, just treat machining

1:15:08

as software engineering

1:15:10

and not just as some research projects. Alright. Well, thank you very much for taking

1:15:12

the time today to join

1:15:13

me and share the work that you've

1:15:15

been doing on river and

1:15:19

helping to introduce the overall concept of online

1:15:21

machine learning. It's definitely a very

1:15:23

interesting space, and it's great to

1:15:25

have tools like River available

1:15:27

to help people will take advantage of this approach. So thank

1:15:29

you for all of the time and effort that you and the other maintainers are putting into

1:15:31

the project, and I hope

1:15:33

you enjoy the rest of

1:15:35

your day. Oh,

1:15:36

thank you. Thanks for having me.

1:15:38

It was great. Thank you for listening. Don't forget

1:15:40

to check out our other

1:15:42

shows. The data engineering podcast which

1:15:45

covers the latest on modern data management and the machine learning podcast, which helps you go

1:15:47

from idea to production with machine

1:15:52

learning. Visit aside at pythonpodcast dot com

1:15:54

to subscribe to the show, sign up for the mailing list and read the show notes. And if you learned

1:15:56

something or tried out a project from

1:15:58

the show, then tell us

1:15:59

about it. Email

1:16:02

hosts at python podcast dot com with your story. And to help other people find the show, please leave a review

1:16:07

on Apple Podcastinit tell

1:16:11

your friends and

1:16:14

coworkers.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features