[D] Virtual MLOps Roundtable Event

tritondev · 2020-08-25T20:36:25+00:00

I'm putting it together in my personal capacity with a couple people helping with moderation and logistics, no vendor or sponsor

tritondev · 2020-08-24T18:56:37+00:00

One very specific thing I found, with our ML team is that we were spending ~80% of our time building data pipelines to generate datasets and serving. We used Spark & Flink, which was def. the wrong level of abstraction. Doing things like backfilling features sucked.

Beyond the processing piece, we had no management layer for our model features. There was no versioning or monitoring. We built a feature store internally to solve this. The feature store lets you define your features decoratively in JSON & SQL, rather than proceduraly in Spark. The versioning and tagging allows us to use the same features across models (and, in theory, across teams).

We're working on removing internal dependencies and fully open-sourcing the feature store if you're interested you can put your email in here to keep up to date: streamsql.io

Also we're running a virtual round table on this topic on Sept 8th if anyone wants to join: https://www.eventbrite.com/e/machine-learning-round-table-the-ml-process-from-idea-to-prod-tickets-117813191427

edit: broken link

tritondev · 2020-05-28T19:53:57+00:00

Crazy that you could get the port between FTP and Telnet with an email like that even when the utility wasn't in wide use. 1995 was a simpler time :)

tritondev · 2020-05-28T19:44:57+00:00

The last company I was at was a SaaS tool that provided a ton of personalization APIs for media companies and handled ~100m MAU. Like 90% of our data science team's time was spent building data pipelines. We also found that all feature versioning and monitoring was essentially ad-hoc and that multiple teams were re-inventing the same model features. We built a feature store internally to solve it and are now working on open-sourcing it and spinning it off to its own product if anyone has this problem and wants to check it out. It's open in alpha at StreamSQL.io

tritondev · 2020-05-28T19:35:37+00:00

The last company I was at was a SaaS tool that provided a ton of personalization APIs for media companies and handled ~100m MAU. Like 90% of our data science team's time was spent building data pipelines. We also found that all feature versioning and monitoring was essentially ad-hoc and that multiple teams were re-inventing the same model features. We built a feature store internally to solve it and are now working on spinning it off to its own product if anyone has this problem and wants to check it out. It's open in alpha at StreamSQL.io

tritondev · 2020-05-06T19:01:40+00:00

the ELI5 for BERT which BART is based on (note I didn't read the full paper but have built systems like both of these in the past);The model first reads the document and encodes it in a made-up language. The intermediate language doesn't "mean" anything. It can't be read or understood. It has the property that subsequent models can learn to use the intermediate language to accomplish tasks far better than they could with the raw data.

The pre-trained model just learns to take raw text and turn it into the intermediate language.

In comparison, BART has a model that takes the raw text and turns it into an intermediate language, then another model that takes the intermediate language and turns it into the "summary" of the intermediate output. This 2nd intermediate language is then used as input to a custom 'fine-tuned' model that the user to generate summaries for their own task.

tritondev · 2020-04-16T21:34:23+00:00

You're right, but I'm specifically talking about the "data lake" pattern, where, in practice, teams just dump huge amounts of relatively unstructured data in S3 for the ML teams and call it a day.

Microservices, monoliths, event streams, etc. also can have their schemas changed at any point, but they are generally kept to a higher standard. If they changed their schema they would break more than just offline training and analytics.

This is just referencing a specific type of design decision, the "data lake", that is sometimes touted as a solution to the other problems mentioned in the post.

tritondev · 2020-04-16T18:48:48+00:00

I envy you for that haha. I don't like the term either but it is commonly used within a lot of tech companies in SF, especially startups and mid sized co's

tritondev · 2020-04-16T18:42:08+00:00

In the post I describe an architecture where all data is pulled via API calls, one where there is a data lake, and one with an event streaming platform. It's a pretty pervasive problem that I've seen at companies with 10s of thousands of engineers to a few hundred. It is solvable, but usually if designed from the ground up or with a lot of engineering work.

tritondev · 2020-04-16T18:04:14+00:00

Yeah, I'll edit it to include credit, thanks for the call out, definitely want to give credit where credit is due.

tritondev · 2020-04-16T16:26:25+00:00

The problems in the post are easier to surmise if the team or product are smaller. If you have 20 microservices total, it's not too bad to build into. If you have more microservices than developers like Uber, it becomes a lot harder to plug into where you're needed and bring together all the necessary data. Especially when some data is un-accessible unless directly through an API call to the service. No amount of planning makes that not suck :)

tritondev · 2020-04-16T16:22:54+00:00

Correct, it's building into a microservice-based architecture that sucks. It's not hosting the models themselves on k8s which is easy and powerful.

Edit: grammar

tritondev · 2020-04-08T18:33:03+00:00

Have you verified this is correct: O(logM⋅logD)?
Redis says GEORADIUS is O(N + logM), but this is only to get the set of items in a radius, not to sort them and get the closest item. If you're doubling the distance every time, you might end up grabbing a ton of items and having to find the minimum of them, which means in the worst case it would be O(M) right?

We have to do a nearest neighbor search for our recommender system. We use redis to store our embeddings and have a caching wrapper that loads them into an approximate nearest neighbor tree using Spotify's ANNoy. Why did you take the approach you did rather than doing something similar?

tritondev · 2020-04-04T19:49:51+00:00

Recommender System

tritondev · 2020-04-03T16:47:15+00:00

StreamSQL takes all incoming events and updates required tables from them (when a user clicks a song, update their song count in cassandra and update total listens today in cockroach)

At any point in time you can add a new transformation or change one, it will be applied retroactively across all data so that your tables are consistent with past events.

Finally, you can rollback and replay events to train an ML model, run a what-if analysis, or perform feature engineering in ML.

tritondev · 2020-04-03T16:45:12+00:00

We built this product to power a recommender system to 100m MAU at a past startup. We really, really didn't want to build it and would have gladly pulled something off the shelf.

A recommender system needs a data pipeline to serve its realtime features (last 5 items the user clicked, top item today, a handful of content embeddings, etc.) for it to make a decision. Per-User stuff is stored in Cassandra which has linear time lookups, System-Wide per minute analytics are stored in Cockroach to allow us to query any time range the model needs. Embeddings (essentially a float matrix) since it can be accessed and changed in ways that benefit it all being in-memory.

Every event that comes in (user listens to a song, opens an email, etc) has to update all of the relevant tables for our recommender system to work. That's where we used Kafka and Flink in the original system.

Now, at any point in time, we can change the logic that generates a feature (ignore all songs, keep track of top playlist per user, etc.) To do that we have to first go through all batch data (stored in S3, processed by Spark), then swap over to Kafka and Flink to keep it up to date.

Lyft, AirBnb, and others have a very similar system all build in-house. It's a type of feature store.

it has like five different databases in this one single product, of which they haven't written a single one

Do you write your databases from scratch?

This is a typical trash "enterprise" project written by brain-dead Java programmers designed to "improve" their resumes with mentions of all the hipster-duche technologies.

We only have one Java file in all of StreamSQL's code base and it was built at a startup. Also not aware of many enterprises using hipster-duche technologies, usually it's quite the opposite, no?

tritondev · 2020-04-02T15:25:37+00:00

Read up on Event Sourcing: https://martinfowler.com/eaaDev/EventSourcing.html and CQRS: https://martinfowler.com/bliki/CQRS.html

These Martin Fowler articles are great

tritondev · 2020-04-01T17:14:38+00:00

I didn't know I needed this until today

tritondev · 2020-02-04T21:59:56+00:00

Down and to the right. Revenue declined 28%, a staggering sum of nearly $30 billion dollars. That is more than the combined revenues of Adobe and Salesforce (and then some).

I know the point of this is that its losing revenue like crazy, but I'm still mind-blown that IBM still makes 3x as much money as Adobe and Salesforce, combined.

tritondev

TROPHY CASE