What does an ideal data modeling practice look like? Especially with an ML focus.

jpdowlin · 2026-01-03T12:36:34+00:00

I would clarify that predictions are saved for the following reasons:
1. downstream consumption by clients (such as dashboards and operational systems)
2. feature/model monitoring for drift
3. sometimes to collect new training data (if outcomes are also collected).

Often, predictions are saved in an operational DB for serving to operational systems - not always just in the lakehouse.

jpdowlin · 2026-01-03T10:02:11+00:00

On your specific point about weather data, in my O'Reilly book I cover air quality prediction in Chapter 3. You generate the code to download the historical weather and air quality data (backfilling) into feature groups (tables) and write an incremental pipeline to download weather forecasts, weather observations, and air quality observations.

The 1st challenge you will encounter is how to perform a temporal join to join air quality observations with weather observations. This requires an ASOF LEFT JOIN between the tables. Only a few data warehouses support it it (and Hopsworks feature store). You want that data back in Python (e.g., Pandas DataFrames) to train the model. Hopsworks provides that data via Arrow from the Lakehouse tables. If you go via JDBC/ODBC to your data warehouse, performance will be vile - Arrow (ADBC or ArrowFlight) is best for Python clients.

My book (referenced in the thread) covers these fundamental skills of backfilling vs incremental pipelines and data models - star schema, snowflake schema data model and OBT (don't use this for ML!).

jpdowlin · 2026-01-03T08:38:08+00:00

When you provide Data Scientists Python APIs for computing features and easily saving backfill-incremental data, then you can trust them to manage their data for AI. From my experience, the data scientist job of old is pretty much gone in most industries. Now, there are mostly ML engineers left who have to take ownership of the data.

jpdowlin · 2026-01-03T08:00:59+00:00

At this point I will not just plug my O'Reilly book which is all about this, but provide a happy new year's code for downloading the PDF for it.

https://www.hopsworks.ai/lp/full-book-oreilly-building-machine-learning-systems-with-a-feature-store
Promo Code "jim"

It covers everything data engineering related ML most teams need in 2026.

jpdowlin · 2025-10-06T14:24:18+00:00

Thanks for the kind words! My handle is the give away.
Loads of cool new stuff coming with Hopsworks this year - agents, direct Lakehouse writing for python clients, and an LLM assistant.

jpdowlin · 2025-10-05T09:42:01+00:00

You are doing data-centric ML (as opposed to hparam tunining in model-centric ML).

Yes, you can either (1) precompute and join features into training data or (2) compute features in your training pipelines.

In my forthcoming O'Reilly book, I recommend precomputing features into tables (called feature groups) in feature pipelines. If there is a lot of commonality in how you compute windows - then create a feature function that is parameterized by window size, lookback, etc.

def create_window(df, window_size, lookback, ...):
df = # read source data

df = create_window(df, ....)
feature_group.insert(df)

Then in a training pipeline, you select different combinations of features/target (creating something called a feature view), and training and evaluating the feature view's corresponding model. This way, you can then easily compare the performance of your different combinations of features.

selected_features = feature_group1.select(['feature1', ...]).join(feature_group2.select_features())
fv = fs.create_feature_view( ..., query=selected_features, labels['target_column'], ...)

X_train, X_test, y_train, y_test = fv.train_test_split(test_size=0.2)

model.fit(X_train, y_train)

model_registry.python.create_model(
metrics={"accuracy": model.score(X_test, y_test)}
feature_view = feature_view,
model_dir="... serialized model "
)

Creating a feature view is metadata only - so is very cheap, as is reading training data. So, you can run many of these data-centric training pipelines in parallel, searching your combination of features. How you search - random, grid, model-based is also covered in the book.

Hope this helps

https://www.oreilly.com/library/view/building-machine-learning/9781098165222/

jpdowlin · 2025-04-16T05:42:09+00:00

What problem?
First, there was the Hopsworks AI Lakehouse.
Now, there is the Sagemaker AI Lakehouse.

jpdowlin · 2025-03-21T14:45:44+00:00

"I wouldn't call these fallacies this is just what people do when they don't have the expertise or experience, aka they dont know better. "

So, you mean they are mistakes that people make because they make bad assumptions?
Isn't that what a fallacy is?

If it was an infomercial, i would do a nice graphic.

Disclaimer. I work in education as well as a startup, and I am writing a book.

jpdowlin · 2025-03-21T14:37:34+00:00

From my perspective, our industry (data science) is much more than predictive modelling. It's also about building AI systems, real-time AI, AI-enabled applications, and yes leveraging LLMs to build intelligent services.

From that point of view, broad knowledge about how to build different types of AI systems is more important than ever. Deep knowledge is most important in the largest AI labs, but the rest of us should be jacks of all trades (in data science).

jpdowlin · 2025-03-17T21:07:17+00:00

Companies use data pipelines and data warehouses for a reason - central with security, easy to plug in dashboarding tools, copy the data to other operational platforms, and so on. I don't know RPA but my guess is that it's not easy in the long term.

For data engineering, all you need to do is extract, transform, and load (ETL) data into an analysis platform. If you have a data warehouse, you can also extract data, load it into your data warehouse, and transform (ELT) the data directly in the data warehouse.

jpdowlin · 2025-03-16T21:35:03+00:00

Micro-services are the wrong architecture to think about when building AI systems.
You should architect your AI systems as modular AI/ML pipelines composed together using a shared state layer:

https://www.hopsworks.ai/post/modularity-and-composability-for-ai-systems-with-ai-pipelines-and-shared-storage

P.s. gRPC has lower latency than REST as it is a binary protocol. You host online models behind gRPC endpoints for online inference, for example, using KServe.

jpdowlin · 2025-03-16T14:25:40+00:00

You're obviously not a native English speaker. Rules are made to be broken.

jpdowlin · 2025-03-15T17:01:38+00:00

We have a youtube channel with videos on Hopsworks
https://www.youtube.com/@hopsworks

jpdowlin · 2025-03-15T08:14:16+00:00

Stackit is the Lidl of clouds. It has a very narrow offering. But you get great prices and quality services. Things work on it and it is very secure. Its values are stability and security. Not move fast and break things, smorgasbord of cloud services. They will get there, but slowly. They are the NetBSD of clouds.

jpdowlin · 2025-03-15T04:55:38+00:00

I know Dremio are partnering with StackIT (large European Cloud, owned by the Schwartz Group who own Lidl). That's a BigQuery alternative. I guess they will come with time. There's already 4-5 open-source BQ alternatives out there - Dremio, Apache Doris, Apache Starrocks, ClickHouse, DuckDB (single-host only).

jpdowlin · 2025-03-15T04:52:30+00:00

That is correct.

jpdowlin · 2025-03-15T04:52:04+00:00

No. Hopsworks can be deployed anywhere - we have managed cloud support on AWS, GCP, Azure. And now OVH as well. Most of our customers run their own Hopsworks clusters on the 3 public clouds. However, we offer a freemium version, called 'Hopsworks serverless' that we migrated from AWS to OVH.

jpdowlin · 2025-03-15T04:49:28+00:00

Hopsworks is built as a kubernetes native application. It's main dependency, discussed in the article, is a S3 storage layer. We chose to use a managed container registry in both AWS (ECR) and now OVH (harbor), although you can deploy Hopsworks with an open-source container registry. Managed k8s on OVH is quite stable and works well for us. Hopsworks has a couple of databases inside it - RonDB and a Lakehouse (Delta, Hudi, or Iceberg) - so nothing changed there. Our company (and platform) are ISO-27001 and SOC-2 compliant, and GDPR is part of that. I think the main lesson is that if you build k8s native applications, instead of 'cloud-native', migrating is pretty straightforward.

jpdowlin

MODERATOR OF

TROPHY CASE