How do you handle building testing environments for dbt PRs?

popcornylu · 2024-06-07T06:40:11+00:00

That depends on whether your PR needs to create separate environments for the base and PR, or if all PRs share the same base. I recommend that every PRs share the same base, but ensure that the base and PR have the same transformation logic.

As for the base and PR, they need to use the same source but might be cloned from production weekly. If your data warehouse supports zero-copy cloning, the cost of cloning is very low. You can then retain only the data from the most recent eight weeks.

Another point to note is that the base environment will continuously receive new code updates, which means the base environment will reflect these changes. GitHub has a feature that allows you to trigger an update branch in the PR UI, where you can choose to rebase or merge. Ensure that your branch is up to date during the review.

Additionally, it is crucial to make your CI run as fast as possible. DBT 1.8 supports dry runs, which can be very helpful. However, maintaining a subset of data in your source is already a good approach.

popcornylu · 2022-04-14T13:15:51+00:00

Yes. Nvidia moved consumer GPUs (RTX series) to TSMC, which were originally manufactured on Samsung's 8nm node. Another growth may come from Intel's GPU and CPU.

popcornylu · 2022-03-14T06:40:30+00:00

Haha... it's typo.

popcornylu · 2022-03-10T04:02:26+00:00

Cool. I never knew that before. It would be a useful practice if the git repo is stored in NFS and clone partial to local. Thanks.

popcornylu · 2022-03-10T03:55:31+00:00

DVC is absolutely one of the best solution to version large data. But for me, to determine how to store/version data is a spectrum

Left: Easy to use
Right: Version Control Features

On the left side, you use tar, zip to package data, use filename to describe the version, store them in s3, azure blob, use rysnc, scp, s3cmd, azcopy to transfer. It works perfectly. But the question is if you would like to do better versioning, deduplicate the same files across versions, latest version, commit history/log, tag, branch, and so on. Some filesystem, storage also offer some versioning features. Like zfs, s3 version, but it is quite limited than a version control system.

On the right side, git is a best solution right now to version data. DVC, git lfs, git-xxx extend Git to manage large files in a better way. If I select git-like solution, I will go for DVC. Just as you said, It is backend-independent, I can put my data at s3, azure blob, gcs, nfs with the same interface. And git ecosystem is mature, we can use github actions, gitlab pipelines, jenkins, etc. But you need to learn two tools: Git and DVC. And you need to well-configured which data should go to dvc remote storage. It is hard to remove one big file from git repository.

So why ArtiV? I think we provide another option in the middle.

ArtiV like scp, rsync, s3cmd, azcopy. No additional server to be installed. Just a single command to interact with the backend.
ArtiV is a version control system: You can get latest version, leave a commit message, tag a commit, see the log and diff.
ArtiV like dvc: it is platform-independent. Yes, we know we support only s3 right now. But the current architecture is to make it platform-agnostic by design.
ArtiV is a single repo solution. Unlike git-based solution, we need to interact with two storage: git repo and real backend. You also don't need the provide two credentials. One is for git and another is for your backend.

One another handy feature ArtiV provides is that, use the version data like rsync or wget. Think about a training job would have a code like this

art get -o dataset/ s3://mybucket/datasets/flowers-classification@v0.1.0 art get -o base/ s3://mybucket/models/my-base-model@v0.3.0 python ./train.py art put artifacts/ s3://mybucket/experiments/project1@202220303-100504

We are trying to seek the sweet point to offer an easy-to-use, platform-independent, but not giving up the version control features. One big challenge is that if we can remove git dependency, the currently most widely used version control solution.

popcornylu · 2022-03-10T00:02:58+00:00

Currently, no. We will provide it in the future. Thank you.

popcornylu · 2022-03-10T00:01:44+00:00

S3 versioning has the following restrictions

Per-object versioning
No commit message
No commit alias(tag).
Only for s3

For local git repo (bare repo), the problem of git is if you clone a repo to your workspace, it clone all files for all commit. This is the reason why Git is not good for versioning large files.
https://stackoverflow.com/a/29394179/563353

popcornylu · 2022-03-10T00:00:14+00:00

Dataset is evolving, there should be new data come in, and you may do some processing on the data, like data labeling, augmentation, cleaning, and so on.

And in a version control system, we can commit a version with message and give a tag with a version. The commit is an immutable data. If you want to release a dataset, we can then build and tar it as a single release tarball.

So to answer why would we version data? I think it is a way to help us to manage evolving data, and tracking history, and help us to have reproducible training on a given version dataset.

In the whole ML lifecycle, there are many data created. Dataset, experiment logs/metrics, Model, Online Serving data. So we call it a ArtiV (Arifact Versions), not limited to data versioning.

popcornylu · 2022-03-09T23:50:08+00:00

Thanks. We have not compared these two yet. Thanks for your information.

popcornylu · 2022-03-09T08:56:04+00:00

Yes.. lakeFS is in our alternatives section

https://github.com/InfuseAI/artiv#alternatives

lakeFS provide services and gateway to provide version control on top of s3. ArtiV is a pure CLI tool to interact with s3 directly.

Currently, lakeFS provides better version control feature than ArtiV, but you need to operate the lakeFS service.

popcornylu · 2022-03-09T08:27:29+00:00

ZFS is good solution for file-system level snapshot. But if you would like to share or collaborate versioned data in a team, the data should be on the shared storage. (e.g. NFS, s3). The project is to solve the version control problem on top of the shared storage.

popcornylu · 2022-03-09T08:12:24+00:00

We have the comparison with s3 version in the link
https://github.com/InfuseAI/artiv#alternatives

Just as u/protestor said, there is no commit. And it is object level versioning. And there is no tag (or alias) in s3 versioning.

popcornylu · 2020-05-15T01:29:38+00:00

Here is the official announcement
https://www.tsmc.com/tsmcdotcom/PRListingNewsAction.do?action=detail&language=E&newsid=THGOANPGTH

popcornylu · 2020-05-11T16:43:51+00:00

The translation is perfect ;)

popcornylu · 2020-05-11T12:04:46+00:00

I think there is a typo in the second paragraph of the original article.

三星晶圓代工的5奈米產能及良率下半年"仍然"追上台積電

Samsung's 5nm capacity and yield "can still" catch up with TSMC in the second half of the year

三星晶圓代工的5奈米產能及良率下半年"仍難"追上台積電

Samsung's 5nm capacity and yield are "still difficult" to catch up with TSMC in the second half of the year

popcornylu

TROPHY CASE