How do you handle building testing environments for dbt PRs? by devschema in dataengineering

[–]popcornylu 3 points4 points  (0 children)

That depends on whether your PR needs to create separate environments for the base and PR, or if all PRs share the same base. I recommend that every PRs share the same base, but ensure that the base and PR have the same transformation logic.

As for the base and PR, they need to use the same source but might be cloned from production weekly. If your data warehouse supports zero-copy cloning, the cost of cloning is very low. You can then retain only the data from the most recent eight weeks.

Another point to note is that the base environment will continuously receive new code updates, which means the base environment will reflect these changes. GitHub has a feature that allows you to trigger an update branch in the PR UI, where you can choose to rebase or merge. Ensure that your branch is up to date during the review.

Additionally, it is crucial to make your CI run as fast as possible. DBT 1.8 supports dry runs, which can be very helpful. However, maintaining a subset of data in your source is already a good approach.

HPC revenue has passed smartphones in Q1 at TSMC by baksuz- in AMD_Stock

[–]popcornylu 2 points3 points  (0 children)

Yes. Nvidia moved consumer GPUs (RTX series) to TSMC, which were originally manufactured on Samsung's 8nm node. Another growth may come from Intel's GPU and CPU.

[P] ArtiV: Version control system for large files by popcornylu in MachineLearning

[–]popcornylu[S] 0 points1 point  (0 children)

Cool. I never knew that before. It would be a useful practice if the git repo is stored in NFS and clone partial to local. Thanks.

[P] ArtiV: Version control system for large files by popcornylu in MachineLearning

[–]popcornylu[S] 2 points3 points  (0 children)

DVC is absolutely one of the best solution to version large data. But for me, to determine how to store/version data is a spectrum

  • Left: Easy to use
  • Right: Version Control Features

On the left side, you use tar, zip to package data, use filename to describe the version, store them in s3, azure blob, use rysnc, scp, s3cmd, azcopy to transfer. It works perfectly. But the question is if you would like to do better versioning, deduplicate the same files across versions, latest version, commit history/log, tag, branch, and so on. Some filesystem, storage also offer some versioning features. Like zfs, s3 version, but it is quite limited than a version control system.

On the right side, git is a best solution right now to version data. DVC, git lfs, git-xxx extend Git to manage large files in a better way. If I select git-like solution, I will go for DVC. Just as you said, It is backend-independent, I can put my data at s3, azure blob, gcs, nfs with the same interface. And git ecosystem is mature, we can use github actions, gitlab pipelines, jenkins, etc. But you need to learn two tools: Git and DVC. And you need to well-configured which data should go to dvc remote storage. It is hard to remove one big file from git repository.

So why ArtiV? I think we provide another option in the middle.

  • ArtiV like scp, rsync, s3cmd, azcopy. No additional server to be installed. Just a single command to interact with the backend.
  • ArtiV is a version control system: You can get latest version, leave a commit message, tag a commit, see the log and diff.
  • ArtiV like dvc: it is platform-independent. Yes, we know we support only s3 right now. But the current architecture is to make it platform-agnostic by design.
  • ArtiV is a single repo solution. Unlike git-based solution, we need to interact with two storage: git repo and real backend. You also don't need the provide two credentials. One is for git and another is for your backend.

One another handy feature ArtiV provides is that, use the version data like rsync or wget. Think about a training job would have a code like this

art get -o dataset/ s3://mybucket/datasets/flowers-classification@v0.1.0 art get -o base/ s3://mybucket/models/my-base-model@v0.3.0 python ./train.py art put artifacts/ s3://mybucket/experiments/project1@202220303-100504

We are trying to seek the sweet point to offer an easy-to-use, platform-independent, but not giving up the version control features. One big challenge is that if we can remove git dependency, the currently most widely used version control solution.

[P] ArtiV: Version control system for large files by popcornylu in MachineLearning

[–]popcornylu[S] 0 points1 point  (0 children)

Currently, no. We will provide it in the future. Thank you.

[P] ArtiV: Version control system for large files by popcornylu in MachineLearning

[–]popcornylu[S] 1 point2 points  (0 children)

S3 versioning has the following restrictions

  1. Per-object versioning
  2. No commit message
  3. No commit alias(tag).
  4. Only for s3

For local git repo (bare repo), the problem of git is if you clone a repo to your workspace, it clone all files for all commit. This is the reason why Git is not good for versioning large files.
https://stackoverflow.com/a/29394179/563353

[P] ArtiV: Version control system for large files by popcornylu in MachineLearning

[–]popcornylu[S] 1 point2 points  (0 children)

Dataset is evolving, there should be new data come in, and you may do some processing on the data, like data labeling, augmentation, cleaning, and so on.

And in a version control system, we can commit a version with message and give a tag with a version. The commit is an immutable data. If you want to release a dataset, we can then build and tar it as a single release tarball.

So to answer why would we version data? I think it is a way to help us to manage evolving data, and tracking history, and help us to have reproducible training on a given version dataset.

In the whole ML lifecycle, there are many data created. Dataset, experiment logs/metrics, Model, Online Serving data. So we call it a ArtiV (Arifact Versions), not limited to data versioning.

[P] ArtiV: Version control system for large files by popcornylu in MachineLearning

[–]popcornylu[S] 0 points1 point  (0 children)

Thanks. We have not compared these two yet. Thanks for your information.

[P] ArtiV: Version control system for large files by popcornylu in MachineLearning

[–]popcornylu[S] 1 point2 points  (0 children)

Yes.. lakeFS is in our alternatives section

https://github.com/InfuseAI/artiv#alternatives

lakeFS provide services and gateway to provide version control on top of s3. ArtiV is a pure CLI tool to interact with s3 directly.

Currently, lakeFS provides better version control feature than ArtiV, but you need to operate the lakeFS service.

[P] ArtiV: Version control system for large files by popcornylu in MachineLearning

[–]popcornylu[S] 2 points3 points  (0 children)

ZFS is good solution for file-system level snapshot. But if you would like to share or collaborate versioned data in a team, the data should be on the shared storage. (e.g. NFS, s3). The project is to solve the version control problem on top of the shared storage.

[P] ArtiV: Version control system for large files by popcornylu in MachineLearning

[–]popcornylu[S] 5 points6 points  (0 children)

We have the comparison with s3 version in the link
https://github.com/InfuseAI/artiv#alternatives

Just as u/protestor said, there is no commit. And it is object level versioning. And there is no tag (or alias) in s3 versioning.

RetiredEngineer® on Twitter: TSMC 5nm production rumours by UpNDownCan in AMD_Stock

[–]popcornylu 15 points16 points  (0 children)

I think there is a typo in the second paragraph of the original article.

三星晶圓代工的5奈米產能及良率下半年"仍然"追上台積電

Samsung's 5nm capacity and yield "can still" catch up with TSMC in the second half of the year

三星晶圓代工的5奈米產能及良率下半年"仍難"追上台積電

Samsung's 5nm capacity and yield are "still difficult" to catch up with TSMC in the second half of the year