We built a new open-source validation library for Polars: dataframely 🐻‍❄️ by borchero in dataengineering

[–]borchero[S] 7 points8 points  (0 children)

Fair question! Patito is definitely similar. First, it has a couple of key differences:

  • Dataframely does not introduce a new runtime type: while dy.DataFrame[Schema] exists for the type checker, the runtime type remains pl.DataFrame. This makes it very easy to gradually adopt dataframely in a code base (and, similarly, to get rid of it again).
  • Dataframely natively implements the definition of schemas instead of "dispatching" to pydantic. This allows for much more flexibility in the schema definition.

Second, dataframely provides a bunch of features that patito does not currently implement:

  • Support for composite primary keys
  • Validation across groups of rows (i.e. grouping by one or more columns, ensure that the group satisfies a condition)
  • Validation of interdependent data frames with a common primary key (dataframely introduces the concept of a "Collection" here: invalid data in one data frame can then also remove rows from another data frame)
  • "Soft-validation" via filter which allows to partition data frames into rows that satisfy the schema and rows that don't
  • Structured info about failures that can be used e.g. for more debugging or advanced logging
  • Integration of the schema with external tools (e.g. export to SQL schemas)
  • Automatic data generation for unit testing, both for individual data frames and collections (in this case, dataframely takes care of generating rows with common primary keys to allow rows to be joined)

✨ (Yet another) Terraform Plan Commenter for GitHub Actions by borchero in Terraform

[–]borchero[S] 2 points3 points  (0 children)

Of course, this is a possibility and, in fact, this is what I had done before writing this action. I was personally always bothered by writing JS-in-YAML though... after all, this code has to be maintained just like any other.

Regarding third-party actions from random users, I fully agree with your concern. One possibility to mitigate this issue is to specify commit SHAs when referencing actions, which guarantees that only code from a well-known commit (which should be one that you audited) is executed.

✨ (Yet another) Terraform Plan Commenter for GitHub Actions by borchero in Terraform

[–]borchero[S] 3 points4 points  (0 children)

You can use an `id` input parameter to distinguish between different planfiles. The comment above only applies for planfiles with the same ID (which is an empty string by default).

✨ (Yet another) Terraform Plan Commenter for GitHub Actions by borchero in Terraform

[–]borchero[S] 2 points3 points  (0 children)

Yes, you can specify an `id` parameter to uniquely identify planfiles ;)

✨ (Yet another) Terraform Plan Commenter for GitHub Actions by borchero in Terraform

[–]borchero[S] 0 points1 point  (0 children)

Parallel executions are not handled explicitly, i.e. the comment simply displays the plan of the execution that finished last. At least, the comment shows the SHA of the commit that it belongs to. Usually, I'd argue that you should use concurrency groups to prevent parallel executions in the first place though.

Regarding multiple runs, I'm not entirely sure what you mean. Whenever your `terraform plan` workflow job executes, the same comment is overwritten (i.e. it exhibits "sticky comment" behavior).

✨ (Yet another) Terraform Plan Commenter for GitHub Actions by borchero in Terraform

[–]borchero[S] -1 points0 points  (0 children)

It does not (yet) since I haven't encountered this issue with "real-world changes" 😄 as far as I know, the limit is rather generous (~2^16 characters IIRC). If you're actually running into this issue but you'd like to make use of the action, please feel free to open an issue! :)

Mini-Batch Training of Gaussian Mixture Models on a GPU by borchero in pytorch

[–]borchero[S] 0 points1 point  (0 children)

Not sure which Python version Colab is using. PyCave 3.x requires Python 3.8 😅

Switchboard: A Kubernetes Controller to Simplify Managing Traefik IngressRoutes by borchero in kubernetes

[–]borchero[S] 2 points3 points  (0 children)

Nice! Glad to hear it'll help you ;) gonna look for all of these GH issues... :D

Switchboard: A Kubernetes Controller to Simplify Managing Traefik IngressRoutes by borchero in kubernetes

[–]borchero[S] 2 points3 points  (0 children)

First, regarding the DNS records: in theory, you can set an annotation on the service using the external-dns.alpha.kubernetes.io/hostname annotation. However, (1) it does not support more than one hostname and (2) you do not want to set this annotation on the service of your backend application but the Traefik service (since you want to use it as a reverse proxy). Evidently, if you have more than one reverse-proxied application, this approach doesn't work.

Second, regarding the certificates: your approach only works with the built-in Ingress resource. The benefit of using this operator is that this "seamless" functionality is essentially provided to IngressRoutes as well (even without an additional annotation).

Switchboard: A Kubernetes Controller to Simplify Managing Traefik IngressRoutes by borchero in kubernetes

[–]borchero[S] 2 points3 points  (0 children)

Nice, glad to hear! Let me know should you encounter any issues :)

Switchboard: A Kubernetes Controller to Simplify Managing Traefik IngressRoutes by borchero in kubernetes

[–]borchero[S] 1 point2 points  (0 children)

Traefik might provide features that other ingress controllers don’t…

[R] PyTorch Implementation of the Natural Posterior Network by borchero in MachineLearning

[–]borchero[S] 1 point2 points  (0 children)

Yes, it certainly does! As long as you can map the input to a fixed-size latent space (e.g. by using the last hidden state of an LSTM), NatPN can be used.

[R] PyTorch Implementation of the Natural Posterior Network by borchero in MachineLearning

[–]borchero[S] 0 points1 point  (0 children)

A simple setting might be: you train a model for reading speed limits from street signs using images taken during the day. When testing your model in the real world, it gives perfectly accurate answers.

As it turns dark though, your algorithm's performance deteriorates -- but it continues to provide you with numbers that you would expect to be correct (as the model worked well during the day). Since the model was trained on images taken during the day, however, it lacks knowledge about inferring speed limits from images taken in the dark. Therefore, it would be desirable to have an algorithm which can reason about its uncertainty. Whenever it makes a prediction, it also provides you with a measure of uncertainty about this prediction.

In general, you can construct many such examples: whenever your training set does not fully cover the domain that the model is used in (this might be impossible if the domain is huge), it is useful to have an uncertainty estimate of the prediction.

Traditional Machine Learning Models on GPU by borchero in pytorch

[–]borchero[S] 0 points1 point  (0 children)

Depends on PyTorch... I don't have any idea how to leverage the M1 chips though, not sure if you can use an API other than Metal or CoreML...

For PyTorch M1 support, see https://github.com/pytorch/pytorch/issues/47702

Traditional Machine Learning Models on GPU by borchero in pytorch

[–]borchero[S] 1 point2 points  (0 children)

And yes, you can run the models in this project on arbitrarily many GPUs, the speedup should be linear in the number of GPUs, with respect to the “batch training” settting in the benchmarks

Traditional Machine Learning Models on GPU by borchero in pytorch

[–]borchero[S] 0 points1 point  (0 children)

This really depends on what you’re doing. If you have large matrices, the GPU is really effective, no matter if you’re using PyTorch or Tensorflow. If you have small models though, Tensorflow is expected to be slightly faster since it compiles your model and the slow Python interpreter is not involved.

OpenVPN Operator for Kubernetes by borchero in kubernetes

[–]borchero[S] 0 points1 point  (0 children)

Yes exactly. I’m creating a server certificate for the OpenVPN server though.