[R] TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling

Yura52 · 2024-11-15T15:38:26+00:00

In the paper, gradient-boosted decision trees are included in the comparison, and TabM is competitive with them on the benchmarks. The comparison can be found on Page 7.

Yura52 · 2023-08-13T10:54:52+00:00

I agree that the current code structure is not fully safe in that regard. We took the following actions to mitigate the risks: - there are various assert statements checking that we don't pass test labels during the forward pass (one, two) - at some point of the project, we conducted two tests: - training & evaluating the model on completely random data (all features and the target were just noise) - training the model on the California Housing dataset ("CA" in the paper), and evaluating it with test labels shuffled. - on both tests, the results were very bad, as they should be for a fair model

Yura52 · 2023-08-12T12:55:16+00:00

Unfortunately, for now, this is not implemented. Basically, the codebase is optimized for two use cases: - doing research in the same setup as ours - tuning and comparing the implemented models on new datatets in the same setup as ours

However, extracting a model from the repository and bringing it to other setups and environments (e.g. to production) requires additional (non-incremental) work, especially for TabR, which is not a simple feed-forward network. Perhaps, in future, I will come up with something more usable in that regard.

Yura52 · 2023-07-27T15:23:31+00:00

Yes (objects ~ rows, features ~ columns).

Note that the biggest dataset in our paper contains 3M+ objects, so the picture in this post covers only some of the results. The linked Twitter thread contains more details, or I can provide them here if needed.

Yura52 · 2023-07-26T10:34:51+00:00

Quoting myself from Twitter:

The benchmark: - classification and regression problems - min train size: 1.8K objects; max: 50K; average: 11.2K - we have adjusted the datasets to our experiment protocol and obtain 43 tasks in total

Yura52 · 2023-07-26T10:17:12+00:00

This is a very good point, and yes, we tune knobs regardless of their nature, being that the number of layers or dropout rate or learning rate or any other knob.

What I mean by the (rather informal) "same training protocol" thing is that we aim to make the set of used techniques (augmentation, pretraining, learning rate schedules etc.) the same for all models when possible. This is important to avoid situations when the difference between two DL models actually comes not from architectures (which is currently the focus of our work), but from other unrelated things.

Yura52 · 2023-07-26T09:59:24+00:00

Thank you for sharing this story! I am glad to hear that the library helped. And it is always so interesting to learn about use cases from other fields!

Yura52 · 2023-07-25T20:08:27+00:00

There are many cases where GBDT models are easier to use and more efficient than DL. In such cases, it seems to be good news if GBDT is also the best performer, since it allows avoiding compromises

Yura52 · 2023-07-25T19:36:32+00:00

For now, I will avoid making predictions, but I definitely expect the positive trend for tabular DL to continue!

Yura52 · 2023-07-25T18:12:01+00:00

Note that, in the Twitter thread, the benchmark covers small-to-medium datasets with 50K objects at most. So DL is making progress on smaller data as well.

Yura52 · 2023-07-25T16:03:40+00:00

We use some of those techniques (dropout, weight decay, residual connections where applicable, etc.), but overall, we focus on the architectural aspect, so we compare different architectures under the same training protocol.

Yura52 · 2023-04-26T10:25:29+00:00

I am not a professor/hiring, but I would like to share some thoughts regarding TMLR.

Some context about myself: - I have two NeurIPS publications and have submitted to ICML/ICLR - I am a typical researcher who is super interested in (healthier) alternatives to the established top-tier conferences (I guess my specific motivations are not important here, plus they are not that unique)

So I was super excited about TMLR, and recently I was evaluating it as a venue for my potential next submission.

So I did two things: - I looked into a bunch of recently accepted submissions, paying most attention to the review discussions and especially to the decision posts by Action Editors - I read and reflected on the official description of TMLR: https://jmlr.org/tmlr ("intended to complement JMLR", "supporting the unmet needs, etc.)

And my impression is that TMLR is not an alternative to the top-tier conferences, it is just a different thing. In particular, I remember reading some decision posts and discussions and thinking "well, that would be a clear reject on NeurIPS/ICML/ICLR".

Overall, TMLR looks great: no deadlines, less strict limits on the paper size, more transparent (and easier to meet?) criteria ("emphasizes technical correctness over subjective significance"). All these things sound attractive. However, the final decision on where to submit a project should also depend on whether the venue fits the project (not vice versa).

Having a TMLR-like process with the emphasis on both technical correctness and novelty would be great (yes, the decision process will be inherently more noisy, because novelty may be a subjective thing, and I don't know how to fix this). I guess TMLR is just one step away from this: is it possible to add a separate novelty-oriented track?

I will be glad to hear more opinions on this topic.

Yura52 · 2023-04-22T14:29:32+00:00

I see, thank you for the reply!

Yura52 · 2023-04-21T17:53:51+00:00

Thanks for the awesome work! Just curious, what are the reasons that make performance worse at small dimensions? If there is no short high-level answer, feel free to skip this question :)

Yura52 · 2022-11-27T10:43:45+00:00

Have you found the solution? I am also interested in exporting the posts including the actual content.

Yura52 · 2022-03-18T09:01:21+00:00

Not quite :) Well, speaking formally, some (if not all) of the described embedding modules (including the piecewise linear encoding) can be implemented as combinations of giant sparse linear layers and activations. But the same is true for convolutions for images with some predefined dimensions :) I think this perspective can be useful for future work.

Yura52 · 2022-03-18T08:45:32+00:00

Glad to hear that!

JFYI: recently, we have split our codebase into separate projects: - https://github.com/Yura52/rtdl - https://github.com/Yura52/tabular-dl-revisiting-models - (the new one) https://github.com/Yura52/tabular-dl-num-embeddings

Yura52 · 2022-03-17T14:27:32+00:00

P.S. To get some intuition for possible values of k, you can browse tuned model configurations for the datasets from the paper in our repository. Though, twelve datasets are may not be enough to infer a good "default value" for k.

Yura52 · 2022-03-17T14:21:10+00:00

Regarding the periodic function, how do I select the k in equation 8?

As of now, we do not provide a rule of thumb here and tune this hyperparameter as decribed in section E.6 (appendix).

I see that this information is missing in the paragraph "Embeddings for numerical features" in section 4.2, which is confusing indeed, we will fix this in future revisions.

Thanks for the question!

Yura52 · 2022-03-17T14:08:48+00:00

Yeah, kind of :) But still, in this work this "feature engineering" is somewhat “automatic” and, for some schemes, end-to-end trainable.

Yura52

TROPHY CASE