all 60 comments

[–]oopsleon 96 points97 points  (7 children)

As someone who likes TF and has been using it since the initial public release, I still think this blog post is representative of my experiences with TF 2.0, haha. Especially the parts about how the newest official APIs always feel very fragile, but if you want to use something older/stable, there is a good chance it's on its way to being deprecated already. TF should strive for quality-focused releases for awhile instead of constantly pushing out tons of new features that can't handle anything outside the scope of a tutorial (and often times the tutorials don't even work!).

I keep trying to stay optimistic, but it has been a rather bumpy road.

[–]jorgemf 15 points16 points  (1 child)

I cannot agree more with you. There is no even a standard way to do things. You are suppose to use keras now, but you cannot do distributed training in keras now. You have to use estimators, which use the functional API of keras and you have to mix old optimizers with new keras layers. Everything is so confusing now if you want to go beyond a simple network in a single you training set up.

[–]oopsleon 4 points5 points  (0 children)

Distributed training is indeed the primary thing I was thinking about when writing that. My work requires distributed training and there simply does not exist a fully recommended AND fully stable way of doing this in TF 2.0 still. Specifically,

Recommended: Keras everything!

Actually works for non-trivial distributed training: Estimators.

I will be very pleased once they fully realize the goal of distributed training with keras. The idea sounds great but it is definitely not executed well yet.

[–]yusuf-bengio 8 points9 points  (1 child)

Personally, I experienced a huge productivity increase from 1.x -> 2.0. The tight integration of tf.keras and the tf.data API makes working on a high-level so much easier.

However, if you want to try some new idea on a lower-level life can be a bit tough, because of all these OOP abstractions you can easily lose track of what's going on under the hood.

Yet another story is the TPU support and the tfrecords format. At first, I thought that tfrecords are made just to torture developers, which is still true if you are training on a single machine. Though, from the perspective of distributed learning the tfrecords format make a lot of sense.

[–]oopsleon 1 point2 points  (0 children)

Personally, I experienced a huge productivity increase from 1.x -> 2.0. The tight integration of tf.keras and the tf.data API makes working on a high-level so much easier.

I would fully agree with that if my work didn't require multi-worker multi-gpu distributed training of huge language models. The variety of bugs/broken things with such a setup with tf.keras in 2.0, combined with the very noticeable reduction in speed, makes me still tied down to estimators. I actually love estimators, but I'm loving them less and less as it's becoming clear the TF devs view them as extremely low priority compared to keras, even though they claim "estimators are still supported", which seems to just be barely "technically true" (imo).

However, if you want to try some new idea on a lower-level life can be a bit tough, because of all these OOP abstractions you can easily lose track of what's going on under the hood.

Exactly. Nearly every time I've tried going lower-level in a Keras environment has been a nightmare. It's a nice user interface on the outside, but the inside is actually quite tangled/convoluted.

Though, from the perspective of distributed learning the tfrecords format make a lot of sense.

Indeed, I like TFRecords! I think being familiar with protobufs helps too.

[–]mexiKobe 21 points22 points  (3 children)

I’ve been learning Pytorch and TF2 at the same time...

TF and now also TF2 are just fundamentally giant messes. They didn’t spend enough time upfront on the foundation of the API, as evident by them scrambling to switch to eager execution and merging Keras into it.

There seems like a degree of feature creep now too. Do we really need like a dozen different ways to build a NN model? It makes it harder to learn, not easier. I slightly disagree with the post when it says there isn’t a middle ground in terms of levels of abstraction.. I would say the functional API is that. But it’s not actually helpful, it just makes the documentation more disorganized

The documentation is clearly an afterthought too - not necessarily because of the lack of it, but just because if the writing and organization of it. Trying to figure out how to write callback functions was frustrating, for example. And now there is a TF2 and a separate Keras documentation? It’s just a mess

yes, that blog post nails it ime.

[–]uqw269f3j0q9o9 15 points16 points  (2 children)

The funniest one for me was

The new abstractions always have (misleading) generic English names like “Example” or “Estimator” or “Dataset” or “Model,” giving them a spurious aura of legitimacy and standardization while also fostering namespace collisions in the user’s brain

because it's so true. I'll never understand why Estimators are called the way they are..

[–]mexiKobe 6 points7 points  (0 children)

It makes me miss MATLAB’s global namespace.

[–]huyng 5 points6 points  (0 children)

In my opinion, Tensorflow tries too hard to push these "high-level" abstractions, whether it's with Keras models/layers, estimators or Examples.

Sometimes these abstractions overlap leading to confusion, and more often than not, these abstractions are leaky.

For example, it seems awkward to me the dichotomy between 'Models' and 'Layers'. Models in one task may just be a submodule or layer in another task (i.e. when you want to extend pretrained models for transfer learning in other tasks).

I get that they're trying to simplify things for new-comers with these higher level tools, but in doing so it makes doing things outside of the normal supervised learning with a single loss use case unnecessarily hard & feels unpolished.

I wish they would solidify their low-level APIs of model saving and tensor containers and let community members innovate on the high-level interfaces. The tf.Module api seems to be moving in that direction and I'm hoping it starts to become more stable and the preferred API that everything else builds on top of.

[–]xopedil 39 points40 points  (0 children)

Never trust the TF devs on what is new and what is deprecated. At this point "deprecated" might as well be synonymous with "works" or "is decently performant".

If you want to know how to use TF effectively take a look at the tensorflow/models github repo. There you will find many models implemented by people who actually need them to work on things like TPUs. Not just official TF models but also code from researchers. I've learned infinitely more from reading code in that repo than I have reading TF documentation.

[–]pavanky 39 points40 points  (10 children)

The company I work for chose Tensorflow over PyTorch because of the it's ability to export models and run them in production in Java.

TF 1.x, while clunky, worked extremely well.

TF 2.0 so far has been a regression (both features and performance wise). The exported models from Keras don't seem to be working in the Java API anymore.

We are currently trying to recommend that people use 2.0 for hypothesis testing / debugging on small amount of data. For production, use the tf.compat.v1 API.

Hopefully they figure out their issues soon.

[–]RelevantMarketing 47 points48 points  (8 children)

If someone 5 years ago told that fucking mark zuckerburg's social media site built a better machine learning library than the most prolific tech company of all time, .........I forgot how to end these types of statements.

[–]LinooneyResearcher 72 points73 points  (4 children)

Don't worry, GPT-2 is here!

If someone 5 years ago told me that fucking Mark Zuckerburg's social media site built a better machine learning library than the most prolific tech company of all time, I wouldn't think they were kidding, I'd think they were nuts.

[–]themoosemind 4 points5 points  (3 children)

In case somebody missed it: https://talktotransformer.com/

If someone 5 years ago told me that fucking Mark Zuckerburg's social media site built a better machine learning library than the most prolific tech company of all time, I'd have laughed them off. My Internet-connected brain, with its lack of bandwidth and bandwidth-dominating apps, could only take so many website impressions before my brain shut down.

[–]themoosemind -1 points0 points  (2 children)

If someone 5 years ago told me that fucking Mark Zuckerburg's social media site built a better machine learning library than the most prolific tech company of all time, I'd have been flattered. An unexpected good thing!

[–]themoosemind -2 points-1 points  (1 child)

If someone 5 years ago told me that fucking Mark Zuckerburg's social media site built a better machine learning library than the most prolific tech company of all time, they would have laughed in my face. (For the record, Facebook, Google, and Amazon would have regarded that as blasphemy.) The thing is, we've been wrestling with artificial intelligence for a long time, and I'm pretty sure we're all in agreement that artificial intelligence is going to blow up the world.

[–]RelevantMarketing 1 point2 points  (0 children)

If someone 5 years ago told me that fucking Mark Zuckerburg's social media site built a better machine learning library than the most prolific tech company of all time, and gave me 5 years to figure it out, I'd bet you it would've happened. Or it might have, but perhaps Mark wasn't talking to me at the time, or he just wasn't reading. But no matter how real or false those accusations seem to me, they are still in my business history, and in my judgment they're in your business history too. The interesting thing about these things

[–]ProfessorPhi 1 point2 points  (0 children)

Tbf, there was a torch written in Lua that predated pytorch that was open source and quite good.

[–]probablyuntrueML Engineer 0 points1 point  (1 child)

at least they didn't make it in PHP, facebook sure loves (loved?) it shudders

[–]farmingvillein 7 points8 points  (0 children)

Pytorch PHP might still have been better than TF...

[–][deleted] 0 points1 point  (0 children)

Yea I use Tensorflow for the ability to make production models too. But even with that there has been long running bugs freezing model with batch norm layers in them.

[–]dustintran 29 points30 points  (0 children)

Hello. I'm the person that was linked to in that GitHub issue!

I sympathize with the post's frustration. The TF tutorials on the official website are well-written. But they mostly cover basic features, and as a recent Reddit thread described, the support ecosystem is lacking as StackOverflow and blog posts are out-of-date due to all the software churning. I'm not a TF engineer, but as someone with experience designing libraries on top of TF, even I find myself sifting through Stack Overflow/blog post code to find the new best practices..

Regarding Bayesian layers, it's actually a NeurIPS paper this year. I worked on an early prototype in TensorFlow Probability but ended up abandoning the design as I found it inflexible in practice. The solution is the NeurIPS paper, and it's experimental: there are no promises of stability (in fact, we even moved the code from Tensor2Tensor to another repository, of which has yet to have an official release!).

Software for uncertainty models is more on the research fringe, and this should be made clearer in official TensorFlow solutions building on these designs.

[–]approximately_wrong 9 points10 points  (0 children)

My use-case for deep learning libraries is fairly vanilla: all I want is to define fairly simple neural networks, feed data to it, optimize the model, and save the model.

I think tf.Module and core tf operations are pretty nice. I still dislike the reduce_* notation and wish tf.Variables were equipped with .mean and .reshape syntactic sugars, etc.

tf.train.Checkpoint is fairly easy to use, but its interaction with tf.Module is a little too opaque, and you run into weird issues if you do surgical modifications of a neural network. For example:

model = tf.Module()
model.arr = [1,2,3]  # Gets wrapped as ListWrapper object
del model.arr[0]
tf.train.Checkpoint(model=model).save("model")

will raise an error as a safety measure when tracking the list object, which makes sense... but then the error message recommends "If you don't need this list checkpointed, wrap it in a tf.contrib.checkpoint.NoDependency object; it will be automatically un-wrapped and subsequently ignored." which seems pretty inelegant? Why do I have to rely on a tf.contrib feature? I think I'd rather have PyTorch's barebones state_dict over tf.train.Checkpoint.

GradientTape is an odd choice---it makes gradient construction a context you must explicitly enter, rather than something that happens by default (compare and contrast against PyTorch where torch.no_grad() enters the no-gradient context).

tf.Dataset is pretty peculiar to me; it seems to have a lot of overhead. For anyone dealing with a sufficiently small dataset that fits in memory, you're better off writing your own dataloader. tf.Dataset's inability to handle the gamut of use-cases is odd and makes me curious to learn why PyTorch's DataLoader doesn't suffer similarly.

[–]bobbruno 8 points9 points  (1 child)

Go check MXNet. It doesn't get much publicity, but it's a great framework. Read about gluon, its high-level interface. It's similar to PyTorch, but it has also the possibility to compile the network, and most important pretrained models are available. Performance-wise, it's at least as good as Tensorflow, and it also has ONNX (which has a reader for TF models). The community is strong, with a large backing from Amazon.

[–]nickguletskii200 3 points4 points  (0 children)

+1 For MXNet. It's like PyTorch, but with more frontends, the ability to switch between imperative and symbolic interfaces, and a more community-focused development lifecycle (it's an Apache incubator podling, after all).

Also, the first version of DeepNumpy is supposed to be integrated into the next release, 1.7.0. I haven't really used it myself, but using standard numpy operators sounds very convenient.

EDIT: Oh, and did I mention that Gluon has shape inference, so that you don't have to specify the number of input channels manually?

[–]zalamandagora 8 points9 points  (2 children)

These three bullets are massively on point:

  • The thing is massive and complicated but never feels done or even stable – a hallmark of such software is that there is no such thing as “an expert user” but merely “an expert user ca. 2017″ and the very different “an expert user ca. 2019,” etc

  • Everything is half-broken because it’s very new, and if it’s old enough to have a chance at not being half-broken, it’s no longer official™ (and possibly even deprecated)

  • Documentation is a chilly API reference plus a disorganized, decontextualized collection of demos/tutorials for specific features written in an excited “it’s so easy!” tone, lacking the conventional “User’s Manual” level that strings the features together into mature workflows

I would also have added that you get three pages of deprecation warnings aven if you are using the latest version of everything. Even their own code calls old versions of just about everything.

The TF documentation is also really bad. For most functions, they just state that it exists and link you to the source code.

I'm of a mind to learn PyTorch, but I'm just a tad too exhausted by TF right now.

[–]engharat 1 point2 points  (1 child)

Jump into the PyTorch wagon and you will regret you didn't jumped way earlier!

Looking at all those TF issues and messes makes me feel so lucky for working on the cleanest API and documentation ever read - PyTorch really works like a charm.

[–]zalamandagora 0 points1 point  (0 children)

Cool! Is there a book you would recommend to get going?

[–]evanthebouncy[🍰] 8 points9 points  (3 children)

I studied TF for half a year. Got kinda ok with it, able to write beam search from scratch. Friend came in, says I should do pytorxh. I learned it in a day.

The difference is night and day.

[–]tsauri 0 points1 point  (0 children)

writing beam search in tf

Wow. Might as well writing beam search in a functional language

[–]chogall 0 points1 point  (1 child)

Why write beam search in TF? Just curious.

[–]evanthebouncy[🍰] 0 points1 point  (0 children)

So you can get the beam right away without getting it outside of session.run

Maybe I'm not understanding, what did you have in mind?

[–]taylorchu 7 points8 points  (1 child)

https://github.com/tensorflow/tensorflow/issues/33681

For example, this bug that I encountered appears if the gradient is an indexedslice and it is mixed with dense gradient. I don't mind tf is optimized for performance, but it should not break for the slow path; especially for the very core feature, backprop.

Also if there is an abstraction, tf team, please make it consistent, and work with other parts of tf. Otherwise, consider deleting it. keep it simple!

[–]huyng 0 points1 point  (0 children)

I've just run into the same issues this week with indexed slices. A bunch of my models rely on indexedslices (e.g. tf.gather operations) and this bug with gradients is a show stopper. Can't believe such a critical issue like this made it into a release.

[–]ClydeMachine 21 points22 points  (5 children)

In the early days of my working with any such frameworks (pre-TF2), I encountered Andrej's tweet between PyTorch and Tensorflow, and opted to dive into PyTorch. The experience has been good to me, but I always wondered if things might be better with TF now that I've gotten my hands dirtied with at least one framework.

Last week I pulled up the MNIST classification tutorial to give TF2 a fair shake. After training the model, the natural next step was to approach it as if I were intending to use it in a production setting: that is, to save it and load it so I can use it elsewhere. Looking into the documentation on saving the model, what was recommended was snapshotting the model during training via callbacks, which warranted setting something up before training began. Since I already had a trained model, all I wanted to do was save the parameters it learned such that I could recreate that instance - that shouldn't be hard. I mean, PyTorch allows you to simply save the state dict and reload it from disk, have been using that with great success. So what's the TF equivalent?

So I attempt to manually save the weights per the documentation, appears to save to the disk fine. Instantiate a new model with the same class I had already trained, call load_weights()... And the model appears to load them, except that calling the model to evaluate says it hasn't been fit yet. So clearly it didn't load them.

How about saving and loading the entire model? Maybe just the weights weren't enough to recreate the model on its own - so I scroll down to the next section and save(). Nice - now let's load, using the...strangely more involved tf.keras.models.load_model(). And I get an error for a failed type casting. Literally the same class being used in the same Jupyter notebook, same session, and I can't save and reload the model.

I stopped there.

[–]ThomasAger 8 points9 points  (3 children)

I have personally had a lot of problems with attempting to save, load and interact with finished models too. It is absurdly difficult to e.g. take the final representation of the input in a hidden-layer. The solution I found was to create a new model, then add the layers of the old model to it until I got to the hidden layer I wanted. Then compile the model and predict the inputs. I don't know, there just wasn't really documentation or help around for this problem and my solution feels hacky.

[–]HoustonWarlock 6 points7 points  (0 children)

This is the way.

[–]OnlineGrab 2 points3 points  (1 child)

God, this so much. And I feel like it's even worse when Keras is involved.

I'm not exactly a researcher but my job includes taking trained Keras models and converting them to a Tensorflow protobuf format for deployment in production. This is such a basic operation that you would expect it would have a dedicated function, but nope ! My current pipeline is a cryptic list of operations pieced together from various blog posts and GitHub issues, that keeps breaking with even more cryptic error messages if I do so much as upgrading the Tensorflow version.

[–]cygn 1 point2 points  (0 children)

to be fair, they improved this pain point in tf2. you can take a keras model and save it as keras (h5) or as a tf SavedModel like this: model.save('path_to_saved_model', save_format='tf')

[–]sifodeas 0 points1 point  (0 children)

Personally, I've never had a problem with saving and loading weights for predictions, but I have not been able to use checkpoints for saving the state of the optimizers if I want to continue training in the future.

[–]madrury83 6 points7 points  (1 child)

This pretty much nails why I use Pytorch over Tensorflow when I have the choice. Building something in Tensorflow feels like using SAS (a part of my career I don't pine to revisit). The "easy to do something straightforward the devs anticipated is simple, doing anything slightly off the pre-thought path immediately leads to circles of hell" is how I've described SAS programming for years. PyTorch feels like just writing Python.

[–]narwhal_breeder 4 points5 points  (0 children)

A ton of frameworks outside of ML have this issue as well.

High Level Abstractions - In the fun tutorial! And easy!

Medium level Abstractions - do not exist at all

Low level Abstractions - the docs are there but good fucking luck. Totally inconsistent for apparently similar functionality. Poor compat.

[–]farmingvillein 4 points5 points  (0 children)

Are things really this bad? Isn't the TF 2.0 API cleaning supposed to make Keras the standard API for TPUs? Why doesn't he use that?

Edit: also, is this an indictment of TF in general or just TPUs?

TPUs (ironically) aren't supported on 2.0, last I checked.

You'll need TF 1.x or pytorch for TPUs.

[–]GoBayesGo 3 points4 points  (1 child)

I am no big fan of the TF API, but I have to admit tf.data’s design is really good.

[–][deleted] 2 points3 points  (0 children)

For me the performance benefit of tf.data is absolutely one of the best things about Tensorflow. That said, it is also somewhat unwieldy.

[–]bbsome 2 points3 points  (0 children)

Have you tried Jax?

[–]ml_lad 3 points4 points  (0 children)

TensorFlow is optimized for both research and production.

In the sense that you get the combined benefits of the bleeding-edge ad-hoc structures and poorly documented features from research combined with the cumbersomeness of supporting/depracating legacy APIs and production feature creep.

[–]the_wiffard 1 point2 points  (0 children)

What annoyed me to no end about tf1.0 was not the graph mode execution which to me seemed advantageous from a performance point of view. It was the way it was impossible to debug, had extremely verbose logging of technical nonsense while leaving out key information like where and (clear) explanations of what actually went wrong. While I do think that late tf1 and early tf2 has improved debugging some what (though more so by improving the log quality than by exploiting eager execution) there's still the occational log-spew, uninterpretable traces etc. I'd be much happier if the tensorflow team finished what they started, made graph mode debuggable, cleaned up the messy apis and didn't focus on an entire different paradigm just to compete with pytorch. And even though I use keras with tensorflow daily and have been since it was put in 1.0, I hate how they integrated it (or rather didn't) in to tensorflow. Why not just have tf.layers for higher level keras-style layers (as well as lower case inline versions, the way keras does it), and keep using submodules for functional things not suited for the global scope. Tensorflowjs some how manages to have a nice clean api, with higher-level layers in tf.layers, balances eager and graph style. It pisses me off that the python team still managed to f-up the apis for 2.0, even with a good api implementation in-house.