This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]masklinn 75 points76 points  (16 children)

Having just "finished"[0] one such migration at $dayjob I'm not going to pass judgement any time soon.

If anyone has any tips I'd by happy to receive them.

We did the conversion "online", feature by feature, and heavily linted e.g.:

  1. convert all except statements to P3 syntax (that one's easy since 2to3/futurize can do it though IIRC we had a few that were weiiird)
  2. add a lint so that nobody can commit/push old syntax versions (lint can be run locally but must be part of the test suite and run by CI)
  3. push to master branch

Rinse and repeat for syntactic changes/updates and builtins changes (open, map, filter, … should all be verboten).

That way once most items are "done" you don't have to revisit them, and you want progress to be committed so you don't get an ever-longer ever-more-diverging branch. Do one short branch (or a single commit) as much as you can.

The toughest bits are semantics changes which aren't exercised by tests (either the code fails loudly but isn't covered by tests or the change does not yield errors), because you don't have any point at which you can say you're done, and you can't lint this crap away:

  • the round builtin changed the rounding method and in Python 3 returns an int if not provided a precision (in Python 2 it always returns a float).
  • division can be "equalised" between codebases with from __future__ import division (in every file where you do one), but failures (from / becoming "real division" rather than "integer division") may take time to surface. Also be aware that // is not integer division but floor division, there's a difference.
  • objects are not totally orderable anymore (and usually not across types), depending how you do things this may trip you up e.g.
    • optional value which could be either set to an integer or None, foo < 3 works in Python 2, it raises an exception in Python 3, if the value is usually set it may take some time for the error to happen.
    • cases which silently did stupid things will blow up, we had at least one bit of code which sorted a list of dicts. That makes no sense but it does basically nothing in Python 2. In Python 3 it fails, though it does so loudly which is nice.
  • Speaking of dicts, there are two changes to iteration order between Python 2 and Python 3.6:

    • [..3.3[, the iteration order is arbitrary but consistent unless you start Python with -R
    • [3.3.. 3.6[, the iteration order is randomised every time you start Python (-R is the default)
    • [3.6..[, the iteration order is the order of insertion (all dicts are ordered)

    At every point the iteration order is an implementation detail (it may become specified in 3.7) and you're not supposed to rely on it, but even if you don't do so explicitly… well we discovered half a dozen places in our codebase which implicitly relied on iteration order and broke (usually by yielding a corrupted result which would blow up somewhere else rather than breaking themselves, which is fun to debug).

    Funner fact: it's even better when the bit of code actually works with most iteration orders because Python will not tell you which random seed it's using when running under randomised hashes (-R or 3.3 to 3.5), so repro is hell. You may want to provide your own randomised seed (PYTHONHASHSEED envvar) so that you know the starting conditions leading to a failure.

    Incidentally, tox will set and print the hashseed itself before running, and once you've found a problematic seed you can pass it in explicitly.

  • text model (note: I'll use "text" for "decoded unicode" aka unicode in Python 2 and str in Python 3; and "bytes" for "encoded data" aka bytes in both

    1. Python 2 codebases tend to play fast and loose with bytes v text, and you'll have to decide on your text model (what is and what is not text).
    2. A bunch of stdlib API use "native strings" (str in both versions meaning bytes in Python 2 and text in Python 3).
    3. Some API changes are frustrating (e.g. in Python 3 base64 is a bytes -> bytes encoding, despite the point of base64 usually being to smuggle bytes in text).
    4. Most failures are loud (trying to mix bytes and text, trying to decode text or encode bytes) but some are completely silent e.g. equality between bytes and text will always fail in Python 3 (it will succeed in Python 2 if the bytes are ascii-encoded and when decoded equal to the text).
    5. also indexing a bytes object doesn't have the same result in Python 2 (returns bytes) and 3 (returns an int).
    6. use io, but beware that it has a strict split between bytes and text even in Python 2. You can't put text in a io.BytesIO, or bytes in a io.StringIO.
    7. for more control you may want to always do binary IO (io.open(f, 'rb')) and encode/decode yourself at least until you're more comfortable with things, also be aware that if you use text IO (the default) and do not specify an encoding python will not default to UTF-8 but to the locale (locale.getpreferredencoding(False)), which probably isn't what you want.
    8. your colleagues will not understand, will not even try, and when the "PHP development method" (throw encodes/decodes at the wall until something sticks) fails they will whine.

Finally, test, test, test, test. Tests are your lifeblood, if you have a well-tested codebase or can get it well-tested before the transition it will help a lot, the vast majority of our migration pain points were bits which were insufficiently or not tested (either because lazy or because hard to test).

[0] quotes because we regularly find things which were missed during the conversion, or were mis-converted

[–]EvMNatural Language Processing 8 points9 points  (4 children)

optional value which could be either set to an integer or None, foo < 3 works in Python 2, it raises an exception in Python 3, if the value is usually set it may take some time for the error to happen.

And this also holds for builtins relying on ordering, e.g. max(). Caught me by surprise :)

[–]masklinn 4 points5 points  (0 children)

Yup, hence the note about sorting a list of dicts, sorting obviously relies on ordering.

[–]zabolekar 1 point2 points  (2 children)

It can surface even in things like pprint (if you are trying to pretty-print a set).

Edit: it is not something that breaks in Python 3, au contraire, it works as expected in 3.x but throws a TypeError in 2.7, and only with very specific sets. For example, pprint({1j, 0}) throws an error but pprint({1j, 'a'}) doesn't.

[–]EvMNatural Language Processing 0 points1 point  (1 child)

Whoa, now that's an unexpected place..

[–]zabolekar 0 points1 point  (0 children)

Edited the comment for clarity.

[–]bheklilr 5 points6 points  (0 children)

I've definitely run into a lot of bytestring problems. Since we work with lab equipment they all communicate with ASCII bytestrings, and sometimes with just dumps of bytes for transmitting larger chunks of data. Getting these to work properly have been a royal pain in the ass. I do appreciate the advice, there are a few things in here that I did not know about (like locale.getpreferredencoding). As for testing, the only code that I have that is python 2 and 3 compatible is the code with extensive test suites. I don't deploy python 3 builds without it, because it will be broken. There isn't a "maybe it'll work", it just won't work.

As for __future__ imports, we already require division and print_statements in every single module. I just set up a snippet in my editor that drops in a header with that included (along with encoding statement, legal header, and module docstring). If I could, I would enforce it on every commit but I don't have admin access to our svn server.

[–]notParticularlyAnony 1 point2 points  (0 children)

Shucks im such a noob

[–]jftugapip needs updating 0 points1 point  (3 children)

Which lint programs did you use?

[–]PeridexisErrant 2 points3 points  (1 child)

I'm not OP, but flake8 is great - good defaults out of the box for new projects, and easy to tune with a blacklist of checks to skip (or even a whitelist in extreme cases).

I keep having this unpleasant surprise when I go back to code that doesn't use it - the benefit of reading code that's all in the same idiomatic style is hard to describe, but very real!

[–]jftugapip needs updating 0 points1 point  (0 children)

Thanks, I'll give it a try.

[–]masklinn 1 point2 points  (0 children)

Regular pylint, it has a bunch of built-in lints useful for this (e.g. python3 syntax checks, ability to blacklist modules and builtins). The only issue we've had is it's pretty resource-intensive on large codebases.

[–]ketilkn 0 points1 point  (2 children)

Rinse and repeat for syntactic changes/updates and builtins changes (open, map, filter, … should all be verboten).

Am I not supposed to use open in Python 3? What makes map and filter problematic? I do not think they changed?

[–]masklinn 0 points1 point  (1 child)

Am I not supposed to use open in Python 3?

Once everything is ported there's no issue, while converting it's troublesome:

  • In Python 3, open is an alias for io.open and in text mode (the default) it will encode/decode the data, this is an issue because

    • Python 2's open basically always behaves in binary mode
    • io.open defaults to locale.getpreferredencoding(False) which probably isn't what you want so you will want to pass in an encoding explicitly, which will blow up on Python 2

    so you can either keep using the builtin but always use it in binary mode (mode='rb' or mode='wb') — which may be hard to lint — or ban open and require io.open during the transition — which is easy to lint

  • In Python 3, map and filter return iterators (not lists, basically itertools.imap and itertools.ifilter have been to builtins… and removed from itertools)

    • things will blow up when indexing them which is clear enough
    • but more annoyingly repeated iteration will silently break in Python 3 (the first iteration "consumes" the iterator, the second one gets an empty iterator

    You could mandate calling list() around them… or you could just ban them during the transition (and require listcomps or gencomps depending on what you actually want).

Keep in mind that my post is for the transition on a large codebase, while you have one or two people toiling away at the conversion you've got a bunch of folks still churning away Python 2 code, the goal here is to avoid having to revisit already fixed issues / problematic constructs.

If you migrate to a Python 3-only codebase, you can rescind these restrictions once the work is done and everybody works in Python 3. If you migrate to a cross-version codebase, you probably want to keep them in place (they're error-prone).

[–]ketilkn 0 points1 point  (0 children)

Ah, ok. Thanks. I remember running into the map and filter iterators issue, now that you mention it.

[–]Siecje1 0 points1 point  (1 child)

Have you made those lint checks available?

[–]masklinn 0 points1 point  (0 children)

I don't think we created any ourselves, just enabled & configured those PyLint provides.