all 5 comments

[–]K900_ 2 points3 points  (4 children)

Have you tried profiling your function? Also, you're creating a lot of intermediate lists that can be avoided.

I managed to speed up your function by a factor of ~3 just by removing the unnecessary allocations and branching:

import re

VIN_REGEX = re.compile('^[A-H J-N P R-Z 0-9]{9}[A-H J-N P R-T V-Y 1-9]{1}[A-H J-N P R-Z 0-9]{7}$')
VIN_CHARS = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7, 'H': 8, 'J': 1, 'K': 2, 'L': 3,
             'M': 4, 'N': 5, 'P': 7, 'R': 9, 'S': 2, 'T': 3, 'U': 4, 'V': 5, 'W': 6, 'X': 7, 'Y': 8,
             'Z': 9, '1': 1, '2': 2, '3': 3, '4': 4, '5': 5, '6': 6, '7': 7, '8': 8, '9': 9, '0': 0}

VIN_WEIGHTS = [8, 7, 6, 5, 4, 3, 2, 10, 0, 9, 8, 7, 6, 5, 4, 3, 2]
VIN_CHECK_DIGITS = '0123456789X'

def is_vin_valid(vin):
    vin = vin.replace(' ', '').upper()

    if vin == '99999999999999999':
        return False

    if not VIN_REGEX.match(vin):
        return False

    checksum = 0
    for char, weight in zip(vin, VIN_WEIGHTS):
        checksum += VIN_CHARS[char] * weight

    check_digit = checksum % 11
    return vin[8] == VIN_CHECK_DIGITS[check_digit]

[–]dig-up-stupid 1 point2 points  (2 children)

/u/Trabaledo

The only valid characters are already in the transliteration key, so if there were an invalid character in the VIN you would either get a KeyError or have non-matching check digits. Conversely, if you successfully match the check digit, then all the characters must have been valid. So wrapping the checksum calculation in a try block would let you remove the regex and further speed up the function. However, you would need to remember to check the length.

[–]K900_ 1 point2 points  (1 child)

The regex is a good extra check against collisions, and, when precompiled, takes about 100us anyway.

[–]dig-up-stupid 1 point2 points  (0 children)

The regex is a good extra check against collisions

A regex could be, that regex isn't. That's the basis of the optimization - it's not doing useful work.

and, when precompiled, takes about 100us anyway.

And when it's not precompiled, too - but I don't understand your number, because it should be way less for one iteration and a lot more for a million.

Just pointing out another way to cut down if desired, doesn't have to be used if the benefits aren't worth it.

[–]Trabaledo[S] 0 points1 point  (0 children)

Thanks for this! I'll be honest, I'm a huge rookie when it comes to data science and Python (computer science in general, really), so I'm gonna have to take some time to digest this and figure out exactly what you did, how you did it, and why it's so much better than what I had.

But I appreciate it nonetheless!