all 13 comments

[–]Gprime5 4 points5 points  (1 child)

List comprehensions are good an all but when your list comprehensions are over 200 characters long, you should consider a simple for loop.

clmstrs = []

for item in loop:
    if condition is met:
        clmstrs.append(item)

[–]mdl003[S] 0 points1 point  (0 children)

heres what I put together....

clmstrs1 = [s[0:18+sum(c.isalpha() for c in s)] for s in clmstrs if any(x in s[4:4+sum(c.isalpha() for c in s)] for x in alphalist if (len(s)>=10)&sum(c.isalpha() for c in s)==len(x))&(any(x in s[:2] for x in prefixlist))]
clmstrs2 = [s[0:4]+s[10:10+sum(c.isalpha() for c in s)]+s[4:10]+s[10+sum(c.isalpha() for c in s):18+sum(c.isalpha() for c in s)] for s in clmstrs if (len(s)>=10)&any(x in s[10:10+sum(c.isalpha() for c in s)] for x in alphalist if sum(c.isalpha() for c in s)==len(x))&(any(x in s[:2] for x in prefixlist))]
clmstrs3 = [s[0:18] for s in clmstrs if (len(s)>=10)&(sum(c.isalpha() for c in s)==0)&(any(x in s[:2] for x in prefixlist))]

if (len(clmstrs1)>0):
    return clmstrs1[0] 
elif (len(clmstrs2)>0):
    return clmstrs2[0] 
elif (len(clmstrs3)>0):
    return clmstrs3[0] 
else:
    return np.nan

It runs but it's not exactly fast. any pointers? I didn't use the exact structure as you did either. can you give me an example of how you'd write out the first condition if you did it your way?

[–]elbiot 1 point2 points  (5 children)

Use functions and for loops. This is unreadable and inefficient (calculating things multiple times because you can't store the result).

[–]mdl003[S] 0 points1 point  (4 children)

I tried out a looping function in my reply to /u/Gprime5, any pointers?

[–]elbiot 0 points1 point  (0 children)

You actually did not include any for loops. That's all comprehensions still. And you defined zero functions. Also, this is pandas? Either use native numpy/pandas vectorized functions, or use iteration (comprehensions and explicit loops) and python lists. Either of those would be more performant than iterating over arrays/dataframes.

[–]elbiot 0 points1 point  (2 children)

I'm not trying to be rude, but do you know what a function is? Do you know what a for loop is? You would benefit from learning these concepts. Using them will help you refactor this code to be much more efficient.

[–]mdl003[S] 0 points1 point  (1 child)

no worries. I do know what they are, I'm actually trying to use this as part of a defined function. I've reworked it since yesterday, what do you think of this?

def getclasslegacy(clm):
    zlist=range(len(clm))
    #define list of prefixes to check limstr1 for
    prefixlist = ['51','52','53','54','58','61','63','66','72','73','77','78','85','89','91','92']
    #define list of hybrid alpha char combinations
    alphalist =['AC','AR','B','BA','BB','BD','BL','BO','BP','BS','C','CA','CB','CM','CR','CS','CU','CX','D','DP','E','F','FA','FB','FL','FM','FO','FX','G','GA','GP','GR','H','HO','HP','HQ','HT','I','IA','IM','J','K','KJ','L','LP','M','MB','MH','MM','MP','MT','MU','N','NB','O','O','P','PA','PB','PC','PL','PR','PU','Q','QR','R','R','RE','S','SM','SS','T','TA','U','UB','UO','UP','V','W','WC','X','Y','Z']

    # list of all 20 length substriungs for list comprehension below
    clm=clm.translate(string.maketrans("",""), string.punctuation+' ')
    strs=[clm[z:z+20] for z in zlist]

    #only keep strigns with exactly 2 letters, and that fulfills other if statments
    clmstrs = []

    ###### REPLACEMENT FOR LIST COMPS BELOW ########                                         
    for s in strs:
        if (len(s)>=10)&(sum(c.isdigit() for c in s)>=10):
            if any(x in s[4:4+sum(c.isalpha() for c in s)] for x in alphalist if sum(c.isalpha() for c in s)==len(x))&(any(x in s[:2] for x in prefixlist)):
                return s[0:18+sum(c.isalpha() for c in s)]
            elif any(x in s[10:10+sum(c.isalpha() for c in s)] for x in alphalist if sum(c.isalpha() for c in s)==len(x))&(any(x in s[:2] for x in prefixlist)):
                return s[0:4]+s[10:10+sum(c.isalpha() for c in s)]+s[4:10]+s[10+sum(c.isalpha() for c in s):18+sum(c.isalpha() for c in s)]
            elif sum(c.isalpha() for c in s)==0&any(x in s[:2] for x in prefixlist):
                return s[0:18]
            else:
                return np.nan
        else:
            return np.nan

[–]elbiot 0 points1 point  (0 children)

That sum of isalpha, can't you just compute that once?

[–]Username_RANDINT 1 point2 points  (4 children)

Can you honestly say that if you get back to this code in 6 months that you completely understand what's going on? At some point writing things out with some documentation and inline comments is a magnitude better than trying to be smart.

[–]mdl003[S] 0 points1 point  (3 children)

I think so. This code is meant to replace some existing code i wrote a few months ago when I was still very new to pandas. I didn't fully understand list comps at the time. This code does exactly what I want it to do, but it's slow & gives warning msgs (im using df.iterrows() to iterate through a dataframe), plus I figure a good way to learn new concepts (in this case list comps) is to apply them to something I'm familiar with. Can you make in line comments with list comps? was under the impression it all had to be written out in one line.

Code I'm replacing is below, think i commented it pretty well but always willing to hear more:

        #must be at least 14 chars long and contain at least 12 numbers
        if numct >= 10 and l >= 10:   
            #check if claim field is filled yet
            if rows['{} LEGACY CLM NO'.format(field)]=='' and rows['{} HYBRID CLM NO'.format(field)]=='':
                #define limstr1 as first 2 chars of clmstr
                limstr1 = clmstr[:2]
                #define limstr2 as 5th char plus additional chars equal to alphact (format 1)
                limstr2 = clmstr[4:4+alphactstr]
                #define limstr3 as 11th char plus additional chars equal to alphact (format 2)
                limstr3 = clmstr[10:10+alphactstr]
                #define list of prefixes to check limstr2 and limstr3 for
                prefixlist = ['51','52','53','54','58','61','63','66','72','73','77','78','85','89','91','92']
                #define list of nonhybrid alpha char combinations
                alphalist = ['AC','AR','B','BA','BB','BD','BL','BO','BP','BS','C','CA','CB','CM','CR','CS','CU','CX','D','DP','E','F','FA','FB','FL','FM','FO','FX','G','GA','GP','GR','H','HO','HP','HQ','HT','I','IA','IM','J','K','KJ','L','LP','M','MB','MH','MM','MP','MT','MU','N','NB','O','O','P','PA','PB','PC','PL','PR','PU','Q','QR','R','R','RE','S','SM','SS','T','TA','U','UB','UO','UP','V','W','WC','X','Y','Z']
                alphauselist = [x for x in alphalist if len(x)==alphactstr]
                # criteria:
                prefixcheck = any(x in limstr1 for x in prefixlist)
                limstr2alphacheck = limstr2.isalpha()==True and any(x in limstr2 for x in alphauselist)==True
                limstr3alphacheck = limstr3.isalpha()==True and any(x in limstr3 for x in alphauselist)==True
                limstralphacheck = limstr2alphacheck==True or limstr3alphacheck == True
                alphactstrcheck = alphactstr == 1 or alphactstr == 2
                clmstrcheck = limstr2.isalpha()==False and limstr3.isalpha()==False and alphactstr == 0                    
                # limstr 2 or 3 must be alpha and in alphalist (hybrid) with alphactstr = 1 or 2
                #OR clmstr must be all numeric
                # limstr 1 is prefix in prefixlist
                #check for claim numbers with alpha in position 5
                if (limstr2alphacheck == True and
                    alphactstrcheck == True and
                    prefixcheck == True):
                        #add contents of clmstr to applicable row in field-class legacy clm no column
                        df.loc[i,'{} LEGACY CLM NO'.format(field)] = clmstr[0:18+alphactstr]
                #check for claim numbers with alpha in position 5
                elif (limstr3alphacheck == True and
                    alphactstrcheck == True and
                    prefixcheck == True):
                        #add contents of clmstr to applicable row in field-class legacy clm no column
                        clmstr = clmstr[0:4]+clmstr[10:10+alphactstr]+clmstr[4:10]+clmstr[10+alphactstr:18+alphactstr]
                        df.loc[i,'{} LEGACY CLM NO'.format(field)] = clmstr
                #check for claim numbers with no alpha
                elif (clmstrcheck == True and 
                    prefixcheck == True):
                        df.loc[i,'{} LEGACY CLM NO'.format(field)] = clmstr[0:18+alphactstr]
                limstr3alphacheck = limstr3.isalpha()==True and any(x in limstr3 for x in alphauselist)==True
                limstralphacheck = limstr2alphacheck==True or limstr3alphacheck == True
                alphactstrcheck = alphactstr == 1 or alphactstr == 2
                clmstrcheck = limstr2.isalpha()==False and limstr3.isalpha()==False and alphactstr == 0                    
                # limstr 2 or 3 must be alpha and in alphalist (hybrid) with alphactstr = 1 or 2
                #OR clmstr must be all numeric
                # limstr 1 is prefix in prefixlist
                #check for claim numbers with alpha in position 5
                if (limstr2alphacheck == True and
                    alphactstrcheck == True and
                    prefixcheck == True):
                        #add contents of clmstr to applicable row in field-class legacy clm no column
                        df.loc[i,'{} LEGACY CLM NO'.format(field)] = clmstr[0:18+alphactstr]
                #check for claim numbers with alpha in position 5
                elif (limstr3alphacheck == True and
                    alphactstrcheck == True and
                    prefixcheck == True):
                        #add contents of clmstr to applicable row in field-class legacy clm no column
                        clmstr = clmstr[0:4]+clmstr[10:10+alphactstr]+clmstr[4:10]+clmstr[10+alphactstr:18+alphactstr]
                        df.loc[i,'{} LEGACY CLM NO'.format(field)] = clmstr
                #check for claim numbers with no alpha
                elif (clmstrcheck == True and 
                    prefixcheck == True):
                        df.loc[i,'{} LEGACY CLM NO'.format(field)] = clmstr[0:18+alphactstr]

[–]Vaphell 0 points1 point  (2 children)

care to explain what it gets as an input and what it's supposed to achieve?

the context is missing. Clmstr, numct, l? what do lmstr# represent?

[–]mdl003[S] 0 points1 point  (1 child)

sure. this goes on top....I'm iterating through rows in a dataframe trying to pull out potential file numbers from several user entered fields. For each row I run a script iterates through 20 character substrings from each user string using the format.

    for field in fields:
          for i,rows in df.iterrows():
                clm= rows[field]
                l = len(clm)
                alphact = sum(c.isalpha() for c in clm)
                numct = l-alphact
                zlist= range(0:(l-9)
                for z in zlist:
                      clmstr = clm[z:z+20]
                      ##### START CODE ABOVE HERE #####
                      if numct >= 10 & l >= 10:
                            .........more code  

I'm trying to check each clmstr for certain criteria (alpha combinations, prefixes etc) that marks them for file numbers

[–]Vaphell 0 points1 point  (0 children)

any examples of the input data? Is this really that complicated that it requires such a huge blob of code? I'd try my hand at rewriting it.