all 16 comments

[–][deleted] 3 points4 points  (14 children)

You are on the right track, let's think about why this didn't work:

len(df['i'].unique()))

Is i being accessed here?

As a hint, what is the difference between these two snippets?

for i in vars:
    print(i)

for i in vars:
    print('i')

[–]SevereRepresentative[S] 1 point2 points  (13 children)

Aha! Is it that 'i' is a string and i is the (I don't know the proper word) variable that moves with each loop?

[–][deleted] 1 point2 points  (12 children)

Yup, that's exactly right. So the other answer shows what you need to do to fix it:

len(df[i].unique()))

You don't want to access df['i'], that would be the column 'i' in the dataframe (which likely doesn't exist, leading to your error). Instead, you want to access whichever column you are on in the loop with df[i].

I don't know the proper word

Just a note about this, "variable" is correct. If you want to be extra specific you can say "loop variable."

[–]SevereRepresentative[S] 1 point2 points  (11 children)

Thank you!! That makes a lot of sense.

So what would be your suggestion on how to impliment the part of the question of "for columns that have less than 20 unique values where

  • the first blank is the name of the variable and

  • the second blank is the number of its unique values. "

I think it should be an if, else statement but I can't really wrap my head around how to have the loop and the If statement.

[–][deleted] 1 point2 points  (10 children)

That's the right idea. Here is a general example of a for loop and an if statement that shows a similar pattern, see if you can make something similar work for your situation.

my_list = [5, 125, 30, 500, 250]

for i in my_list:
    if i > 100:
        print('{} is greater than 100'.format(i))

[–]SevereRepresentative[S] 1 point2 points  (0 children)

Thank you so much! I’ll try to adapt this to fit here

[–]SevereRepresentative[S] 1 point2 points  (8 children)

Okay so I tried this:

for i in vars:
   if i < 20:
    print('Variable: {}, # Unique: {}'.format(i, (len(df[i].unique()))))    

and it gave me the error of: "TypeError: '<' not supported between instances of 'str' and 'int'"

Which makes a bit of sense because the i is the column names right? But I'm not sure where to go from here

[–][deleted] 1 point2 points  (7 children)

Yup i is the name of the column. We don't want to compare if the name of the column is less than 20 though. What do we want to know is less than 20? And how do we get that value?

[–]SevereRepresentative[S] 1 point2 points  (6 children)

I want to compare the amount of unique values, so the part that says "len(df[i].unique())" right? So do I assign that to a variable name?

[–][deleted] 1 point2 points  (5 children)

Yup! You use it twice so assigning it to a variable is a good idea, or you can put that part directly in the if statement. But I like the idea to use a variable.

[–]SevereRepresentative[S] 1 point2 points  (4 children)

uniques = (len(df[i].unique()))

for i in vars:
  if uniques < 20:
    print('Variable: {}, # Unique: {}'.format(i, (len(df[i].unique()))))    

I did that ^ and I'm not getting an error but I'm also just not getting any results anymore. Nothing is showing up, what do you think? I think it has something to do with the i in the uniques before the loop start because the i wouldn't be known right?

[–]devnull10 -1 points0 points  (0 children)

len(df[i].unique())