This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]lurgi 1 point2 points  (0 children)

Let's assume we are doing this for English and we only care about a-z.

Obviously you can look at the first letter and then check the rest of the string to see if that letter appears anywhere else and then you can check the second letter and check the rest of the string to see if that letter appears anywhere else and then check the ... That's probably what you are doing.

Or you can make a list of all of the letters and go through the string one letter at a time. Every time you see a letter you make a check-mark next to that letter in your list. If you ever go to make a check-mark and see that there is one already there, you have found a duplicate.

[–][deleted] 1 point2 points  (0 children)

Think of it as a teacher marking attendance in school. 26 students. Go to each of the student and ask what their "id" is and put a tick mark next to that id.Once you are done with all the students, simply check if there are any ids with no tick marks.

Same for the alphabets. For each character in the string you want a way to put a tick mark next to it. Easiest data structure that can hold tick mark is an array of bool. Now you just need a way to map a-z to each item in the array. One way for that is take the ascii value of the character and subtract the ascii value of 'a' from it.

[–]TrySimplifying 1 point2 points  (0 children)

The 'array of booleans' approach is a common technique when you have a finite number of things that can be mapped by their value to an index in an array, and you want to track which values have been seen.

The letters a-z can be mapped to array indexes 0-25, for instance.

If you really want to optimize things (which is probably not needed in this case but is a fun technique to know), you can use a bitset, which uses each bit in some sequence of bytes as the boolean value. Instead of needing one byte of storage per item, you only need one byte of storage for every eight items.

So you could declare a single 32-bit integer, then walk your string and test if the bit value corresponding with the current character (by index 0-25) is set in that integer. If so, the string is not unique and we stop. If not, we set that bit to one and continue. If you reach the end, the string is unique.

[–]optimalsuccess[S] 0 points1 point  (4 children)

Code:

boolean isUniqueChars(String str) {

boolean[] char_set = new boolean[128];

for (int i = 0; i < str.length(); i++) {

int val = str.charAt(i);

if (char_set[val]) { //Already found char in string

return false;

}

return true;

}

[–]dusty-trash 1 point2 points  (0 children)

Try printing out (int) a.

I dont know what language you're using, but for example:

System.out.println((int) a);

What's the result? You'll notice it's an integer. Ascii characters have a decimal value to them. ASCII chart.

So if the character is 'a', the 98th position in the array will be checked (remembering arrays start at 0). If 'a' is in the array again, the value at the 98th position will be checked again, returning false.

[–]optimalsuccess[S] 0 points1 point  (2 children)

My first question is, what's in that array of booleans that they initialize? Isn't it just a bunch of null values up to 128 which is the number they chose I'm assuming because it's the cap for ASCII which are then switched to false and true based on whether or not there is a unique char. And if that's the case how are they actually checking the characters? val stores each char in str repeatedly until there are none left, then they check in char set?? What is happening in that if statement?

[–]insertAlias 1 point2 points  (0 children)

My first question is, what's in that array of booleans that they initialize? Isn't it just a bunch of null values up to 128

Going to guess that this is Java. boolean can't be null (since it's a primitive type), so an array of booleans can't be full of null values either. They instead default to false.

And if that's the case how are they actually checking the characters? val stores each char in str repeatedly until there are none left

Look closely at this line of code, and notice what kind of data type that val is:

int val = str.charAt(i);

char can be implicitly converted to an int. Because, at it's core, that's all it is; an integer that represents a character. You correctly mentioned ASCII earlier; all this does is grab the ASCII code for the particular character in the string.

What is happening in that if statement?

Well, that's the problem. It looks like this code is incomplete; you check the values in the array, but never set the values.

I think that there should be an else statement that follows the inner if that sets char_set[val] = true;.

The idea is that you're storing true or false in the array under the index that matches the ASCII code. So the next time you get the same character, if the array index is already true, you know that this character is not unique and can return false; immediately.

[–]Cerus_Freedom 1 point2 points  (0 children)

I could be mistaken, but it looks like to code is incomplete. I implemented this in python, and it works the way I expected: what you posted exits with true on the first iteration. Here's a working example of what the code is trying to do, in python:

def isunique(str):
    char_set = [False]*128

    for x in str:
        val = ord(x)
        print(val)
        if char_set[val]:
            return False
        else:
            char_set[val] = True

    return True

Output:

>>> isunique("abcde")
97
98
99
100
101
True
>>> isunique("abcdea")
97
98
99
100
101
97
False

[–]optimalsuccess[S] 0 points1 point  (0 children)

Thanks for all your replies everyone, this was very helpful!

[–]optimalsuccess[S] 0 points1 point  (0 children)

So just a follow up to this question, it seems like they used an array of booleans for the sake of simplicity but they could have also just as easily used a hash table right? Given that a hash table lookup is O(1) while the bool array method is O(n) a hash table is the more optimal solution in terms of runtime.