Detecting if a string has unique characters: comparing my solution to "Cracking the Coding Interview?"
Asked Answered
J

8

55

I am working through the book "Cracking the Coding Interview" and I have come across questions here asking for answers, but I need help comparing my answer to the solution. My algorithm works, but I am having difficulty understand the solution in the book. Mainly because I don't understand what some of the operators are really doing.

The task is: "Implement an algorithm to determine if a string has all unique characters. What if you cannot use additional data structures?"

This is my solution:

public static boolean checkForUnique(String str){
    boolean containsUnique = false;

    for(char c : str.toCharArray()){
        if(str.indexOf(c) == str.lastIndexOf(c)){
            containsUnique = true;
        } else {
            containsUnique = false;
        }
    }

    return containsUnique;
}

It works, but how efficient is this? I saw that the complexity of the index functions for String in Java are O(n*m)

Here is the solution from the book:

public static boolean isUniqueChars(String str) {
    if (str.length() > 256) {
        return false;
    }
    int checker = 0;
    for (int i = 0; i < str.length(); i++) {
        int val = str.charAt(i) - 'a';
        if ((checker & (1 << val)) > 0) return false;
        checker |= (1 << val);
    }
    return true;
}

A couple things I am not quite understanding with the solution. First, what does the "|=" operator do? Why is 'a' subtracted from the current character in the string for the value of "val"? I know "<<" is a bitwise left shift, but what does (checker & (1<<val)) do? I know it is bitwise and, but I am not understanding it since I am not understanding the line where checker gets a value.

I am just not familiar with these operations and unfortunately the book does not give an explanation of the solutions, probably because it assumes you already understand these operations.

Jayson answered 21/10, 2013 at 0:11 Comment(4)
I know Google filters out special characters, but you can use symbolhound.com to do the equivalent of a Google search for special characters like |= or &.Blatman
@Blatman wow than you so much for that. It was driving me crazy to not be able to google it. I wish I thought of googling "how to search with special characters" lolJayson
Moral: don't believe everything you read. The answer isn't necessarily right, or even good, just because it appears in a book. :-)Concernment
Please read the question carefully before posting an answer! The question is asking something very specific. This is not a place for you to dump your own attempt(s) at implementing this algorithm. Such answers will be deleted. If you are looking for suggestions on how to improve your own working code, please visit Code Review.Gene
S
111

There are two separate questions here: what's the efficiency of your solution, and what is the reference solution doing? Let's treat each independently.

First, your solution:

public static boolean checkForUnique(String str){
    boolean containsUnique = false;

    for(char c : str.toCharArray()){
        if(str.indexOf(c) == str.lastIndexOf(c)){
            containsUnique = true;
        } else {
            containsUnique = false;
        }
    }

    return containsUnique;
}

Your solution essentially consists of a loop over all characters in the string (let's say there are n of them), checking on each iteration whether the first and last index of the characters are the same. The indexOf and lastIndexOf methods each take time O(n), because they have to scan across all the characters of the string to determine if any of them match the one you're looking for. Therefore, since your loop runs O(n) times and does O(n) work per iteration, its runtime is O(n2).

However, there's something iffy about your code. Try running it on the string aab. Does it work correctly on this input? As a hint, as soon as you determine that there are two or more duplicated characters, you're guaranteed that there are duplicates and you can return that not all characters are unique.

Now, let's look at the reference:

public static boolean isUniqueChars(String str) {
    if (str.length() > 256) { // NOTE: Are you sure this isn't 26?
        return false;
    }
    int checker = 0;
    for (int i = 0; i < str.length(); i++) {
        int val = str.charAt(i) - 'a';
        if ((checker & (1 << val)) > 0) return false;
        checker |= (1 << val);
    }
    return true;
}

This solution is cute. The basic idea is the following: imagine that you have an array of 26 booleans, each one tracking whether a particular character has appeared in the string already. You start with all of them false. You then iterate across the characters of the string, and each time you see a character you look into the array slot for that character. If it's false, this is the first time you've seen the character and you can set the slot to true. If it's true, you've already seen this character and you can immediately report that there's a duplicate.

Notice that this method doesn't allocate an array of booleans. Instead, it opts for a clever trick. Since there are only 26 different characters possible and there are 32 bits in an int, the solution creates an int variable where each bit of the variable corresponds to one of the characters in the string. Instead of reading and writing an array, the solution reads and writes the bits of the number.

For example, look at this line:

if ((checker & (1 << val)) > 0) return false;

What does checker & (1 << val) do? Well, 1 << val creates an int value that has all bits zero except for the valth bit. It then uses bitwise AND to AND this value with checker. If the bit at position val in checker is already set, then this evaluates to a nonzero value (meaning we've already seen the number) and we can return false. Otherwise, it evaluates to 0, and we haven't seen the number.

The next line is this:

checker |= (1 << val);

This uses the "bitwise OR with assignment" operator, which is equivalent to

checker = checker | (1 << val);

This ORs checker with a value that has a 1 bit set only at position val, which turns the bit on. It's equivalent to setting the valth bit of the number to 1.

This approach is much faster than yours. First, since the function starts off by checking if the string has length greater than 26 (I'm assuming the 256 is a typo), the function never has to test any string of length 27 or greater. Therefore, the inner loop runs at most 26 times. Each iteration does O(1) work in bitwise operations, so the overall work done is O(1) (O(1) iterations times O(1) work per iteration), which is significantly faster than your implementation.

If you haven't seen bitwise operations used this way, I'd recommend searching for "bitwise operators" on Google to learn more.

Seibold answered 21/10, 2013 at 0:23 Comment(14)
thank you for the comprehensive answer. unfortunately the book does not give any information on restrictions other than the sentence I provided. I don't think I would have ever thought of the book's solution, but I am not sure if I should be worried about that..Jayson
@Seephor- I wouldn't worry about it. That solution is a bit hacky and hard to read, and you wouldn't want to write it unless you were 100% sure that the input consisted purely of lower-case letters and that the function was so time-critical it warranted an aggressively optimized solution.Seibold
The 'int checker' is a bit hacky, but actually it can be replaced by a array of boolean or even a hash map with almost equivalent performance.Nonpros
@PhamTrung yes it is tricky but a good example to learn. The problem has a constraint, say, without using extra space, then you can't really use the hash map.Natie
It wasnt addressed here, but i assume the subtraction of 'a' is to create ints in the proper range [0,25]?Ketty
@Seibold Thank you for the detailed explanation :)Erin
Great answer. This video helps explain some of the core bit operator concepts: youtube.com/watch?v=d0AwjSpNXR0Blanket
Please edit this answer to reflect that this solution is O(n), not O(1). n iterations of O(1) computations results in O(n) time complexity.Grindle
@Grindle This solution actually isn't O(n) because the number of iterations of the loop is never greater than 26 (notice that we immediately return true if the string length is at most 26). Therefore, we could replace the upper bound of the loop with Math.min(str.length(), 26) and it would behave identically, hence the O(1) bound.Seibold
How do we test this code with all the ascii characters? I was able to create a string with only 96 characters. I am not able to create the first 32 characters.Age
@Seibold UNICODE characters are a total of 256, that is being considered in the given example. If ASCII characters were being considered, then 128 would be used. Normal human characters are just 26. For an interview question it's important to clarify the type of string you want to work with, according to the book.Galatea
@DerryckDX Correct me if I'm wrong, but aren't there significantly more than 256 Unicode characters? And I totally agree with you. The purpose of this answer was to explain how the provided code operates rather than to discuss the broader question.Seibold
What int val = str.charAt(i) - 'a'; does? Is there anybody what is the equivalence of this in javascript?Originate
This computes the numerical index of the letter, assuming we’re using the English alphabet. I’m not sure what the JS equivalent is, but it probably involves something about getting a character code?Seibold
A
14

The book solution is one I don't like and I believe is dysfunctional..... templatetypedef has posted a comprehensive answer which indicates that the solution is a good one. I disagree, since the book's answer assumes that the string only has lower-case characters, (ascii) and does no validation to ensure that.

public static boolean isUniqueChars(String str) {
    // short circuit - supposed to imply that
    // there are no more than 256 different characters.
    // this is broken, because in Java, char's are Unicode,
    // and 2-byte values so there are 32768 values
    // (or so - technically not all 32768 are valid chars)
    if (str.length() > 256) {
        return false;
    }
    // checker is used as a bitmap to indicate which characters
    // have been seen already
    int checker = 0;
    for (int i = 0; i < str.length(); i++) {
        // set val to be the difference between the char at i and 'a'
        // unicode 'a' is 97
        // if you have an upper-case letter e.g. 'A' you will get a
        // negative 'val' which is illegal
        int val = str.charAt(i) - 'a';
        // if this lowercase letter has been seen before, then
        // the corresponding bit in checker will have been set and
        // we can exit immediately.
        if ((checker & (1 << val)) > 0) return false;
        // set the bit to indicate we have now seen the letter.
        checker |= (1 << val);
    }
    // none of the characters has been seen more than once.
    return true;
}

The bottom line, given templatedef's answer too, is that there's not actually enough information to determine whether the book's answer is right.

I distrust it though.

templatedef's answer on the complexity is one I agree with though.... ;-)

EDIT: As an exercise, I converted the book's answer to one that will work (albeit slower than the book's answer - BigInteger is slow).... This version does the same logic as the book's, but does not have the same validation and assumption problems (but it is slower). It is useful to show the logic too.

public static boolean isUniqueChars(String str) {
    if (str.length() > 32768) {
        return false;
    }
    BigInteger checker = new BigInteger(0);
    for (int i = 0; i < str.length(); i++) {
        int val = str.charAt(i);
        if (checker.testBit(val)) return false;
        checker = checker.setBit(val);
    }
    // none of the characters has been seen more than once.
    return true;
}
Acetaldehyde answered 21/10, 2013 at 0:32 Comment(6)
The question was stated without more context, and I'm assuming in the book that it made these restrictions more explicit.Seibold
I am guessing you are right (and I was part way though my answer whn your's came up). But, the book's solution is a really good example of what not to do.Acetaldehyde
@Seibold the book makes no other mention on restrictions. It is literally the one sentence I provided.Jayson
2^16 = 65536, not 32768Perverted
@Jayson You should check the book again. Took book says: "We will asume, in the below code, that the string is only lower case 'a' through 'z'. This will allow us to use just a single int."Untidy
@AdrianLiu there have been two newer versions of this book released since I made this question. Old version doesn't have that statement :)Jayson
C
4

Since a char value can hold one of only 256 different values, any string that's longer than 256 characters must contain at least one duplicate.

The remainder of the code uses checker as a sequence of bits, with each bit representing one character. It seems to convert each character to an integer, starting with a = 1. It then checks the corresponding bit in checker. If it's set, it means that character has already been seen, and we therefore know that the string contains at least one duplicate character. If the character hasn't yet been seen, the code sets the corresponding bit in checker and continues.

Specifically, (1<<val) generates an integer with a single 1 bit in position val. For example, (1<<3) would be binary 1000, or 8. The expression checker & (1<<val) will return zero if the bit in position val is not set (that is, has value 0) in checker, and (1<<val), which is always non-zero, if that bit is set. The expression checker |= (1<<val) will set that bit in checker.

However, the algorithm seems to be flawed: it doesn't seem to account for the uppercase characters and punctuation (which generally come before the lowercase ones lexicographically). It would also seem to require a 256-bit integer, which is not standard.

As rolfl mentions in the comment below, I prefer your solution because it works. You can optimize it by returning false as soon as you identify a non-unique character.

Concernment answered 21/10, 2013 at 0:17 Comment(6)
I would add to that: the checker is a 32-bit variable, and the high bit is the sign bit, so the value could also be negative (> 0 test is broken). As far as I am concerned, the OP has the better solution simply because it works, even though it is O(n^2)Acetaldehyde
@rolfl- I think that the 256 is an error. There are only 26 possible characters that can occur, and I don't think the solution is designed to work with characters outside that range.Seibold
As the algo substract letter a, I think it checks only lower case letters, not numbers, uppercases, special characters...Alic
Agreed, as templatetypedef mentioned. But if that's the case, it should verify its input.Concernment
As a second comment, a char in Java is 2 bytes, and has 32768 possible values, not 256Acetaldehyde
Yes, right. So doubly broken. Until Java offers 32768-bit integers. :-\Concernment
L
1

6th edition update

    public static void main(String[] args) {
        System.out.println(isUniqueChars("abcdmc")); // false
        System.out.println(isUniqueChars("abcdm")); // true
        System.out.println(isUniqueChars("abcdm\u0061")); // false because \u0061 is unicode a
    }


    public static boolean isUniqueChars(String str) {
        /*
         You should first ask your interviewer if the string is an ASCII string or a Unicode string.
         Asking this question will show an eye for detail and a solid foundation in computer science.
         We'll assume for simplicity the character set is ASCII.
         If this assumption is not valid, we would need to increase the storage size.
         */
        // at 6th edition of the book, there is no pre condition on string's length
        /*
         We can reduce our space usage by a factor of eight by using a bit vector.
         We will assume, in the below code, that the string only uses the lowercase letters a through z.
         This will allow us to use just a single int.
          */
        // printing header to provide nice csv format log, you may uncomment
//        System.out.println("char,val,valBinaryString,leftShift,leftShiftBinaryString,checker");
        int checker = 0;
        for (int i = 0; i < str.length(); i++) {
            /*
                Dec Binary Character
                97  01100001    a
                98  01100010    b
                99  01100011    c
                100 01100100    d
                101 01100101    e
                102 01100110    f
                103 01100111    g
                104 01101000    h
                105 01101001    i
                106 01101010    j
                107 01101011    k
                108 01101100    l
                109 01101101    m
                110 01101110    n
                111 01101111    o
                112 01110000    p
                113 01110001    q
                114 01110010    r
                115 01110011    s
                116 01110100    t
                117 01110101    u
                118 01110110    v
                119 01110111    w
                120 01111000    x
                121 01111001    y
                122 01111010    z
             */
            // a = 97 as you can see in ASCII table above
            // set val to be the difference between the char at i and 'a'
            // b = 1, d = 3.. z = 25
            char c = str.charAt(i);
            int val = c - 'a';
            // means "shift 1 val numbers places to the left"
            // for example; if str.charAt(i) is "m", which is the 13th letter, 109 (g in ASCII) minus 97 equals 12
            // it returns 1 and 12 zeros = 1000000000000 (which is also the number 4096)
            int leftShift = 1 << val;
            /*
                An integer is represented as a sequence of bits in memory.
                For interaction with humans, the computer has to display it as decimal digits, but all the calculations
                are carried out as binary.
                123 in decimal is stored as 1111011 in memory.

                The & operator is a bitwise "And".
                The result is the bits that are turned on in both numbers.

                1001 & 1100 = 1000, since only the first bit is turned on in both.

                It will be nicer to look like this

                1001 &
                1100
                =
                1000

                Note that ones only appear in a place when both arguments have a one in that place.

             */
            int bitWiseAND = checker & leftShift;
            String leftShiftBinaryString = Integer.toBinaryString(leftShift);
            String checkerBinaryString = leftPad(Integer.toBinaryString(checker), leftShiftBinaryString.length());
            String leftShiftBinaryStringWithPad = leftPad(leftShiftBinaryString, checkerBinaryString.length());
//            System.out.printf("%s &\n%s\n=\n%s\n\n", checkerBinaryString, leftShiftBinaryStringWithPad, Integer.toBinaryString(bitWiseAND));
            /*
            in our example with string "abcdmc"
            0 &
            1
            =
            0

            01 &
            10
            =
            0

            011 &
            100
            =
            0

            0111 &
            1000
            =
            0

            0000000001111 &
            1000000000000
            =
            0

            1000000001111 &
            0000000000100
            =
            100
             */
//            System.out.println(c + "," + val + "," + Integer.toBinaryString(val) + "," + leftShift + "," + Integer.toBinaryString(leftShift) + "," + checker);
            /*
            char val valBinaryString leftShift leftShiftBinaryString checker
            a   0       0               1       1                       0
            b   1       1               2       10                      1
            c   2       10              4       100                     3
            d   3       11              8       1000                    7
            m   12      1100            4096    1000000000000           15
            c   2       10              4       100                     4111
             */
            if (bitWiseAND > 0) {
                return false;
            }
            // setting 1 on on the left shift
            /*
            0000000001111 |
            1000000000000
            =
            1000000001111
             */
            checker = checker | leftShift;
        }
        return true;
        /*
        If we can't use additional data structures, we can do the following:
        1. Compare every character of the string to every other character of the string.
            This will take 0( n 2 ) time and 0(1) space
        2. If we are allowed to modify the input string, we could sort the string in O(n log(n)) time and then linearly
            check the string for neighboring characters that are identical.
            Careful, though: many sorting algorithms take up extra space.

        These solutions are not as optimal in some respects, but might be better depending on the constraints of the problem.
         */
    }

    private static String leftPad(String s, int i) {
        StringBuilder sb = new StringBuilder(s);
        int charsToGo = i - sb.length();
        while (charsToGo > 0) {
            sb.insert(0, '0');
            charsToGo--;
        }
        return sb.toString();
    }
Levon answered 7/9, 2017 at 21:58 Comment(1)
Great explanation I was almost skipping this answer because I tend to skip answers their text is inside the code but for this, it is better for understanding.Originate
P
0

The solution from the book is case insensitive. 'A' and 'a' is considered duplicate as per the implementation.

Explanation: for input string with char 'A', 'A' - 'a' is -32 so '1 << val' will be evaluated as 1 << -32. shift on any negative number will shift the bits in opposite direction. thus 1 << -32 will be 1 >> 32. Which will set the first bit to 1. This is also the case with char 'a'. Thus 'A' and 'a' are considered duplicate characters. Similarly for 'B' and 'b' second bit is set to 1 and so on.

Parabola answered 16/3, 2017 at 18:48 Comment(0)
S
0

As referenced from 'Cracking the Coding Interview', an alternative solution exists:

boolean isUniqueChars(String str) {
  if(str.length() > 128) return false;

  boolean[] char_set = new boolean[128];
  for(int i = 0; i < str.length(); i++) {
    int val = str.charAt(i);

    if(char_set[val]) {
      return false;
    }
    char_set[val] = true;
  }
  return true;
}

Of course, to achieve better space complexity, please refer to the above example by @templatetypedef

Shopworn answered 25/7, 2018 at 16:35 Comment(0)
C
0

In case somebody needs this as Javascript implementation. The explanation is already clear out in the @whoopdedoo and others. https://mcmap.net/q/333102/-detecting-if-a-string-has-unique-characters-comparing-my-solution-to-quot-cracking-the-coding-interview-quot

function isUniqueChars(str) {
    let ASCIICodeOfa = 'a'.charCodeAt(0);
    let checker = 0;
    for (let i = 0; i < str.length; i++) {
        let currentChar = str.charCodeAt(i);
        let val = currentChar - ASCIICodeOfa;
        if ((checker & (1 << val)) > 0) {
            return false;
        }
        checker |= (1 << val);
    }
    return true;
}

console.log(isUniqueChars('abcdef'))
console.log(isUniqueChars('abcdefb'))
Christopher answered 7/6, 2022 at 21:20 Comment(0)
A
-1

This is the necessary correction to the book's code:

public static boolean checkForUnique(String str){
    boolean containsUnique = false;

    for(char c : str.toCharArray()){
        if(str.indexOf(c) == str.lastIndexOf(c)){
            containsUnique = true;
        } else {
            return false;
        }
    }

    return containsUnique;
}
Athalia answered 21/11, 2016 at 21:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.