I was solving an arrays problem from GFG, where we need to find the index of the first repeating element. The question is as follows:
Given an array arr[] of size n, find the first repeating element. The element should occur more than once and the index of its first occurrence should be the smallest.
I tried to use the hashmaps in python to go ahead with the problem. The following are my two codes:
Code Snippet 1
class Solution:#Function to return the position of the first repeating element.def firstRepeated(self,arr, n):m = {}for i in arr:if i not in m:m[i] = 1else:m[i] += 1for i in arr:if m[i] > 1:return arr.index(i)+1return -1
Code Snippet 2
class Solution:#Function to return the position of the first repeating element.def firstRepeated(self,arr, n):m = {}for i in arr:if i not in arr:m[i] = 1else:m[i] += 1for i in arr:if m[i] > 1:return arr.index(i)+1return -1
Code snippet 2 runs just fine but not code snippet 1. In code snippet 2, I have traversed through the array while in the former one, I have traversed the hashmap. When I tried in colab, both the codes are giving the correct answers. Not able to understand why they took the arr and not the hashmap m.
Updated answer
Your code snippets only differ in one word i. e. using arr
(the array) in your second snippet instead or m
(the dictionary) in your first.
By reason of that your first snippet will run and return a correct result (assuming you want a 1-based index; for 0-based index remove the +1
) while your second snippet will fail with a KeyError
.
Why will snippet #2 fail with KeyError
?
Well, you're looping over the array and then check for every element in the array whether it's in the array or not. This check doesn't make any sense, since the element is guaranteed to be in the array, otherwise you wouldn't receive the value within the for
loop.
Consequently, the condition if i not in arr:
will always be false
which results in the the value at m[i]
never being initialized and the else
branch always being executed. Due to the missing initialization there is no key i
in the dictionary m
which you could increase by 1 (see line m[i] += 1
) and therefore a KeyError
is thrown.
Additonal remarks
You do have one unused parameter n
which you can remove, since you don't need the array size as a parameter in Python.
The rest of the post can be ignored and is only left there to have some kind of version history and explanation for the discussion in the comment section.
Edit #1
As some comments by Kelly Bundy indicate I need to explain my original answer in more detail:
Comment 1: Fails for example input [10, 5, 3, 4, 3, 5, 6]
, reporting the 3
instead of the 5
Good point. I did not think of that. However, I think this is a specification problem. [...] find the first repeating element [...] (positionFirst, positionRepeated)
could be interpreted in two ways:
Consider [10, 5, 3, 4, 3, 5, 6]
:
(1, 5)
with positionFirst
being minimal (the first value of the pair is encountered first)
(2, 4)
with positionRepeated
being minimal (this is the first pair to be encountered)
I've used the latter interpretation, in which case, the output is not wrong.
Comment #2: It's O(n)
, since index()
is called only once
For better understanding: this was proposed by Kelly Bundy and was found to be much faster (approx. 1.5-2 times faster).
from time import time
# my proposed solution
def with_enumerate(arr):encountered_elements = dict()for i, value in enumerate(arr):if value in encountered_elements:return encountered_elements[value]encountered_elements[value] = ireturn -1# solution proposed by Kelly Bundy
def with_index(arr):encountered_elements = set()for value in arr:if value in encountered_elements:return arr.index(value)encountered_elements.add(value)return -1
# The max_value variable was introduced by me to allow for better comparison
max_value = 10000
arr = [*range(max_value), 0]funcs = with_enumerate, with_indexfor f in funcs * 3:t = time()for _ in range(100):f(arr)print(round((time()-t)*10, 3), f.__name__)
Sample output:
1.76 with_enumerate
0.747 with_index
1.011 with_enumerate
0.619 with_index
0.922 with_enumerate
0.594 with_index
Yes, I should have been more precise here. I assumed someone was using just index()
without any set or dictionary which would result in a runtime of O(n²)
. If you use index()
together with a set or dictionary you would of course also get a runtime of O(n)
. But that implementation does not make much sense (from an algorithmic point of view), because you are already storing the encountered values, why not also store their position then? Instead you use index()
to compute that position again. That would be a second loop over the array which is not necessary.
Here also comes my second observation into play in how you measured the performance. By setting up the array like this
max_value = 10
arr = [*range(max_value), 0]
# results in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0]
you made sure that the linear search by index()
will always be successful on the first element which is the best case of O(1)
.
If you instead setup the array that the duplicate would occur late in the array which is the worst case (and I was talking about the worst case):
max_value = 10
arr = [*range(max_value), max_value - 1]
# results in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9]
the measurements look a lot different since the linear search will now have the worst case runtime of O(n)
. I again used a max_value
of 10000
:
1.851 with_enumerate
1.035 with_index
1.246 with_enumerate
1.29 with_index
1.11 with_enumerate
0.984 with_index
One can already see, that the two versions are now much closer together.
Now, there is another thing we need to have a look at if we wanna properly compare both algorithms: You are using index()
which is implemented in C
and is therefore of course faster than the interpreted Python. If we were to replace the lst:index()
method with the following functionally equivalent Python function index()
def index(arr, value):for i, arrValue in enumerate(arr):if arrValue == value:return ireturn -1
so that the line return arr.index(value)
will end up being index(arr, value)
and then measure again with a max_value
of 10000
the results turn in favor of my implementation (especially for larger n
):
1.906 with_enumerate
1.189 with_index
0.915 with_enumerate
1.011 with_index
1.128 with_enumerate
1.692 with_index
So here two takeaways for everyone reading this:
- I don't wanna say my solution is faster, I just wanted to say that from an algorithmic point of view (on an abstract mathematical level) it is the fasted soltion. Execution times on a real-world machine do not always reflect that due to a number of reasons (C code vs. Python code, other stuff running etc.) so you always wanna do some measurements for your use cases
- Be careful with how you setup your performance tests as you may introduce some bias by (accidentally) setting up your test data to yield your (expected) result: I've shown two completely different results for the same algorithm, just by varying the input.
Original answer
Both implementations are odd if you ask me. Why not use enumerate()
and store the position in the dictionary? then return immediately once you have a match in the dictionary. There's no point in always iterating over the whole array except wasting compute resources. Also you do not need the parameter n
so remove it.
The worst case runtime (in case there is no repeating element or the repeating element is the last element) will be O(n)
. Depending on how long the array is and how the values are distributed the average runtime should be (considerably) faster.
FYI: solutions utilizing count()
will have a worst case runtime of O(n²)
since count()
itself takes O(n)
. The same holds true for solutions utilzing index()
.
class Solution:#Function to return the position of the first repeating element.def firstRepeated(self, arr):encountered_elements = {}for i, value in enumerate(arr):if value in encountered_elements:return (encountered_elements[value], i)encountered_elements[value] = ireturn -1solution = Solution()
print(solution.firstRepeated([1, 1, 2, 3, 4])) # (0, 1)
print(solution.firstRepeated([1, 2, 1, 3, 4])) # (0, 2)
print(solution.firstRepeated([1, 2, 3, 4, 1])) # (0, 4)
print(solution.firstRepeated([1, 2, 3, 4, 5])) # -1
This function returns the two positions of the first repeating element. You can easily adjust it to return only the first/ second position of the repeating element.