I'm trying to do a pretty simple task in Python that I have already done in Julia. It consists of taking an array of multiple 3d elements and making a dictionary of indexes of unique values from that list (note the list is 6,000,000 elements long). I have done this in Julia and it is reasonably fast (6 seconds) - here is the code:
function unique_ids(itr)
#create dictionary where keys have type of whatever itr is
d = Dict{eltype(itr), Vector}()
#iterate through values in itr
for (index,val) in enumerate(itr)
#check if the dictionary
if haskey(d, val)
push!(d[val],index)
else
#add value of itr if its not in v yet
d[val] = [index]
end
end
return collect(values(d))
end
So far so good. However, when I try doing this in Python, it seems to take forever, so long that I can't even tell you how long. So the question is, am I doing something dumb here, or is this just the reality of the differences between these two languages? Here is my Python code, a translation of the Julia code.
def unique_ids(row_list):
d = {}
for (index,val) in tqdm(enumerate(row_list)):
if str(val) in d:
d[str(val)].extend([index])
else:
d[str(val)] = [index]
return list(d.values())
Note that I use strings for the keys of the dict in Python as it is not possible to have an array as a key in Python.
String
s quite likely you can still speed up the Julia code by usingSymbol
s or usingShortStrings.jl
instead of just usingString
s (depends on particular use case scenario but the speedup could be significant) – Sprainrow_list
is like ? people will explain what you are doing wrong in python. – Quickfreezetqdm
call causes the whole list to be read into memory at once, where you'd like to keep a generator. Does it help if you take it out? It's definitely not necessary, and your Julia code does not seem to be doing anything like this. Perhaps see also #49320507 – OppugnantVector
by itself is not fully specified. You presumably meanDict{eltype(itr), Vector{Int}}()
– Uellad[str(val)].extend([index])
is silly. Just used[str(val)].append(index)
. Also, ifval
is already astr
, stop callingstr
on it; if it's not astr
, call it once upfront and don't convert it multiple times. Andcollections.defaultdict(list)
exists for avoiding the needlessly complex code you've got checking for the existence of a key each time. – Heerpairs(itr)
rather thanenumerate(itr)
in the loop header, to be properly generic. Usingpairs
is just as performant too (it actually improves the performance very slightly, by 1-2%), so it's a good practice to get into. – Tutorjulia --threads=auto
starts with 4 threads available, and with that, this code runs about 2.5x faster than the serial version. – TutorDict{eltype(itr), Vector{Int}}()
. When compared to the code in the question, it's 3.5x faster for me.) – TutorVector{Int}
in the dictionary signature, that's a fundamental concern in Julia. – Uella