Peter Bengtsson: Don't forget your sets in Python!

I had this piece of code:

new_all_ids=set(xforxinall_idsifxnotinto_process_ids)

The all_ids is a list object of 1.1 million IDs. Some repeated. And to_process_ids is a list sample of 1,000 randomly selected IDs from that list but all unique. The objective of the code was to first extract 1,000 IDs, do something when them, then remove those 1,000 from the original list and once that's done, update all_ids back with those from to_process_ids removed.

Only problem was that I noticed that this operation took 30+ seconds! How can it take 30 seconds to do a little bit of list comprehension? The explanation is the lack of index lookups.

What the code actually does is this:

new_all_ids=set()forxinall_ids:not_found=Falseforyinto_process_ids:ify!=x:not_found=Trueifnot_found:new_all_ids.add(x)

Can you see it? It's doing a nested loop. Or, O(n^2) in computer lingo. That's 1.1 million * 1,000 iterations of that conditional.

What sets do, on the other hand, is that they convert the list of keys in the set to a hash table. Kinda similar to how dict works except it doesn't point to a value. That means that asking a set if it contains a certain key is a O(1) operation.

You might have spotted the solution already.
If you instead do:

to_process_ids_set=set(to_process_ids)new_all_ids=set(xforxinall_idsifxnotinto_process_ids_set)

Now, in my particular example, instead of taking 30+ seconds this implementation only takes 0.026 seconds. 100 times faster.

You might also have noticed that there's a much more convenient way to do this very same thing, namely set operations! The desired new_all_ids is going to become a set anyway. And if we can first convert it to a set, then do the operation on it we can avoid looping over repeats as we do the checking.

Final solution:

new_all_ids=set(all_ids)-set(to_process_ids)

You can try it yourself with this little benchmark:

"""Demonstrate three ways how to reduce a non-unique list of integersby 1,000 randomly selected unique ones.Demonstrates the huge difference between lists and set lookups."""importtimeimportrandom# Range numbers chosen quite arbitrarily to get a benchmark that lasts# a couple of seconds with a realistic proportion of unique and # repeated integers.items=200000all_ids=[random.randint(1,items*2)for_inrange(items)]print('About',100.*len(set(all_ids))/len(all_ids),'% unique')to_process_ids=random.sample(set(all_ids),1000)deff1(to_process_ids):returnset(xforxinall_idsifxnotinto_process_ids)deff2(to_process_ids):to_process_ids_set=set(to_process_ids)returnset(xforxinall_idsifxnotinto_process_ids_set)deff3(to_process_ids):returnset(all_ids)-set(to_process_ids)functions=[f1,f2,f3]results={f.__name__:[]forfinfunctions}foriinrange(10):random.shuffle(functions)forfinfunctions:t0=time.time()f(to_process_ids)t1=time.time()results[f.__name__].append(t1-t0)forfunctioninsorted(results):print(function,sum(results[function])/len(results[function]))

When I run that with my Python 3.5.1 I get:

About 78.616 % unique
f1 3.9494598865509034
f2 0.041156983375549315
f3 0.02245485782623291

Seems to match expectations.

Peter Bengtsson: Don't forget your sets in Python!

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...