Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22907

death and gravity: Has your password been pwned? Or, how I almost failed to search a 37 GB text file in under 1 millisecond (in Python)

$
0
0

So, there's this website, Have I Been Pwned, where you can check if your email address has appeared in a data breach.

There's also a Pwned Passwords section for checking passwords ... but, typing your password on a random website probably isn't such a great idea, right?

Of course, you could read about how HIBP protects the privacy of searched passwords1, and understand how k-Anonymity works (it does), and check that the website uses the k-Anonymity API (it does), but that's waaay too much work.

Of course, we could use the API ourselves, but where's the fun in that?

Instead, we'll do it the hard way – we'll check for the password offline.

And we're not stopping until it's fast. (This is a threat.)

The Pwned Passwords list#

OK, first we need to get password list.

Go to Pwned Passwords, scroll to Downloading the Pwned Passwords list, and download the SHA-1 ordered by hash file – be nice and use the torrent, if you can.

Note

You can also use the downloader to get an up-to-date version of the file, but that wasn't available a year ago when I started writing this.

The 16 GB archive extracts to a 37 GB text file:

$ pushd ~/Downloads
$ stat -f %z pwned-passwords-sha1-ordered-by-hash-v8.7z
16257755606$ 7zz e pwned-passwords-sha1-ordered-by-hash-v8.7z
$ stat -f %z pwned-passwords-sha1-ordered-by-hash-v8.txt
37342268646$ popd

... that looks like this:

$ head -n5 ~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt
000000005AD76BD555C1D6D771DE417A4B87E4B4:1000000000A8DAE4228F821FB418F59826079BF368:400000000DD7F2A1C68A35673713783CA390C9E93:87300000001E225B908BAC31C56DB04D892E47536E0:600000006BAB7FC3113AA73DE3589630FC08218E7:3

Each line has the format:

<SHA-1 of the password>:<times it appeared in various breaches>

Note

See the article mentioned in the introduction for why use a hash instead of the actual password, and what the counts are good for.

To make commands using it shorter, we link it in the current directory:

$ ln -s ~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt pwned.txt

A minimal plausible solution#

We'll take an iterative, problem-solution approach to this. But since right now we don't have any solution, we start with the simplest thing that could possibly work.

First, we get the imports out of the way:2

12345
importosimportsysimporttimeimportgetpassimporthashlib

Then, we open the file:

78
path=sys.argv[1]file=open(path,'rb')

By opening the file early, we get free error checking – no point in reading the password if the file isn't there.

Opening it in binary mode makes searching a bit faster, as it skips needless decoding. It's minimal plausible solution, it doesn't mean we can't do some obvious things.

Next, the password:

1011121314151617
try:password=sys.argv[2]exceptIndexError:password=getpass.getpass()hexdigest=hashlib.sha1(password.encode()).hexdigest()delpasswordprint("looking for",hexdigest)

We either take it as the second argument to the script, or read it with getpass(), so your actual password doesn't remain the shell history.

After computing the password's hash, we delete it. We won't go as far as to zero it,3 but at least we prevent accidental printing (e.g. in a traceback).

Let's see if it works:

$ python pwned.py pwned.txt password
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8$ python pwned.py pwned.txt
Password:looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8

The shell sha1sum seems to agree:

$ echo -n password | sha1sum
5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8  -

To find the hash, we just go through the file line by line (remember, simplest thing that could possibly work). We put this in a function, it might be useful later on.

 8 91011121314
deffind_line(lines,prefix):forlineinlines:ifline.startswith(prefix):returnlineifline>prefix:breakreturnNone

If a line was found, we print the count:

29303132333435
line=find_line(file,hexdigest.upper().encode())ifline:times=int(line.decode().rstrip().partition(':')[2])print(f"pwned! seen {times:,} times before")else:print("not found")

Before giving it a whirl, let's add some timing code:

293031
start=time.monotonic()line=find_line(file,hexdigest.upper().encode())end=time.monotonic()
39
print(f"in {end-start:.6f} seconds")

And, it works (the code so far):

$ python pwned.py pwned.txt blocking
looking for 000085013a02852372159cb94101b99ccaec59e1pwned! seen 587 times beforein 0.002070 seconds

Problem: it's slow#

...kinda.

You may have noticed I switched from password to blocking. That's because I was cheating – I deliberately chose a password whose hash is at the beginning of the file.

On my 2013 laptop, searching for password actually takes 86 seconds!

Let's put a lower bound on the time it takes to go through the file:

$ time wc -l pwned.txt
 847223402 pwned.txtreal	1m7.234suser	0m31.133ssys	0m16.379s

There are fasterimplementations of wcout there, but Python can't even beat this one:

$ time python3 -c 'for line in open("pwned.txt", "rb"): pass'real	1m28.325suser	1m10.049ssys	0m15.577s

There must be a better way.

Skipping#

There's a hint about that in the file name: the hashes are ordered.

That means we don't have to check all the lines; we can skip ahead until we're past the hash, go back one step, and only check each line from there.

Lines are different lengths, so we can't skip exactly X lines without reading them. But we don't need to, skipping to any line that's a reasonable amount ahead will do.

171819202122232425262728293031
defskip_to_before_line(file,prefix,offset):old_position=file.tell()whileTrue:file.seek(offset,os.SEEK_CUR)file.readline()line=file.readline()# print("jumped to", (line or b'<eof>').decode().rstrip())ifnotlineorline>=prefix:file.seek(old_position)breakold_position=file.tell()

So we just seek() ahead a set number of bytes. Since that might not leave us at the start of a line, we discard the incomplete line, and use the next one.

Finally, we wrap the original find_line() with one that does the skipping:

 8 91011121314
deffind_line(file,prefix):skip_to_before_line(file,prefix,16*2**20)returnfind_line_linear(file,prefix)deffind_line_linear(lines,prefix):forlineinlines:

It works:

$ python pwned.py pwned.txt password
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8pwned! seen 9,545,824 times beforein 0.027203 seconds

I found the magic 16 MiB offset by trying a bunch of different values:

offset (MiB)   time (s)
           1      0.05
           4      0.035
           8      0.030
          16      0.027  <-- sweet spot
          32      0.14

The code so far.

Problem: it needs tuning, it's still slow#

While three orders of magnitude faster, we still have a bunch of issues:

  1. The ideal skip size depends on computer you're running this on.
  2. The run time still increases linearly with file size; we haven't really solved the problem, as much as made it smaller by a (large, but) constant factor.
  3. The run time still increases linearly with where the hash is in the file.
  4. To be honest, it's still kinda slow. ¯\_(ツ)_/¯

There must be a better way.

Binary skipping#

To make the "linear" part painfully obvious, uncomment the jumped to line.

$ python pwned.py pwned.txt password | grep -o 'jumped to .'| uniq -c
 139 jumped to 0 139 jumped to 1 139 jumped to 2 139 jumped to 3 139 jumped to 4 103 jumped to 5

Surely, after we've ju 44 0.000 0.000 0.000 0.000 {method 'tell' of '_io.BufferedReader' objects}mped to 0 once, we don't need to do it 138 more times, right?

We could jump directly to a line in the middle of the file; if there at all, the hash will be in either of the halves. We then jump to the middle of that half, and to the middle of that half, and so on, until we either find the hash or there's nowhere left to jump.

Note

If that sounds a lot like binary search, that's because it is – it's just not wearing its usual array clothes.

And most of the work is already done: we can jump to a line at most X bytes from where the hash should be, we only need to do it repeatedly, in smaller and smaller fractions of the file size:

2526272829303132
defskip_to_before_line(file,prefix,offset):whileoffset>2**8:offset//=2skip_to_before_line_linear(file,prefix,offset)defskip_to_before_line_linear(file,prefix,offset):old_position=file.tell()

The only thing left is to get the file size:

 8 910111213
deffind_line(file,prefix):file.seek(0,os.SEEK_END)size=file.tell()file.seek(0)skip_to_before_line(file,prefix,size)returnfind_line_linear(file,prefix)

It works:

$ python pwned.py pwned.txt password
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8pwned! seen 9,545,824 times beforein 0.009559 seconds$ python pwned.py pwned.txt password
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8pwned! seen 9,545,824 times beforein 0.000268 seconds

The huge time difference is due to operating system and/or disk caches– on the second run, the (same) parts of the file are likely already in memory.

Anyway, look again at the jumped to output: instead of jumping blindly through the whole file, now we're jumping around the hash, getting closer and closer to it.

$ python pwned.py pwned.txt password | grep -o 'jumped to .'| uniq -c
   1 jumped to 7   1 jumped to 3   1 jumped to 7   1 jumped to 5   1 jumped to 4  39 jumped to 5
$ python pwned.py pwned.txt password | grep -o 'jumped to ..'| uniq -c
   1 jumped to 7F   1 jumped to 3F   1 jumped to 7F   1 jumped to 5F   1 jumped to 4F   1 jumped to 5F   1 jumped to 57   1 jumped to 5F   1 jumped to 5B   1 jumped to 59   1 jumped to 5B   1 jumped to 5A  32 jumped to 5B

You may notice we end up at the same 7F... prefix twice; this makes sense – we skip ahead by half the file size, then back, then ahead by a quarter two times. It shouldn't really change anything, the second time the data is likely already cached.

The code so far.

Better timing#

Given the way caching muddies the waters, how fast is it really?

This shell function averages a hundred runs, each with a different password:

function average-many {for _ in{1..100};do
        python $@$( python -c 'import time; print(time.time())')done\| grep seconds \| cut -d' ' -f2 \| paste -sd+ - \| sed 's#^#scale=6;(#'| sed 's#$#)/100#'\| bc
}
After a few repeats, the average time settles around 3 ms.
$ for _ in{1..20};do average-many pwned.py pwned.txt;done.004802.003904.004088.003486.003451.003476.003414.003442.003169.003297.002931.003077.003092.003011.002980.003147.003112.002942.002984.002934

Again, this is due to caching; the more we run it, the more likely it is the pages at half, quarters, eights, sixteenths, and so on of the file size are already in memory, and the road to any line starts through a subset of those.


I wave my hands, get a 2020 laptop, and a miracle happens. It's far enough into the totally unpredictable future, now, that you can search any password in under 1 millisecond. You can do anything you want.

So, there we go. Wasn't that an interesting story? That's the end of the article. Don't look at the scroll bar. Don't worry about it.

If you came here to find a somewhat inconvenient way of checking if your password has been compromised, you can go.

Subscribe on the way out if you'd like, but take care :)

Go solve Advent of Code or something.

I'm just chilling out.

See ya.

Want to know when new articles come out? Subscribe here to get new stuff straight to your inbox!

Failing to get to under 1 millisecond#

OK, I think they're gone.

I swear this was supposed to be the end; this really was supposed to be a short one.

Here's what a friend of mine said, that chronologically should be way later into the article, but makes a great summary for what follows:

And now that you have arrived at this point, spend a moment to ponder the arbitrary nature of 1 millisecond given its dependency on the current year and the choice of your particular hardware.

After that moment, continue celebrating.

Nah, fuck it, it has to take less than 1 millisecond on the old laptop.

... so yeah, here's a bunch of stuff that didn't work.

Profile before optimizing#

Now, with the obvious improvements out of the way, it's probably a good time to stop and find out where time is being spent.

$ python -m cProfile -s cumulative pwned.py pwned.txt "$( date )"looking for 3960626a8c59fe927d3cf2e991d67f4c505ae198not foundin 0.004902 seconds         1631 function calls (1614 primitive calls) in 0.010 seconds   Ordered by: cumulative time   ncalls  tottime  percall  cumtime  percall filename:lineno(function)      ...        1    0.000    0.000    0.005    0.005 02-binary-search.py:8(find_line)        1    0.000    0.000    0.005    0.005 02-binary-search.py:22(skip_to_before_line)       28    0.000    0.000    0.005    0.000 02-binary-search.py:28(skip_to_before_line_linear)       86    0.004    0.000    0.004    0.000 {method 'readline' of '_io.BufferedReader' objects}      ...       71    0.000    0.000    0.000    0.000 {method 'seek' of '_io.BufferedReader' objects}      ...       44    0.000    0.000    0.000    0.000 {method 'tell' of '_io.BufferedReader' objects}      ...

From the output above, we learn that:

readline() is implemented in C, so there's not much we can change there.

What we can change, however, is how often we call it.


Another thing of interest is how much individual readline() calls take.

In skip_to_before_line_linear():

34353637383940
start=time.monotonic()file.readline()line=file.readline()end=time.monotonic()print("jumped to",(line[:5]orb'<eof>').decode().rstrip(),f"in {(end-start)*1000000:4.0f} us",f"at offset {file.tell():16,}")
The output is pretty enlightening:
$ python pwned.py pwned.txt asdf
looking for 3da541559918a808c2402bba5012f6c60b27661cjumped to 7FF9E in   10 us at offset   18,671,134,394  <-- 1/2 file sizejumped to 3FF89 in    4 us at offset    9,335,567,234  <-- 1/4 file sizejumped to 1FFBA in    3 us at offset    4,667,783,663  <-- 1/8 file sizejumped to 3FF89 in    3 us at offset    9,335,567,322  <-- 1/4 file sizejumped to 2FFA4 in    5 us at offset    7,001,675,508jumped to 3FF89 in    4 us at offset    9,335,567,366  <-- 1/4 file sizejumped to 37F98 in    4 us at offset    8,168,621,453jumped to 3FF89 in    3 us at offset    9,335,567,410  <-- 1/4 file sizejumped to 3BF94 in    3 us at offset    8,752,094,477jumped to 3FF89 in    2 us at offset    9,335,567,498  <-- 1/4 file sizejumped to 3DF8E in    3 us at offset    9,043,831,007jumped to 3CF90 in    3 us at offset    8,897,962,782jumped to 3DF8E in    2 us at offset    9,043,831,095jumped to 3D790 in    3 us at offset    8,970,896,964jumped to 3DF8E in    2 us at offset    9,043,831,139jumped to 3DB90 in  253 us at offset    9,007,364,072jumped to 3D990 in  206 us at offset    8,989,130,552jumped to 3DB90 in    6 us at offset    9,007,364,160jumped to 3DA8F in  270 us at offset    8,998,247,402  <-- page 2,196,837jumped to 3DA0F in  189 us at offset    8,993,689,007jumped to 3DA8F in    5 us at offset    8,998,247,446  <-- page 2,196,837jumped to 3DA4F in  212 us at offset    8,995,968,274jumped to 3DA8F in    5 us at offset    8,998,247,534  <-- page 2,196,837jumped to 3DA6F in  266 us at offset    8,997,107,921jumped to 3DA5F in  203 us at offset    8,996,538,139jumped to 3DA57 in  195 us at offset    8,996,253,241jumped to 3DA53 in  197 us at offset    8,996,110,772jumped to 3DA57 in    6 us at offset    8,996,253,285jumped to 3DA55 in  193 us at offset    8,996,182,045jumped to 3DA54 in  178 us at offset    8,996,146,471jumped to 3DA54 in  189 us at offset    8,996,128,666jumped to 3DA54 in  191 us at offset    8,996,119,760  <-- page 2,196,318jumped to 3DA54 in   32 us at offset    8,996,128,710jumped to 3DA54 in    5 us at offset    8,996,124,259jumped to 3DA54 in   10 us at offset    8,996,122,057  <-- page 2,196,318jumped to 3DA54 in    4 us at offset    8,996,120,955  <-- page 2,196,318jumped to 3DA54 in    4 us at offset    8,996,120,382  <-- page 2,196,318jumped to 3DA54 in    9 us at offset    8,996,120,112  <-- page 2,196,318jumped to 3DA54 in    1 us at offset    8,996,120,470  <-- page 2,196,318jumped to 3DA54 in    1 us at offset    8,996,120,338  <-- page 2,196,318pwned! seen 324,774 times beforein 0.003654 seconds

At least half the reads are pretty fast:

  • In the beginning, because searches start with the same few pages.
  • At the end, because searches end on the same page (the macOS page size is 4K).
  • Reads on the same page, after the first one.

So, it's the reads in the middle that we need to get rid of.

Position heuristic#

In theory, the output of a good hash function should be uniformly distributed.

This means that with a bit of math, we can estimate where a hash would be – a hash that's ~1/5 in the range of all possible hashes should be at ~1/5 of the file.

Here's a tiny example:

>>> digest='5b'# 1-byte hash (2 hex digits)>>> size=1000# 1000-byte long file>>> int_digest=int(digest,16)# == 91>>> int_end=16**len(digest)# == 0xff + 1 == 256>>> int(size*int_digest/int_end)355

We can do this once, and then binary search a safety interval around that position. Alas, this only gets rid of the fast jumps at the beginning of the binary search (and for some reason, it ends up being slightly slower than binary search alone). (code)

We can also narrow down around the estimated position iteratively, making the interval smaller by constant factor each time. This seems to work: a factor of 1000 yields 1.7 ms, and a factor of 8000 yields 1.2 ms (both in 2 steps). (code)

However, it has other issues:

  • Having arbitrary start/end offsets complicates the code quite a bit.
  • I don't know how to reliably determine the factor.4
  • I don't know how to prove it's correct (especially for smaller intervals, where the hashes are less uniform). To be honest, I don't think it can be 100% correct.5

Anyway, if the implementation is hard to explain, it's a bad idea.

Index file#

An arbitrary self-imposed restriction I had was that any solution should mostly use the original passwords list, with little to no preparation.

By relaxing this a bit, and going through the file once, we can build an index like:

<SHA-1 of the password>:<offset in file>

... that we can search with skip_to_before_line().

Of course, we can't include all the hashes, since the index will have just as many lines as the original file. But we don't need to – by including only lines a few kilobytes apart from each other, we can seek directly to within a few kilobytes in the big file.

The only thing left to figure out is how much "a few kilobytes" is.

After my endless harping about pages and caching, the answer should be obvious: one page size (4K). And this actually gets us 0.8 ms (with a 452M index file).

Back when I wrote the code, that thought hadn't really sunk in, so after getting 1.2 ms with a 32K "block size", I deemed this not good enough, and moved on.

(Code: search, index.)

Binary file#

At this point, I was grasping for straws.

Since I was already on the external file slipery slope, I thought of converting the file to binary, mostly to make the file smaller – smaller file, fewer reads.

I packed each line into 24 bytes:

| binary hash (20 bytes) | count (4 bytes) |

This halved the file, but only lowered the runtime to 2.6 ms :(

But more importantly, it made the code much, much simpler: because items are fixed size, you can know where the Nth item is, so I was able to use bisect for the binary search.

(Code: search, convert.)

Getting to under 1 millisecond#

OK, what now? We've tried some things, we've learned some stuff:

  • The position heuristic kinda (maybe?) works, but is hard to reason about.
  • The index file gets us there in the end, but barely, and the index is pretty big.
  • The binary file isn't much faster, and it creates a huge file. But, less code!

I don't know what to do with the first one, but we can combine the last two.

Generating the index#

Let's start with the script I made for the text index:

 1 2 3 4 5 6 7 8 9101112131415
importos,sysfile=sys.stdin.bufferoutf=sys.stdout.bufferwhileTrue:pos=file.tell()line=file.readline()ifnotline:breakoutf.write(line.partition(b':')[0]+b':'+str(pos).encode()+b'\n')file.seek(2**12,os.SEEK_CUR)file.readline()

The output looks like this:

$ python generate-index.py < pwned.txt 2>/dev/null | head -n5
000000005AD76BD555C1D6D771DE417A4B87E4B4:000000099A4D3034E14DF60EF50799F695C27C0EC:415700000172E8E1D1BD54AC23B3F9AB4383F291CA17:8312000002C8F808A7DB504BBC3C711BE8A8D508C0F9:124530000047139578F13D70DD96BADD425C372DB64A9:16637

We need to pack that into bytes.

A hash takes 20 bytes. But, looking at the values above, we only need slightly more than 3 bytes (6 hex digits) to distinguish between the index lines:

$ python generate-index.py < pwned.txt 2>/dev/null | cut -c-6 | uniq -c | head
   2 000000   1 000001   1 000002   1 000004   1 000005   1 000007   1 000008   1 00000A   1 00000B   1 00000C

To represent all the offsets in the file, we need log2(35G) / 8 = 4.39... bytes, which results in a total of 9 bytes (maybe even 8, if we mess with individual bits).

But, let's make it future-proof: 6 bytes for the hash buys at least 2.4 petabyte files, and 6 for the offset buys 0.28 petabyte files.

121314
digest,_,_=line.partition(b':')outf.write(bytes.fromhex(digest.decode())[:6])outf.write(pos.to_bytes(6,'big'))

If you look at the text index, you'll notice we skip 4K + ~50 bytes; this results in sometimes having to read 2 pages from the big file, because not all pages have an index entry. Let's fix that by reading the first whole line after a 4K boundary instead:

1617
file.seek((pos//4096+1)*4096)file.readline()

OK, we're done:

$ time python generate-index.py < pwned.txt > index.bin

real	1m2.729suser	0m34.292ssys	0m21.392s

Using the index#

We start off with a skeleton that's functionally identical with the naive "go through every line" script; the only difference is I've added stubs for passing and using the index:

 91011
deffind_line(file,prefix,index):skip_to_before_line_index(file,prefix,index)returnfind_line_linear(file,prefix)
2324
defskip_to_before_line_index(file,prefix,index):...
3031
index_path=sys.argv[2]index=...
43
line=find_line(file,hexdigest.upper().encode(),index)
(In case you want to see the whole script.)
 1 2 3 4 5 6 7 8 910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
importosimportsysimporttimeimportbisectimportgetpassimporthashlibdeffind_line(file,prefix,index):skip_to_before_line_index(file,prefix,index)returnfind_line_linear(file,prefix)deffind_line_linear(lines,prefix):forlineinlines:ifline.startswith(prefix):returnlineifline>prefix:breakreturnNonedefskip_to_before_line_index(file,prefix,index):...path=sys.argv[1]file=open(path,'rb')index_path=sys.argv[2]index=...try:password=sys.argv[3]exceptIndexError:password=getpass.getpass()hexdigest=hashlib.sha1(password.encode()).hexdigest()delpasswordprint("looking for",hexdigest)start=time.monotonic()line=find_line(file,hexdigest.upper().encode(),index)end=time.monotonic()ifline:times=int(line.decode().rstrip().partition(':')[2])print(f"pwned! seen {times:,} times before")else:print("not found")print(f"in {end-start:.6f} seconds")

As mentioned before, we'll use the standard library bisect module to search the index.

We could read the entire index in memory, as a list of 12-byte bytes. But that has at least two issues:

  • It'd still be slow, even if outside the current timing code.
  • Memory usage would increase with the size of the file.

Although the module documentation says they work with lists, the bisect* functions work with any arbitrary sequence; a sequence is an object that implements a couple special methods that allow it to behave like a list.

We'll need a object, which needs the file and its size:

272829303132
classBytesArray:item_size=12def__init__(self,file):self.file=file

We can go ahead and plug it in:

3839
index_path=sys.argv[2]index=BytesArray(open(index_path,'rb'))

The first special method is __getitem__(), to implement a[i]:6

343536373839
def__getitem__(self,index):self.file.seek(index*self.item_size)buffer=self.file.read(self.item_size)iflen(buffer)!=self.item_size:raiseIndexError(index)# out of boundsreturnbuffer

The second special method is __len__(), to implement len(a):

41424344
def__len__(self):self.file.seek(0,os.SEEK_END)size=self.file.tell()returnsize//self.item_size

Using the index becomes straightforward:

232425262728293031323334
defskip_to_before_line_index(file,prefix,index):item_prefix=bytes.fromhex(prefix.decode())[:6]item=find_lt(index,item_prefix)offset=int.from_bytes(item[6:],'big')ifitemelse0file.seek(offset)deffind_lt(a,x):i=bisect.bisect_left(a,x)ifi:returna[i-1]returnNone

We get the first 6 bytes of the hash, find the rightmost value less than that, extract the offset from it, and seek to there. find_lt() comes from bisect's recipes for searching sorted lists.

And we're done:

$ average-many pwned.py pwned.txt index.bin
.002546

Huh? ... that's unexpected...

Oh.

I said we won't read the index in memory. But we can force it into the OS cache by getting the OS to read it a bunch of times:

$ for _ in{1..10};do cat index.bin > /dev/null;done

Finally:

$ average-many pwned.py pwned.txt index.bin
.000421

I herd you like indexes (the end)#

Hmmm... isn't that cold start bugging you? It is bugging me.

I tried it out, and if we make an index (313K) for the big index, we get 1.2 ms from a cold start. (Forcing the big index in memory doesn't get us more than the ~500 ms we hit before, because most of that is spent reading from the big file.)

Maybe another smaller index can take us to below 1 ms from a cold start?

...

Just kidding, this is it, this really is the end.

Let's take a look at what we've accomplished:

methodstatementstime (ms, order of magnitude)
linear29100,000
linear+skip42100
binary search4910
binary index59 (72)1

For twice the code, it's 5 orders of magnitude faster! (I'm not counting bisect or the OS cache, but that's kinda the point, those already exist, they're basically free.)

I mean, we could do indexception, but to do it even remotely elegantly, I need at least five more hours, for something that's definitely not going to be twice the code. And for what? Just slightly faster cold start? I'm good.

Turns out, you can get pretty far with just a few tricks.


That's it for now.

Learned something new today? Share this with others, it really helps!

Want to know when new articles come out? Subscribe here to get new stuff straight to your inbox!

  1. It's linked right above the textbox, after all. [return]

  2. I remember reading that Guido van Rossum thinks sorting imports by length looks cool (I know I do). In production code, though, maybe stick to alphabetical :) [return]

  3. Which is kinda difficult to do in Python anyway. [return]

  4. Something like (size / 4096) ** (1 / int(log(size, 4096))) should work, but I didn't have enough patience to debug the infinite loop it caused. [return]

  5. I mean, I did cross-check it with the binary search solution for a few thousand values, and it seems correct, but that's not proof. [return]

  6. We're only implementing the parts of the sequence protocol that bisect uses.

    For the full protocol, __getitem__() would need to also implement negative indexes and slicing. To get even more sequence methods like count() and index() for free, we can inherit collections.abc.Sequence.

    Interestingly, our class will work in a for loop without needing an __iter__() method. That's because there are actually two iteration protocols: an older one, using __getitem__(), and a newer one, added in Python 2.1, using __iter__() / __next__(); for backwards compatibility, Python still supports the old one (more details on the logic in Unravelling for statements by Brett Cannon). [return]


Viewing all articles
Browse latest Browse all 22907

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>