Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22872

David Amos: Tree Traversals And Anagrams In Python

$
0
0
Tree Traversals And Anagrams In Python

This article contains affiliate links. See my affiliate disclosure for more information.

I know it&aposs all just an accident of language, but the fact that you can rearrange the letters in "the eyes" into "they see" feels… magical.

Ever since I was a kid, I&aposve enjoyed anagrams. So, I thought I&aposd try to build an anagram generator in Python. It turned into a fun little weekend project with a dash of tree traversals and recursion.

✉️
This article was originally published in my Curious About Code newsletter. Never miss an issue. Subscribe here →

Here&aposs what I came up with.

How To Check If Two Strings Are Anagrams

string1 is an anagram of string2 if the characters in string1 are a rearrangement of the characters in string2.

The order of the characters doesn&apost really matter, though. What matters is that the count of each character in string1 is the same as the count of each character in string2. If each character appears the same number of times in each string, then the two strings are anagrams of each other.

Python&aposs collections.Counter can calculate this:

from collections import Counter

string1 = "the eyes"
string2 = "they see"

Counter(string1)
# Counter({
#     &apose&apos: 3,
#     &apost&apos: 1,
#     &aposh&apos: 1,
#     &apos &apos: 1,
#     &aposy&apos: 1,
#     &aposs&apos: 1
# })

Counter(string1) == Counter(string2)
# True ✅

See that empty string &apos &apos in the Counter object? That&aposs a problem. "Theater" and "the tear" are anagrams, even though one has a space and the other doesn&apost. Anagrams are case-insensitive. They may also have different punctuation, as in "vitalise" and "IT&aposS ALIVE!"

A robust solution needs to take all this into account, so collections.Counter won&apost cut it without some preprocessing:

string1 = "vitalise"
string2 = "IT&aposS ALIVE!"

Counter(string1) == Counter(string2)
# False ❌

Like most problems, there are many ways to solve this one.

I wanted a solution that is easy to understand and easy to change in case it doesn&apost handle every case it should. I went with the string .translate() method, which replaces each character in a string according to a translation dictionary. Characters mapped to None are removed from the string. Plus, I can easily create translation dictionaries with str.maketrans().

This gives me fine-grained control over how the strings are processed before being passed to Counter():

from collections import Counter
from string import punctuation, whitespace
from typing import Callable

def process(string_: str) -> str:
    chars_to_remove = punctuation + whitespace
    translation = str.maketrans("", "", chars_to_remove)
    return string_.translate(translation).lower()

def are_anagrams(
    string1: str,
    string2: str,
    process: Callable = process
) -> bool:
    processed1 = process(string1)
    processed2 = process(string2)
    return Counter(processed1) == Counter(processed2)

It might be better to use .casefold() instead of .lower() in the process() function. But overall, this makes for a robust solution. It&aposs easy to understand, provided you know about Counter, .translate(), and str.maketrans(). And it&aposs easy to swap out processing functions if needed.

But what about finding anagrams for a particular string?

How To Generate Anagrams

It&aposs harder than checking if two strings are anagrams of each other.

Simply generating all possible rearrangements of a string isn&apost enough. You have to insert spaces and punctuation in places that make sense. The resulting words in the string need to be actual dictionary words.

We need a way to efficiently generate words that use letters from a string.

Solving It On Paper

Set aside the problem of obtaining a list of words and start simple:

eyes
the
they
see
sea

The process goes like this:

  1. Write down the phrase for which to generate anagrams, such as "the eyes."
  2. Pick a letter from the phrase (say, "t") and scan the list for words that start with that letter (words = "the" and "they"). Cross off one instance of the letter in the phrase, and write down the letter as the first letter of an anagram (anagram = "t").
  3. Pick a letter from the phrase that hasn&apost been crossed off yet (say, "h"), and filter the words selected in the last step to those whose second letter is the new letter you picked (words = "the" and "they"). Cross off the letter in the phrase and add it to your anagram (anagram = "th").
  4. Continue this process of selecting unused letters and filtering the list of words. When you reach the end of a word, check if the phrase you&aposve generated so far is an anagram of the original phrase. If it is, you&aposre done. If not, start over with the complete list of words but only use letters that haven&apost been crossed off.
  5. Repeat the whole process using different initial letters from the phrase.

If you follow this algorithm step-by-step on the small list of words above, you&aposll end up with four anagrams:

  • the eyes
  • eyes the
  • they see
  • see they

Of course, "the eyes" is the phrase you started with. That&aposs fine. Just discard it.

Wouldn&apost it be great if, instead of scanning the entire list for all words starting with the letter T, we could jump straight to them? Or jump to all the words that start with TH? Or begin with THE?

Sounds like we need a hash table!

What we really want is a hash table of hash tables.

A hash table of hash tables is one way to represent a tree. And, if you stop and think about it, the method I&aposve described for generating anagrams feels a lot like a depth-first search. That&aposs a big clue for how to implement the algorithm.

And Python has all of the pieces we need.

📚
Need a refresher on algorithms?

My go-to book is Grokking Algorithms by Aditya Bhargava. It covers hash tables, trees, recursion, depth-first search, and so much more, in a friendly and approachable way.

Get instant access from Manning or pick up a copy from Amazon.

Building a Tree Of Dictionaries

The tree we&aposre going to build will look something like this:

Tree Traversals And Anagrams In Python

Python dictionaries are hash tables, so they&aposre the natural choice for the basis of our tree. The keys for each dictionary are strings containing a single letter, and the values are more dictionaries. We&aposll need to know when we&aposre at the end of a word, so we&aposll mark those places with the dictionary {" ": {}}.

The tree of dictionaries for the words eyes, the, they, see, and sea looks like this:

{&apose&apos: {&aposy&apos: {&apose&apos: {&aposs&apos: {&apos &apos: {}}}}},
 &aposs&apos: {&apose&apos: {&aposa&apos: {&apos &apos: {}}, &apose&apos: {&apos &apos: {}}}},
 &apost&apos: {&aposh&apos: {&apose&apos: {&apos &apos: {}, &aposy&apos: {&apos &apos: {}}}}}}

But how do you construct a dictionary like this?

The key observation is that you can add words to the dictionary recursively. Start with an empty dictionary. For each word, pop off the first letter — call it char. If char is a key in the dictionary, grab its value. Otherwise, set the key char in the dictionary with an empty dictionary for its value. Then repeat the process, using the dictionary mapped to char and the word without its first character.

Assume word has no whitespace and tack the string " " on to the end so that we get the terminal dictionaries right.

Here&aposs the code:

from typing import Iterable

def add_word(tree: dict, word: str) -> None:
    if word != "":
        char = word[0]
        branch = tree.get(char, {})
        tree[char] = branch
        _add_word(branch, word[1:])

def build_word_tree(words: Iterable[str]) -> dict:
    word_tree = {}
    for word in words:
        add_word(word_tree, word + " ")
    return word_tree

Recursion is nice because it can compactly describe a repetitive process. But it can be difficult to understand what a recursive algorithm does without reasoning through the steps by hand. Documentation with some good examples can help a reader (including your future self!) understand a recursive function.

📚
Want to get better at recursion?

Al Sweigart, of Automate The Boring Stuff fame, has a new book all about recursion with examples in Python and JavaScript.

Get The Recursive Book of Recursion from No Starch Press or Amazon, or read it for free on Al&aposs website.

Now that we&aposve got a tree to work with, we can start generating anagrams.

Traversing The Tree To Make Anagrams

You can implement the anagram algorithm as a walk over the nodes in a word tree.

We pick an initial node — one of the characters in the phrase we want anagrams for — and mark that node as visited in an array. Then move to the branch in the tree rooted at that node. Keep repeating this until you visit all of the characters in the phrase or reach the end of a word. More recursion!

It&aposs essentially a depth-first search algorithm, except that the nodes available to walk to are limited by the nodes in the tree and the characters in the phrase that haven&apost been visited yet.

This is captured in the following Python code:

from typing import Generator

def walks(
    tree: dict,
    filter_: Callable,
    _visited: tuple = ()
) -> Generator:
    if tree == {}:
        yield _visited
    else:
        for node in filter_(tree, _visited):
            yield from walks(
                tree[node], filter_, (*_visited, node)
            )

The filter_() function should return a tuple of nodes available to walk to. There&aposs some nuance to how exactly this works for finding anagrams. We&aposll get to that in a bit and hold off on defining a filter function for now.

We&aposre still not done implementing the walk for the anagram algorithm.

walks() yields tuples of nodes visited by a walk — characters from a phrase in this case — but walks end as soon as they reach the end of a word.

To produce multi-word anagrams you need to circle around and continue walking the tree from the root. I call this "walking around" the tree. You keep walking around and around, collecting words until the walk is done.

Here&aposs the code:

def walk_around(
    tree: dict,
    done: Callable,
    filter_: Callable,
    _visited: tuple = (),
) -> Generator:
    for walk in walks(tree, filter_, _visited):
        if done(walk):
            yield walk
        else:
            yield from walk_around(
                tree, done, filter_, walk
            )

We need to implement done and filter functions to pass to walk_around(), both of which depend on the phrase we&aposre generating anagrams for.

A walk is done when the string of characters visited in the walk is an anagram of the phrase. We already know how to do that! We can use the are_anagrams() function:

def done(phrase: str, walk: tuple) -> bool:
    return are_anagrams(phrase, "".join(walk))

Now we need to define filter_().

I said there was some nuance to this.

At each step in the walk, we only need to move to a character in the phrase that we haven&apost visited yet. Or a punctuation character, since we want to include words like contractions. Oh, and we can also move to a " " character, since that is how we know we&aposre at the end of a word.

We can express these rules cleanly using sets:

def filter_(
    phrase: str,
    tree: dict,
    _visited: tuple
) -> tuple:
    remaining = set(Counter(phrase) - Counter(_visited))
    _punctuation = set(punctuation)
    end_of_word = {" "}
    allowed = remaining | _punctuation | end_of_word
    nodes = set(tree.keys())
    anagram_nodes = nodes.intersection(allowed)
    return tuple(anagram_nodes)

Counter makes it easy to calculate the characters in the phrase that haven&apost been visited yet.

Subtracting a Counter from another returns a Counter with the counts subtracted. Characters whose count is zero are removed. For instance, Counter("tea") - Counter("e") returns Counter({&apost&apos: 1, &aposa&apos: 1}).

The | is the dictionary merge operator available since Python 3.9.

We have all the pieces in place to write an anagram generator:

from functools import partial

def anagrams(phrase: str, words: Iterable) -> Generator:
    tree = build_word_tree(words)
    _done = partial(done, phrase)
    _filter = partial(filter_, phrase)
    for walk in walk_around(tree, _done, _filter):
        anagram = "".join(walk).rstrip()
        if anagram != phrase:
            yield anagram

functools.partial() is used here to "fill in" the phrase parameters of the done() and filter_() functions. We have to do this because the done and filter_ parameters of walk_around() expect functions without a phrase parameter. Whitespace in the anagram is removed with .rstrip() since walks produced by walk_around() end with a space.

Let&aposs take anagrams() for a spin on a small list of words:

words = ["eyes", "the", "they", "see", "sea"]
phrase = "the eyes"

for anagram in anagrams(phrase, words):
    print(anagram.rstrip())

# eyes the
# they see
# see they

Hey, it works!

Testing On a Large Set Of Words

If your computer has a Unix-like OS, you have a large list of words available right now.

On most machines, there&aposs a words file located at /usr/share/dict/words or /usr/dict/words. It&aposs a newline delimited list of dictionary words. If you have a Windows machine or don&apost have a words file, you can find a tarball for the file in your language on the GNU website.

Let&aposs try it out:

from pathlib import Path

words_path = Path("/usr/share/dict/words")
words = words_path.read_text()
# The words file ends with a blank line so we
# need to remove trailing whitespace
words = words.lower().strip().split("\n")
phrase = "tea"

for anagram in anagrams(phrase, words):
    print(anagram)

# a t e
# a te
# a e t
# at e
# ate
# ae t
# t a e
# ...

It&aposs working great, but the words file includes single letters and two-letter words like "te." That adds a bunch of noise to our anagrams. We should probably clean it up.

Let&aposs filter out some words we know that we don&apost want:

from string import ascii_lowercase

SINGLE_LETTERS = set(ascii_lowercase) - {"a", "i"}
TWO_LETTER_WORDS = {"te", "ae", "ea"}
FORBIDDEN = SINGLE_LETTERS | TWO_LETTER_WORDS

words_path = Path("/usr/share/dict/words")
words = words_path.read_text()
words = words.lower().strip().split("\n")
words = (word for word in words if word not in FORBIDDEN)

phrase = "tea"

for anagram in anagrams(phrase, words):
    print(anagram)
    
# eat
# eta
# ate
# tae

This definitely gives us better results. For a really good anagram generator, you&aposll need to spend some time curating words. And the words file isn&apost necessarily the best set of words to start with, since it doesn&apost contain any contractions.

You can generate more interesting word lists using aspell&aposs SCOWL tool.

Happy anagram hunting!



Viewing all articles
Browse latest Browse all 22872

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>