If you missed it, read the first post of this series
What do you do when you need to filter a long list of records for your users?
That was the question we set to answer in a previous post. We saw that, for simple queries, built-in filtering provided by your framework of choice (think Django) is just fine. Most of the time, though, you'll need something more powerful. This is where PostgreSQL's full text search facilities comes in handy.
We also saw that just using to_tsvector
and to_tsquery
functions
goes a long way filtering your records. But what about documents
that contain accented characters? What can we do to optimize
performance? How do we integrate this with Django?
Hola, Mundo!
We have found that the need to search documents in multiple
languages is fairly common. You can query your data using
to_tsquery
without passing a language configuration name but
remember that, under the hood, the text search functions always use one.
The default language is english, but you have to use the right language stemmer according to your document language or you might not get any matches.
If, for example, we search for física
in spanish documents that
have this word and its variations we would only see exact matches
for this query:
=> SELECT text FROM terms WHERE to_tsvector(text) @@ to_tsquery('física'); text ------------------------------------------------------------------------- física (aparatos e instrumentos de —) física (educación —) física (investigación en —) rehabilitación física (aparatos de —) para uso médico educación física conversión de datos y programas informáticos, excepto conversión física investigación en física terapia física (8 rows)
And worse, if we search just for fisica
, unaccented:
=> SELECT text FROM terms WHERE to_tsvector(text) @@ to_tsquery('fisica'); text ------ (0 rows)
To get results that contain a variation of física
(físicas
,
físico
, físicamente
, etc…) we have to use the right
stemmer. Remember that if our stemmer doesn't have a word in its
dictionary it won't stem it:
=> SELECT ts_lexize('english_stem', 'programming'); ts_lexize ----------- {program} (1 row) => SELECT ts_lexize('spanish_stem', 'programming'); ts_lexize --------------- {programming} (1 row)
But if we use the right language, it will:
=> SELECT text FROM terms WHERE to_tsvector('spanish', text) @@ to_tsquery('spanish', 'física'); text -------------------------------------------------------------------------------- física (aparatos e instrumentos de —) ejercicios físicos (aparatos para —) entrenamiento físico (aparatos de —) físicos (aparatos para ejercicios —) física (educación —) preparador físico personal [mantenimiento físico] (servicios de —) física (investigación en —) ejercicio físico (aparatos de —) para uso médico rehabilitación física (aparatos de —) para uso médico aparatos para ejercicios físicos almacenamiento de soportes físicos de datos o documentos electrónicos clases de mantenimiento físico clubes deportivos [entrenamiento y mantenimiento físico] educación física conversión de datos o documentos de un soporte físico a un soporte electrónico conversión de datos y programas informáticos, excepto conversión física investigación en física terapia física (18 rows)
Working with multiple languages
Of course, we don't want to fill in the language manually every
time. A straightforward solution would be to store the record's
language in its own column. to_tsvector
and friends accept a
string to set the language but also a column name. So we could
write:
=> SELECT text FROM term WHERE to_tsvector(language, text) @@ to_tsquery(language, 'entrena'); text ---------------------------------------------------------- entrenamiento físico (aparatos de —) clubes deportivos [entrenamiento y mantenimiento físico] (2 rows)
The only catch here is that the column must be a regconfig
. If
you are using South or Django migrations to manage your database
schema remember to change the type of your language column:
=> ALTER TABLE terms ALTER COLUMN language TYPE regconfig USING language::regconfig;
This way we can query records in their own language.
Even Better: Accented Characters
But what about accented words? If our users don't carefully type them (most users don't) then they won't find anything.
If we search for the word fisica
again, even with all the
previous setup, nothing shows up:
=> SELECT text FROM terms WHERE to_tsvector(text) @@ to_tsquery('fisica'); text ------ (0 rows)
We found this behavior funny (not in a good way). Some words work and some don't. I guess it depends on completeness of the dictionary we choose to use. But we can't depend on the word being on the dictionary. It would be a hit or miss (and, by our experience, more miss) thing.
But don't worry. There is a PostgreSQL extension for that™.
CREATE EXTENSION unaccent;
The unaccent
extension can be used to extend the default language
configurations to filter out accented characters:
CREATE TEXT SEARCH CONFIGURATION sp ( COPY = spanish ); ALTER TEXT SEARCH CONFIGURATION sp ALTER MAPPING FOR hword, hword_part, word WITH unaccent, spanish_stem;
And, behold! It works great:
=> SELECT text FROM terms WHERE to_tsvector('sp', text) @@ to_tsquery('sp', 'fisica'); text -------------------------------------------------------------------------------- física (aparatos e instrumentos de —) ejercicios físicos (aparatos para —) entrenamiento físico (aparatos de —) físicos (aparatos para ejercicios —) física (educación —) preparador físico personal [mantenimiento físico] (servicios de —) física (investigación en —) ejercicio físico (aparatos de —) para uso médico rehabilitación física (aparatos de —) para uso médico aparatos para ejercicios físicos almacenamiento de soportes físicos de datos o documentos electrónicos clases de mantenimiento físico clubes deportivos [entrenamiento y mantenimiento físico] educación física conversión de datos o documentos de un soporte físico a un soporte electrónico conversión de datos y programas informáticos, excepto conversión física investigación en física terapia física (18 rows)
Note that you must be superuser to install this extension on a
normal installation (not in Heroku Postgres). If PostgreSQL
complains about not finding the extension, install the
postgresql-contrib
package in your distro.
A Note on Performance
To make text search work well in a large dataset, consider creating
a GIN
index:
CREATE INDEX terms_idx ON terms USING gin(to_tsvector(language, text));
GIN
indexes might take longer time to build than GiST
indexes,
but lookups are a lot faster.
Wrapping Up: Using It in Django
Finally, to make all this easy to use from Python, we wrote a custom queryset and used django-model-utils to embed it on the model's manager:
from django.db import models from model_utils.managers import PassThroughManager class TermSearchQuerySet(models.query.QuerySet): def search(self, query, raw=False): function = "to_tsquery" if raw else "plainto_tsquery" search_vector = "language, text" ts_query = "%s(language, '%s')" % (function, query) where = "to_tsvector(%s) @@ %s" % (search_vector, ts_query) return self.extra(where=[where]) class Term(models.Model): text = models.CharField(_('text'), max_length=255) language = models.CharField(max_length=20, default='sp', editable=False) objects = PassThroughManager.for_queryset_class(TermSearchQuerySet)()
Remember our original example on the first part?
>>> Entry.objects.search('man is biting a tail') [<Entry: Man Bites Dogs Tails>]