For a change of pace, I figured I would do a basic chemistry lesson about molecular structures, instead of a more computer oriented blog post.
Chemists often think about a molecule as a core structure (usually a ring system) and a set of R-groups. Each R-group is attached to an atom in the core structure by a bond. Typically that bond is a single bond, and often "rotatable".
Here's an example of what I mean. The first image below shows the
structure of vanillin, which is
the primary taste behind vanilla. In the second image, I've
circled ellipsed the three R-groups in the structure.
Vanillin structure (the primary taste of vanilla) | Vanillin with three R-groups identified |
The R-groups in this case are R1=a carbonyl group (*-CH=O2), R2=a methoxy group (*-O-CH3), and R3=a hydroxyl group (*-OH), where the "*" inidicates where the R-group attaches to the core structure.
The R-group concept is flexible. Really it just means that you have a fixed group of connected atoms, which are connected along some bond to a variable group of atoms, and where the variable group is denoted R. Instead of looking at the core structure and a set of R-groups, I can invert the thinking and think of an R-group, like the carbonyl group, as "the core structure", and the rest of the vanillin as its R-group.
With that in mind, I'll replace the "*" with the "R" to get the groups "R-CH=O2", "R-O-CH3", and "R-OH". (The "*" means that the fragment is connected to an atom at this point, but it's really just an alternative naming scheme for "R".)
All three of these group are also functional groups. Quoting Wikipedia, "functional groups are specific groups (moieties) of atoms or bonds within molecules that are responsible for the characteristic chemical reactions of those molecules. The same functional group will undergo the same or similar chemical reaction(s) regardless of the size of the molecule it is a part of."
These three corresponding functional groups are R1 = aldehyde, R2 = ether. and R3 = hydroxyl.
As the Wikipedia quote pointed out, if you have reaction which acts on an aldehyde, you can likely use it on the aldehyde group of vanillin.
Vanillyl group and capsaicin
A functional group can also contain functional groups. I pointed to
the three functional groups attached to the central ring of a
vanillin, but most of the vanillin structure is itself another
functional group, a vanillyn:
Structures which contain a vanillyl group are called
vanilloids. Vanilla
is of course a vanilloid, but surprisingly so is capsaicin, the source
of the "heat" to many a spicy food. Here's the capsaicin structure,
with the vanillyl group circled:
The feeling of heat comes because the capsaicin binds to TrpV1 (the transient receptor potential cation channel subfamily V member 1), also known as the "capsaicin receptor". It's a nonselective recepter, which means that many things can cause it to activate. Quoting that Wikipedia page: "The best-known activators of TRPV1 are: temperature greater than 43 °C (109 °F); acidic conditions; capsaicin, the irritating compound in hot chili peppers; and allyl isothiocyanate, the pungent compound in mustard and wasabi." The same receptor detects temperature, capsaicin, and a compound in hot mustard and wasabi, which is why your body interprets them all as "hot."
Capsaicin is a member of the capsaicinoid family. All capsaicinoids are vanillyls, all vanillyls are aldehydes. This sort of is-a family membership relationship in chemistry has lead to many taxonomies and ontologies, including ChEBI.
But don't let my example or the existence of nomenclature lead you to the wrong conclusion that all R-groups are functional groups! An R-group, at least with the people I usually work with, is a more generic term used to describe a way of thinking about molecular structures.
QSAR modeling
QSAR (pronounced "QUE-SAR") is short for "quantitative structure-activity relationship", which is a mouthful. (I once travelled to the UK for a UK-QSAR meeting. The border inspecter asked me where I was going, and I said "the UK-QSAR meeting; QSAR is .." and I blanked on the expansion of that term! I was allowed across the border, so it couldn't have been that big of a mistake.)
QSAR deals with the development of models which relate chemical structure to its activity in a biological or chemical system. Looking at that, I realize I just moved the words around a bit, so I'll give a simple example.
Consider an activity, which I'll call "molecular weight". (This is more of a physical property than a chemical one, but I am trying to make it simple.) My model for molecular weight assumes that each atom has its own weight, and the total molecular weight is the sum of the individual atom weights. I can create a training set of molecules, and for each molecule determine its structure and molecular weight. With a bit of least-squares fitting, I can determine the individual atom weight contribution. Once I have that model, I can use it to predict the molecular weight of any molecule which contains atoms which the model knows about.
Obviously this model will be pretty accurate. It won't be perfect, because isotopic ratios can vary. (A chemical synthesized from fossil oil is slightly lighter and less radioactive than the same chemical derived from from environmental sources, because the heavier radioactive 14C in fossil oil has decayed.) But for most uses it will be good enough.
A more chemically oriented property is the partition coefficient, measured in log units as "log P", which is a measure of the solubility in water compared to a type of oil. This gives a rough idea of if the molecule will tend to end up in hydrophobic regions like a cell membrane, or in aqueous regions like blood. One way to predict log P is with the atom-based approach I sketched for the molecular weight, where each atom type has a contribution to the overall measured log P. (This is sometimes called AlogP.)
In practice, atom-based solutions are not as accurate as fragment-based solutions. The molecular weight can be atom-centered because nearly all of the mass is in the atom's nucleous, which is well localized to the atom. But chemistry isn't really about atoms but about the electron density around atoms, and electrons are much less localized than nucleons. The density around an atom depends on the neighboring atoms and the configuration of the atoms in space.
As a way to improve on that, some methods look at the extended local environment (this is sometimes called XlogP) or at larger fragment contributions (for example, BioByte's ClogP). The more complex it is, the more compounds you need for the training and the slower the model. But hopefully the result is more accurate, so long as you don't overfit the model.
If you're really interested in the topic, Paul Beswick of the Sussex Drug Discovery Centre wrote a nice summary on the different nuances in log P prediction.
Matched molecular pairs
Every major method from data mining, and most of the minor methods, have been applied to QSAR models. The history is also quite long. There are cheminformatics papers back from the 1970s looking at supervised and unsupervised learning, building on even earlier work on clustering applied to biological systems.
A problem with most of these is the black-box nature. The data is noisy, and the quantum nature of chemistry isn't that good of a match to data mining tools, so these prediction are used more often to guide a pharmaceutical chemist than to make solid predictions. This means the conclusions should be interpretable by the chemist. Try getting your neural net to give a chemically reasonable explanation of why it predicted as it did!
Matched molecular pair (MMP) analysis is a more chemist-oriented QSAR method, with relatively little mathematics beyond simple statistics. Chemists have long looked at activities in simple series, like replacing a ethyl (*-CH3) with a methyl (*-CH2-CH3) or propyl (*-CH2-CH2-CH3), or replacing a fluorine with a heavier halogen like a chlorine or bromine. These can form consistent trends across a wide range of structures, and chemists have used these observations to develop techniques for how to, say, improve the solubility of a drug candidate.
MMP systematizes this analysis over all considered fragments, including not just R-groups (which are connected to the rest of the structure by one bond) but also so-called "core" structures with two or three R-groups attached to it. For example, if the known structures can be described as "A-B-C", "A-D-C", "E-B-F" and "E-D-F" with activities of 1.2, 1.5, 2.3, and 2.6 respectively then we can do the following analysis:
A-B-C transforms to A-D-C with an activity shift of 0.3. E-B-F transforms to E-D-F with an activity shift of 0.3. Both transforms can be described as R1-B-R2 to R1-D-R2. Perhaps R1-B-R2 to R1-D-R2 in general causes a shift of 0.3?
Its not quite as easy as this, because the molecular fragments aren't so easily identified. A molecule might be described as "A-B-C", as well as "E-Q-F" and "E-H" and "C-T(-P)-A", where "T" has three R-groups connected to it.
Thanks
Thank to the EPAM Life Sciences for their Ketcher tool, which I used for the structure depictions that weren't public domain on Wikipedia.