A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation. (1).
According to the syntax used, it is possible to repeat zero, once or several times character strings of one word, leading to the definition of a set of words with a close spelling.
Regular expressions used in search equations allow to find results despite spelling mistakes and have to be enclosed within slashes, as follows:
au.\*:/joh?n(ath[oa]n)/
They also enable to push a boundary of the Elasticsearch search engine, namely, the impossibility of using a truncation in an exact phrase or phrase-type (proximity search) search. This can be useful, for example, if one wishes to search for all the publications of an author whose name is common.
To search for the documents associated with “Martin J.”, one will write:
au.raw:/[Mm][Aa][Rr][Tt][Ii][Nn][, ]+[Jj].*/
Building a Regular Expression, Step by Step
In this paragraph we will resume stage by stage the building of the regular expression allowing to retrieve the documents associated with author “Martin J.”.
au.raw:/[Mm][Aa][Rr][Tt][Ii][Nn][, ]+[Jj].*/
There are specific rules to the writing of regular expressions. Only for the regular expressions:
? stands for 0 or 1 time the previous item
* stands for 0 or n times the previous item
+ stands for 1 or n times the previous item
In the example we are studying, we first have to choose whether we use all subindexes of the “au” author index or a single subindex. In this example, it is good practice to take only in consideration the “raw” subindex for it doesn’t segment each term and doesn’t process character strings. Thus the author’s name, “Martin J”, is not transformed.
For further information, see : Expert Search
1) Due to the choice made, we write
au.raw :
2) The regular expression is always written between 2 slashes
au.raw:/ /
3) Each letter may be in upper or lower case
au.raw:/[Mm][Aa][Rr][Tt][Ii][Nn]/
4) The name may be followed by a comma or a space
au.raw:/[Mm][Aa][Rr][Tt][Ii][Nn][, ]/
5) The name must be followed by the J initial in upper or lower case
au.raw:/[Mm][Aa][Rr][Tt][Ii][Nn][, ]+[Jj]/
6) The first letter of the first name is followed or not by a period.
au.raw:/[Mm][Aa][Rr][Tt][Ii][Nn][, ]+[Jj].*/
Square brackets are used to find possibilities of writing related to a single character, while brackets allow to group one or more characters.
In the following regular expression:
au.\*:/joh?n(ath[oa]n)/
? indicates that the “h” letter must be absent or present once,
[oa] means that there will be a “o” or a “a”, at this place in the author’s name.
With this regular expression, one will be able to retrieve the authors named: jonathan
, jonathon
, johnatho
n or johnathan
.
Note that this search in the author index, “au”, does not relate only to the “raw” subindex, but to all the subindexes declared for this index, namely: “raw”, “fold” and “rich.”
For further information, see : Index and Subindex List
If one wishes to take into consideration more than two choices for one character, one may write as follows:
(é|e|è
)
This form of writing allows to manage the choice between several accentuated forms or the possibilities to have commas followed or preceded by a space, in the structure of metadata.
For further information on the rules governing writing of regular expressions in Elasticsearch, see:
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-dsl-regexp-query.html#regexp-syntax