Issue Details (XML | Word | Printable)

Key: CMIS-144
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Ethan Gur-esh
Reporter: David Caruana
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
OASIS Content Management Interoperability Services (CMIS) TC

Full text search syntax and semantics

Created: 15/Apr/09 01:24 PM   Updated: 11/Jan/10 06:41 PM
Component/s: Domain Model
Affects Version/s: Draft 0.60
Fix Version/s: Draft 0.62

Proposal:
As defined by JSR-283, but modified to the style of the CMIS SQL grammar.

Note:
<space> & <non space char> definitions taken directly from JSR-283, but consideration of whitespace is probably needed.

<text search expression> ::= <disjunct> {<space> OR <space> <disjunct>}
<disjunct> ::= <term> {<space> <term>}
<term> ::= ['-'] <simple term>
<simple term> ::= <word> | <phrase>
<word> ::= <non space char> {<non space char>}
<phrase> ::= '"' <word> {<space> <word>} '"'
<space> ::= <space char> {<space char>}
<non space char> ::= <char> - <space char> /* Any Char except SpaceChar */
<space char> ::= ' '
<char> ::= /* Any character */

## Proposal 1 Semantics

All proposal 0 semantics, plus

- Terms separated by whitespace are implicitly "ANDed"
- Terms separated by "OR" are "ORed"
- "AND" has higher precedence than "OR"
- Within a word, each double quote (") must also be escaped by a preceding "\" (backslash)
Resolution: Proposal 1 in doc in tc


 Description  « Hide
The text search expression is defined as a <character string literal> (as defined by SQL-92). However, the syntax and semantics of the full text search expression are repo specific.


I remember there was some resistance to defining a 'lowest common denominator' full text search language, but I don't remember why.


Given that we define SQL, and that query is a key use case, I think there's value in a deeper FTS definition.


As a starting point, JCR provides minimal definition. I'm not sure we would need to much further than that to start with.

 All   Comments   Change History      Sort Order: Ascending order - Click to sort in descending order
Al Brown added a comment - 15/Apr/09 02:28 PM
I think you are asking to clarify CONTAINS(foo) syntax. In particular whether or not a client can do 'foo AND bar', '+foo +bar', 'foo <NOT> bar', '+foo -bar' and the particular syntax a client can use to express:
AND
OR
NOT

as well as phrases. If we standardize the above behavior, we also need to add in a capability on level of support - simple/black box vs enhanced

I am not sure it is required at this point to support a standard contains syntax in 1.0. A lot of us implement FTS in a variety of ways - Autononmy/IDOL, Autonomy/Verity, Lucene, Omnifind, DB-specific FTS (DB2, Oracle, etc), FAST, etc.

The main use case I am aware of is a client wanting to use single keywords in contains() - e.g., contains(foo) rather than complex FTS queries. I do not think that warrants specifying the syntax for AND, OR, NOT, phrases in contains.

David Caruana added a comment - 15/Apr/09 04:01 PM
That's correct, I think there's value in defining a minimal syntax for foo.

I appreciate the differing implementations and the constraint this may impose. However, as you state, the main use case is limited which requires only a handful of FTS constructs.

If the standardised FTS language were to only include AND, OR, NOT, phrases (quoted terms) and single keywords, can our implementations (which provide FTS) support this subset of capabilities?

If not, then we can't standardise. If so, it shouldn't be difficult to define a syntax, which JCR has already done.

Al Brown added a comment - 15/Apr/09 06:28 PM
JCR supports: (AND, OR, NOT, Phrases)
- terms separated by space are AND'ed
- terms preceded by '-' are NOT
- terms can be separated by OR

If there is consensus that this is needed (for a usecase in scope), I can live with adding support for AND, OR, NOT and phrases. I would like to defer this to 2.0.

From JCR:
A query satisfies a FullTextSearch constraint if the value (or values) of the full-text indexed properties within the full-text search scope satisfy the specified fullTextSearchExpression, evaluated as follows:
- A term not preceded with "-" (minus sign) is satisfied only if the value contains that term.
- A term preceded with "-" (minus sign) is satisfied only if the value does not contain that term.
- Terms separated by whitespace are implicitly "ANDed".
- Terms separated by "OR" are "ORed".
- "AND" has higher precedence than "OR".
- Within a term, each double quote ("), "-" (minus sign), and "\" (backslash) must be escaped by a preceding "\" (backslash).
The query is invalid if:
- selectorName is not the name of a selector in the query, or
- fullTextSearchExpression does not conform to the above grammar (as augmented by the implementation).

Ethan Gur-esh added a comment - 15/Apr/09 06:42 PM
I'd also like to see this deferred. While I agree that there's value in having this included in the CMIS specification at some point, I think it's actually a larger change to the spec than I'd feel comfortable taking for V1 at this point.

The implication of this change is that:
1) Many repositories would have to implement an FTS query parser/translator.
2) It's not clear what a repository would do if there are subsets of the query language we define that they do NOT support (e.g. what if the repository doesn't support support the "-" indicated above?

Florent Guillaume added a comment - 04/May/09 11:34 AM
Could we at least define a common ground on implicit ANDs, and say that FTS query:
  foo bar
means
  search for foo AND bar
?

This semantic knowledge would be very useful for clients (and there's no way to efficiently do a AND from individual terms if you don't know how to express it to the server in a single query).

David Caruana added a comment - 13/May/09 10:09 AM - edited

Florent Guillaume added a comment - 22/May/09 09:19 AM
I updated David's proposal to include a simpler version than JSR-283 that does not deal with phrase search nor ORing of terms.

http://www.oasis-open.org/apps/org/workgroup/cmis/download.php/32637/cmis_fts_proposal_v0_3.txt

David Choy added a comment - 22/May/09 04:17 PM
What is the precise semantics for a Boolean conjunction (AND)?

The CONTAINS() function does not specify the FT search scope, simply because a repository's FTS capability largely depends on what objects and which values of an object are FT-indexed, and what source identities are captured in the index. For some repositories, a single document may have multiple FT-index values: the separate values of a multi-valued property, the values of different properties, the content-stream. This offers a finer-grain FTS capability. For others, all these are treated as parts of a single logical FT value.

Now, a Boolean expression "proposition-A AND proposition-B" is true iff both prop-A and prop-B are true for the same value. What is "the same value" here? Does it mean the same property value or the same content stream? A FTS implementation that treats everything indexed for a document as a part of a single logical value would not be able to support this fine-grain semantics. Then, should it mean everything associated with a document that is FT-indexed? A FTS implementation that uses property value id as source id but not document id would have difficulty to support this coarse semantics without a great deal of post processing.

The same ambiguity applies to disjunction (OR) and negation (-), although the problem may be less severe in practice.

Should there be a fixed semantics, or would this be repository-specific?

Norrie Quinn added a comment - 03/Jun/09 08:42 PM
I would also like to see enhanced FT deferred for v1, to be included in a future specification with the right capability balance.

For Documentum, terms separated by whitespace are implicitly "ORed". This is the opposite of the current proposals and JCR. Given that there is such diverse fulltext grammar, perhaps we should consider a grammar that avoids implicit behavior?

Also, the current proposal does not mention case or wildcard expectations.

Florent Guillaume added a comment - 08/Jun/09 06:22 AM
If we don't specify a syntax, every single CMIS client will have to have a hardcoded list of vendors and each one's behavior. I don't feel that's a good way of doing interop.

What I see in projects around makes me think that ANDing of terms is what's most useful to humans using a search engine, so if there should be a minimal syntax I suggest that ANDing be the default. Note that you can implement ORing by doing several queries without great loss of performance, but only the server can AND efficiently.

Therefore my proposal 0. Specifics can be discussed, for instance if people feel that exclusion is already too much then we can have a proposal -1 that doesn't deal with term exclusion.

Case: I would leave it to repositories. Most are case insensitive in fulltext search however, isn't it?
Wildcard: that's gotta be another can of worms, so I'd leave unspecified, but I agree that it's a concern.
 

Florent Guillaume added a comment - 08/Jun/09 06:27 AM
David:
ANDing should be considered loosely, "best effort", with the intent that the user wants to see all documents containing both terms, for whatever definition of "containing" the repository feels is best suited.

We cannot mandate strict semantics in any case, as already most of fulltext matching (scope, canonicalization, stemming, thesaurus, etc) is repository-specific.

Al Brown added a comment - 15/Jun/09 01:10 PM
accepted

Al Brown added a comment - 16/Jun/09 01:08 PM
Updated proposal field with Proposal #1 from doc

Al Brown added a comment - 05/Jan/10 12:27 AM
JIRA Cleanup