|
That's correct, I think there's value in defining a minimal syntax for foo.
I appreciate the differing implementations and the constraint this may impose. However, as you state, the main use case is limited which requires only a handful of FTS constructs. If the standardised FTS language were to only include AND, OR, NOT, phrases (quoted terms) and single keywords, can our implementations (which provide FTS) support this subset of capabilities? If not, then we can't standardise. If so, it shouldn't be difficult to define a syntax, which JCR has already done. JCR supports: (AND, OR, NOT, Phrases)
- terms separated by space are AND'ed - terms preceded by '-' are NOT - terms can be separated by OR If there is consensus that this is needed (for a usecase in scope), I can live with adding support for AND, OR, NOT and phrases. I would like to defer this to 2.0. From JCR: A query satisfies a FullTextSearch constraint if the value (or values) of the full-text indexed properties within the full-text search scope satisfy the specified fullTextSearchExpression, evaluated as follows: - A term not preceded with "-" (minus sign) is satisfied only if the value contains that term. - A term preceded with "-" (minus sign) is satisfied only if the value does not contain that term. - Terms separated by whitespace are implicitly "ANDed". - Terms separated by "OR" are "ORed". - "AND" has higher precedence than "OR". - Within a term, each double quote ("), "-" (minus sign), and "\" (backslash) must be escaped by a preceding "\" (backslash). The query is invalid if: - selectorName is not the name of a selector in the query, or - fullTextSearchExpression does not conform to the above grammar (as augmented by the implementation). I'd also like to see this deferred. While I agree that there's value in having this included in the CMIS specification at some point, I think it's actually a larger change to the spec than I'd feel comfortable taking for V1 at this point.
The implication of this change is that: 1) Many repositories would have to implement an FTS query parser/translator. 2) It's not clear what a repository would do if there are subsets of the query language we define that they do NOT support (e.g. what if the repository doesn't support support the "-" indicated above? Could we at least define a common ground on implicit ANDs, and say that FTS query:
foo bar means search for foo AND bar ? This semantic knowledge would be very useful for clients (and there's no way to efficiently do a AND from individual terms if you don't know how to express it to the server in a single query). As requested, proposal for FTS including BNF.
http://www.oasis-open.org/apps/org/workgroup/cmis/download.php/32526/cmis_fts_proposal_v0_2.txt I updated David's proposal to include a simpler version than JSR-283 that does not deal with phrase search nor ORing of terms.
http://www.oasis-open.org/apps/org/workgroup/cmis/download.php/32637/cmis_fts_proposal_v0_3.txt What is the precise semantics for a Boolean conjunction (AND)?
The CONTAINS() function does not specify the FT search scope, simply because a repository's FTS capability largely depends on what objects and which values of an object are FT-indexed, and what source identities are captured in the index. For some repositories, a single document may have multiple FT-index values: the separate values of a multi-valued property, the values of different properties, the content-stream. This offers a finer-grain FTS capability. For others, all these are treated as parts of a single logical FT value. Now, a Boolean expression "proposition-A AND proposition-B" is true iff both prop-A and prop-B are true for the same value. What is "the same value" here? Does it mean the same property value or the same content stream? A FTS implementation that treats everything indexed for a document as a part of a single logical value would not be able to support this fine-grain semantics. Then, should it mean everything associated with a document that is FT-indexed? A FTS implementation that uses property value id as source id but not document id would have difficulty to support this coarse semantics without a great deal of post processing. The same ambiguity applies to disjunction (OR) and negation (-), although the problem may be less severe in practice. Should there be a fixed semantics, or would this be repository-specific? I would also like to see enhanced FT deferred for v1, to be included in a future specification with the right capability balance.
For Documentum, terms separated by whitespace are implicitly "ORed". This is the opposite of the current proposals and JCR. Given that there is such diverse fulltext grammar, perhaps we should consider a grammar that avoids implicit behavior? Also, the current proposal does not mention case or wildcard expectations. If we don't specify a syntax, every single CMIS client will have to have a hardcoded list of vendors and each one's behavior. I don't feel that's a good way of doing interop.
What I see in projects around makes me think that ANDing of terms is what's most useful to humans using a search engine, so if there should be a minimal syntax I suggest that ANDing be the default. Note that you can implement ORing by doing several queries without great loss of performance, but only the server can AND efficiently. Therefore my proposal 0. Specifics can be discussed, for instance if people feel that exclusion is already too much then we can have a proposal -1 that doesn't deal with term exclusion. Case: I would leave it to repositories. Most are case insensitive in fulltext search however, isn't it? Wildcard: that's gotta be another can of worms, so I'd leave unspecified, but I agree that it's a concern. David:
ANDing should be considered loosely, "best effort", with the intent that the user wants to see all documents containing both terms, for whatever definition of "containing" the repository feels is best suited. We cannot mandate strict semantics in any case, as already most of fulltext matching (scope, canonicalization, stemming, thesaurus, etc) is repository-specific. |
|||||||||||||||||||||||||||||||||||||||||
AND
OR
NOT
as well as phrases. If we standardize the above behavior, we also need to add in a capability on level of support - simple/black box vs enhanced
I am not sure it is required at this point to support a standard contains syntax in 1.0. A lot of us implement FTS in a variety of ways - Autononmy/IDOL, Autonomy/Verity, Lucene, Omnifind, DB-specific FTS (DB2, Oracle, etc), FAST, etc.
The main use case I am aware of is a client wanting to use single keywords in contains() - e.g., contains(foo) rather than complex FTS queries. I do not think that warrants specifying the syntax for AND, OR, NOT, phrases in contains.