Discussion:
[mongodb-dev] Adding free-implementation of Arabic Language support to Text search.
Kefah Issa
2018-03-02 00:38:23 UTC
Permalink
Hello,

Currently text search support for Arabic is only possible on MongoDB
Enterprise with dependency on 3rd party proprietary component that requires
a separate license; Basis Technology Rosette Linguistics Platform (RLP) is
used to perform normalization, word breaking, sentence breaking, and
stemming or tokenization depending on the language.

I would like to champion the impelmentation of a free / open source Arabic
search implementation for mongodb. The support would include normalization,
stemming, word-breaking ...etc.

As such I would like to have the following basic guidance / hints on how
can that be done for mongodb:

1. What are the possible implementation languages: c++, javascript?
2. What is the required interface / api / abi ?
3. Is there an available sample language codebase that I can use as a
skeleton ? e.g. English.
4. How can I setup mongodb to use a custom language support extension so I
can test it on ground before submitting.

That implementation can easily be further extended - by others - to
supports other languages like Farsi (Iranian/Persian) and Urdu.

Thank you in advance for your help and guidance.

Regards,
- Kefah.
--
You received this message because you are subscribed to the Google Groups "mongodb-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-dev+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-dev/65c52503-ab53-46f0-97c8-67c975f3e255%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
'Mark Benvenuto' via mongodb-dev
2018-03-02 22:31:54 UTC
Permalink
To answer your questions:
1. It would need to be C/C++.
2. The basic MongoDB interface is pretty simple since because we rely on
third-party libraries to do the tokenization and stemming.

The basic tokenizer interface is here:
https://github.com/mongodb/mongo/blob/master/src/mongo/db/fts/fts_unicode_tokenizer.cpp
Language registration is here:
https://github.com/mongodb/mongo/blob/master/src/mongo/db/fts/fts_language.cpp

3. The main library we use for English and other languages is Snowball
which is based on Porter's work. MongoDB does not actually have any
stemming code itself, just code to integrate Snowball and do scoring. See
http://snowballstem.org/. I do not know how well Arabic fits this stemming
model.

4. To setup a custom language, just modify this registration function:
https://github.com/mongodb/mongo/blob/70e200e98474d1a29339bf536f348257e8f83a9d/src/mongo/db/fts/fts_language.cpp#L141-L151

Mark
Post by Kefah Issa
Hello,
Currently text search support for Arabic is only possible on MongoDB
Enterprise with dependency on 3rd party proprietary component that requires
a separate license; Basis Technology Rosette Linguistics Platform (RLP) is
used to perform normalization, word breaking, sentence breaking, and
stemming or tokenization depending on the language.
I would like to champion the impelmentation of a free / open source Arabic
search implementation for mongodb. The support would include normalization,
stemming, word-breaking ...etc.
As such I would like to have the following basic guidance / hints on how
1. What are the possible implementation languages: c++, javascript?
2. What is the required interface / api / abi ?
3. Is there an available sample language codebase that I can use as a
skeleton ? e.g. English.
4. How can I setup mongodb to use a custom language support extension so I
can test it on ground before submitting.
That implementation can easily be further extended - by others - to
supports other languages like Farsi (Iranian/Persian) and Urdu.
Thank you in advance for your help and guidance.
Regards,
- Kefah.
--
You received this message because you are subscribed to the Google Groups
"mongodb-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at https://groups.google.com/group/mongodb-dev.
To view this discussion on the web visit https://groups.google.com/d/
msgid/mongodb-dev/65c52503-ab53-46f0-97c8-67c975f3e255%40googlegroups.com
<https://groups.google.com/d/msgid/mongodb-dev/65c52503-ab53-46f0-97c8-67c975f3e255%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-dev+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-dev/CAHnRF7U2FetJb%3DRUHTDgXN6w6zadTW0PfgjYCCV%3DcrVf2jxstg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Loading...