In this post I want to talk about multilingual indexing.
there are two architectures for indexing. Firstly, centralized architecture appears adequate for indexing multilingual documents, because of making use of a single index, but it has been shown to have some problems. .One major problem with centralized architecture is that index weights are usually overweighted. This is because the number of documents (DF) increases while the number of occurrences of a term (TF) is kept unchanged and thus weights are overweighted. For example, consider a collection containing 6,000 monolingual Arabic documents along with 70,000 documents in English. In a centralized architecture the (N) value (number of all documents) in the (IDF) of a term, which is computed as log(N/DF) in order to estimate a term importance, for the Arabic collection will increase to 76,000, instead of 6,000 when all documented are placed together in a single collection. This will cause weights of terms to overweight and thus documents with small collections are preferred.
The second approach to indexing is the distributed approach. With respect to multilingual querying, it is clear that the dominant approach in distributed architecture is to translate a user query to target language(s) and next a monolingual language-specific search is carried out per each sub-collection followed by a merging method.
it combines centralized and distributed architectures. For the centralized architecture,This is done by indexing multilingual documents only in a centralized architecture, instead of indexing both monolingual and multilingual documents.
A typical distributed architecture does not prefer collections with small number of documents, as in a centralized architecture. The retrieval performance of each monolingual run is much better than in a centralized architecture. This is because both queries and documents in each distributed sub-collection index are in the same language. Therefore, the proposed combined index creates a distributed-monolingual-subcollection for each language that is used in monolingual documents only, but not for documents in multiple languages. Thus, multilingual documents are not included in these distributed-monolingual-sub-collection(s). The significant benefit of indexing monolingual documents only in distributed architectures is the efficient retrieval in each sub-collection, due to the similarity in languages between queries and documents. In addition, since multilingual documents are not included in the monolingual indexes, partitioning these documents as well as overlapping of them in individual lists were avoided, unlike the normal distributed index which doesn‟t consider the multilingualism feature in multilingual documents.
refrence: Mohammed Mustafa, Izzedin Osman, Hussein Suleman,” Indexing and Weighting of Multilingual and Mixed Documents”in ACM -2011