workflow of project

As you know I have a project in my serach engine course about indexing.

In this post I put link of my prezi that I presented in my class. this prezi indicates our workflow so far.

multilingual indexing

In this post I want to talk about multilingual indexing.

there are two architectures for indexing. Firstly, centralized architecture appears adequate for indexing multilingual documents, because of making use of a single index, but it has been shown to have some problems. .One major problem with centralized architecture is that index weights are usually overweighted. This is because the number of documents (DF) increases while the number of occurrences of a term (TF) is kept unchanged and thus weights are overweighted. For example, consider a collection containing 6,000 monolingual Arabic documents along with 70,000 documents in English. In a centralized architecture the (N) value (number of all documents) in the (IDF) of a term, which is computed as log(N/DF) in order to estimate a term importance, for the Arabic collection will increase to 76,000, instead of  6,000 when all documented are placed together in a single collection. This will cause weights of terms to overweight and thus documents with small collections are preferred.

 The second approach to indexing is the distributed approach. With respect to multilingual querying, it is clear that the dominant approach in distributed architecture is to translate a user query to target language(s) and next a monolingual language-specific search is carried out per each sub-collection followed by a merging method.

Combined Indexing:

it combines centralized and distributed architectures. For the centralized architecture,This is done by indexing multilingual documents only in a centralized architecture, instead of indexing both monolingual and multilingual documents.

A typical distributed architecture does not prefer collections with small number of documents, as in a centralized architecture. The retrieval performance of each monolingual run is much better than in a centralized architecture. This is because both queries and documents in each distributed sub-collection index are in the same language. Therefore, the proposed combined index creates a distributed-monolingual-subcollection for each language that is used in monolingual documents only, but not for documents in multiple languages. Thus, multilingual documents are not included in these distributed-monolingual-sub-collection(s). The significant benefit of indexing monolingual documents only in distributed architectures is the efficient retrieval in each sub-collection, due to the similarity in languages between queries and documents. In addition, since multilingual documents are not included in the monolingual indexes, partitioning these documents as well as overlapping of them in individual lists were avoided, unlike the normal distributed index which doesn‟t consider the multilingualism feature in multilingual documents.  

refrence: Mohammed Mustafa, Izzedin Osman, Hussein Suleman,” Indexing and Weighting of Multilingual and  Mixed Documents”in  ACM -2011 

indexing in bigdata

In this post I want to talk about indexing in bigdata. before that you shod know what approaches big data is using.

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

Example for base indexing:

Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code:

Inverted Index: The map function parses each document, and emits a sequence of < word; document ID>pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word; list(document ID)> pair. The set of all output pairs forms a simple inverted index. 



Content based video: part 2

In this post I want to talk about step of index tree in video.

for construct tree we need to define window size that known winszie. for example If winsize is 3, the next 3 patterns {B,C, A} from A are contained in the window {A, B, C, A} in clip 1 in previous post.winszie can be static and dynamic.The main advantage of static window size is that the index-tree can be built off-line and the tree construction cost can be saved while perform-ing the pattern matching operation. In contrast to the static win-dow size, dynamic window size is an adaptive method that can adjust the window size according to the length of the query clip.two shot-patterns are enough to solve the sequence matching problem. below picture shows two shot-pattern for winsize 3.

fast-pattern-index tree: More generally, FPI-tree can be regarded as a 2-pattern-based prefix-tree and the construction can be viewed as an iterative operation. For each clip in the data-base, we have to generate all two shot-patterns, which is represented as ‘‘2-pattern” . If a 2-pattern is shared with multiple clips, the related clip ids can form the queue prefixed by the specific 2-pattern.

example of FPI tree


Content based video: part 1

In the following my previous post I want to talk about another aspect of indexing in this post which is content based video indexing.

Shot detection
In this operation, for the query clip and the target videos, we perform transitional shot detection to divide a video into a set of sequential shots. Finally, the key-frame of each shot is defined.Hence, a shot within a video clip is represented by a key-frame in the remainder of this paper.

Shot clustering and encoding

To construct the pattern-based index tree, encoding the shots is necessary. The main contribution of this work is that, the feature dimensionality can be reduced substantially and the pattern matching cost becomes very low. In this work, the shots are clus-tered by the well-known algorithm k-means and each shot is as-signed a symbol by its belonging cluster number.

Indexing stage

After the video clips in the database are symbolized,bellow tabel is a simple example of clip-transaction list that contains 4 target clips. Each clip consists of several sequential shot-patterns. By this clip-transaction list, we can build index-tree, with respect to FPI-tree.

the task for building the index-tree can be divided into two parts, including the generation of temporal patterns and the construction of index-tree.

I will talk about constructing index tree in the next post. 


Context indexing

I have a seminar about indexing in my search engine course one of aspect in indexing is content-based indexing so I want talk about it in this post.

In architecture context based indexing, web pages are stored in the crawled web page repository. The indexes hold the valuable compressed information for each web page. The preprocessing steps are performed on the documents (i.e. stemming as well as removal of stop words). The keywords are extracted from the document, and their corresponding multiple contexts are identified from the Word Net. Indexer maintains the index of the keyword using the Binary Search tree.

Documents are arranged by the keywords it contains and the index is maintained in lexical order. For every alphabet in the index there is one BST (Binary search tree) containing the keywords with the first letter matches with the alphabet. Each node in the BST points to a structure that contains the list of contextual meanings corresponding to that keyword and contains the pointers to the documents that matches the particular meaning.

Keyword – is the keyword that appear in some or more documents in local database and that will match the user query keyword.

 List of Contexts – is the list of all different usage/senses of the keyword obtained from the WordNet.

C1 – stands for the contextual sense 1 With each Contextual sense (C) a list of pointers to the documents in which this C appears is associated. Where,

 D1 – stands for the pointer to document 1

Steps to search the index to resolve a query

1. For the query keyword given by the user, search the index to get match with the first alphabet of the keyword

2. The corresponding BST is selected for further searching.

3. If a match is found with some entry, corresponding list of meaning i.e. C1, C2, C3…etc is displayed to the user to get the user selection, after getting a specific choice from the user the corresponding list of pointers is accessed to get the documents from the repository and finally displayed to the user as final result for the query.

4. Else if the keyword is not found in the corresponding BST, the appropriate insertion is done in the BST and no match found is displayed to the user.



solr config

In this post I want to talk about how to index wiki data in solr cmd

2.go to bin folder of solr with cd command

3. write solr.cmd start to run solr

4.write solr.cmd create -c wiki for create core in solr

5. go to  <<solr-6.2.1\server\solr\wiki\conf>> for config wiki core to index wiki

6. in this folder you have managed-schema file, please change file to schema.xml. write the code below

<field name="_version_" type="long" indexed="true" stored="true"/>

<field name="id" type="string" indexed="true" stored="true" required="true"/>

 <field name="title" type="string" indexed="true" stored="true"/>

 <field name="revision" type="int" indexed="true" stored="false"/>

<field name="user" type="string" indexed="true" stored="false"/>

<field name="userId" type="int" indexed="true" stored="false"/>

<field name="_text_" type="text_en" indexed="true" stored="false"/>

Instead of

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

 <field name="_version_" type="long" indexed="true" stored="false"/>

 <field name="_root_" type="string" indexed="true" stored="false" docValues="false" />

 <field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>

7.go to <<solr-6.2.1\server\solr\wiki\conf\solrconfig.xml>>

After  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-velocity-\d.*\.jar" />

 write  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />

After  <requestHandler name="/browse" class="solr.SearchHandler" useParams="query,facets,velocity,browse">

    <lst name="defaults">

      <str name="echoParams">explicit</str>


write </requestHandler>

 <requestHandler name="/dihupdate" class="org.apache.solr.handler.dataimport

.DataImportHandler "startu="lazy">

    <lst name="defaults">

      <str name="config">data-config.xml</str>



8.create data-config.xml file in <<solr-6.2.1\server\solr\wiki\conf>> and write code below


       <dataSource type="FileDataSource" encoding="UTF-8" />


        <entity name="page"







            <field column="id"        xpath="/mediawiki/page/id" />

            <field column="title"     xpath="/mediawiki/page/title" />

            <field column="revision"  xpath="/mediawiki/page/revision/id" />

            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />

            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />

            <field column="text"      xpath="/mediawiki/page/revision/text" />

            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />

            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>




9.restart solr with solr stop –all and solr start command your browser and write localhost:8983/solr/wiki/dihupdate to index your data

11. go to wiki core in browser and see numDocs which is indexed

About solr search engine

I have project in my search engine course about indexing with solr so i want to talk about solr in this post.

Solr is an open source search platform, written in java, from the Apache lucene project. Its major features include full text search, hit highlighting, real-time indexing, dynamic clustering, database integration,NoSQL features and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is designed for scalability and fault-tolerance.Solr is the second-most popular enterprise search engine after Elasticserach.

Solr runs as a standalone full-text search server. It uses the Lucene Java search library at its core for full-text indexing and search, and has RESET-like HTTP/XML and JSON APIs that make it usable from most popular programming languages. Solr's external configuration allows it to be tailored to many types of application without Java coding, and it has a plugin architecture to support more advanced customization.


Techniques of inverted index:part1

I talked about inverted index in my previous post and I want to talk about some techniques of inverted index in this post.

Var-Byte Coding: Variable-byte compression represents an integer in a variable number of bytes, where each byte consists of one status bit, indicating whether another byte follows the current one, followed by 7 data bits. Thus, 142 = 1·27 +16 is represented as 10000001 0001000, while 2 is represented as 00000010. Var-byte compression does not achieve a very good compression ratio, but is simple and allows for fast decoding and is thus used in many systems.

Rice Coding: This method compresses a sequence of integers by first choosing a b such that 2b is close to the average value. Each integer n is then encoded in two parts: a quotient q = n/(2b) stored in unary code using q + 1 bits, and a remainder r = n mod 2b stored in binary using b bits. Rice coding achieves very good compression on standard unordered collections but is slower than var-byte, though the gap in speed can be reduced by using an optimized implementation.



inverted index compression

Most techniques for inverted index compression  first replace each docID (except the first in a list) by the difference between it and the preceding docID, called d-gap, and then encode the d-gap using some integer compression algorithm. Using d-gaps instead of docIDs decreases the average value that needs to be compressed, resulting in a higher compression ratio. Of course, these values have to be summed up again during decompression, but this can usually be done very efficiently. Thus, inverted index compression techniques are concerned with compressing sequences of integers whose average value is small inverted index compression techniques are concerned with compressing sequences of integers whose average value is small. The resulting compression ratio depends on the exact properties of these sequences, which depend on the way in which docIDs are assigned to documents. The basic idea here is that if we assign docIDs such that many similar documents (i.e., documents that share a lot of terms) are close to each other in the docID assignment, then the resulting sequence of d-gaps will become more skewed, with large clusters of many small values interrupted by a few larger values, resulting in better compression. In contrast, if docIDs are assigned at random, the distribution of gaps will be basically exponential, and small values will not be clustered together. 

Inverted index

inverted index for a collection of documents is a structure that stores, for each term (word) occurring somewhere in the collection, information about the locations where it occurs. In particular, for each term t, the index contains an inverted list It consisting of a number of index postings. Each posting in It contains information about the occurrences of t in one particular document d, usually the ID of the document (the docID), the number of occurrences of t in d (the frequency), and possibly other information about the locations of the occurrences within the document and their contexts. The postings in each list are usually sorted by docID. 

search engine and its issues

Search engines come in a number of configurations that reflect the applications they are designed for.Web search engines, such as Google and Yahoo! must be able to capture, or crawl, many tera bytes of data, and then provide subsecond response times to millions of queries submitted everyday from around the world. The “big issues” in the design of search engines include the ones identified for information retrieval: effective ranking algorithms, evaluation, and user interaction. There are, however, a number of additional critical features of search engines that result from their deployment in large-scale, operational environments. Foremost among these features is the performance of the search engine in terms of measures such as response time, query throughput, and indexing speed.Response time is the delay between submitting a query and receiving the result list,throughput measures the number of queries that can be processed in a given time, and indexing speed is the rate at which text documents can be transformed into indexes for searching . An index is a data structure that improves the speed of search.The design of indexes for search engines is one of the major topics in this blog.



The main purpose of my blog is to gather latest news and information in the field of semantic web,I hope this weblog be useful for those who are interested in this field.


If you can read this post, it means that the registration process was successful and that you can start blogging