hbase

در این پست میخواهم درباره hbase که باید داده های ورودی را از آن بخوانم صحبت کنم
ابتدا hbase را راه اندازی کرده سپس با استفاده از کد زیر یک جدول در ان ایجاد میکنم
Configuration config = HBaseConfiguration.create();
HTable hTable = new HTable(config, "BTC" );
سپس با استفاده از کد زیر برای هر ردیف جدول یک کلید انتخاب می کنم
Put p = new Put(Bytes.toBytes(split[0]));
 با استفاده از کد زیر به ستون خانواده ranking به ستون subject آن یک داده اضافه میکنم.
Bytes.toBytes("subject" ),Bytes.toBytes(split[0]));
 
 درلینک زیر کد کامل قرار داده شده است که داده ها رو از فایل ورودی BTC میگیرد و درجدولhbase قرار می دهد.
https://fumdrive.um.ac.ir/index.php/s/e5ZXUTsZJd4zIRs
 


نصب hadoop

در این پست می خواهم در باره پروژه این ترم که پیاده سازی رنکینگ به صورت آنلاین هست صحبت کنم

برای این پیاده سازی لازم است Hadoop را نصب کنم که در مسیر نصب با مشکلاتی برخورد کردم .

از جمله سطح دسترسی، کتابخانه بومی، و ناهماهنگی ورژن 32 بیتی کامپایلر Hadoop با لینوکس و ...

پس آن را بر روی مک نصب کردم که قبل از نصب آن باید brew و jdk مربوط به مک را نصب میکردم

مراحل مربوط به نصب Hadoop بر روی brew به طور کامل در زیر توضیح داده شده است

$ brew install hadoop

حال hadoop-env.sh را در مسیرusr/local/Cellar/hadoop/2.8.0/libexec/etc/hadoop/hadoop-env.sh باز کرده ودستور زیر را پیدا کنید

export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

و دستور زیر را جایگزین آن کنید

export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="

Core-site.xml را باز کنید و دستور زیر را در آن کپی کنید

<configuration> 
<property>
     <name>hadoop.tmp.dir</name>
     <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
     <description>A base for other temporary directories.</description>
  </property>
  <property>
     <name>fs.default.name</name>                                    
     <value>hdfs://localhost:9000</value>                            
  </property>
</configuration> 
 

mapred-site.xml زا در مسیر گفته شده باز کرده و دستور زیر را در آن کپی کنید

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9010</value>
</property>
</configuration>

 hdfs-site.xml زا در مسیر گفته شده باز کرده و دستور زیر را در آن کپی کنید

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

 درprofile/~. دستورات زیر را وارد کنید.

alias hstart="/usr/local/Cellar/hadoop/2.6.0/sbin/start-dfs.sh;/usr/local/Cellar/hadoop/2.6.0/sbin/start-yarn.sh"
alias hstop="/usr/local/Cellar/hadoop/2.6.0/sbin/stop-yarn.sh;/usr/local/Cellar/hadoop/2.6.0/sbin/stop-dfs.sh"

حال دستور زیر را اجرا کنید

source ~/.profile

حال قبل از اجرای hadoop باید hdfs زافرمت کنید

$ hdfs namenode -format

حال باید یک ssh key درست کنید

$ ssh-keygen -t rsa

از طریق زیر remote login را فعال کنید

“System Preferences” -> “Sharing”. Check “Remote Login”

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

حال  hadoop را فعال کنید

$ hstart

یک برنامه آن را اجرا کنید

$ hadoop jar  pi 10 100

 

با استفاده از دستور زیر اجزای فعال  hadoop راببینید

$ jps 

interface مزبوط به hadoop را در http://localhost:50070 میتوانید ببینید

 


C-sparql

در این پست می‌خواهم درباره ارائه کوتاه خود در سر کلاس صحبت کنم .موضوع ارائه کوتاه من c-sparql بود.
c-sparql : برای query  زدن بر روی داه هایی که مدام در حال تولید هستند و در جریانند استفاده می شود، مانند تعداد ماشین های که در هر دقیقه وارد یک دروازه می‌شوند ، اطلاعاتی که از سنسورها دریافت می‌شود.
c-sparql  توسعه ای بر روی sparql  است که مفاهیمی چون window ، timestamps، aggregation و ... به آن اضافه شده است که درباره هرکدام از این مفاهیم به صورت مجزا در اسلایدهای موجود در لینک زیر توضیح داده شده است .
https://fumdrive.um.ac.ir/index.php/s/o8s9vBkOXuwzaOa


شاخص گذاری در داده های حجیم

در این پست می‌خواهم در مورد ارائه ای که در کلاس درباره شاخص گذاری در داده‌های حجیم انجام دادم صحبت کنم
در ابتدای این ارائه مفاهیمی که در مورد کار با داده های حجیم باید می‌دانستیم ارائه شد از جمله مفاهیمی چون Hadoop ، برنامه نویسی نگاشت و کاهش ، hbase و ....
برنامه نویسی نگاشت و کاهش : یک مدل برنامه نویسی است که برای کار با داده‌های حجیم استفاده می‌شود که تعدادی key و value می‌گیرد و پس از پردازش بر روی آن‌ها خروجی به‌صورت <key,value>  تولید می‌کند.
Hadoop : بستری است که برای برنامه نویسی نگاشت و کاهش استفاده می شود .
Hbase : جدولی است که برای دسترسی سریع به داده ها از آن استفاده می‌کنیم .
پس از آن چند مدل شاخص گذاری در داده‌های حجیم ارائه شد از جمله شاخص گذاری مبتنی بر predicate‌های مشترک ،شاخص‌گذاری های مبتنی بر گره‌های برچسب گذاریشاخص گذاری یکپارچه،شاخص گذاری مبتنی بر بخش بندی بر اساس predicate سپس با استفاده از گرفتن ایده از روش اول ، روش جدیدی ارائه شد که آن را پیاده سازی کرده اند .
این روش از سه قسمت استخراج ساختار از داده ، ذخیره داده و بازیابی داده تشکیل شده است .
در قسمت استخراج ساختار از داده با استفاده از روشی RDFهای موجود را خوشه بندی می‌کنیم و هر خوشه را در یک جدول Hbase  نگه می‌داریم .
در قسمت ذخیره داده : الگوهای موجودیت‌های موجود در اسناد را درآورده و آن‌ها را در شبیه‌ترین خوشه قرار می‌دهیم .
در قسمت بازیابی داده : پرس و جو را دریافت کرده ، الگوی آن را پیدا کرده و پاسخ را به کاربر برمی‌گردانیم .
در زیر لینک اسلایدهای این ارائه با توضیحات و جدول قرار داده شده است
https://drive.google.com/open?id=0B5nhdxcxQ6Ajckl5VGF1RmZEeGc


آنتولوژی

در این پست می خواهم درباره تمرین آنتولوژی که برای درس وب معنایی داشتم صحبت کنم
برای این تمرین لازم بود از نرم افزار protege استفاده کنیم نرم افزار protege برای کشیدن آنتولوژی استفاده می شود و ابزارهایی دارد که روابط موجود در آنتولوژی را برای ما میسر میسازد.من برای این تمرین لازم بود یکی از فصول کتاب swebook را خوانده و آنتولوژی آن را بکشم . برای این منظور هر سر فصل را یک زیرکلاس از thing گرفتم و هر زیر فصل را زیر کلاس های آ‌ن گرفتم. پس هر فصل روابط و مفاهیم را در آورره و به عنوان objectproperty  یا dataproperty  روابط را بهم وصل کردم.و کلاس های جدید به وجود امده را به عنوان یک زیر کلاس از یک مفهوم کلی گرفتم. در این تمرین سعی کردم تا روابط مشترک بین فصل ها را پیدا کنم تا بتوانم dataproperty کمتری ایجاد کنم و حجم فایل بالا نرود به عنوان مثال در همه زیر فصل ها مفهومی به عنوان means پیدا کردم که تعریف هر زیر فصل بود.
در زیر لینک مزبوط به این آنتولوژی گذاشته شده است.
https://drive.google.com/open?id=0B5nhdxcxQ6AjbGRhckZobWY1WUE


workflow of project

As you know I have a project in my serach engine course about indexing.

In this post I put link of my prezi that I presented in my class. this prezi indicates our workflow so far.

https://drive.google.com/open?id=0B5nhdxcxQ6AjNUxqZGtuUm5nWjg


multilingual indexing

In this post I want to talk about multilingual indexing.

there are two architectures for indexing. Firstly, centralized architecture appears adequate for indexing multilingual documents, because of making use of a single index, but it has been shown to have some problems. .One major problem with centralized architecture is that index weights are usually overweighted. This is because the number of documents (DF) increases while the number of occurrences of a term (TF) is kept unchanged and thus weights are overweighted. For example, consider a collection containing 6,000 monolingual Arabic documents along with 70,000 documents in English. In a centralized architecture the (N) value (number of all documents) in the (IDF) of a term, which is computed as log(N/DF) in order to estimate a term importance, for the Arabic collection will increase to 76,000, instead of  6,000 when all documented are placed together in a single collection. This will cause weights of terms to overweight and thus documents with small collections are preferred.

 The second approach to indexing is the distributed approach. With respect to multilingual querying, it is clear that the dominant approach in distributed architecture is to translate a user query to target language(s) and next a monolingual language-specific search is carried out per each sub-collection followed by a merging method.

Combined Indexing:

it combines centralized and distributed architectures. For the centralized architecture,This is done by indexing multilingual documents only in a centralized architecture, instead of indexing both monolingual and multilingual documents.

A typical distributed architecture does not prefer collections with small number of documents, as in a centralized architecture. The retrieval performance of each monolingual run is much better than in a centralized architecture. This is because both queries and documents in each distributed sub-collection index are in the same language. Therefore, the proposed combined index creates a distributed-monolingual-subcollection for each language that is used in monolingual documents only, but not for documents in multiple languages. Thus, multilingual documents are not included in these distributed-monolingual-sub-collection(s). The significant benefit of indexing monolingual documents only in distributed architectures is the efficient retrieval in each sub-collection, due to the similarity in languages between queries and documents. In addition, since multilingual documents are not included in the monolingual indexes, partitioning these documents as well as overlapping of them in individual lists were avoided, unlike the normal distributed index which doesn‟t consider the multilingualism feature in multilingual documents.  

refrence: Mohammed Mustafa, Izzedin Osman, Hussein Suleman,” Indexing and Weighting of Multilingual and  Mixed Documents”in  ACM -2011 


indexing in bigdata

In this post I want to talk about indexing in bigdata. before that you shod know what approaches big data is using.

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

Example for base indexing:

Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code:

Inverted Index: The map function parses each document, and emits a sequence of < word; document ID>pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word; list(document ID)> pair. The set of all output pairs forms a simple inverted index. 

 

 


Content based video: part 2

In this post I want to talk about step of index tree in video.

for construct tree we need to define window size that known winszie. for example If winsize is 3, the next 3 patterns {B,C, A} from A are contained in the window {A, B, C, A} in clip 1 in previous post.winszie can be static and dynamic.The main advantage of static window size is that the index-tree can be built off-line and the tree construction cost can be saved while perform-ing the pattern matching operation. In contrast to the static win-dow size, dynamic window size is an adaptive method that can adjust the window size according to the length of the query clip.two shot-patterns are enough to solve the sequence matching problem. below picture shows two shot-pattern for winsize 3.

fast-pattern-index tree: More generally, FPI-tree can be regarded as a 2-pattern-based prefix-tree and the construction can be viewed as an iterative operation. For each clip in the data-base, we have to generate all two shot-patterns, which is represented as ‘‘2-pattern” . If a 2-pattern is shared with multiple clips, the related clip ids can form the queue prefixed by the specific 2-pattern.

example of FPI tree

 


Content based video: part 1

In the following my previous post I want to talk about another aspect of indexing in this post which is content based video indexing.

Shot detection
In this operation, for the query clip and the target videos, we perform transitional shot detection to divide a video into a set of sequential shots. Finally, the key-frame of each shot is defined.Hence, a shot within a video clip is represented by a key-frame in the remainder of this paper.

Shot clustering and encoding

To construct the pattern-based index tree, encoding the shots is necessary. The main contribution of this work is that, the feature dimensionality can be reduced substantially and the pattern matching cost becomes very low. In this work, the shots are clus-tered by the well-known algorithm k-means and each shot is as-signed a symbol by its belonging cluster number.

Indexing stage

After the video clips in the database are symbolized,bellow tabel is a simple example of clip-transaction list that contains 4 target clips. Each clip consists of several sequential shot-patterns. By this clip-transaction list, we can build index-tree, with respect to FPI-tree.

the task for building the index-tree can be divided into two parts, including the generation of temporal patterns and the construction of index-tree.

I will talk about constructing index tree in the next post. 

 


Context indexing

I have a seminar about indexing in my search engine course one of aspect in indexing is content-based indexing so I want talk about it in this post.

In architecture context based indexing, web pages are stored in the crawled web page repository. The indexes hold the valuable compressed information for each web page. The preprocessing steps are performed on the documents (i.e. stemming as well as removal of stop words). The keywords are extracted from the document, and their corresponding multiple contexts are identified from the Word Net. Indexer maintains the index of the keyword using the Binary Search tree.

Documents are arranged by the keywords it contains and the index is maintained in lexical order. For every alphabet in the index there is one BST (Binary search tree) containing the keywords with the first letter matches with the alphabet. Each node in the BST points to a structure that contains the list of contextual meanings corresponding to that keyword and contains the pointers to the documents that matches the particular meaning.

Keyword – is the keyword that appear in some or more documents in local database and that will match the user query keyword.

 List of Contexts – is the list of all different usage/senses of the keyword obtained from the WordNet.

C1 – stands for the contextual sense 1 With each Contextual sense (C) a list of pointers to the documents in which this C appears is associated. Where,

 D1 – stands for the pointer to document 1

Steps to search the index to resolve a query

1. For the query keyword given by the user, search the index to get match with the first alphabet of the keyword

2. The corresponding BST is selected for further searching.

3. If a match is found with some entry, corresponding list of meaning i.e. C1, C2, C3…etc is displayed to the user to get the user selection, after getting a specific choice from the user the corresponding list of pointers is accessed to get the documents from the repository and finally displayed to the user as final result for the query.

4. Else if the keyword is not found in the corresponding BST, the appropriate insertion is done in the BST and no match found is displayed to the user.

 

 


solr config

In this post I want to talk about how to index wiki data in solr

1.run cmd

2.go to bin folder of solr with cd command

3. write solr.cmd start to run solr

4.write solr.cmd create -c wiki for create core in solr

5. go to  <<solr-6.2.1\server\solr\wiki\conf>> for config wiki core to index wiki

6. in this folder you have managed-schema file, please change file to schema.xml. write the code below

<field name="_version_" type="long" indexed="true" stored="true"/>

<field name="id" type="string" indexed="true" stored="true" required="true"/>

 <field name="title" type="string" indexed="true" stored="true"/>

 <field name="revision" type="int" indexed="true" stored="false"/>

<field name="user" type="string" indexed="true" stored="false"/>

<field name="userId" type="int" indexed="true" stored="false"/>

<field name="_text_" type="text_en" indexed="true" stored="false"/>

Instead of

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

 <field name="_version_" type="long" indexed="true" stored="false"/>

 <field name="_root_" type="string" indexed="true" stored="false" docValues="false" />

 <field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>

7.go to <<solr-6.2.1\server\solr\wiki\conf\solrconfig.xml>>

After  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-velocity-\d.*\.jar" />

 write  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />

After  <requestHandler name="/browse" class="solr.SearchHandler" useParams="query,facets,velocity,browse">

    <lst name="defaults">

      <str name="echoParams">explicit</str>

    </lst>

write </requestHandler>

 <requestHandler name="/dihupdate" class="org.apache.solr.handler.dataimport

.DataImportHandler "startu="lazy">

    <lst name="defaults">

      <str name="config">data-config.xml</str>

    </lst>

  </requestHandler>

8.create data-config.xml file in <<solr-6.2.1\server\solr\wiki\conf>> and write code below

<dataConfig>

       <dataSource type="FileDataSource" encoding="UTF-8" />

        <document>

        <entity name="page"

                processor="XPathEntityProcessor"

                stream="true"

                forEach="/mediawiki/page/"

                url="F:\solr-6.2.1\server\solr\wiki6\enwiki-20160113-pages-articles1.xml"

                transformer="RegexTransformer,DateFormatTransformer"

                >

            <field column="id"        xpath="/mediawiki/page/id" />

            <field column="title"     xpath="/mediawiki/page/title" />

            <field column="revision"  xpath="/mediawiki/page/revision/id" />

            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />

            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />

            <field column="text"      xpath="/mediawiki/page/revision/text" />

            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />

            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>

       </entity>

        </document>

</dataConfig>

9.restart solr with solr stop –all and solr start command

10.run your browser and write localhost:8983/solr/wiki/dihupdate to index your data

11. go to wiki core in browser and see numDocs which is indexed


About solr search engine

I have project in my search engine course about indexing with solr so i want to talk about solr in this post.

Solr is an open source search platform, written in java, from the Apache lucene project. Its major features include full text search, hit highlighting, real-time indexing, dynamic clustering, database integration,NoSQL features and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is designed for scalability and fault-tolerance.Solr is the second-most popular enterprise search engine after Elasticserach.

Solr runs as a standalone full-text search server. It uses the Lucene Java search library at its core for full-text indexing and search, and has RESET-like HTTP/XML and JSON APIs that make it usable from most popular programming languages. Solr's external configuration allows it to be tailored to many types of application without Java coding, and it has a plugin architecture to support more advanced customization.

 


Techniques of inverted index:part1

I talked about inverted index in my previous post and I want to talk about some techniques of inverted index in this post.

Var-Byte Coding: Variable-byte compression represents an integer in a variable number of bytes, where each byte consists of one status bit, indicating whether another byte follows the current one, followed by 7 data bits. Thus, 142 = 1·27 +16 is represented as 10000001 0001000, while 2 is represented as 00000010. Var-byte compression does not achieve a very good compression ratio, but is simple and allows for fast decoding and is thus used in many systems.

Rice Coding: This method compresses a sequence of integers by first choosing a b such that 2b is close to the average value. Each integer n is then encoded in two parts: a quotient q = n/(2b) stored in unary code using q + 1 bits, and a remainder r = n mod 2b stored in binary using b bits. Rice coding achieves very good compression on standard unordered collections but is slower than var-byte, though the gap in speed can be reduced by using an optimized implementation.

 

 


inverted index compression

Most techniques for inverted index compression  first replace each docID (except the first in a list) by the difference between it and the preceding docID, called d-gap, and then encode the d-gap using some integer compression algorithm. Using d-gaps instead of docIDs decreases the average value that needs to be compressed, resulting in a higher compression ratio. Of course, these values have to be summed up again during decompression, but this can usually be done very efficiently. Thus, inverted index compression techniques are concerned with compressing sequences of integers whose average value is small inverted index compression techniques are concerned with compressing sequences of integers whose average value is small. The resulting compression ratio depends on the exact properties of these sequences, which depend on the way in which docIDs are assigned to documents. The basic idea here is that if we assign docIDs such that many similar documents (i.e., documents that share a lot of terms) are close to each other in the docID assignment, then the resulting sequence of d-gaps will become more skewed, with large clusters of many small values interrupted by a few larger values, resulting in better compression. In contrast, if docIDs are assigned at random, the distribution of gaps will be basically exponential, and small values will not be clustered together. 


1 2  Next»