A Proposed Method for Documents Indexing

In this paper, a new method is proposed for documents indexing based on constructing two tables, namely, words-information table and pages-information table. These two tables used to represent the first step in information retrieval (which prepare the documents set (preprocessing)). In Information retrieval systems, tokenization is an integrals part whose prime objective is to identifying the tokens and their count. In this paper, can be proposed an effective tokenization approach, which is based on proposed new method called documents indexing and results shows that efficiency of proposed algorithm. Tokenization on documents helps to satisfy user’s information need more precisely and reduced search sharply. Preprocessing of input document is an integral part of Tokenization, which involves preprocessing of documents and generates its respective tokens, which is the basis of these tokens. Probabilistic IR generate its scoring and gives reduced search space. Comparative analysis based on the two parameters; reduce the time of search space, Pre-processing time, and reduce the size of memory.


Introduction
Amount of operational data has been increasing exponentially from past few decades, the expectations of data-user is changing proportionally as well.The data-user expects more deep, exact, and detailed results.Retrieval of relevant results is always affected by the pattern, how they are stored indexed.Various techniques designed to index the documents, which done on the tokens identified with in documents, new techniques by using inverted index.Information retrieval (IR) handles the representation, storage, organization, and access to information items [1].In IR, one of the main problems is to determine which documents are relevant and which are not to the user's needs.In practice, this problem usually mentioned as a ranking problem, which aims to solve according to the degree of relevance (matching) between all documents and the query of user [1] [2].Which deals with information retrieval [3].General structure of information retrieval as shown in figure 1.

Information Retrieval Process Model
The proposed system consists of two stages: the first stage is the preprocessing (prepare the dataset and store in database that will use as input to the second stage to retrieve relevant documents to the user query), in this stage can be proposed new method to index documents called proposed documents indexing algorithm.Illustrated in details in this paper.

Related work
Uematsu researcher used the inverted index in 2008, which used to store position of word, and document ID.Word position data is a list of offsets or positions in which the words occur in the document.Such occurrence information (i.e.Document ID and word position data) for each word is expressed as a list, called the "inverted list", and all the inverted lists taken together are referred to as the inverted index.In addition, A. Dallal in 2014 used EII, by using for each word weight, frequency, number of unique word, total weight etc. but this method need large space of memory this size 100MB, and need time to store the results.So that, these models cannot be used.

Documents Indexing
This stage consists of two processes: (Dataset reading and proposed documents indexing algorithm).Figure 2 shows a block diagram describing the main processes of this stage.The proposed system implemented using a free dataset for simulation purpose.This dataset containing World Wide webpages gathered from computer science departments of various universities.Dataset consists of 8280 semi-structured documents, written in Hypertext Markup Language (HTML), webpages documents, which were manually classified into seven directories these directories, are department, students, staff, faculty, projects, courses and others see Table 1.Inside each directory five classes, each of which represents universities names, see Table 2. Table 3 shows documents tags.Each of these tags reflects a specific level of importance within the document, as well as these tags contains essential information near to the term of the user query, Figure 3 shows document before and after reading.\\ compute total weight from equation 3.1.-Pages-information table.Total-count-word= summation (W) -Words-information table.Word= W -Words-information table.Pages-list= each page contain the same W

End if End while
-Store in pages-information table (id, p-name, total-weight, and total-count-word).
-Store in words-information table (word, pages-list)  After these preprocessing the indexing process, begins.Pages-information table generation is made by finding the total-tags-weight, using equation 1.

Total-tag-weight=Weight (W) tag+∑Weight (W) tag … 1
Where: Weight (W) tag: weight of word by using weight of tag Table 4 shows an example of pages-information table.The words-information table is constructed by first storing the word then finding the pages-list for this word, table 5 is an example of words-information table.The proposed method ability to index all terms that is meaning, and then add information need for each term, these information stored in two tables, first table store words and pages-list of each word.Each page list consists of ID-list, also this ID consists of three parts (directory number, university number, and page number), this ID make the process of retrieve related document faster than the traditional.The second table especial of pages information, and this table store the name of page and total word count finally total weight of each page.This help of made up the proposed fitness function.So that the memory space of the developing smaller than traditional and ref [6] In figure 5 traditional need large space of memory, for each entry read document need 2-byte (2*8), and dataset used is 8280 documents.Can be required to multiple total words in dataset (67,672 words) with all document in dataset (8280 documents) with 16, the result is 8965,186,560 MB.While memory size need to store, the indexing in Ref.
[6] is 100MB.While the proposed documents indexing in the first stage of this proposed system, only need 19.9MB to store data in memory.This reduce of memory size in proposed system done through remove each word outside tags (html, head, sub-header (h1, h2, and h3), and body).inaddition remove each word that contains number together such as: operating565 or any word contains any of special character that is came together with this word can be removed.
By using the proposed documents indexing in this paper the memory space became smaller than traditional and ref. [6].Because this method work as follows: open the source code of each

Conclusion
In this paper, proposed method of indexing documents can be used for indexing webpage documents information retrieval.Simulation results of the proposed algorithm then compared with same traditional algorithm and ref. [6].Approved the efficiency of the proposed algorithm in term of storage space and processing time.

Vol: 13 Figure 1 :
Figure 1: general structure of information retrieval

Figure 4 , 1 . 2 .
Figure 4, shows a block-diagram of the proposed documents indexing main steps.The proposed documents indexing process begins with the following process:1.Special-word table construction by doing the following process:

Figure 4 :
Figure 4: Block-diagram of documents indexing algorithm

Vol: 13
No:2 , April 2017 DOI: http://dx.doi.org/10.24237/djps.1302.144AP-ISSN: 2222-8373 E-ISSN: 2518-9255webpage then determine, where the start tag of html begins or tag of title, head or body because some webpage cant begins with html tag, so that take this reason in account.Therefore each word out of this tag can be removed also can be remove special word and delimiter sentence but cannot applied the stemming process.Then entered to each tag and put the weight for each word, specific to the degree of a tag contains these words.To give this webpage value when the keywords of query found in this page to spent on two problems of information retrieval.Then can be account the total words weight and account, finally for each word applied the principle of pages-list, used in the evaluation function (fitness function).The principle of pages list consider the essential in the IR systems.

Figure 5 :
Figure 5: Memory space for document indexing