Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Signed Approach for Mining Web Content Outliers
Post: #1

Signed Approach for Mining Web Content Outliers


The emergence of the Internet has brewed the revolution of information storage and retrieval. As most of the data in the web is unstructured, and contains a mix of text, video, audio etc, there is a need to mine information to cater to the specific needs of the users without loss of important hidden information. Thus developing user friendly and automated tools for providing relevant information quickly becomes a major challenge in web mining research. Most of the existing web mining algorithms have concentrated on finding frequent patterns while neglecting the less frequent ones that are likely to contain outlying data such as noise, irrelevant and redundant data. This paper mainly focuses on Signed approach and full word matching on the organized domain dictionary for mining web content outliers. This Signed approach gives the relevant web documents as well as outlying web documents. As the dictionary is organized based on the number of characters in a word, searching and retrieval of documents takes less time and less space

Presented By
G. Poonkuzhali, K.Thiagarajan, K.Sarukesi and G.V.Uma


the exponential growth of information available on the internet, updating incoming data and retrieving relevant information from the web quickly and efficiently is a growing concern. Most of the web search engines typically employ conventional information retrieval and data mining techniques to discover automatically useful and previously unknown information from web content. With the enormous growth on the web, users get easily lost in the rich hyper structure. In addition, as most of the data in the web is unstructured, and contains a mix of text, video, audio etc, there is a need to mine information to cater to the specific needs of the users[9]. Efforts are being made to make such data available, usually in some structured form as in matrix G.Poonkuzhali is Assistant professor in the Department of Computer Science and Engineering with the Rajalakshmi Engineering College, Affiliated to Anna University Chennai, Tamil Nadu, India, phone: 9444836861, email : Kuzhal_s[at] K.Thiagarajan is Senior Lecturer in the Department of Mathematics with the Rajalakshmi Engineering College, Affiliated to Anna University Chennai, Tamil Nadu, India, email : vidhyamannan[at] K.Sarukesi is Vice Chancellor with the Hindusthan University “ Chennai, email: profsaru[at] G.V.Uma is Professor in the Department of Computer Science and Engineering with the Anna University-Chennai, email: gvuma[at] form for further manipulation. Web mining is an emerging research area focused on resolving these problems. The proposed work in web mining aims to develop new methodology to effectively mine useful knowledge or information from the web documents quickly. In general, web mining tasks can be classified into three major categories, web structure mining, web usage mining and web content mining. Web structure mining tries to discover useful knowledge from the structure of hyperlinks. Web usage mining refers to the discovery of user access patterns from web usage logs. Web content mining aims to extract/mine useful information from the web pages based on their contents [1],[4],[10],[11]. Two groups of web content mining are those that directly mine the content of documents and those that improve on the content search of other tools like search engine. For Web content mining data can be image, audio, text and video [15]-[16]. Existing web mining algorithms do not consider documents having varying contents within the same category called web content outliers. Generally, Outliers are the data that obviously deviate from others, disobey the general mode or behavior of data and disaccord with other existing data. Outliers may also reflect the true properties of data, such as the rare disastrous weather recorded in meteorological database, which often contains one or more properties whose values seriously deviate from the normal values. However, these data may contain more valuable information than normal data. Researches on outlier detection broadly fall into following categories: A. Distribution based methods are conducted by the statistics community. These methods deploy some known distribution model and detect as outliers points that deviate from the model. B. Depth based algorithms organize objects in convex hull layers in data space according to peeling depth and outliers expected to be with shallow depth values[13]. C. Deviation based techniques detect outliers by checking the characteristics of objects and identify an object as that deviates these features as outlier. D. Distance based algorithms give a rank to all points, using distance of point from k-th nearest neighbor, and orders points by this rank. The top n points in ranked list identified as outliers. Alternative approaches compute the outlier factor as sum of distances from k nearest neighbors. E. Density based methods rely on local outlier factor (LOF) of each point, which depends on local density of neighborhood. Points with high factor are indicated as outliers Unlike traditional outlier mining algorithm designed only for numeric data sets, web outliers mining algorithm should be applicable to various types of data including text, hypertext, image, video etc. Web pages that have different contents from the category in which they were taken constitute web content outliers.[7]-[8] Web content outliers mining concentrates on finding outliers such as noise, irrelevant and redundant pages from the web documents[10]-[11] Also, web content outliers mining can be used to determine pages with entirely different contents from their parent web sites. In the proposed system, web documents are extracted from the search engines by giving query by the user to the web. Then the obtained web documents D is preprocessed, i.e., stop words, stem words and except text other data such as hyperlinks, sound, images etc are removed. The output is a set of documents with white-spaced separated words and it is indexed in two dimensional format (i,j), where ˜i™ represent web pages and ™j™ represent words. Therefore, first word from first web page is indexed as (1,1), second word from the first page is indexed as (1,2) etc,. The domain dictionary is arranged in such a way that, all 1-letter word will be indexed first, followed by 2-letter words, then 3-letter words similarly up to 15-letters word which is a very reasonable upper bounds for number of characters in a word. Each page is mined individually to detect relevant and irrelevant documents using signed approach. Finally, a relevant web document is obtained which contains required information catering to the user needs.

full report
Post: #2
I want to implemen this paper but i have some problem and question
like how can i do preprocessing web content?
how can i provide dataset and how an i use it?
i demand you help me,please.

thank you
Post: #3
you can refer these page details of "Signed Approach for Mining Web Content Outliers"link bellow

Important Note..!

If you are not satisfied with above reply ,..Please


So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Popular Searches: web mining blog, new web content mining seminar report, web mining book, web content marketing jobs, web content business, advantages n disadvantages of web mining, outliers in data mining doc,

Quick Reply
Type your reply to this message here.

Image Verification
Image Verification
(case insensitive)
Please enter the text within the image on the left in to the text box below. This process is used to prevent automated posts.

Possibly Related Threads...
Thread: Author Replies: Views: Last Post
  web spoofing full report computer science technology 9 9,103 26-03-2014 06:29 AM
Last Post: Guest
  Web Services Architecture computer topic 0 557 25-03-2014 10:20 PM
Last Post: computer topic
  GREEN CLOUD -A Data Center Approach computer topic 0 662 25-03-2014 10:13 PM
Last Post: computer topic
  Opera (web browser) computer science crazy 3 3,357 08-07-2013 12:45 PM
Last Post: computer topic
  Layered Approach Using Conditional Random Fields for Intrusion Detection project report helper 11 6,540 01-03-2013 11:58 AM
Last Post: [email protected]
Star DATA MINING AND WAREHOUSE seminar projects crazy 2 2,471 05-02-2013 12:00 PM
Last Post: seminar details
  Relation-Based Search Engine in Semantic Web project topics 1 1,363 21-12-2012 11:00 AM
Last Post: seminar details
  A survey of usage of Data Mining and Data Warehousing in Academic Institution and Lib seminar class 1 1,399 29-11-2012 12:56 PM
Last Post: seminar details
  Integration Of Data mining And Data warehousing Systems computer science topics 1 2,485 29-11-2012 12:56 PM
Last Post: seminar details
  OBJECT-ORIENTED APPROACH IN SOFTWARE DEVELOPMENT project report helper 2 1,788 20-11-2012 12:48 PM
Last Post: seminar details