Web Forum Crawling using Index Thread Page Flipping Algorithm
Published: 2014
Author(s) Name: A. Anny Leema, P. Iswarya |
Author(s) Affiliation: Department of Computer Applications, B.S. Abdur Rahman University, Chennai, Tamil Nadu, India.
Locked
Subscribed
Available for All
Abstract
Internet forums are important platforms where users can send request and exchange information from different sources. The issue in existing system is the URL type recognition problem which consists of duplicate links
and uninformative pages. Index Thread Page Flipping Algorithm (ITF) is used to overcome this issue. URL layout and page layout are used to recognize whether the URL link is valid or invalid. In this project (Phase-I), Web Forum Crawling using Index Thread Page Flipping Algorithm is provided that finds whether the links are valid or invalid. The goal is to crawl relevant content. The Internet forums will have the URL type recognition problem. It learns to get the correct path or URL by using regular expression
patterns and with created training sets from page type classifiers. The modules implemented are user interface design module, page flipping module, entry URL discovery module, index/thread URL detection module, generic crawler module. In the user interface design module to connect with a server, user must give their user name and password. In the page flipping module, a long forum is divided into more pages which are linked by page-flipping links. Generic crawlers process each page individually and ignore the relationships between such pages. In the entry URL discovery module entry URL should be specified to perform the process. Some rules are defined to find the entry URL. In the index and thread URL detection module, index URL and thread URL are identified by their URL pattern. In the generic crawler module, given a forum it enters into the thread page and it performs crawling where it avoids the duplicate links and page flipping links. The front end for all the modules in the project (Phase-I) is designed using eclipse and the back-end is designed using SQL server 2005. The two modules
in the project (Phase-I) are implemented using Java Servlet, JSP and the code behind is written using Java. The main feature of this project (Phase-I) is to save the bandwidth and time.
Keywords: Forum Crawling, Index Url, Thread Url, Page Flipping Url
View PDF