ISSearch
Class ISDBCrawler

java.lang.Object
  extended by ISSearch.ISDBCrawler
All Implemented Interfaces:
ISCrawlerInterface, ISDBCrawlerInterface, java.lang.Runnable

public class ISDBCrawler
extends java.lang.Object
implements ISDBCrawlerInterface

The Crawler class of the Web search engine. This class is used to start and stop the Crawler, to reset the engine and to control crawling parameters.


Field Summary
 
Fields inherited from interface ISSearch.ISCrawlerInterface
RUNNING, STOPPED
 
Constructor Summary
ISDBCrawler()
          Creates a new instance of ISCrawler
 
Method Summary
 void addLink(java.net.URL link)
          Adds a new link to the URL queue, if the link is not yet visited.
 void closeDB()
          Closes the database connection of the built-in database interface.
 java.net.URL getBest()
          Returns the best candidate to be visited next.
 java.lang.String getContentType(java.net.URLConnection urlConnection)
          Returns the ContentType of the current document.
 int getCrawlingDepth()
          Returns the current maximum allowed crawling depth.
 ISDocumentInterface getCurrentDocument()
          Returns the last document visited by the Crawler.
 java.net.URL getCurrentURL()
          Returns the last URL visited by the Crawler.
 ISDBinterface getDBInterface()
          Returns the built-in database interface of the crawler
 int getMaxQueueSize()
          Returns the maximum allowed size of the URL Queue
 java.net.URL getNextURL()
          Returns the next URL to be searched.
 int getQueueSize()
          Returns the current size of the URL queue
 int getState()
          Returns the current state of the crawler.
 int getTimeout()
          Returns the current timeout of the crawler.
 boolean isDataStructureEmpty()
          Checks if our data structure is empty or not.
 boolean isVisited(java.net.URL doc)
          Checks if the URL of the given document is already visited by the crawler.
 boolean openDB()
          Initializes the internal database interface and opens its database connection
 void reset()
          Resets the crawler.
 boolean robotSafe(java.net.URL url)
          Checks if there exists a robots.txt on the server and checks it contains a "Disallow:".
 void run()
           
 ISDocumentInterface runParser(java.io.Reader r)
          Starts the parser.
 void setCrawlingDepth(int depth)
          Sets the maximum allowed crawling depth.
 void setCurrentDocument(ISDocumentInterface isd)
          Sets the last document visited by the Crawler.
 void setQueueMaxSize(int m)
          Set the maximum allowed size of the URL queue
 void setState(int state_code)
          Sets the current state of the crawler.
 void setTimeout(int t)
          Sets the current timeout of the crawler.
 void start()
          Starts the thread of the crawler and changes the engine state to RUNNING
 void stop()
          Stops the crawler.
 boolean store(java.net.URL link, ISDocumentInterface doc)
          Stores the crawled document and its URL into the database
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ISDBCrawler

public ISDBCrawler()
Creates a new instance of ISCrawler

Method Detail

store

public boolean store(java.net.URL link,
                     ISDocumentInterface doc)
Stores the crawled document and its URL into the database

Specified by:
store in interface ISDBCrawlerInterface
Parameters:
link - the URL of the crawled document
doc - extracted terms and links from the document
Returns:
true, if the storage was successful; false otherwise.

openDB

public boolean openDB()
Initializes the internal database interface and opens its database connection

Specified by:
openDB in interface ISDBCrawlerInterface
Returns:
true, if the connection to the database was successful, false otherwise.

closeDB

public void closeDB()
Closes the database connection of the built-in database interface.

Specified by:
closeDB in interface ISDBCrawlerInterface

getDBInterface

public ISDBinterface getDBInterface()
Returns the built-in database interface of the crawler

Specified by:
getDBInterface in interface ISDBCrawlerInterface
Returns:
the database interface of the crawler

addLink

public void addLink(java.net.URL link)
Adds a new link to the URL queue, if the link is not yet visited.

Specified by:
addLink in interface ISCrawlerInterface
Parameters:
link - The URL link representation of the new target

getBest

public java.net.URL getBest()
Returns the best candidate to be visited next. The result must have the highest priority (in the sense of the selected ordering strategy) under all available links.

Specified by:
getBest in interface ISCrawlerInterface
Returns:
The best target to be visited by the Crawler next, null if the queue is empty.

getCrawlingDepth

public int getCrawlingDepth()
Returns the current maximum allowed crawling depth.

Specified by:
getCrawlingDepth in interface ISCrawlerInterface
Returns:
The current allowed craling depth.

getCurrentDocument

public ISDocumentInterface getCurrentDocument()
Returns the last document visited by the Crawler.

Specified by:
getCurrentDocument in interface ISCrawlerInterface
Returns:
The last visited document as object that implements ISDocumentInterface (and contains all extracted links, words and their stems); null if no documents were crawled yet.

setCurrentDocument

public void setCurrentDocument(ISDocumentInterface isd)
Description copied from interface: ISCrawlerInterface
Sets the last document visited by the Crawler.

Specified by:
setCurrentDocument in interface ISCrawlerInterface

getCurrentURL

public java.net.URL getCurrentURL()
Returns the last URL visited by the Crawler.

Specified by:
getCurrentURL in interface ISCrawlerInterface
Returns:
The last visited URL; null if no links were crawled yet.

getMaxQueueSize

public int getMaxQueueSize()
Returns the maximum allowed size of the URL Queue

Specified by:
getMaxQueueSize in interface ISCrawlerInterface
Returns:
The max allowed Queue size

getQueueSize

public int getQueueSize()
Returns the current size of the URL queue

Specified by:
getQueueSize in interface ISCrawlerInterface
Returns:
The current size of the URL queue.

getState

public int getState()
Returns the current state of the crawler. Possible states are RUNNING and STOPPED.

Specified by:
getState in interface ISCrawlerInterface
Returns:
The current state of the crawler, RUNNING oder STOPPED

setState

public void setState(int state_code)
Description copied from interface: ISCrawlerInterface
Sets the current state of the crawler. Possible states are RUNNING and STOPPED.

Specified by:
setState in interface ISCrawlerInterface

getTimeout

public int getTimeout()
Description copied from interface: ISCrawlerInterface
Returns the current timeout of the crawler.

Specified by:
getTimeout in interface ISCrawlerInterface
Returns:
The current timeout of the crawler.

setTimeout

public void setTimeout(int t)
Description copied from interface: ISCrawlerInterface
Sets the current timeout of the crawler.

Specified by:
setTimeout in interface ISCrawlerInterface

isVisited

public boolean isVisited(java.net.URL doc)
Checks if the URL of the given document is already visited by the crawler.

Specified by:
isVisited in interface ISCrawlerInterface
Returns:
true if the engine was able to recognize the given URL as already visited, false.

setCrawlingDepth

public void setCrawlingDepth(int depth)
Sets the maximum allowed crawling depth.

Specified by:
setCrawlingDepth in interface ISCrawlerInterface
Parameters:
depth - The maximum allowed craling depth.

setQueueMaxSize

public void setQueueMaxSize(int m)
Set the maximum allowed size of the URL queue

Specified by:
setQueueMaxSize in interface ISCrawlerInterface
Parameters:
m - The maximum allowed Queue size

start

public void start()
Starts the thread of the crawler and changes the engine state to RUNNING

Specified by:
start in interface ISCrawlerInterface

stop

public void stop()
Stops the crawler. This method stops crawling and sets the engine status to STOPPED.

Specified by:
stop in interface ISCrawlerInterface

reset

public void reset()
Resets the crawler. This method stops the crawling, resets the URL queue, and the list of visited links. Finally, it sets the crawler status to STOPPED,

Specified by:
reset in interface ISCrawlerInterface

run

public void run()
Specified by:
run in interface java.lang.Runnable

getNextURL

public java.net.URL getNextURL()
Description copied from interface: ISCrawlerInterface
Returns the next URL to be searched. It is doing the job for getBest().

Specified by:
getNextURL in interface ISCrawlerInterface
Returns:
next URL in the queue.

isDataStructureEmpty

public boolean isDataStructureEmpty()
Description copied from interface: ISCrawlerInterface
Checks if our data structure is empty or not.

Specified by:
isDataStructureEmpty in interface ISCrawlerInterface
Returns:
boolean value which is true if our data structure is empty

runParser

public ISDocumentInterface runParser(java.io.Reader r)
Description copied from interface: ISCrawlerInterface
Starts the parser.

Specified by:
runParser in interface ISCrawlerInterface
Returns:
ISDocumentInterface.

getContentType

public java.lang.String getContentType(java.net.URLConnection urlConnection)
Description copied from interface: ISCrawlerInterface
Returns the ContentType of the current document.

Specified by:
getContentType in interface ISCrawlerInterface
Parameters:
urlConnection - The current document given by a URLConnection.
Returns:
The ContentType of the current document as String.

robotSafe

public boolean robotSafe(java.net.URL url)
Description copied from interface: ISCrawlerInterface
Checks if there exists a robots.txt on the server and checks it contains a "Disallow:".

Specified by:
robotSafe in interface ISCrawlerInterface
Parameters:
url - URL which should be checked.
Returns:
true if the document can be parsed.