ISSearch
Interface ISCrawlerInterface

All Superinterfaces:
java.lang.Runnable
All Known Implementing Classes:
ISCrawler

public interface ISCrawlerInterface
extends java.lang.Runnable

Interface of the main Crawler class of the Web search engine. This class is used to start and stop the Crawler, to reset the engine and to control crawling parameters.

See Also:
Runnable, Thread, InetAddress, URL, HttpURLConnection, InputStreamReader, BufferedReader, Exception

Field Summary
static int RUNNING
          The Running state of the current thread
static int STOPPED
          The Idle state of the current thread
 
Method Summary
 void addLink(java.net.URL link)
          Adds a new link to the URL queue, if the link is not yet visited.
 java.net.URL getBest()
          Returns the best candidate to be visited next.
 java.lang.String getContentType(java.net.URLConnection urlConnection)
          Returns the ContentType of the current document.
 int getCrawlingDepth()
          Returns the current maximum allowed crawling depth.
 ISDocumentInterface getCurrentDocument()
          Returns the last document visited by the Crawler.
 java.net.URL getCurrentURL()
          Returns the last URL visited by the Crawler.
 int getMaxQueueSize()
          Returns the maximum allowed size of the URL Queue
 java.net.URL getNextURL()
          Returns the next URL to be searched.
 int getQueueSize()
          Returns the current size of the URL queue
 int getState()
          Returns the current state of the crawler.
 int getTimeout()
          Returns the current timeout of the crawler.
 boolean isDataStructureEmpty()
          Checks if our data structure is empty or not.
 boolean isVisited(java.net.URL doc)
          Checks if the URL of the given document is already visited by the crawler.
 void reset()
          Resets the crawler.
 boolean robotSafe(java.net.URL url)
           
 ISDocumentInterface runParser(java.io.Reader r)
          Starts the parser.
 void setCrawlingDepth(int depth)
          Sets the maximum allowed crawling depth.
 void setCurrentDocument(ISDocumentInterface isd)
          Sets the last document visited by the Crawler.
 void setQueueMaxSize(int m)
          Set the maximum allowed size of the URL queue
 void setState(int state_code)
          Sets the current state of the crawler.
 void setTimeout(int t)
          Sets the current timeout of the crawler.
 void start()
          Starts the thread of the crawler and changes the engine state to RUNNING
 void stop()
          Stops the crawler.
 
Methods inherited from interface java.lang.Runnable
run
 

Field Detail

RUNNING

static final int RUNNING
The Running state of the current thread

See Also:
Constant Field Values

STOPPED

static final int STOPPED
The Idle state of the current thread

See Also:
Constant Field Values
Method Detail

start

void start()
Starts the thread of the crawler and changes the engine state to RUNNING


stop

void stop()
Stops the crawler. This method stops crawling and sets the engine status to STOPPED.


reset

void reset()
Resets the crawler. This method stops the crawling, resets the URL queue, and the list of visited links. Finally, it sets the crawler status to STOPPED,


addLink

void addLink(java.net.URL link)
Adds a new link to the URL queue, if the link is not yet visited.

Parameters:
link - The URL link representation of the new target

getState

int getState()
Returns the current state of the crawler. Possible states are RUNNING and STOPPED.

Returns:
The current state of the crawler, RUNNING oder STOPPED

setState

void setState(int state_code)
Sets the current state of the crawler. Possible states are RUNNING and STOPPED.

Parameters:
The - current state of the crawler, RUNNING oder STOPPED

getTimeout

int getTimeout()
Returns the current timeout of the crawler.

Returns:
The current timeout of the crawler.

setTimeout

void setTimeout(int t)
Sets the current timeout of the crawler.

Parameters:
The - current timeout of the crawler in ms.

getQueueSize

int getQueueSize()
Returns the current size of the URL queue

Returns:
The current size of the URL queue.

setQueueMaxSize

void setQueueMaxSize(int m)
Set the maximum allowed size of the URL queue

Parameters:
m - The maximum allowed Queue size

getMaxQueueSize

int getMaxQueueSize()
Returns the maximum allowed size of the URL Queue

Returns:
The max allowed Queue size

setCrawlingDepth

void setCrawlingDepth(int depth)
Sets the maximum allowed crawling depth.

Parameters:
depth - The maximum allowed craling depth.

getCrawlingDepth

int getCrawlingDepth()
Returns the current maximum allowed crawling depth.

Returns:
The current allowed craling depth.

getBest

java.net.URL getBest()
Returns the best candidate to be visited next. The result must have the highest priority (in the sense of the selected ordering strategy) under all available links.

Returns:
The best target to be visited by the Crawler next, null if the queue is empty.

isVisited

boolean isVisited(java.net.URL doc)
Checks if the URL of the given document is already visited by the crawler.

Returns:
true if the engine was able to recognize the given URL as already visited, false.

getCurrentDocument

ISDocumentInterface getCurrentDocument()
Returns the last document visited by the Crawler.

Returns:
The last visited document as object that implements ISDocumentInterface (and contains all extracted links, words and their stems); null if no documents were crawled yet.

setCurrentDocument

void setCurrentDocument(ISDocumentInterface isd)
Sets the last document visited by the Crawler.

Parameters:
The - last visited document as object that implements ISDocumentInterface (and contains all extracted links, words and their stems).

getCurrentURL

java.net.URL getCurrentURL()
Returns the last URL visited by the Crawler.

Returns:
The last visited URL; null if no links were crawled yet.

getNextURL

java.net.URL getNextURL()
Returns the next URL to be searched. It is doing the job for getBest().

Returns:
next URL in the queue.

isDataStructureEmpty

boolean isDataStructureEmpty()
Checks if our data structure is empty or not.

Returns:
boolean value which is true if our data structure is empty

runParser

ISDocumentInterface runParser(java.io.Reader r)
Starts the parser.

Parameters:
Reader - which contains the URL to be parsed.
Returns:
ISDocumentInterface.

robotSafe

boolean robotSafe(java.net.URL url)

getContentType

java.lang.String getContentType(java.net.URLConnection urlConnection)
Returns the ContentType of the current document.

Parameters:
urlConnection - The current document given by a URLConnection.
Returns:
The ContentType of the current document as String.