|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.ObjectISSearch.ISDBCrawler
public class ISDBCrawler
The Crawler class of the Web search engine. This class is used to start and stop the Crawler, to reset the engine and to control crawling parameters.
| Field Summary |
|---|
| Fields inherited from interface ISSearch.ISCrawlerInterface |
|---|
RUNNING, STOPPED |
| Constructor Summary | |
|---|---|
ISDBCrawler()
Creates a new instance of ISCrawler |
|
| Method Summary | |
|---|---|
void |
addLink(java.net.URL link)
Adds a new link to the URL queue, if the link is not yet visited. |
void |
closeDB()
Closes the database connection of the built-in database interface. |
java.net.URL |
getBest()
Returns the best candidate to be visited next. |
java.lang.String |
getContentType(java.net.URLConnection urlConnection)
Returns the ContentType of the current document. |
int |
getCrawlingDepth()
Returns the current maximum allowed crawling depth. |
ISDocumentInterface |
getCurrentDocument()
Returns the last document visited by the Crawler. |
java.net.URL |
getCurrentURL()
Returns the last URL visited by the Crawler. |
ISDBinterface |
getDBInterface()
Returns the built-in database interface of the crawler |
int |
getMaxQueueSize()
Returns the maximum allowed size of the URL Queue |
java.net.URL |
getNextURL()
Returns the next URL to be searched. |
int |
getQueueSize()
Returns the current size of the URL queue |
int |
getState()
Returns the current state of the crawler. |
int |
getTimeout()
Returns the current timeout of the crawler. |
boolean |
isDataStructureEmpty()
Checks if our data structure is empty or not. |
boolean |
isVisited(java.net.URL doc)
Checks if the URL of the given document is already visited by the crawler. |
boolean |
openDB()
Initializes the internal database interface and opens its database connection |
void |
reset()
Resets the crawler. |
boolean |
robotSafe(java.net.URL url)
Checks if there exists a robots.txt on the server and checks it contains a "Disallow:". |
void |
run()
|
ISDocumentInterface |
runParser(java.io.Reader r)
Starts the parser. |
void |
setCrawlingDepth(int depth)
Sets the maximum allowed crawling depth. |
void |
setCurrentDocument(ISDocumentInterface isd)
Sets the last document visited by the Crawler. |
void |
setQueueMaxSize(int m)
Set the maximum allowed size of the URL queue |
void |
setState(int state_code)
Sets the current state of the crawler. |
void |
setTimeout(int t)
Sets the current timeout of the crawler. |
void |
start()
Starts the thread of the crawler and changes the engine state to RUNNING |
void |
stop()
Stops the crawler. |
boolean |
store(java.net.URL link,
ISDocumentInterface doc)
Stores the crawled document and its URL into the database |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public ISDBCrawler()
| Method Detail |
|---|
public boolean store(java.net.URL link,
ISDocumentInterface doc)
store in interface ISDBCrawlerInterfacelink - the URL of the crawled documentdoc - extracted terms and links from the document
public boolean openDB()
openDB in interface ISDBCrawlerInterfacepublic void closeDB()
closeDB in interface ISDBCrawlerInterfacepublic ISDBinterface getDBInterface()
getDBInterface in interface ISDBCrawlerInterfacepublic void addLink(java.net.URL link)
addLink in interface ISCrawlerInterfacelink - The URL link representation of the new targetpublic java.net.URL getBest()
getBest in interface ISCrawlerInterfacenull if the queue is empty.public int getCrawlingDepth()
getCrawlingDepth in interface ISCrawlerInterfacepublic ISDocumentInterface getCurrentDocument()
getCurrentDocument in interface ISCrawlerInterfacepublic void setCurrentDocument(ISDocumentInterface isd)
ISCrawlerInterface
setCurrentDocument in interface ISCrawlerInterfacepublic java.net.URL getCurrentURL()
getCurrentURL in interface ISCrawlerInterfacepublic int getMaxQueueSize()
getMaxQueueSize in interface ISCrawlerInterfacepublic int getQueueSize()
getQueueSize in interface ISCrawlerInterfacepublic int getState()
RUNNING and STOPPED.
getState in interface ISCrawlerInterfaceRUNNING oder STOPPEDpublic void setState(int state_code)
ISCrawlerInterfaceRUNNING and STOPPED.
setState in interface ISCrawlerInterfacepublic int getTimeout()
ISCrawlerInterface
getTimeout in interface ISCrawlerInterfacepublic void setTimeout(int t)
ISCrawlerInterface
setTimeout in interface ISCrawlerInterfacepublic boolean isVisited(java.net.URL doc)
isVisited in interface ISCrawlerInterfacetrue if the engine was able to recognize
the given URL as already visited, false.public void setCrawlingDepth(int depth)
setCrawlingDepth in interface ISCrawlerInterfacedepth - The maximum allowed craling depth.public void setQueueMaxSize(int m)
setQueueMaxSize in interface ISCrawlerInterfacem - The maximum allowed Queue sizepublic void start()
RUNNING
start in interface ISCrawlerInterfacepublic void stop()
STOPPED.
stop in interface ISCrawlerInterfacepublic void reset()
STOPPED,
reset in interface ISCrawlerInterfacepublic void run()
run in interface java.lang.Runnablepublic java.net.URL getNextURL()
ISCrawlerInterfacegetBest().
getNextURL in interface ISCrawlerInterfacepublic boolean isDataStructureEmpty()
ISCrawlerInterface
isDataStructureEmpty in interface ISCrawlerInterfacepublic ISDocumentInterface runParser(java.io.Reader r)
ISCrawlerInterface
runParser in interface ISCrawlerInterfaceISDocumentInterface.public java.lang.String getContentType(java.net.URLConnection urlConnection)
ISCrawlerInterface
getContentType in interface ISCrawlerInterfaceurlConnection - The current document given by a URLConnection.
public boolean robotSafe(java.net.URL url)
ISCrawlerInterface
robotSafe in interface ISCrawlerInterfaceurl - URL which should be checked.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||