Skip to content

vicorious/WebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawler

  1. Is a jar stand-alone (Without frameworks)
  2. Supports Java 1.8

Test cases

  • For execute the jar, you need know:

  • /Users/digital/Documents/java_belatrix/urls.txt HASH #ff /Users/digital/Documents/java_belatrix

First param

URL file (In example is *"/Users/digital/Documents/java_belatrix/urls.txt")

Second param (In example is *"HASH")

  • Second you need send a ARG PARAM. this arg param will be: *. HASH *. TWITTER *. ANY

Third param (In example is *"#ff")

  • Third param is the matcher . HASH (#anything) *. TWITTER (@account) *. ANY (anythingelse)

Four param (In example is *"/Users/digital/Documents/java_belatrix")

  • Four param is the outpath (*ONLY PATH, NO INCLUCED Files name)
    • /My/Relative/or/Absolute/Output/Path

Results

Invitame una cerveza

  • HTML folder will have ALL html index for each page from URL file

Invitame una cerveza

  • All Files result with ARG Param name.

  • example:

    • CNN_TWITTER_FINAL.txt -> In this file you find all last Twitter account in cnn index
    • CNN_HASH_FINAL.txt -> In this file you find all last Hash account in cnn index
    • CNN_BELATRIX_FINAL -> In this file you find all last BELATRIX word in cnn index

About

Web Crawler JAVA

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages