Darkweb Crawler Project¶
darc
is designed as a swiss army knife for darkweb crawling.
It integrates requests
to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium
to provide a fully rendered web page and screenshot
of such view.
As the websites can be sometimes irritating for their anti-robots
verification, login requirements, etc., the darc
project
also privides hooks to customise crawling behaviours around both
requests
and selenium
.
See also
Such customisation, as called in the darc
project, site
hooks, is site specific, user can set up your own hooks unto a
certain site, c.f. darc.sites
for more information.
Still, since the network is a world full of mysteries and miracles,
the speed of crawling will much depend on the response speed of
the target website. To boost up, as well as meet the system capacity,
the darc
project introduced multiprocessing, multithreading
and the fallback slowest single-threaded solutions when crawling.
Note
When rendering the target website using selenium
powered by
the renown Google Chrome, it will require much memory to run.
Thus, the three solutions mentioned above would only toggle the
behaviour around the use of selenium
.
To keep the darc
project as it is a swiss army knife, only the
main entrypoint function darc.process.process()
is exported
in global namespace (and renamed to darc.darc()
), see below:
-
darc.
darc
()¶ Main process.
The function will register
_signal_handler()
forSIGTERM
, and start the main process of thedarc
darkweb crawlers.The general process can be described as following:
process()
: obtain URLs from therequests
link database (c.f.load_requests()
), and feed such URLs tocrawler()
with multiprocessing support.crawler()
: parse the URL usingparse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
). Also, if the submission API is provided,submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),darc
will save the document usingsave_html()
orsave_file()
accordingly. And if the submission API is provided,submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).process()
: after the obtained URLs have all been crawled,darc
will obtain URLs from theselenium
link database (c.f.load_selenium()
), and feed such URLs toloader()
.loader()
: parse the URL usingparse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalselenium.webdriver.Chrome
object.If successful, the rendered source HTML document will be saved using
save_html()
, and a full-page screenshot will be taken and saved.If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).
If in reboot mode, i.e.
REBOOT
isTrue
, the function will exit after first round. If not, it will renew the Tor connections (if bootstrapped), c.f.renew_tor_session()
, and start another round.