Technical Documentation¶
darc is designed as a swiss army knife for darkweb crawling.
It integrates requests to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium to provide a fully rendered web page and screenshot
of such view.
As the websites can be sometimes irritating for their anti-robots
verification, login requirements, etc., the darc project
also privides hooks to customise crawling behaviours around both
requests and selenium.
See also
Such customisation, as called in the darc project, site
hooks, is site specific, user can set up your own hooks unto a
certain site, c.f. darc.sites for more information.
Still, since the network is a world full of mysteries and miracles,
the speed of crawling will much depend on the response speed of
the target website. To boost up, as well as meet the system capacity,
the darc project introduced multiprocessing, multithreading
and the fallback slowest single-threaded solutions when crawling.
Note
When rendering the target website using selenium powered by
the renown Google Chrome, it will require much memory to run.
Thus, the three solutions mentioned above would only toggle the
behaviour around the use of selenium.
To keep the darc project as it is a swiss army knife, only the
main entrypoint function darc.process.process() is exported
in global namespace (and renamed to darc.darc()), see below:
And we also exported the necessary hook registration functions to the global namespace, see below:
For more information on the hooks, please refer to the customisation documentations.