darc is designed as a swiss army knife for darkweb crawling.
requests to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium to provide a fully rendered web page and screenshot
of such view.
- URL Utilities
- Source Parsing
- Link Database
- Proxy Utilities
- Sites Customisation
- Module Constants
- Custom Exceptions
- Data Models
As the websites can be sometimes irritating for their anti-robots
verification, login requirements, etc., the
also privides hooks to customise crawling behaviours around both
Still, since the network is a world full of mysteries and miracles,
the speed of crawling will much depend on the response speed of
the target website. To boost up, as well as meet the system capacity,
darc project introduced multiprocessing, multithreading
and the fallback slowest single-threaded solutions when crawling.
When rendering the target website using
selenium powered by
the renown Google Chrome, it will require much memory to run.
Thus, the three solutions mentioned above would only toggle the
behaviour around the use of
To keep the
darc project as it is a swiss army knife, only the
main entrypoint function
darc.process.process() is exported
in global namespace (and renamed to
darc.darc()), see below:
And we also exported the necessary hook registration functions to the global namespace, see below:
For more information on the hooks, please refer to the customisation documentations.