Technical Documentation¶

darc is designed as a swiss army knife for darkweb crawling. It integrates requests to collect HTTP request and response information, such as cookies, header fields, etc. It also bundles selenium to provide a fully rendered web page and screenshot of such view.

As the websites can be sometimes irritating for their anti-robots verification, login requirements, etc., the darc project also privides hooks to customise crawling behaviours around both requests and selenium.

See also

Such customisation, as called in the darc project, site hooks, is site specific, user can set up your own hooks unto a certain site, c.f. darc.sites for more information.

Still, since the network is a world full of mysteries and miracles, the speed of crawling will much depend on the response speed of the target website. To boost up, as well as meet the system capacity, the darc project introduced multiprocessing, multithreading and the fallback slowest single-threaded solutions when crawling.

Note

When rendering the target website using selenium powered by the renown Google Chrome, it will require much memory to run. Thus, the three solutions mentioned above would only toggle the behaviour around the use of selenium.

To keep the darc project as it is a swiss army knife, only the main entrypoint function darc.process.process() is exported in global namespace (and renamed to darc.darc()), see below:

darc.darc(worker)¶

Main process.

The function will register _signal_handler() for SIGTERM, and start the main process of the darc darkweb crawlers.

Parameters: worker (Literal[crawler, loader]) – Worker process type.
Raises: ValueError – If worker is not a valid value.
Return type: None

Before starting the workers, the function will start proxies through

darc.proxy.tor.tor_proxy()
darc.proxy.i2p.i2p_proxy()
darc.proxy.zeronet.zeronet_proxy()
darc.proxy.freenet.freenet_proxy()

The general process can be described as following for workers of crawler type:

process_crawler(): obtain URLs from the requests link database (c.f. load_requests()), and feed such URLs to crawler().

Note

If FLAG_MP is True, the function will be called with multiprocessing support; if FLAG_TH if True, the function will be called with multithreading support; if none, the function will be called in single-threading.
crawler(): parse the URL using parse_link(), and check if need to crawl the URL (c.f. PROXY_WHITE_LIST, PROXY_BLACK_LIST , LINK_WHITE_LIST and LINK_BLACK_LIST); if true, then crawl the URL with requests.

If the URL is from a brand new host, darc will first try to fetch and save robots.txt and sitemaps of the host (c.f. save_robots() and save_sitemap()), and extract then save the links from sitemaps (c.f. read_sitemap()) into link database for future crawling (c.f. save_requests()). Also, if the submission API is provided, submit_new_host() will be called and submit the documents just fetched.

If robots.txt presented, and FORCE is False, darc will check if allowed to crawl the URL.

Note

The root path (e.g. / in https://www.example.com/) will always be crawled ignoring robots.txt.

At this point, darc will call the customised hook function from darc.sites to crawl and get the final response object. darc will save the session cookies and header information, using save_headers().

Note

If requests.exceptions.InvalidSchema is raised, the link will be saved by save_invalid(). Further processing is dropped.

If the content type of response document is not ignored (c.f. MIME_WHITE_LIST and MIME_BLACK_LIST), submit_requests() will be called and submit the document just fetched.

If the response document is HTML (text/html and application/xhtml+xml), extract_links() will be called then to extract all possible links from the HTML document and save such links into the database (c.f. save_requests()).

And if the response status code is between 400 and 600, the URL will be saved back to the link database (c.f. save_requests()). If NOT, the URL will be saved into selenium link database to proceed next steps (c.f. save_selenium()).

The general process can be described as following for workers of loader type:

process_loader(): in the meanwhile, darc will obtain URLs from the selenium link database (c.f. load_selenium()), and feed such URLs to loader().

Note

If FLAG_MP is True, the function will be called with multiprocessing support; if FLAG_TH if True, the function will be called with multithreading support; if none, the function will be called in single-threading.
loader(): parse the URL using parse_link() and start loading the URL using selenium with Google Chrome.

At this point, darc will call the customised hook function from darc.sites to load and return the original Chrome object.

If successful, the rendered source HTML document will be saved, and a full-page screenshot will be taken and saved.

If the submission API is provided, submit_selenium() will be called and submit the document just loaded.

Later, extract_links() will be called then to extract all possible links from the HTML document and save such links into the requests database (c.f. save_requests()).

After each round, darc will call registered hook functions in sequential order, with the type of worker ('crawler' or 'loader') and the current link pool as its parameters, see register() for more information.

If in reboot mode, i.e. REBOOT is True, the function will exit after first round. If not, it will renew the Tor connections (if bootstrapped), c.f. renew_tor_session(), and start another round.

See also

The function is renamed from darc.process.process().

And we also exported the necessary hook registration functions to the global namespace, see below:

darc.register_hooks(hook, *, _index=None)¶

Register hook function.

Parameters

hook (Callable[[Literal[crawler, loader], List[darc.link.Link]], None]) – Hook function to be registered.
_index (Optional[int]) –

Keyword Arguments

_index – Position index for the hook function.

Return type

None

The hook function takes two parameters:

a str object indicating the type of worker, i.e. 'crawler' or 'loader';
a list object containing Link objects, as the current processed link pool.

The hook function may raises WorkerBreak so that the worker shall break from its indefinite loop upon finishing of current round. Any value returned from the hook function will be ignored by the workers.

See also

The hook functions will be saved into _HOOK_REGISTRY.

See also

The function is renamed from darc.process.register().

darc.register_proxy(proxy, session=<function null_session>, driver=<function null_driver>)¶

Register new proxy type.

Parameters

proxy (str) – Proxy type.
session (Callable[[bool], requests.sessions.Session]) – Session factory function, c.f. darc.requests.null_session().
driver (Callable[[], selenium.webdriver.chrome.webdriver.WebDriver]) – Driver factory function, c.f. darc.selenium.null_driver().

Return type

None

See also

The function is renamed from darc.proxy.register().

darc.register_sites(site, *hostname)¶

Register new site map.

Parameters

site (Type[darc.sites._abc.BaseSite]) – Sites customisation class inherited from BaseSite.
*hostname (Tuple[str]) – Optional list of hostnames the sites customisation should be registered with. By default, we use site.hostname.

Return type

None

See also

The function is renamed from darc.sites.register().

For more information on the hooks, please refer to the customisation documentations.

Technical Documentation¶

darc

Navigation

Related Topics