Main Processing

The darc.process module contains the main processing logic of the darc module.

darc.process._dump_last_word(errors=True)

Dump data in queue.

Parameters

errors (bool) – If the function is called upon error raised.

The function will remove the backup of the requests database _queue_requests.txt.tmp (if exists) and the backup of the selenium database _queue_selenium.txt.tmp (if exists).

If errors is True, the function will copy the backup of the requests database _queue_requests.txt.tmp (if exists) and the backup of the selenium database _queue_selenium.txt.tmp (if exists) to the corresponding database.

The function will also remove the PID file darc.pid

Fetch links from queue.

Returns

List of links from the requests database.

Return type

List[str]

Deprecated since version 0.1.0: Use darc.db.load_requests() instead.

Fetch links from queue.

Returns

List of links from the selenium database.

Return type

List[str]

Deprecated since version 0.1.0: Use darc.db.load_selenium() instead.

darc.process._load_last_word()

Load data to queue.

The function will copy the backup of the requests database _queue_requests.txt.tmp (if exists) and the backup of the selenium database _queue_selenium.txt.tmp (if exists) to the corresponding database.

The function will also save the process ID to the darc.pid PID file.

darc.process._signal_handler(signum=None, frame=None)

Signal handler.

The function will call _dump_last_word() to keep a decent death.

If the current process is not the main process, the function shall do nothing.

Parameters
  • signum (Union[int, signal.Signals, None]) – The signal to handle.

  • frame (types.FrameType) – The traceback frame from the signal.

darc.process.process()

Main process.

The function will register _signal_handler() for SIGTERM, and start the main process of the darc darkweb crawlers.

The general process can be described as following:

  1. process(): obtain URLs from the requests link database (c.f. load_requests()), and feed such URLs to crawler() with multiprocessing support.

  2. crawler(): parse the URL using parse_link(), and check if need to crawl the URL (c.f. PROXY_WHITE_LIST, PROXY_BLACK_LIST , LINK_WHITE_LIST and LINK_BLACK_LIST); if true, then crawl the URL with requests.

    If the URL is from a brand new host, darc will first try to fetch and save robots.txt and sitemaps of the host (c.f. save_robots() and save_sitemap()), and extract then save the links from sitemaps (c.f. read_sitemap()) into link database for future crawling (c.f. save_requests()). Also, if the submission API is provided, submit_new_host() will be called and submit the documents just fetched.

    If robots.txt presented, and FORCE is False, darc will check if allowed to crawl the URL.

    Note

    The root path (e.g. / in https://www.example.com/) will always be crawled ignoring robots.txt.

    At this point, darc will call the customised hook function from darc.sites to crawl and get the final response object. darc will save the session cookies and header information, using save_headers().

    Note

    If requests.exceptions.InvalidSchema is raised, the link will be saved by save_invalid(). Further processing is dropped.

    If the content type of response document is not ignored (c.f. MIME_WHITE_LIST and MIME_BLACK_LIST), darc will save the document using save_html() or save_file() accordingly. And if the submission API is provided, submit_requests() will be called and submit the document just fetched.

    If the response document is HTML (text/html and application/xhtml+xml), extract_links() will be called then to extract all possible links from the HTML document and save such links into the database (c.f. save_requests()).

    And if the response status code is between 400 and 600, the URL will be saved back to the link database (c.f. save_requests()). If NOT, the URL will be saved into selenium link database to proceed next steps (c.f. save_selenium()).

  3. process(): after the obtained URLs have all been crawled, darc will obtain URLs from the selenium link database (c.f. load_selenium()), and feed such URLs to loader().

    Note

    If FLAG_MP is True, the function will be called with multiprocessing support; if FLAG_TH if True, the function will be called with multithreading support; if none, the function will be called in single-threading.

  4. loader(): parse the URL using parse_link() and start loading the URL using selenium with Google Chrome.

    At this point, darc will call the customised hook function from darc.sites to load and return the original selenium.webdriver.Chrome object.

    If successful, the rendered source HTML document will be saved using save_html(), and a full-page screenshot will be taken and saved.

    If the submission API is provided, submit_selenium() will be called and submit the document just loaded.

    Later, extract_links() will be called then to extract all possible links from the HTML document and save such links into the requests database (c.f. save_requests()).

If in reboot mode, i.e. REBOOT is True, the function will exit after first round. If not, it will renew the Tor connections (if bootstrapped), c.f. renew_tor_session(), and start another round.