Main Processing¶
The darc.process module contains the main processing
logic of the darc module.
-
darc.process._process(worker)[source]¶ Wrapper function to start the worker process.
- Parameters
worker (Union[darc.process.process_crawler, darc.process.process_loader]) –
-
darc.process._signal_handler(signum=None, frame=None)[source]¶ Signal handler.
If the current process is not the main process, the function shall do nothing.
- Parameters
signum (Optional[Union[int, signal.Signals]]) – The signal to handle.
frame (types.FrameType) – The traceback frame from the signal.
See also
-
darc.process.process(worker)[source]¶ Main process.
The function will register
_signal_handler()forSIGTERM, and start the main process of thedarcdarkweb crawlers.- Parameters
worker (Literal[crawler, loader]) – Worker process type.
- Raises
ValueError – If
workeris not a valid value.
Before starting the workers, the function will start proxies through
darc.proxy.tor.tor_proxy()darc.proxy.i2p.i2p_proxy()darc.proxy.zeronet.zeronet_proxy()darc.proxy.freenet.freenet_proxy()
The general process can be described as following for workers of
crawlertype:process_crawler(): obtain URLs from therequestslink database (c.f.load_requests()), and feed such URLs tocrawler().crawler(): parse the URL usingparse_link(), and check if need to crawl the URL (c.f.PROXY_WHITE_LIST,PROXY_BLACK_LIST,LINK_WHITE_LISTandLINK_BLACK_LIST); if true, then crawl the URL withrequests.If the URL is from a brand new host,
darcwill first try to fetch and saverobots.txtand sitemaps of the host (c.f.save_robots()andsave_sitemap()), and extract then save the links from sitemaps (c.f.read_sitemap()) into link database for future crawling (c.f.save_requests()). Also, if the submission API is provided,submit_new_host()will be called and submit the documents just fetched.If
robots.txtpresented, andFORCEisFalse,darcwill check if allowed to crawl the URL.Note
The root path (e.g.
/in https://www.example.com/) will always be crawled ignoringrobots.txt.At this point,
darcwill call the customised hook function fromdarc.sitesto crawl and get the final response object.darcwill save the session cookies and header information, usingsave_headers().Note
If
requests.exceptions.InvalidSchemais raised, the link will be saved bysave_invalid(). Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LISTandMIME_BLACK_LIST),submit_requests()will be called and submit the document just fetched.If the response document is HTML (
text/htmlandapplication/xhtml+xml),extract_links()will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()).And if the response status code is between
400and600, the URL will be saved back to the link database (c.f.save_requests()). If NOT, the URL will be saved intoseleniumlink database to proceed next steps (c.f.save_selenium()).
The general process can be described as following for workers of
loadertype:process_loader(): in the meanwhile,darcwill obtain URLs from theseleniumlink database (c.f.load_selenium()), and feed such URLs toloader().loader(): parse the URL usingparse_link()and start loading the URL usingseleniumwith Google Chrome.At this point,
darcwill call the customised hook function fromdarc.sitesto load and return the originalChromeobject.If successful, the rendered source HTML document will be saved, and a full-page screenshot will be taken and saved.
If the submission API is provided,
submit_selenium()will be called and submit the document just loaded.Later,
extract_links()will be called then to extract all possible links from the HTML document and save such links into therequestsdatabase (c.f.save_requests()).
If in reboot mode, i.e.
REBOOTisTrue, the function will exit after first round. If not, it will renew the Tor connections (if bootstrapped), c.f.renew_tor_session(), and start another round.
-
darc.process._WORKER_POOL= None¶ List of active child processes and/or threads.
- Type
List[Union[multiprocessing.Process, threading.Thread]]