Main Processing¶
The darc.process
module contains the main processing
logic of the darc
module.
-
darc.process.
_process
(worker)[source]¶ Wrapper function to start the worker process.
- Parameters
worker (Union[darc.process.process_crawler, darc.process.process_loader]) –
- Return type
-
darc.process.
_signal_handler
(signum=None, frame=None)[source]¶ Signal handler.
If the current process is not the main process, the function shall do nothing.
- Parameters
signum (Optional[Union[int, signal.Signals]]) – The signal to handle.
frame (types.FrameType) – The traceback frame from the signal.
- Return type
See also
-
darc.process.
process
(worker)[source]¶ Main process.
The function will register
_signal_handler()
forSIGTERM
, and start the main process of thedarc
darkweb crawlers.- Parameters
worker (Literal[crawler, loader]) – Worker process type.
- Raises
ValueError – If
worker
is not a valid value.- Return type
Before starting the workers, the function will start proxies through
darc.proxy.tor.tor_proxy()
darc.proxy.i2p.i2p_proxy()
darc.proxy.zeronet.zeronet_proxy()
darc.proxy.freenet.freenet_proxy()
The general process can be described as following for workers of
crawler
type:process_crawler()
: obtain URLs from therequests
link database (c.f.load_requests()
), and feed such URLs tocrawler()
.crawler()
: parse the URL usingparse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
). Also, if the submission API is provided,submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).
The general process can be described as following for workers of
loader
type:process_loader()
: in the meanwhile,darc
will obtain URLs from theselenium
link database (c.f.load_selenium()
), and feed such URLs toloader()
.loader()
: parse the URL usingparse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalChrome
object.If successful, the rendered source HTML document will be saved, and a full-page screenshot will be taken and saved.
If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).
After each round,
darc
will call registered hook functions in sequential order, with the type of worker ('crawler'
or'loader'
) and the current link pool as its parameters, seeregister()
for more information.If in reboot mode, i.e.
REBOOT
isTrue
, the function will exit after first round. If not, it will renew the Tor connections (if bootstrapped), c.f.renew_tor_session()
, and start another round.
-
darc.process.
process_crawler
()[source]¶ A worker to run the
crawler()
process.- Warns
HookExecutionFailed – When hook function raises an error.
- Return type
-
darc.process.
process_loader
()[source]¶ A worker to run the
loader()
process.- Warns
HookExecutionFailed – When hook function raises an error.
- Return type
-
darc.process.
register
(hook, *, _index=None)[source]¶ Register hook function.
- Parameters
hook (Callable[[Literal[crawler, loader], List[darc.link.Link]], None]) – Hook function to be registered.
_index (Optional[int]) –
- Keyword Arguments
_index – Position index for the hook function.
- Return type
The hook function takes two parameters:
a
str
object indicating the type of worker, i.e.'crawler'
or'loader'
;a
list
object containingLink
objects, as the current processed link pool.
The hook function may raises
WorkerBreak
so that the worker shall break from its indefinite loop upon finishing of current round. Any value returned from the hook function will be ignored by the workers.See also
The hook functions will be saved into
_HOOK_REGISTRY
.
-
darc.process.
_HOOK_REGISTRY
: typing.List[typing.Callable[[typing.Literal[crawler, loader], typing.List[Link]], None]] = []¶ List of hook functions to be called between each round.
- Type
List[Callable[[Literal[‘crawler’, ‘loader’], List[Link]]]
-
darc.process.
_WORKER_POOL
= None¶ List of active child processes and/or threads.
- Type
List[Union[multiprocessing.Process, threading.Thread]]