Technical Documentation¶
darc
is designed as a swiss army knife for darkweb crawling.
It integrates requests
to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium
to provide a fully rendered web page and screenshot
of such view.
- Main Processing
_process()
process()
process_crawler()
process_loader()
register()
_HOOK_REGISTRY
_WORKER_POOL
- Web Crawlers
crawler()
loader()
- URL Utilities
Link
parse_link()
quote()
unquote()
urljoin()
urlparse()
urlsplit()
- Source Parsing
_check()
_check_ng()
check_robots()
extract_links()
extract_links_from_text()
get_content_type()
match_host()
match_mime()
match_proxy()
darc.parse.URL_PAT
- Source Saving
sanitise()
save_headers()
save_link()
darc.save._SAVE_LOCK
- Link Database
_db_operation()
_drop_hostname_db()
_drop_hostname_redis()
_drop_requests_db()
_drop_requests_redis()
_drop_selenium_db()
_drop_selenium_redis()
_gen_arg_msg()
_have_hostname_db()
_have_hostname_redis()
_load_requests_db()
_load_requests_redis()
_load_selenium_db()
_load_selenium_redis()
_redis_command()
_redis_get_lock()
_save_requests_db()
_save_requests_redis()
_save_selenium_db()
_save_selenium_redis()
drop_hostname()
drop_requests()
drop_selenium()
have_hostname()
load_requests()
load_selenium()
save_requests()
save_selenium()
darc.db.BULK_SIZE
darc.db.LOCK_TIMEOUT
darc.db.MAX_POOL
darc.db.REDIS_LOCK
darc.db.RETRY_INTERVAL
- Data Submission
get_hosts()
get_robots()
get_sitemaps()
save_submit()
submit()
submit_new_host()
submit_requests()
submit_selenium()
darc.submit.PATH_API
darc.submit.SAVE_DB
darc.submit.API_RETRY
darc.submit.API_NEW_HOST
darc.submit.API_REQUESTS
darc.submit.API_SELENIUM
- Requests Wrapper
default_user_agent()
i2p_session()
null_session()
request_session()
tor_session()
- Selenium Wrapper
get_capabilities()
get_options()
i2p_driver()
null_driver()
request_driver()
tor_driver()
darc.selenium.BINARY_LOCATION
- Proxy Utilities
- Bitcoin Addresses
save_bitcoin()
darc.proxy.bitcoin.PATH
darc.proxy.bitcoin.LOCK
- Data URI Schemes
save_data()
darc.proxy.data.PATH
- ED2K Magnet Links
save_ed2k()
darc.proxy.ed2k.PATH
darc.proxy.ed2k.LOCK
- Ethereum Addresses
save_ethereum()
darc.proxy.ethereum.PATH
darc.proxy.ethereum.LOCK
- Freenet Proxy
_freenet_bootstrap()
freenet_bootstrap()
launch_freenet()
darc.proxy.freenet.FREENET_PORT
darc.proxy.freenet.FREENET_RETRY
darc.proxy.freenet.BS_WAIT
darc.proxy.freenet.FREENET_PATH
darc.proxy.freenet.FREENET_ARGS
darc.proxy.freenet._MNG_FREENET
darc.proxy.freenet._FREENET_BS_FLAG
darc.proxy.freenet._FREENET_PROC
darc.proxy.freenet._FREENET_ARGS
- I2P Proxy
_i2p_bootstrap()
fetch_hosts()
get_hosts()
have_hosts()
i2p_bootstrap()
launch_i2p()
read_hosts()
save_hosts()
darc.proxy.i2p.I2P_REQUESTS_PROXY
darc.proxy.i2p.I2P_SELENIUM_PROXY
darc.proxy.i2p.I2P_PORT
darc.proxy.i2p.I2P_RETRY
darc.proxy.i2p.BS_WAIT
darc.proxy.i2p.I2P_ARGS
darc.proxy.i2p._MNG_I2P
darc.proxy.i2p._I2P_BS_FLAG
darc.proxy.i2p._I2P_PROC
darc.proxy.i2p._I2P_ARGS
- IRC Addresses
save_irc()
darc.proxy.irc.PATH
darc.proxy.irc.LOCK
- Magnet Links
save_magnet()
darc.proxy.magnet.PATH
darc.proxy.magnet.LOCK
- Email Addresses
save_mail()
darc.proxy.mail.PATH
darc.proxy.mail.LOCK
- No Proxy
fetch_sitemap()
get_sitemap()
have_robots()
have_sitemap()
read_robots()
read_sitemap()
save_invalid()
save_robots()
save_sitemap()
darc.proxy.null.PATH
darc.proxy.null.LOCK
- JavaScript Links
save_script()
darc.proxy.script.PATH
darc.proxy.script.LOCK
- Telephone Numbers
save_tel()
darc.proxy.tel.PATH
darc.proxy.tel.LOCK
- Tor Proxy
_tor_bootstrap()
print_bootstrap_lines()
renew_tor_session()
tor_bootstrap()
darc.proxy.tor.TOR_REQUESTS_PROXY
darc.proxy.tor.TOR_SELENIUM_PROXY
darc.proxy.tor.TOR_PORT
darc.proxy.tor.TOR_CTRL
darc.proxy.tor.TOR_PASS
darc.proxy.tor.TOR_RETRY
darc.proxy.tor.BS_WAIT
darc.proxy.tor.TOR_CFG
darc.proxy.tor._MNG_TOR
darc.proxy.tor._TOR_BS_FLAG
darc.proxy.tor._TOR_PROC
darc.proxy.tor._TOR_CTRL
darc.proxy.tor._TOR_CONFIG
- ZeroNet Proxy
_zeronet_bootstrap()
launch_zeronet()
zeronet_bootstrap()
darc.proxy.zeronet.ZERONET_PORT
darc.proxy.zeronet.ZERONET_RETRY
darc.proxy.zeronet.BS_WAIT
darc.proxy.zeronet.ZERONET_PATH
darc.proxy.zeronet.ZERONET_ARGS
darc.proxy.zeronet._MNG_ZERONET
darc.proxy.zeronet._ZERONET_BS_FLAG
darc.proxy.zeronet._ZERONET_PROC
darc.proxy.zeronet._ZERONET_ARGS
darc.proxy.LINK_MAP
- Sites Customisation
- Base Sites Customisation
BaseSite
- Default Hooks
DefaultSite
- Bitcoin Addresses
Bitcoin
- Data URI Schemes
DataURI
- ED2K Magnet Links
ED2K
- Ethereum Addresses
Ethereum
- IRC Addresses
IRC
- Magnet Links
Magnet
- Email Addresses
Email
- JavaScript Links
Script
- Telephone Numbers
Tel
crawler_hook()
loader_hook()
register()
darc.sites.SITEMAP
_get_site()
- Module Constants
- Custom Exceptions
APIRequestFailed
DatabaseOperaionFailed
FreenetBootstrapFailed
HookExecutionFailed
I2PBootstrapFailed
LinkNoReturn
LockWarning
RedisCommandFailed
SiteNotFoundWarning
TorBootstrapFailed
TorRenewFailed
UnsupportedLink
UnsupportedPlatform
UnsupportedProxy
WorkerBreak
ZeroNetBootstrapFailed
_BaseException
_BaseWarning
- Data Models
- Task Queues
- Submission Data Models
- Hostname Records
HostnameModel
- URL Records
URLModel
URLThroughModel
robots.txt
RecordsRobotsModel
sitemap.xml
RecordsSitemapModel
hosts.txt
RecordsHostsModel
- Crawler Records
RequestsHistoryModel
RequestsHistoryModel.DoesNotExist
RequestsHistoryModel.cookies
RequestsHistoryModel.document
RequestsHistoryModel.id
RequestsHistoryModel.index
RequestsHistoryModel.method
RequestsHistoryModel.model
RequestsHistoryModel.model_id
RequestsHistoryModel.reason
RequestsHistoryModel.request
RequestsHistoryModel.response
RequestsHistoryModel.status_code
RequestsHistoryModel.timestamp
RequestsHistoryModel.url
RequestsModel
RequestsModel.DoesNotExist
RequestsModel.cookies
RequestsModel.document
RequestsModel.history
RequestsModel.id
RequestsModel.is_html
RequestsModel.method
RequestsModel.mime_type
RequestsModel.reason
RequestsModel.request
RequestsModel.response
RequestsModel.session
RequestsModel.status_code
RequestsModel.timestamp
RequestsModel.url
RequestsModel.url_id
- Loader Records
SeleniumModel
- Base Model
BaseMeta
BaseMetaWeb
BaseModel
BaseModelWeb
- Miscellaneous Utilities
IPField
IntEnumField
JSONField
PickleField
Proxy
table_function()
As the websites can be sometimes irritating for their anti-robots
verification, login requirements, etc., the darc
project
also privides hooks to customise crawling behaviours around both
requests
and selenium
.
See also
Such customisation, as called in the darc
project, site
hooks, is site specific, user can set up your own hooks unto a
certain site, c.f. darc.sites
for more information.
Still, since the network is a world full of mysteries and miracles,
the speed of crawling will much depend on the response speed of
the target website. To boost up, as well as meet the system capacity,
the darc
project introduced multiprocessing, multithreading
and the fallback slowest single-threaded solutions when crawling.
Note
When rendering the target website using selenium
powered by
the renown Google Chrome, it will require much memory to run.
Thus, the three solutions mentioned above would only toggle the
behaviour around the use of selenium
.
To keep the darc
project as it is a swiss army knife, only the
main entrypoint function darc.process.process()
is exported
in global namespace (and renamed to darc.darc()
), see below:
- darc.darc(worker)¶
Main process.
The function will register
_signal_handler()
forSIGTERM
, and start the main process of thedarc
darkweb crawlers.- Parameters:
worker (
Literal
['crawler'
,'loader'
]) – Worker process type.- Raises:
ValueError – If
worker
is not a valid value.- Return type:
Before starting the workers, the function will start proxies through
darc.proxy.tor.tor_proxy()
darc.proxy.i2p.i2p_proxy()
darc.proxy.zeronet.zeronet_proxy()
darc.proxy.freenet.freenet_proxy()
The general process can be described as following for workers of
crawler
type:process_crawler()
: obtain URLs from therequests
link database (c.f.load_requests()
), and feed such URLs tocrawler()
.crawler()
: parse the URL usingparse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
). Also, if the submission API is provided,submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).
The general process can be described as following for workers of
loader
type:process_loader()
: in the meanwhile,darc
will obtain URLs from theselenium
link database (c.f.load_selenium()
), and feed such URLs toloader()
.loader()
: parse the URL usingparse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalChrome
object.If successful, the rendered source HTML document will be saved, and a full-page screenshot will be taken and saved.
If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).
After each round,
darc
will call registered hook functions in sequential order, with the type of worker ('crawler'
or'loader'
) and the current link pool as its parameters, seeregister()
for more information.If in reboot mode, i.e.
REBOOT
isTrue
, the function will exit after first round. If not, it will renew the Tor connections (if bootstrapped), c.f.renew_tor_session()
, and start another round.See also
The function is renamed from
darc.process.process()
.
And we also exported the necessary hook registration functions to the global namespace, see below:
- darc.register_hooks(hook, *, _index=None)¶
Register hook function.
- Parameters:
- Keyword Arguments:
_index – Position index for the hook function.
- Return type:
The hook function takes two parameters:
a
str
object indicating the type of worker, i.e.'crawler'
or'loader'
;a
list
object containingLink
objects, as the current processed link pool.
The hook function may raises
WorkerBreak
so that the worker shall break from its indefinite loop upon finishing of current round. Any value returned from the hook function will be ignored by the workers.See also
The hook functions will be saved into
_HOOK_REGISTRY
.See also
The function is renamed from
darc.process.register()
.
- darc.register_proxy(proxy, session=<function null_session>, driver=<function null_driver>)¶
Register new proxy type.
- Parameters:
proxy (
str
) – Proxy type.session (
Callable
[[bool
],Union
[Session
,FuturesSession
]]) – Session factory function, c.f.darc.requests.null_session()
.driver (
Callable
[[],WebDriver
]) – Driver factory function, c.f.darc.selenium.null_driver()
.
- Return type:
See also
The function is renamed from
darc.proxy.register()
.
- darc.register_sites(site, *hostname)¶
Register new site map.
- Parameters:
- Return type:
See also
The function is renamed from
darc.sites.register()
.
For more information on the hooks, please refer to the customisation documentations.