Technical Documentation¶
darc is designed as a swiss army knife for darkweb crawling.
It integrates requests to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium to provide a fully rendered web page and screenshot
of such view.
- Main Processing
_process()process()process_crawler()process_loader()register()_HOOK_REGISTRY_WORKER_POOL- Web Crawlers
crawler()loader()- URL Utilities
Linkparse_link()quote()unquote()urljoin()urlparse()urlsplit()- Source Parsing
_check()_check_ng()check_robots()extract_links()extract_links_from_text()get_content_type()match_host()match_mime()match_proxy()darc.parse.URL_PAT- Source Saving
sanitise()save_headers()save_link()darc.save._SAVE_LOCK- Link Database
_db_operation()_drop_hostname_db()_drop_hostname_redis()_drop_requests_db()_drop_requests_redis()_drop_selenium_db()_drop_selenium_redis()_gen_arg_msg()_have_hostname_db()_have_hostname_redis()_load_requests_db()_load_requests_redis()_load_selenium_db()_load_selenium_redis()_redis_command()_redis_get_lock()_save_requests_db()_save_requests_redis()_save_selenium_db()_save_selenium_redis()drop_hostname()drop_requests()drop_selenium()have_hostname()load_requests()load_selenium()save_requests()save_selenium()darc.db.BULK_SIZEdarc.db.LOCK_TIMEOUTdarc.db.MAX_POOLdarc.db.REDIS_LOCKdarc.db.RETRY_INTERVAL- Data Submission
get_hosts()get_robots()get_sitemaps()save_submit()submit()submit_new_host()submit_requests()submit_selenium()darc.submit.PATH_APIdarc.submit.SAVE_DBdarc.submit.API_RETRYdarc.submit.API_NEW_HOSTdarc.submit.API_REQUESTSdarc.submit.API_SELENIUM- Requests Wrapper
default_user_agent()i2p_session()null_session()request_session()tor_session()- Selenium Wrapper
get_capabilities()get_options()i2p_driver()null_driver()request_driver()tor_driver()darc.selenium.BINARY_LOCATION- Proxy Utilities
- Bitcoin Addresses
save_bitcoin()darc.proxy.bitcoin.PATHdarc.proxy.bitcoin.LOCK- Data URI Schemes
save_data()darc.proxy.data.PATH- ED2K Magnet Links
save_ed2k()darc.proxy.ed2k.PATHdarc.proxy.ed2k.LOCK- Ethereum Addresses
save_ethereum()darc.proxy.ethereum.PATHdarc.proxy.ethereum.LOCK- Freenet Proxy
_freenet_bootstrap()freenet_bootstrap()launch_freenet()darc.proxy.freenet.FREENET_PORTdarc.proxy.freenet.FREENET_RETRYdarc.proxy.freenet.BS_WAITdarc.proxy.freenet.FREENET_PATHdarc.proxy.freenet.FREENET_ARGSdarc.proxy.freenet._MNG_FREENETdarc.proxy.freenet._FREENET_BS_FLAGdarc.proxy.freenet._FREENET_PROCdarc.proxy.freenet._FREENET_ARGS- I2P Proxy
_i2p_bootstrap()fetch_hosts()get_hosts()have_hosts()i2p_bootstrap()launch_i2p()read_hosts()save_hosts()darc.proxy.i2p.I2P_REQUESTS_PROXYdarc.proxy.i2p.I2P_SELENIUM_PROXYdarc.proxy.i2p.I2P_PORTdarc.proxy.i2p.I2P_RETRYdarc.proxy.i2p.BS_WAITdarc.proxy.i2p.I2P_ARGSdarc.proxy.i2p._MNG_I2Pdarc.proxy.i2p._I2P_BS_FLAGdarc.proxy.i2p._I2P_PROCdarc.proxy.i2p._I2P_ARGS- IRC Addresses
save_irc()darc.proxy.irc.PATHdarc.proxy.irc.LOCK- Magnet Links
save_magnet()darc.proxy.magnet.PATHdarc.proxy.magnet.LOCK- Email Addresses
save_mail()darc.proxy.mail.PATHdarc.proxy.mail.LOCK- No Proxy
fetch_sitemap()get_sitemap()have_robots()have_sitemap()read_robots()read_sitemap()save_invalid()save_robots()save_sitemap()darc.proxy.null.PATHdarc.proxy.null.LOCK- JavaScript Links
save_script()darc.proxy.script.PATHdarc.proxy.script.LOCK- Telephone Numbers
save_tel()darc.proxy.tel.PATHdarc.proxy.tel.LOCK- Tor Proxy
_tor_bootstrap()print_bootstrap_lines()renew_tor_session()tor_bootstrap()darc.proxy.tor.TOR_REQUESTS_PROXYdarc.proxy.tor.TOR_SELENIUM_PROXYdarc.proxy.tor.TOR_PORTdarc.proxy.tor.TOR_CTRLdarc.proxy.tor.TOR_PASSdarc.proxy.tor.TOR_RETRYdarc.proxy.tor.BS_WAITdarc.proxy.tor.TOR_CFGdarc.proxy.tor._MNG_TORdarc.proxy.tor._TOR_BS_FLAGdarc.proxy.tor._TOR_PROCdarc.proxy.tor._TOR_CTRLdarc.proxy.tor._TOR_CONFIG- ZeroNet Proxy
_zeronet_bootstrap()launch_zeronet()zeronet_bootstrap()darc.proxy.zeronet.ZERONET_PORTdarc.proxy.zeronet.ZERONET_RETRYdarc.proxy.zeronet.BS_WAITdarc.proxy.zeronet.ZERONET_PATHdarc.proxy.zeronet.ZERONET_ARGSdarc.proxy.zeronet._MNG_ZERONETdarc.proxy.zeronet._ZERONET_BS_FLAGdarc.proxy.zeronet._ZERONET_PROCdarc.proxy.zeronet._ZERONET_ARGSdarc.proxy.LINK_MAP
- Sites Customisation
- Base Sites Customisation
BaseSite- Default Hooks
DefaultSite- Bitcoin Addresses
Bitcoin- Data URI Schemes
DataURI- ED2K Magnet Links
ED2K- Ethereum Addresses
Ethereum- IRC Addresses
IRC- Magnet Links
Magnet- Email Addresses
Email- JavaScript Links
Script- Telephone Numbers
Telcrawler_hook()loader_hook()register()darc.sites.SITEMAP_get_site()
- Module Constants
- Custom Exceptions
APIRequestFailedDatabaseOperaionFailedFreenetBootstrapFailedHookExecutionFailedI2PBootstrapFailedLinkNoReturnLockWarningRedisCommandFailedSiteNotFoundWarningTorBootstrapFailedTorRenewFailedUnsupportedLinkUnsupportedPlatformUnsupportedProxyWorkerBreakZeroNetBootstrapFailed_BaseException_BaseWarning- Data Models
- Task Queues
- Submission Data Models
- Hostname Records
HostnameModel- URL Records
URLModelURLThroughModelrobots.txtRecordsRobotsModelsitemap.xmlRecordsSitemapModelhosts.txtRecordsHostsModel- Crawler Records
RequestsHistoryModelRequestsHistoryModel.DoesNotExistRequestsHistoryModel.cookiesRequestsHistoryModel.documentRequestsHistoryModel.idRequestsHistoryModel.indexRequestsHistoryModel.methodRequestsHistoryModel.modelRequestsHistoryModel.model_idRequestsHistoryModel.reasonRequestsHistoryModel.requestRequestsHistoryModel.responseRequestsHistoryModel.status_codeRequestsHistoryModel.timestampRequestsHistoryModel.url
RequestsModelRequestsModel.DoesNotExistRequestsModel.cookiesRequestsModel.documentRequestsModel.historyRequestsModel.idRequestsModel.is_htmlRequestsModel.methodRequestsModel.mime_typeRequestsModel.reasonRequestsModel.requestRequestsModel.responseRequestsModel.sessionRequestsModel.status_codeRequestsModel.timestampRequestsModel.urlRequestsModel.url_id
- Loader Records
SeleniumModel
- Base Model
BaseMetaBaseMetaWebBaseModelBaseModelWeb- Miscellaneous Utilities
IPFieldIntEnumFieldJSONFieldPickleFieldProxytable_function()
As the websites can be sometimes irritating for their anti-robots
verification, login requirements, etc., the darc project
also privides hooks to customise crawling behaviours around both
requests and selenium.
See also
Such customisation, as called in the darc project, site
hooks, is site specific, user can set up your own hooks unto a
certain site, c.f. darc.sites for more information.
Still, since the network is a world full of mysteries and miracles,
the speed of crawling will much depend on the response speed of
the target website. To boost up, as well as meet the system capacity,
the darc project introduced multiprocessing, multithreading
and the fallback slowest single-threaded solutions when crawling.
Note
When rendering the target website using selenium powered by
the renown Google Chrome, it will require much memory to run.
Thus, the three solutions mentioned above would only toggle the
behaviour around the use of selenium.
To keep the darc project as it is a swiss army knife, only the
main entrypoint function darc.process.process() is exported
in global namespace (and renamed to darc.darc()), see below:
- darc.darc(worker)¶
Main process.
The function will register
_signal_handler()forSIGTERM, and start the main process of thedarcdarkweb crawlers.- Parameters:
worker (
Literal['crawler','loader']) – Worker process type.- Raises:
ValueError – If
workeris not a valid value.- Return type:
Before starting the workers, the function will start proxies through
darc.proxy.tor.tor_proxy()darc.proxy.i2p.i2p_proxy()darc.proxy.zeronet.zeronet_proxy()darc.proxy.freenet.freenet_proxy()
The general process can be described as following for workers of
crawlertype:process_crawler(): obtain URLs from therequestslink database (c.f.load_requests()), and feed such URLs tocrawler().crawler(): parse the URL usingparse_link(), and check if need to crawl the URL (c.f.PROXY_WHITE_LIST,PROXY_BLACK_LIST,LINK_WHITE_LISTandLINK_BLACK_LIST); if true, then crawl the URL withrequests.If the URL is from a brand new host,
darcwill first try to fetch and saverobots.txtand sitemaps of the host (c.f.save_robots()andsave_sitemap()), and extract then save the links from sitemaps (c.f.read_sitemap()) into link database for future crawling (c.f.save_requests()). Also, if the submission API is provided,submit_new_host()will be called and submit the documents just fetched.If
robots.txtpresented, andFORCEisFalse,darcwill check if allowed to crawl the URL.Note
The root path (e.g.
/in https://www.example.com/) will always be crawled ignoringrobots.txt.At this point,
darcwill call the customised hook function fromdarc.sitesto crawl and get the final response object.darcwill save the session cookies and header information, usingsave_headers().Note
If
requests.exceptions.InvalidSchemais raised, the link will be saved bysave_invalid(). Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LISTandMIME_BLACK_LIST),submit_requests()will be called and submit the document just fetched.If the response document is HTML (
text/htmlandapplication/xhtml+xml),extract_links()will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()).And if the response status code is between
400and600, the URL will be saved back to the link database (c.f.save_requests()). If NOT, the URL will be saved intoseleniumlink database to proceed next steps (c.f.save_selenium()).
The general process can be described as following for workers of
loadertype:process_loader(): in the meanwhile,darcwill obtain URLs from theseleniumlink database (c.f.load_selenium()), and feed such URLs toloader().loader(): parse the URL usingparse_link()and start loading the URL usingseleniumwith Google Chrome.At this point,
darcwill call the customised hook function fromdarc.sitesto load and return the originalChromeobject.If successful, the rendered source HTML document will be saved, and a full-page screenshot will be taken and saved.
If the submission API is provided,
submit_selenium()will be called and submit the document just loaded.Later,
extract_links()will be called then to extract all possible links from the HTML document and save such links into therequestsdatabase (c.f.save_requests()).
After each round,
darcwill call registered hook functions in sequential order, with the type of worker ('crawler'or'loader') and the current link pool as its parameters, seeregister()for more information.If in reboot mode, i.e.
REBOOTisTrue, the function will exit after first round. If not, it will renew the Tor connections (if bootstrapped), c.f.renew_tor_session(), and start another round.See also
The function is renamed from
darc.process.process().
And we also exported the necessary hook registration functions to the global namespace, see below:
- darc.register_hooks(hook, *, _index=None)¶
Register hook function.
- Parameters:
- Keyword Arguments:
_index – Position index for the hook function.
- Return type:
The hook function takes two parameters:
a
strobject indicating the type of worker, i.e.'crawler'or'loader';a
listobject containingLinkobjects, as the current processed link pool.
The hook function may raises
WorkerBreakso that the worker shall break from its indefinite loop upon finishing of current round. Any value returned from the hook function will be ignored by the workers.See also
The hook functions will be saved into
_HOOK_REGISTRY.See also
The function is renamed from
darc.process.register().
- darc.register_proxy(proxy, session=<function null_session>, driver=<function null_driver>)¶
Register new proxy type.
- Parameters:
proxy (
str) – Proxy type.session (
Callable[[bool],Union[Session,FuturesSession]]) – Session factory function, c.f.darc.requests.null_session().driver (
Callable[[],WebDriver]) – Driver factory function, c.f.darc.selenium.null_driver().
- Return type:
See also
The function is renamed from
darc.proxy.register().
- darc.register_sites(site, *hostname)¶
Register new site map.
- Parameters:
- Return type:
See also
The function is renamed from
darc.sites.register().
For more information on the hooks, please refer to the customisation documentations.