darc - Darkweb Crawler Project¶
darc is designed as a swiss army knife for darkweb crawling.
It integrates requests to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium to provide a fully rendered web page and screenshot
of such view.
The general process of darc can be described as following:
process(): obtain URLs from therequestslink database (c.f.load_requests()), and feed such URLs tocrawler()with multiprocessing support.crawler(): parse the URL usingparse_link(), and check if need to crawl the URL (c.f.PROXY_WHITE_LIST,PROXY_BLACK_LIST,LINK_WHITE_LISTandLINK_BLACK_LIST); if true, then crawl the URL withrequests.If the URL is from a brand new host,
darcwill first try to fetch and saverobots.txtand sitemaps of the host (c.f.save_robots()andsave_sitemap()), and extract then save the links from sitemaps (c.f.read_sitemap()) into link database for future crawling (c.f.save_requests()). Also, if the submission API is provided,submit_new_host()will be called and submit the documents just fetched.If
robots.txtpresented, andFORCEisFalse,darcwill check if allowed to crawl the URL.Note
The root path (e.g.
/in https://www.example.com/) will always be crawled ignoringrobots.txt.At this point,
darcwill call the customised hook function fromdarc.sitesto crawl and get the final response object.darcwill save the session cookies and header information, usingsave_headers().Note
If
requests.exceptions.InvalidSchemais raised, the link will be saved bysave_invalid(). Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LISTandMIME_BLACK_LIST),darcwill save the document usingsave_html()orsave_file()accordingly. And if the submission API is provided,submit_requests()will be called and submit the document just fetched.If the response document is HTML (
text/htmlandapplication/xhtml+xml),extract_links()will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()).And if the response status code is between
400and600, the URL will be saved back to the link database (c.f.save_requests()). If NOT, the URL will be saved intoseleniumlink database to proceed next steps (c.f.save_selenium()).process(): in the meanwhile,darcwill obtain URLs from theseleniumlink database (c.f.load_selenium()), and feed such URLs toloader().loader(): parse the URL usingparse_link()and start loading the URL usingseleniumwith Google Chrome.At this point,
darcwill call the customised hook function fromdarc.sitesto load and return the originalselenium.webdriver.Chromeobject.If successful, the rendered source HTML document will be saved using
save_html(), and a full-page screenshot will be taken and saved.If the submission API is provided,
submit_selenium()will be called and submit the document just loaded.Later,
extract_links()will be called then to extract all possible links from the HTML document and save such links into therequestsdatabase (c.f.save_requests()).
Installation¶
Note
darc supports Python all versions above and includes 3.6.
Currently, it only supports and is tested on Linux (Ubuntu 18.04)
and macOS (Catalina).
When installing in Python versions below 3.8, darc will
use walrus to compile itself for backport compatibility.
pip install darc
Please make sure you have Google Chrome and corresponding version of Chrome Driver installed on your system.
Important
Starting from version 0.3.0, we introduced Redis for the task
queue database backend. Please make sure you have it installed, configured,
and running when using the darc project.
However, the darc project is shipped with Docker and Compose support.
Please see Docker Integration for more information.
Or, you may refer to and/or install from the Docker Hub repository:
docker pull jsnbzh/darc[:TAGNAME]
Usage¶
The darc project provides a simple CLI:
usage: darc [-h] [-f FILE] ...
the darkweb crawling swiss army knife
positional arguments:
link links to craw
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE read links from file
It can also be called through module entrypoint:
python -m python-darc ...
Note
The link files can contain comment lines, which should start with #.
Empty lines and comment lines will be ignored when loading.
Configuration¶
Though simple CLI, the darc project is more configurable by
environment variables.
General Configurations¶
-
DARC_REBOOT¶ - Type
bool(int)- Default
0
If exit the program after first round, i.e. crawled all links from the
requestslink database and loaded all links from theseleniumlink database.This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.
-
DARC_DEBUG¶ - Type
bool(int)- Default
0
If run the program in debugging mode.
-
DARC_VERBOSE¶ - Type
bool(int)- Default
0
If run the program in verbose mode. If
DARC_DEBUGisTrue, then the verbose mode will be always enabled.
-
DARC_CHECK¶ - Type
bool(int)- Default
0
If check proxy and hostname before crawling (when calling
extract_links(),read_sitemap()andread_hosts()).If
DARC_CHECK_CONTENT_TYPEisTrue, then this environment variable will be always set asTrue.
-
DARC_CHECK_CONTENT_TYPE¶ - Type
bool(int)- Default
0
If check content type through
HEADrequests before crawling (when callingextract_links(),read_sitemap()andread_hosts()).
-
DARC_CPU¶ - Type
int- Default
None
Number of concurrent processes. If not provided, then the number of system CPUs will be used.
-
DARC_MULTIPROCESSING¶ - Type
bool(int)- Default
1
If enable multiprocessing support.
-
DARC_MULTITHREADING¶ - Type
bool(int)- Default
0
If enable multithreading support.
Note
DARC_MULTIPROCESSING and DARC_MULTITHREADING can
NOT be toggled at the same time.
-
DARC_USER¶ - Type
str- Default
current login user (c.f.
getpass.getuser())
Non-root user for proxies.
-
DARC_MAX_POOL¶ - Type
int- Default
1_000
Maximum number of links loaded from the database.
Note
If is an infinit
inf, no limit will be applied.
Web Crawlers¶
-
DARC_WAIT¶ - Type
float- Default
60
Time interval between each round when the
requestsand/orseleniumdatabase are empty.
-
DARC_SAVE¶ - Type
bool(int)- Default
0
If save processed link back to database.
Note
If
DARC_SAVEisTrue, thenDARC_SAVE_REQUESTSandDARC_SAVE_SELENIUMwill be forced to beTrue.See also
See
darc.dbfor more information about link database.
-
DARC_SAVE_REQUESTS¶ - Type
bool(int)- Default
0
If save
crawler()crawled link back torequestsdatabase.See also
See
darc.dbfor more information about link database.
-
DARC_SAVE_SELENIUM¶ - Type
bool(int)- Default
0
If save
loader()crawled link back toseleniumdatabase.See also
See
darc.dbfor more information about link database.
-
TIME_CACHE¶ - Type
float- Default
60
Time delta for caches in seconds.
The
darcproject supports caching for fetched files.TIME_CACHEwill specify for how log the fetched files will be cached and NOT fetched again.Note
If
TIME_CACHEisNonethen caching will be marked as forever.
-
SE_WAIT¶ - Type
float- Default
60
Time to wait for
seleniumto finish loading pages.Note
Internally,
seleniumwill wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded). However, some extra scripts may take more time running after the event.
White / Black Lists¶
-
LINK_WHITE_LIST¶ - Type
List[str](JSON)- Default
[]
White list of hostnames should be crawled.
Note
Regular expressions are supported.
-
LINK_BLACK_LIST¶ - Type
List[str](JSON)- Default
[]
Black list of hostnames should be crawled.
Note
Regular expressions are supported.
-
LINK_FALLBACK¶ - Type
bool(int)- Default
0
Fallback value for
match_host().
-
MIME_WHITE_LIST¶ - Type
List[str](JSON)- Default
[]
White list of content types should be crawled.
Note
Regular expressions are supported.
-
MIME_BLACK_LIST¶ - Type
List[str](JSON)- Default
[]
Black list of content types should be crawled.
Note
Regular expressions are supported.
-
MIME_FALLBACK¶ - Type
bool(int)- Default
0
Fallback value for
match_mime().
-
PROXY_WHITE_LIST¶ - Type
List[str](JSON)- Default
[]
White list of proxy types should be crawled.
Note
The proxy types are case insensitive.
-
PROXY_BLACK_LIST¶ - Type
List[str](JSON)- Default
[]
Black list of proxy types should be crawled.
Note
The proxy types are case insensitive.
-
PROXY_FALLBACK¶ - Type
bool(int)- Default
0
Fallback value for
match_proxy().
Note
If provided,
LINK_WHITE_LIST, LINK_BLACK_LIST,
MIME_WHITE_LIST, MIME_BLACK_LIST,
PROXY_WHITE_LIST and PROXY_BLACK_LIST
should all be JSON encoded strings.
Data Submission¶
-
API_RETRY¶ - Type
int- Default
3
Retry times for API submission when failure.
-
API_NEW_HOST¶ - Type
str- Default
None
API URL for
submit_new_host().
-
API_REQUESTS¶ - Type
str- Default
None
API URL for
submit_requests().
-
API_SELENIUM¶ - Type
str- Default
None
API URL for
submit_selenium().
Note
If API_NEW_HOST, API_REQUESTS
and API_SELENIUM is None, the corresponding
submit function will save the JSON data in the path
specified by PATH_DATA.
Tor Proxy Configuration¶
-
TOR_PORT¶ - Type
int- Default
9050
Port for Tor proxy connection.
-
TOR_CTRL¶ - Type
int- Default
9051
Port for Tor controller connection.
-
TOR_PASS¶ - Type
str- Default
None
Tor controller authentication token.
Note
If not provided, it will be requested at runtime.
-
TOR_RETRY¶ - Type
int- Default
3
Retry times for Tor bootstrap when failure.
-
TOR_WAIT¶ - Type
float- Default
90
Time after which the attempt to start Tor is aborted.
Note
If not provided, there will be NO timeouts.
-
TOR_CFG¶ - Type
Dict[str, Any](JSON)- Default
{}
Tor bootstrap configuration for
stem.process.launch_tor_with_config().Note
If provided, it should be a JSON encoded string.
I2P Proxy Configuration¶
-
I2P_PORT¶ - Type
int- Default
4444
Port for I2P proxy connection.
-
I2P_RETRY¶ - Type
int- Default
3
Retry times for I2P bootstrap when failure.
-
I2P_WAIT¶ - Type
float- Default
90
Time after which the attempt to start I2P is aborted.
Note
If not provided, there will be NO timeouts.
-
I2P_ARGS¶ - Type
str(Shell)- Default
''
I2P bootstrap arguments for
i2prouter start.If provided, it should be parsed as command line arguments (c.f.
shlex.split).Note
The command will be run as
DARC_USER, if current user (c.f.getpass.getuser()) is root.
ZeroNet Proxy Configuration¶
-
ZERONET_PORT¶ - Type
int- Default
4444
Port for ZeroNet proxy connection.
-
ZERONET_RETRY¶ - Type
int- Default
3
Retry times for ZeroNet bootstrap when failure.
-
ZERONET_WAIT¶ - Type
float- Default
90
Time after which the attempt to start ZeroNet is aborted.
Note
If not provided, there will be NO timeouts.
-
ZERONET_PATH¶ - Type
str(path)- Default
/usr/local/src/zeronet
Path to the ZeroNet project.
-
ZERONET_ARGS¶ - Type
str(Shell)- Default
''
ZeroNet bootstrap arguments for
ZeroNet.sh main.Note
If provided, it should be parsed as command line arguments (c.f.
shlex.split).
Freenet Proxy Configuration¶
-
FREENET_PORT¶ - Type
int- Default
8888
Port for Freenet proxy connection.
-
FREENET_RETRY¶ - Type
int- Default
3
Retry times for Freenet bootstrap when failure.
-
FREENET_WAIT¶ - Type
float- Default
90
Time after which the attempt to start Freenet is aborted.
Note
If not provided, there will be NO timeouts.
-
FREENET_PATH¶ - Type
str(path)- Default
/usr/local/src/freenet
Path to the Freenet project.
-
FREENET_ARGS¶ - Type
str(Shell)- Default
''
Freenet bootstrap arguments for
run.sh start.If provided, it should be parsed as command line arguments (c.f.
shlex.split).Note
The command will be run as
DARC_USER, if current user (c.f.getpass.getuser()) is root.