darc
- Darkweb Crawler Project¶
darc
is designed as a swiss army knife for darkweb crawling.
It integrates requests
to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium
to provide a fully rendered web page and screenshot
of such view.
The general process of darc
can be described as following:
process()
: obtain URLs from therequests
link database (c.f.load_requests()
), and feed such URLs tocrawler()
with multiprocessing support.crawler()
: parse the URL usingparse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
). Also, if the submission API is provided,submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),darc
will save the document usingsave_html()
orsave_file()
accordingly. And if the submission API is provided,submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).process()
: after the obtained URLs have all been crawled,darc
will obtain URLs from theselenium
link database (c.f.load_selenium()
), and feed such URLs toloader()
.loader()
: parse the URL usingparse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalselenium.webdriver.Chrome
object.If successful, the rendered source HTML document will be saved using
save_html()
, and a full-page screenshot will be taken and saved.If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).
Installation¶
Note
darc
supports Python all versions above and includes 3.6.
Currently, it only supports and is tested on Linux (Ubuntu 18.04)
and macOS (Catalina).
When installing in Python versions below 3.8, darc
will
use walrus
to compile itself for backport compatibility.
pip install darc
Please make sure you have Google Chrome and corresponding version of Chrome Driver installed on your system.
However, the darc
project is shipped with Docker and Compose support.
Please see the project root for relevant files and more information.
Usage¶
The darc
project provides a simple CLI:
usage: darc [-h] [-f FILE] ...
the darkweb knife crawling swiss army knife
positional arguments:
link links to craw
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE read links from file
It can also be called through module entrypoint:
python -m python-darc ...
Note
The link files can contain comment lines, which should start with #
.
Empty lines and comment lines will be ignored when loading.
Configuration¶
Though simple CLI, the darc
project is more configurable by
environment variables.
General Configurations¶
-
DARC_REBOOT
: bool (int)¶ If exit the program after first round, i.e. crawled all links from the
requests
link database and loaded all links from theselenium
link database.- Default
0
-
DARC_DEBUG
: bool (int)¶ If run the program in debugging mode.
- Default
0
-
DARC_VERBOSE
: bool (int)¶ If run the program in verbose mode. If
DARC_DEBUG
isTrue
, then the verbose mode will be always enabled.- Default
0
-
DARC_CHECK
: bool (int)¶ If check proxy and hostname before crawling (when calling
extract_links()
,read_sitemap()
andread_hosts()
).If
DARC_CHECK_CONTENT_TYPE
isTrue
, then this environment variable will be always set asTrue
.- Default
0
-
DARC_CHECK_CONTENT_TYPE
: bool (int)¶ If check content type through
HEAD
requests before crawling (when callingextract_links()
,read_sitemap()
andread_hosts()
).- Default
0
-
DARC_CPU
: int¶ Number of concurrent processes. If not provided, then the number of system CPUs will be used.
- Default
None
-
DARC_MULTIPROCESSING
: bool (int)¶ If enable multiprocessing support.
- Default
1
-
DARC_MULTITHREADING
: bool (int)¶ If enable multithreading support.
- Default
0
Note
DARC_MULTIPROCESSING
andDARC_MULTITHREADING
can NOT be toggled at the same time.
-
DARC_USER
: str¶ Non-root user for proxies.
- Default
current login user (c.f.
getpass.getuser()
)
Data Storage¶
Web Crawlers¶
-
TIME_CACHE
: float¶ Time delta for caches in seconds.
The
darc
project supports caching for fetched files.TIME_CACHE
will specify for how log the fetched files will be cached and NOT fetched again.Note
If
TIME_CACHE
isNone
then caching will be marked as forever.- Default
60
-
SE_WAIT
: float¶ Time to wait for
selenium
to finish loading pages.Note
Internally,
selenium
will wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded
). However, some extra scripts may take more time running after the event.- Default
60
White / Black Lists¶
-
LINK_WHITE_LIST
: List[str] (json)¶ White list of hostnames should be crawled.
- Default
[]
Note
Regular expressions are supported.
-
LINK_BLACK_LIST
: List[str] (json)¶ Black list of hostnames should be crawled.
- Default
[]
Note
Regular expressions are supported.
-
LINK_FALLBACK
: bool (int)¶ Fallback value for
match_host()
.
-
MIME_WHITE_LIST
: List[str] (json)¶ White list of content types should be crawled.
- Default
[]
Note
Regular expressions are supported.
-
MIME_BLACK_LIST
: List[str] (json)¶ Black list of content types should be crawled.
- Default
[]
Note
Regular expressions are supported.
-
MIME_FALLBACK
: bool (int)¶ Fallback value for
match_mime()
.
-
PROXY_WHITE_LIST
: List[str] (json)¶ White list of proxy types should be crawled.
- Default
[]
Note
The proxy types are case insensitive.
-
PROXY_BLACK_LIST
: List[str] (json)¶ Black list of proxy types should be crawled.
- Default
[]
Note
The proxy types are case insensitive.
-
PROXY_FALLBACK
: bool (int)¶ Fallback value for
match_proxy()
.
Note
If provided,
LINK_WHITE_LIST
, LINK_BLACK_LIST
,
MIME_WHITE_LIST
, MIME_BLACK_LIST
,
PROXY_WHITE_LIST
and PROXY_BLACK_LIST
should all be JSON encoded strings.
Data Submission¶
-
API_RETRY
: int¶ Retry times for API submission when failure.
- Default
3
-
API_NEW_HOST
: str¶ API URL for
submit_new_host()
.- Default
None
-
API_REQUESTS
: str¶ API URL for
submit_requests()
.- Default
None
-
API_SELENIUM
: str¶ API URL for
submit_selenium()
.- Default
None
Note
If API_NEW_HOST
, API_REQUESTS
and API_SELENIUM
is None
, the corresponding
submit function will save the JSON data in the path
specified by PATH_DATA
.
Tor Proxy Configuration¶
-
TOR_PORT
: int¶ Port for Tor proxy connection.
- Default
9050
-
TOR_CTRL
: int¶ Port for Tor controller connection.
- Default
9051
-
TOR_PASS
: str¶ Tor controller authentication token.
- Default
None
Note
If not provided, it will be requested at runtime.
-
TOR_RETRY
: int¶ Retry times for Tor bootstrap when failure.
- Default
3
-
TOR_WAIT
: float¶ Time after which the attempt to start Tor is aborted.
- Default
90
Note
If not provided, there will be NO timeouts.
-
TOR_CFG
: Dict[str, Any] (json)¶ Tor bootstrap configuration for
stem.process.launch_tor_with_config()
.- Default
{}
Note
If provided, it should be a JSON encoded string.
I2P Proxy Configuration¶
-
I2P_PORT
: int¶ Port for I2P proxy connection.
- Default
4444
-
I2P_RETRY
: int¶ Retry times for I2P bootstrap when failure.
- Default
3
-
I2P_WAIT
: float¶ Time after which the attempt to start I2P is aborted.
- Default
90
Note
If not provided, there will be NO timeouts.
-
I2P_ARGS
: str (shell)¶ I2P bootstrap arguments for
i2prouter start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split
).- Default
''
Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
ZeroNet Proxy Configuration¶
-
ZERONET_PORT
: int¶ Port for ZeroNet proxy connection.
- Default
4444
-
ZERONET_RETRY
: int¶ Retry times for ZeroNet bootstrap when failure.
- Default
3
-
ZERONET_WAIT
: float¶ Time after which the attempt to start ZeroNet is aborted.
- Default
90
Note
If not provided, there will be NO timeouts.
-
ZERONET_PATH
: str (path)¶ Path to the ZeroNet project.
- Default
/usr/local/src/zeronet
-
ZERONET_ARGS
: str (shell)¶ ZeroNet bootstrap arguments for
ZeroNet.sh main
.- Default
''
Note
If provided, it should be parsed as command line arguments (c.f.
shlex.split
).
Freenet Proxy Configuration¶
-
FREENET_PORT
: int¶ Port for Freenet proxy connection.
- Default
8888
-
FREENET_RETRY
: int¶ Retry times for Freenet bootstrap when failure.
- Default
3
-
FREENET_WAIT
: float¶ Time after which the attempt to start Freenet is aborted.
- Default
90
Note
If not provided, there will be NO timeouts.
-
FREENET_PATH
: str (path)¶ Path to the Freenet project.
- Default
/usr/local/src/freenet
-
FREENET_ARGS
: str (shell)¶ Freenet bootstrap arguments for
run.sh start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split
).- Default
''
Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.