`darc` - Darkweb Crawler Project¶

darc is designed as a swiss army knife for darkweb crawling. It integrates requests to collect HTTP request and response information, such as cookies, header fields, etc. It also bundles selenium to provide a fully rendered web page and screenshot of such view.

The general process of darc can be described as following:

process(): obtain URLs from the requests link database (c.f. load_requests()), and feed such URLs to crawler() with multiprocessing support.
crawler(): parse the URL using parse_link(), and check if need to crawl the URL (c.f. PROXY_WHITE_LIST, PROXY_BLACK_LIST , LINK_WHITE_LIST and LINK_BLACK_LIST); if true, then crawl the URL with requests.

If the URL is from a brand new host, darc will first try to fetch and save robots.txt and sitemaps of the host (c.f. save_robots() and save_sitemap()), and extract then save the links from sitemaps (c.f. read_sitemap()) into link database for future crawling (c.f. save_requests()). Also, if the submission API is provided, submit_new_host() will be called and submit the documents just fetched.

If robots.txt presented, and FORCE is False, darc will check if allowed to crawl the URL.

Note

The root path (e.g. / in https://www.example.com/) will always be crawled ignoring robots.txt.

At this point, darc will call the customised hook function from darc.sites to crawl and get the final response object. darc will save the session cookies and header information, using save_headers().

Note

If requests.exceptions.InvalidSchema is raised, the link will be saved by save_invalid(). Further processing is dropped.

If the content type of response document is not ignored (c.f. MIME_WHITE_LIST and MIME_BLACK_LIST), darc will save the document using save_html() or save_file() accordingly. And if the submission API is provided, submit_requests() will be called and submit the document just fetched.

If the response document is HTML (text/html and application/xhtml+xml), extract_links() will be called then to extract all possible links from the HTML document and save such links into the database (c.f. save_requests()).

And if the response status code is between 400 and 600, the URL will be saved back to the link database (c.f. save_requests()). If NOT, the URL will be saved into selenium link database to proceed next steps (c.f. save_selenium()).
process(): in the meanwhile, darc will obtain URLs from the selenium link database (c.f. load_selenium()), and feed such URLs to loader().

Note

If FLAG_MP is True, the function will be called with multiprocessing support; if FLAG_TH if True, the function will be called with multithreading support; if none, the function will be called in single-threading.
loader(): parse the URL using parse_link() and start loading the URL using selenium with Google Chrome.

At this point, darc will call the customised hook function from darc.sites to load and return the original selenium.webdriver.Chrome object.

If successful, the rendered source HTML document will be saved using save_html(), and a full-page screenshot will be taken and saved.

If the submission API is provided, submit_selenium() will be called and submit the document just loaded.

Later, extract_links() will be called then to extract all possible links from the HTML document and save such links into the requests database (c.f. save_requests()).

Installation¶

Note

darc supports Python all versions above and includes 3.6. Currently, it only supports and is tested on Linux (Ubuntu 18.04) and macOS (Catalina).

When installing in Python versions below 3.8, darc will use walrus to compile itself for backport compatibility.

pip install darc

Please make sure you have Google Chrome and corresponding version of Chrome Driver installed on your system.

Important

Starting from version 0.3.0, we introduced Redis for the task queue database backend. Please make sure you have it installed, configured, and running when using the darc project.

However, the darc project is shipped with Docker and Compose support. Please see Docker Integration for more information.

Or, you may refer to and/or install from the Docker Hub repository:

docker pull jsnbzh/darc[:TAGNAME]

Usage¶

The darc project provides a simple CLI:

usage: darc [-h] [-f FILE] ...

the darkweb crawling swiss army knife

positional arguments:
  link                  links to craw

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  read links from file

It can also be called through module entrypoint:

python -m python-darc ...

Note

The link files can contain comment lines, which should start with #. Empty lines and comment lines will be ignored when loading.

Configuration¶

Though simple CLI, the darc project is more configurable by environment variables.

General Configurations¶

DARC_REBOOT¶

Type: bool (int)
Default: 0

If exit the program after first round, i.e. crawled all links from the requests link database and loaded all links from the selenium link database.

This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.

DARC_DEBUG¶

Type: bool (int)
Default: 0

If run the program in debugging mode.

DARC_VERBOSE¶

Type: bool (int)
Default: 0

If run the program in verbose mode. If DARC_DEBUG is True, then the verbose mode will be always enabled.

DARC_FORCE¶

Type: bool (int)
Default: 0

If ignore robots.txt rules when crawling (c.f. crawler()).

DARC_CHECK¶

Type: bool (int)
Default: 0

If check proxy and hostname before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

If DARC_CHECK_CONTENT_TYPE is True, then this environment variable will be always set as True.

DARC_CHECK_CONTENT_TYPE¶

Type: bool (int)
Default: 0

If check content type through HEAD requests before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

DARC_CPU¶

Type: int
Default: None

Number of concurrent processes. If not provided, then the number of system CPUs will be used.

DARC_MULTIPROCESSING¶

Type: bool (int)
Default: 1

If enable multiprocessing support.

DARC_MULTITHREADING¶

Type: bool (int)
Default: 0

If enable multithreading support.

Note

DARC_MULTIPROCESSING and DARC_MULTITHREADING can NOT be toggled at the same time.

DARC_USER¶

Type: str
Default: current login user (c.f. getpass.getuser())

Non-root user for proxies.

DARC_MAX_POOL¶

Type: int
Default: 1_000

Maximum number of links loaded from the database.

Note

If is an infinit inf, no limit will be applied.

See also

Data Storage¶

REDIS_URL¶

Type: str (url)
Default: redis://127.0.0.1

URL to the Redis database.

PATH_DATA¶

Type: str (path)
Default: data

Path to data storage.

See also

See darc.save for more information about source saving.

Web Crawlers¶

DARC_WAIT¶

Type: float
Default: 60

Time interval between each round when the requests and/or selenium database are empty.

DARC_SAVE¶

Type: bool (int)
Default: 0

If save processed link back to database.

Note

If DARC_SAVE is True, then DARC_SAVE_REQUESTS and DARC_SAVE_SELENIUM will be forced to be True.

See also

See darc.db for more information about link database.

DARC_SAVE_REQUESTS¶

Type: bool (int)
Default: 0

If save crawler() crawled link back to requests database.

See also

See darc.db for more information about link database.

DARC_SAVE_SELENIUM¶

Type: bool (int)
Default: 0

If save loader() crawled link back to selenium database.

See also

See darc.db for more information about link database.

TIME_CACHE¶

Type: float
Default: 60

Time delta for caches in seconds.

The darc project supports caching for fetched files. TIME_CACHE will specify for how log the fetched files will be cached and NOT fetched again.

Note

If TIME_CACHE is None then caching will be marked as forever.

SE_WAIT¶

Type: float
Default: 60

Time to wait for selenium to finish loading pages.

Note

Internally, selenium will wait for the browser to finish loading the pages before return (i.e. the web API event DOMContentLoaded). However, some extra scripts may take more time running after the event.

White / Black Lists¶

LINK_WHITE_LIST¶

Type: List[str] (JSON)
Default: []

White list of hostnames should be crawled.

Note

Regular expressions are supported.

LINK_BLACK_LIST¶

Type: List[str] (JSON)
Default: []

Black list of hostnames should be crawled.

Note

Regular expressions are supported.

LINK_FALLBACK¶

Type: bool (int)
Default: 0

Fallback value for match_host().

MIME_WHITE_LIST¶

Type: List[str] (JSON)
Default: []

White list of content types should be crawled.

Note

Regular expressions are supported.

MIME_BLACK_LIST¶

Type: List[str] (JSON)
Default: []

Black list of content types should be crawled.

Note

Regular expressions are supported.

MIME_FALLBACK¶

Type: bool (int)
Default: 0

Fallback value for match_mime().

PROXY_WHITE_LIST¶

Type: List[str] (JSON)
Default: []

White list of proxy types should be crawled.

Note

The proxy types are case insensitive.

PROXY_BLACK_LIST¶

Type: List[str] (JSON)
Default: []

Black list of proxy types should be crawled.

Note

The proxy types are case insensitive.

PROXY_FALLBACK¶

Type: bool (int)
Default: 0

Fallback value for match_proxy().

Note

If provided, LINK_WHITE_LIST, LINK_BLACK_LIST, MIME_WHITE_LIST, MIME_BLACK_LIST, PROXY_WHITE_LIST and PROXY_BLACK_LIST should all be JSON encoded strings.

Data Submission¶

API_RETRY¶

Type: int
Default: 3

Retry times for API submission when failure.

API_NEW_HOST¶

Type: str
Default: None

API URL for submit_new_host().

API_REQUESTS¶

Type: str
Default: None

API URL for submit_requests().

API_SELENIUM¶

Type: str
Default: None

API URL for submit_selenium().

Note

If API_NEW_HOST, API_REQUESTS and API_SELENIUM is None, the corresponding submit function will save the JSON data in the path specified by PATH_DATA.

Tor Proxy Configuration¶

TOR_PORT¶

Type: int
Default: 9050

Port for Tor proxy connection.

TOR_CTRL¶

Type: int
Default: 9051

Port for Tor controller connection.

TOR_STEM¶

Type: bool (int)
Default: 1

If manage the Tor proxy through stem.

TOR_PASS¶

Type: str
Default: None

Tor controller authentication token.

Note

If not provided, it will be requested at runtime.

TOR_RETRY¶

Type: int
Default: 3

Retry times for Tor bootstrap when failure.

TOR_WAIT¶

Type: float
Default: 90

Time after which the attempt to start Tor is aborted.

Note

If not provided, there will be NO timeouts.

TOR_CFG¶

Type: Dict[str, Any] (JSON)
Default: {}

Tor bootstrap configuration for stem.process.launch_tor_with_config().

Note

If provided, it should be a JSON encoded string.

I2P Proxy Configuration¶

I2P_PORT¶

Type: int
Default: 4444

Port for I2P proxy connection.

I2P_RETRY¶

Type: int
Default: 3

Retry times for I2P bootstrap when failure.

I2P_WAIT¶

Type: float
Default: 90

Time after which the attempt to start I2P is aborted.

Note

If not provided, there will be NO timeouts.

I2P_ARGS¶

Type: str (Shell)
Default: ''

I2P bootstrap arguments for i2prouter start.

If provided, it should be parsed as command line arguments (c.f. shlex.split).

Note

The command will be run as DARC_USER, if current user (c.f. getpass.getuser()) is root.

ZeroNet Proxy Configuration¶

ZERONET_PORT¶

Type: int
Default: 4444

Port for ZeroNet proxy connection.

ZERONET_RETRY¶

Type: int
Default: 3

Retry times for ZeroNet bootstrap when failure.

ZERONET_WAIT¶

Type: float
Default: 90

Time after which the attempt to start ZeroNet is aborted.

Note

If not provided, there will be NO timeouts.

ZERONET_PATH¶

Type: str (path)
Default: /usr/local/src/zeronet

Path to the ZeroNet project.

ZERONET_ARGS¶

Type: str (Shell)
Default: ''

ZeroNet bootstrap arguments for ZeroNet.sh main.

Note

If provided, it should be parsed as command line arguments (c.f. shlex.split).

Freenet Proxy Configuration¶

FREENET_PORT¶

Type: int
Default: 8888

Port for Freenet proxy connection.

FREENET_RETRY¶

Type: int
Default: 3

Retry times for Freenet bootstrap when failure.

FREENET_WAIT¶

Type: float
Default: 90

Time after which the attempt to start Freenet is aborted.

Note

If not provided, there will be NO timeouts.

FREENET_PATH¶

Type: str (path)
Default: /usr/local/src/freenet

Path to the Freenet project.

FREENET_ARGS¶

Type: str (Shell)
Default: ''

Freenet bootstrap arguments for run.sh start.

If provided, it should be parsed as command line arguments (c.f. shlex.split).

Note

The command will be run as DARC_USER, if current user (c.f. getpass.getuser()) is root.

`darc` - Darkweb Crawler Project¶

Installation¶

Usage¶

Configuration¶

General Configurations¶

Data Storage¶

Web Crawlers¶

White / Black Lists¶

Data Submission¶

Tor Proxy Configuration¶

I2P Proxy Configuration¶

ZeroNet Proxy Configuration¶

Freenet Proxy Configuration¶

Indices and tables¶

darc

Navigation

Related Topics

darc - Darkweb Crawler Project¶

Installation¶

Usage¶

Configuration¶

General Configurations¶

Data Storage¶

Web Crawlers¶

White / Black Lists¶

Data Submission¶

Tor Proxy Configuration¶

I2P Proxy Configuration¶

ZeroNet Proxy Configuration¶

Freenet Proxy Configuration¶

Indices and tables¶

`darc` - Darkweb Crawler Project¶