darc - Darkweb Crawler Project

darc is designed as a swiss army knife for darkweb crawling. It integrates requests to collect HTTP request and response information, such as cookies, header fields, etc. It also bundles selenium to provide a fully rendered web page and screenshot of such view.

The general process of darc can be described as following:

  1. process(): obtain URLs from the requests link database (c.f. load_requests()), and feed such URLs to crawler() with multiprocessing support.

  2. crawler(): parse the URL using parse_link(), and check if need to crawl the URL (c.f. PROXY_WHITE_LIST, PROXY_BLACK_LIST , LINK_WHITE_LIST and LINK_BLACK_LIST); if true, then crawl the URL with requests.

    If the URL is from a brand new host, darc will first try to fetch and save robots.txt and sitemaps of the host (c.f. save_robots() and save_sitemap()), and extract then save the links from sitemaps (c.f. read_sitemap()) into link database for future crawling (c.f. save_requests()). Also, if the submission API is provided, submit_new_host() will be called and submit the documents just fetched.

    If robots.txt presented, and FORCE is False, darc will check if allowed to crawl the URL.

    Note

    The root path (e.g. / in https://www.example.com/) will always be crawled ignoring robots.txt.

    At this point, darc will call the customised hook function from darc.sites to crawl and get the final response object. darc will save the session cookies and header information, using save_headers().

    Note

    If requests.exceptions.InvalidSchema is raised, the link will be saved by save_invalid(). Further processing is dropped.

    If the content type of response document is not ignored (c.f. MIME_WHITE_LIST and MIME_BLACK_LIST), darc will save the document using save_html() or save_file() accordingly. And if the submission API is provided, submit_requests() will be called and submit the document just fetched.

    If the response document is HTML (text/html and application/xhtml+xml), extract_links() will be called then to extract all possible links from the HTML document and save such links into the database (c.f. save_requests()).

    And if the response status code is between 400 and 600, the URL will be saved back to the link database (c.f. save_requests()). If NOT, the URL will be saved into selenium link database to proceed next steps (c.f. save_selenium()).

  3. process(): after the obtained URLs have all been crawled, darc will obtain URLs from the selenium link database (c.f. load_selenium()), and feed such URLs to loader().

    Note

    If FLAG_MP is True, the function will be called with multiprocessing support; if FLAG_TH if True, the function will be called with multithreading support; if none, the function will be called in single-threading.

  4. loader(): parse the URL using parse_link() and start loading the URL using selenium with Google Chrome.

    At this point, darc will call the customised hook function from darc.sites to load and return the original selenium.webdriver.Chrome object.

    If successful, the rendered source HTML document will be saved using save_html(), and a full-page screenshot will be taken and saved.

    If the submission API is provided, submit_selenium() will be called and submit the document just loaded.

    Later, extract_links() will be called then to extract all possible links from the HTML document and save such links into the requests database (c.f. save_requests()).

Installation

Note

darc supports Python all versions above and includes 3.6. Currently, it only supports and is tested on Linux (Ubuntu 18.04) and macOS (Catalina).

When installing in Python versions below 3.8, darc will use walrus to compile itself for backport compatibility.

pip install darc

Please make sure you have Google Chrome and corresponding version of Chrome Driver installed on your system.

However, the darc project is shipped with Docker and Compose support. Please see Docker Integration for more information.

Or, you may refer to and/or install from the Docker Hub repository:

docker pull jsnbzh/darc[:TAGNAME]

Usage

The darc project provides a simple CLI:

usage: darc [-h] [-f FILE] ...

the darkweb knife crawling swiss army knife

positional arguments:
  link                  links to craw

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  read links from file

It can also be called through module entrypoint:

python -m python-darc ...

Note

The link files can contain comment lines, which should start with #. Empty lines and comment lines will be ignored when loading.

Configuration

Though simple CLI, the darc project is more configurable by environment variables.

General Configurations

DARC_REBOOT
Type

bool (int)

Default

0

If exit the program after first round, i.e. crawled all links from the requests link database and loaded all links from the selenium link database.

This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.

DARC_DEBUG
Type

bool (int)

Default

0

If run the program in debugging mode.

DARC_VERBOSE
Type

bool (int)

Default

0

If run the program in verbose mode. If DARC_DEBUG is True, then the verbose mode will be always enabled.

DARC_FORCE
Type

bool (int)

Default

0

If ignore robots.txt rules when crawling (c.f. crawler()).

DARC_CHECK
Type

bool (int)

Default

0

If check proxy and hostname before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

If DARC_CHECK_CONTENT_TYPE is True, then this environment variable will be always set as True.

DARC_CHECK_CONTENT_TYPE
Type

bool (int)

Default

0

If check content type through HEAD requests before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

DARC_CPU
Type

int

Default

None

Number of concurrent processes. If not provided, then the number of system CPUs will be used.

DARC_MULTIPROCESSING
Type

bool (int)

Default

1

If enable multiprocessing support.

DARC_MULTITHREADING
Type

bool (int)

Default

0

If enable multithreading support.

Note

DARC_MULTIPROCESSING and DARC_MULTITHREADING can NOT be toggled at the same time.

DARC_USER
Type

str

Default

current login user (c.f. getpass.getuser())

Non-root user for proxies.

DARC_MAX_POOL
Type

int

Default

1_000

Maximum number of links loaded from the database.

Note

If is an infinit inf, no limit will be applied.

Data Storage

PATH_DATA
Type

str (path)

Default

data

Path to data storage.

See also

See darc.save for more information about source saving.

Web Crawlers

DARC_SAVE
Type

bool (int)

Default

0

If save processed link back to database.

Note

If DARC_SAVE is True, then DARC_SAVE_REQUESTS and DARC_SAVE_SELENIUM will be forced to be True.

See also

See darc.db for more information about link database.

DARC_SAVE_REQUESTS
Type

bool (int)

Default

0

If save crawler() crawled link back to requests database.

See also

See darc.db for more information about link database.

DARC_SAVE_SELENIUM
Type

bool (int)

Default

0

If save loader() crawled link back to selenium database.

See also

See darc.db for more information about link database.

TIME_CACHE
Type

float

Default

60

Time delta for caches in seconds.

The darc project supports caching for fetched files. TIME_CACHE will specify for how log the fetched files will be cached and NOT fetched again.

Note

If TIME_CACHE is None then caching will be marked as forever.

SE_WAIT
Type

float

Default

60

Time to wait for selenium to finish loading pages.

Note

Internally, selenium will wait for the browser to finish loading the pages before return (i.e. the web API event DOMContentLoaded). However, some extra scripts may take more time running after the event.

White / Black Lists

Type

List[str] (JSON)

Default

[]

White list of hostnames should be crawled.

Note

Regular expressions are supported.

Type

List[str] (JSON)

Default

[]

Black list of hostnames should be crawled.

Note

Regular expressions are supported.

Type

bool (int)

Default

0

Fallback value for match_host().

MIME_WHITE_LIST
Type

List[str] (JSON)

Default

[]

White list of content types should be crawled.

Note

Regular expressions are supported.

MIME_BLACK_LIST
Type

List[str] (JSON)

Default

[]

Black list of content types should be crawled.

Note

Regular expressions are supported.

MIME_FALLBACK
Type

bool (int)

Default

0

Fallback value for match_mime().

PROXY_WHITE_LIST
Type

List[str] (JSON)

Default

[]

White list of proxy types should be crawled.

Note

The proxy types are case insensitive.

PROXY_BLACK_LIST
Type

List[str] (JSON)

Default

[]

Black list of proxy types should be crawled.

Note

The proxy types are case insensitive.

PROXY_FALLBACK
Type

bool (int)

Default

0

Fallback value for match_proxy().

Note

If provided, LINK_WHITE_LIST, LINK_BLACK_LIST, MIME_WHITE_LIST, MIME_BLACK_LIST, PROXY_WHITE_LIST and PROXY_BLACK_LIST should all be JSON encoded strings.

Data Submission

API_RETRY
Type

int

Default

3

Retry times for API submission when failure.

API_NEW_HOST
Type

str

Default

None

API URL for submit_new_host().

API_REQUESTS
Type

str

Default

None

API URL for submit_requests().

API_SELENIUM
Type

str

Default

None

API URL for submit_selenium().

Note

If API_NEW_HOST, API_REQUESTS and API_SELENIUM is None, the corresponding submit function will save the JSON data in the path specified by PATH_DATA.

Tor Proxy Configuration

TOR_PORT
Type

int

Default

9050

Port for Tor proxy connection.

TOR_CTRL
Type

int

Default

9051

Port for Tor controller connection.

TOR_STEM
Type

bool (int)

Default

1

If manage the Tor proxy through stem.

TOR_PASS
Type

str

Default

None

Tor controller authentication token.

Note

If not provided, it will be requested at runtime.

TOR_RETRY
Type

int

Default

3

Retry times for Tor bootstrap when failure.

TOR_WAIT
Type

float

Default

90

Time after which the attempt to start Tor is aborted.

Note

If not provided, there will be NO timeouts.

TOR_CFG
Type

Dict[str, Any] (JSON)

Default

{}

Tor bootstrap configuration for stem.process.launch_tor_with_config().

Note

If provided, it should be a JSON encoded string.

I2P Proxy Configuration

I2P_PORT
Type

int

Default

4444

Port for I2P proxy connection.

I2P_RETRY
Type

int

Default

3

Retry times for I2P bootstrap when failure.

I2P_WAIT
Type

float

Default

90

Time after which the attempt to start I2P is aborted.

Note

If not provided, there will be NO timeouts.

I2P_ARGS
Type

str (Shell)

Default

''

I2P bootstrap arguments for i2prouter start.

If provided, it should be parsed as command line arguments (c.f. shlex.split).

Note

The command will be run as DARC_USER, if current user (c.f. getpass.getuser()) is root.

ZeroNet Proxy Configuration

ZERONET_PORT
Type

int

Default

4444

Port for ZeroNet proxy connection.

ZERONET_RETRY
Type

int

Default

3

Retry times for ZeroNet bootstrap when failure.

ZERONET_WAIT
Type

float

Default

90

Time after which the attempt to start ZeroNet is aborted.

Note

If not provided, there will be NO timeouts.

ZERONET_PATH
Type

str (path)

Default

/usr/local/src/zeronet

Path to the ZeroNet project.

ZERONET_ARGS
Type

str (Shell)

Default

''

ZeroNet bootstrap arguments for ZeroNet.sh main.

Note

If provided, it should be parsed as command line arguments (c.f. shlex.split).

Freenet Proxy Configuration

FREENET_PORT
Type

int

Default

8888

Port for Freenet proxy connection.

FREENET_RETRY
Type

int

Default

3

Retry times for Freenet bootstrap when failure.

FREENET_WAIT
Type

float

Default

90

Time after which the attempt to start Freenet is aborted.

Note

If not provided, there will be NO timeouts.

FREENET_PATH
Type

str (path)

Default

/usr/local/src/freenet

Path to the Freenet project.

FREENET_ARGS
Type

str (Shell)

Default

''

Freenet bootstrap arguments for run.sh start.

If provided, it should be parsed as command line arguments (c.f. shlex.split).

Note

The command will be run as DARC_USER, if current user (c.f. getpass.getuser()) is root.

Indices and tables