darc - Darkweb Crawler Project

darc is designed as a swiss-knife for darkweb crawling. It integrates requests to collect HTTP request and response information, such as cookies, header fields, etc. It also bundles selenium to provide a fully rendered web page and screenshot of such view.

The general process of darc can be described as following:

  1. process(): obtain URLs from the requests link database (c.f. load_requests()), and feed such URLs to crawler() with multiprocessing support.

  2. crawler(): parse the URL using parse_link(), and check if need to crawl the URL (c.f. PROXY_WHITE_LIST, PROXY_BLACK_LIST , LINK_WHITE_LIST and LINK_BLACK_LIST); if true, then crawl the URL with requests.

    If the URL is from a brand new host, darc will first try to fetch and save robots.txt and sitemaps of the host (c.f. save_robots() and save_sitemap()), and extract then save the links from sitemaps (c.f. read_sitemap()) into link database for future crawling (c.f. save_requests()). Also, if the submission API is provided, submit_new_host() will be called and submit the documents just fetched.

    If robots.txt presented, and FORCE is False, darc will check if allowed to crawl the URL.

    Note

    The root path (e.g. / in https://www.example.com/) will always be crawled ignoring robots.txt.

    At this point, darc will call the customised hook function from darc.sites to crawl and get the final response object. darc will save the session cookies and header information, using save_headers().

    Note

    If requests.exceptions.InvalidSchema is raised, the link will be saved by save_invalid(). Further processing is dropped.

    If the content type of response document is not ignored (c.f. MIME_WHITE_LIST and MIME_BLACK_LIST), darc will save the document using save_html() or save_file() accordingly. And if the submission API is provided, submit_requests() will be called and submit the document just fetched.

    If the response document is HTML (text/html and application/xhtml+xml), extract_links() will be called then to extract all possible links from the HTML document and save such links into the database (c.f. save_requests()).

    And if the response status code is between 400 and 600, the URL will be saved back to the link database (c.f. save_requests()). If NOT, the URL will be saved into selenium link database to proceed next steps (c.f. save_selenium()).

  3. process(): after the obtained URLs have all been crawled, darc will obtain URLs from the selenium link database (c.f. load_selenium()), and feed such URLs to loader().

    Note

    If FLAG_MP is True, the function will be called with multiprocessing support; if FLAG_TH if True, the function will be called with multithreading support; if none, the function will be called in single-threading.

  4. loader(): parse the URL using parse_link() and start loading the URL using selenium with Google Chrome.

    At this point, darc will call the customised hook function from darc.sites to load and return the original selenium.webdriver.Chrome object.

    If successful, the rendered source HTML document will be saved using save_html(), and a full-page screenshot will be taken and saved.

    If the submission API is provided, submit_selenium() will be called and submit the document just loaded.

    Later, extract_links() will be called then to extract all possible links from the HTML document and save such links into the requests database (c.f. save_requests()).

Installation

Note

darc supports Python all versions above and includes 3.8. Currently, it only supports and is tested on Linux (Ubuntu 18.04) and macOS (Catalina).

pip install darc

Please make sure you have Google Chrome and corresponding version of Chrome Driver installed on your system.

However, the darc project is shipped with Docker and Compose support. Please see the project root for relevant files and more information.

Usage

The darc project provides a simple CLI:

usage: darc [-h] [-f FILE] ...

darkweb swiss knife crawler

positional arguments:
  link                  links to craw

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  read links from file

It can also be called through module entrypoint:

python -m python-darc ...

Note

The link files can contain comment lines, which should start with #. Empty lines and comment lines will be ignored when loading.

Configuration

Though simple CLI, the darc project is more configurable by environment variables.

General Configurations

DARC_REBOOT: bool (int)

If exit the program after first round, i.e. crawled all links from the requests link database and loaded all links from the selenium link database.

Default

0

DARC_DEBUG: bool (int)

If run the program in debugging mode.

Default

0

DARC_VERBOSE: bool (int)

If run the program in verbose mode. If DARC_DEBUG is True, then the verbose mode will be always enabled.

Default

0

DARC_FORCE: bool (int)

If ignore robots.txt rules when crawling (c.f. crawler()).

Default

0

DARC_CHECK: bool (int)

If check proxy and hostname before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

If DARC_CHECK_CONTENT_TYPE is True, then this environment variable will be always set as True.

Default

0

DARC_CHECK_CONTENT_TYPE: bool (int)

If check content type through HEAD requests before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

Default

0

DARC_CPU: int

Number of concurrent processes. If not provided, then the number of system CPUs will be used.

Default

None

DARC_MULTIPROCESSING: bool (int)

If enable multiprocessing support.

Default

1

DARC_MULTITHREADING: bool (int)

If enable multithreading support.

Default

0

Note

DARC_MULTIPROCESSING and DARC_MULTITHREADING can NOT be toggled at the same time.

DARC_USER: str

Non-root user for proxies.

Default

current login user (c.f. getpass.getuser())

Data Storage

PATH_DATA: str (path)

Path to data storage.

Default

data

See also

See darc.save for more information about source saving.

Web Crawlers

TIME_CACHE: float

Time delta for caches in seconds.

The darc project supports caching for fetched files. TIME_CACHE will specify for how log the fetched files will be cached and NOT fetched again.

Note

If TIME_CACHE is None then caching will be marked as forever.

Default

60

SE_WAIT: float

Time to wait for selenium to finish loading pages.

Note

Internally, selenium will wait for the browser to finish loading the pages before return (i.e. the web API event DOMContentLoaded). However, some extra scripts may take more time running after the event.

Default

60

White / Black Lists

White list of hostnames should be crawled.

Default

[]

Note

Regular expressions are supported.

Black list of hostnames should be crawled.

Default

[]

Note

Regular expressions are supported.

MIME_WHITE_LIST: List[str] (json)

White list of content types should be crawled.

Default

[]

Note

Regular expressions are supported.

MIME_BLACK_LIST: List[str] (json)

Black list of content types should be crawled.

Default

[]

Note

Regular expressions are supported.

PROXY_WHITE_LIST: List[str] (json)

White list of proxy types should be crawled.

Default

[]

Note

The proxy types are case insensitive.

PROXY_BLACK_LIST: List[str] (json)

Black list of proxy types should be crawled.

Default

[]

Note

The proxy types are case insensitive.

Note

If provided, LINK_WHITE_LIST, LINK_BLACK_LIST, MIME_WHITE_LIST, MIME_BLACK_LIST, PROXY_WHITE_LIST and PROXY_BLACK_LIST should all be JSON encoded strings.

Data Submission

API_RETRY: int

Retry times for API submission when failure.

Default

3

API_NEW_HOST: str

API URL for submit_new_host().

Default

None

API_REQUESTS: str

API URL for submit_requests().

Default

None

API_SELENIUM: str

API URL for submit_selenium().

Default

None

Note

If API_NEW_HOST, API_REQUESTS and API_SELENIUM is None, the corresponding submit function will save the JSON data in the path specified by PATH_DATA.

Tor Proxy Configuration

TOR_PORT: int

Port for Tor proxy connection.

Default

9050

TOR_CTRL: int

Port for Tor controller connection.

Default

9051

TOR_STEM: bool (int)

If manage the Tor proxy through stem.

Default

1

TOR_PASS: str

Tor controller authentication token.

Default

None

Note

If not provided, it will be requested at runtime.

TOR_RETRY: int

Retry times for Tor bootstrap when failure.

Default

3

TOR_WAIT: float

Time after which the attempt to start Tor is aborted.

Default

90

Note

If not provided, there will be NO timeouts.

TOR_CFG: Dict[str, Any] (json)

Tor bootstrap configuration for stem.process.launch_tor_with_config().

Default

{}

Note

If provided, it should be a JSON encoded string.

I2P Proxy Configuration

I2P_PORT: int

Port for I2P proxy connection.

Default

4444

I2P_RETRY: int

Retry times for I2P bootstrap when failure.

Default

3

I2P_WAIT: float

Time after which the attempt to start I2P is aborted.

Default

90

Note

If not provided, there will be NO timeouts.

I2P_ARGS: str (shell)

I2P bootstrap arguments for i2prouter start.

If provided, it should be parsed as command line arguments (c.f. shlex.split).

Default

''

Note

The command will be run as DARC_USER, if current user (c.f. getpass.getuser()) is root.

ZeroNet Proxy Configuration

ZERONET_PORT: int

Port for ZeroNet proxy connection.

Default

4444

ZERONET_RETRY: int

Retry times for ZeroNet bootstrap when failure.

Default

3

ZERONET_WAIT: float

Time after which the attempt to start ZeroNet is aborted.

Default

90

Note

If not provided, there will be NO timeouts.

ZERONET_PATH: str (path)

Path to the ZeroNet project.

Default

/usr/local/src/zeronet

ZERONET_ARGS: str (shell)

ZeroNet bootstrap arguments for ZeroNet.sh main.

Default

''

Note

If provided, it should be parsed as command line arguments (c.f. shlex.split).

Freenet Proxy Configuration

FREENET_PORT: int

Port for Freenet proxy connection.

Default

8888

FREENET_RETRY: int

Retry times for Freenet bootstrap when failure.

Default

3

FREENET_WAIT: float

Time after which the attempt to start Freenet is aborted.

Default

90

Note

If not provided, there will be NO timeouts.

FREENET_PATH: str (path)

Path to the Freenet project.

Default

/usr/local/src/freenet

FREENET_ARGS: str (shell)

Freenet bootstrap arguments for run.sh start.

If provided, it should be parsed as command line arguments (c.f. shlex.split).

Default

''

Note

The command will be run as DARC_USER, if current user (c.f. getpass.getuser()) is root.

Indices and tables