`darc` - Darkweb Crawler Project¶

darc is designed as a swiss army knife for darkweb crawling. It integrates requests to collect HTTP request and response information, such as cookies, header fields, etc. It also bundles selenium to provide a fully rendered web page and screenshot of such view.

The general process of darc can be described as following:

process(): obtain URLs from the requests link database (c.f. load_requests()), and feed such URLs to crawler() with multiprocessing support.
crawler(): parse the URL using parse_link(), and check if need to crawl the URL (c.f. PROXY_WHITE_LIST, PROXY_BLACK_LIST , LINK_WHITE_LIST and LINK_BLACK_LIST); if true, then crawl the URL with requests.

If the URL is from a brand new host, darc will first try to fetch and save robots.txt and sitemaps of the host (c.f. save_robots() and save_sitemap()), and extract then save the links from sitemaps (c.f. read_sitemap()) into link database for future crawling (c.f. save_requests()). Also, if the submission API is provided, submit_new_host() will be called and submit the documents just fetched.

If robots.txt presented, and FORCE is False, darc will check if allowed to crawl the URL.

Note

The root path (e.g. / in https://www.example.com/) will always be crawled ignoring robots.txt.

At this point, darc will call the customised hook function from darc.sites to crawl and get the final response object. darc will save the session cookies and header information, using save_headers().

Note

If requests.exceptions.InvalidSchema is raised, the link will be saved by save_invalid(). Further processing is dropped.

If the content type of response document is not ignored (c.f. MIME_WHITE_LIST and MIME_BLACK_LIST), darc will save the document using save_html() or save_file() accordingly. And if the submission API is provided, submit_requests() will be called and submit the document just fetched.

If the response document is HTML (text/html and application/xhtml+xml), extract_links() will be called then to extract all possible links from the HTML document and save such links into the database (c.f. save_requests()).

And if the response status code is between 400 and 600, the URL will be saved back to the link database (c.f. save_requests()). If NOT, the URL will be saved into selenium link database to proceed next steps (c.f. save_selenium()).
process(): after the obtained URLs have all been crawled, darc will obtain URLs from the selenium link database (c.f. load_selenium()), and feed such URLs to loader().

Note

If FLAG_MP is True, the function will be called with multiprocessing support; if FLAG_TH if True, the function will be called with multithreading support; if none, the function will be called in single-threading.
loader(): parse the URL using parse_link() and start loading the URL using selenium with Google Chrome.

At this point, darc will call the customised hook function from darc.sites to load and return the original selenium.webdriver.Chrome object.

If successful, the rendered source HTML document will be saved using save_html(), and a full-page screenshot will be taken and saved.

If the submission API is provided, submit_selenium() will be called and submit the document just loaded.

Later, extract_links() will be called then to extract all possible links from the HTML document and save such links into the requests database (c.f. save_requests()).

Installation¶

Note

darc supports Python all versions above and includes 3.6. Currently, it only supports and is tested on Linux (Ubuntu 18.04) and macOS (Catalina).

When installing in Python versions below 3.8, darc will use walrus to compile itself for backport compatibility.

pip install darc

Please make sure you have Google Chrome and corresponding version of Chrome Driver installed on your system.

However, the darc project is shipped with Docker and Compose support. Please see the project root for relevant files and more information.

Usage¶

The darc project provides a simple CLI:

usage: darc [-h] [-f FILE] ...

the darkweb knife crawling swiss army knife

positional arguments:
  link                  links to craw

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  read links from file

It can also be called through module entrypoint:

python -m python-darc ...

Note

The link files can contain comment lines, which should start with #. Empty lines and comment lines will be ignored when loading.

Configuration¶

Though simple CLI, the darc project is more configurable by environment variables.

General Configurations¶

DARC_REBOOT: bool (int)¶

If exit the program after first round, i.e. crawled all links from the requests link database and loaded all links from the selenium link database.

Default: 0

DARC_DEBUG: bool (int)¶

If run the program in debugging mode.

Default: 0

DARC_VERBOSE: bool (int)¶

If run the program in verbose mode. If DARC_DEBUG is True, then the verbose mode will be always enabled.

Default: 0

DARC_FORCE: bool (int)¶

If ignore robots.txt rules when crawling (c.f. crawler()).

Default: 0

DARC_CHECK: bool (int)¶

If check proxy and hostname before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

If DARC_CHECK_CONTENT_TYPE is True, then this environment variable will be always set as True.

Default: 0

DARC_CHECK_CONTENT_TYPE: bool (int)¶

If check content type through HEAD requests before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

Default: 0

DARC_CPU: int¶

Number of concurrent processes. If not provided, then the number of system CPUs will be used.

Default: None

DARC_MULTIPROCESSING: bool (int)¶

If enable multiprocessing support.

Default: 1

DARC_MULTITHREADING: bool (int)¶

If enable multithreading support.

Default: 0

Note

DARC_MULTIPROCESSING and DARC_MULTITHREADING can NOT be toggled at the same time.

DARC_USER: str¶

Non-root user for proxies.

Default: current login user (c.f. getpass.getuser())

Data Storage¶

PATH_DATA: str (path)¶

Path to data storage.

Default: data

Web Crawlers¶

TIME_CACHE: float¶

Time delta for caches in seconds.

The darc project supports caching for fetched files. TIME_CACHE will specify for how log the fetched files will be cached and NOT fetched again.

Note

If TIME_CACHE is None then caching will be marked as forever.

Default: 60

SE_WAIT: float¶

Time to wait for selenium to finish loading pages.

Note

Internally, selenium will wait for the browser to finish loading the pages before return (i.e. the web API event DOMContentLoaded). However, some extra scripts may take more time running after the event.

Default: 60

White / Black Lists¶

LINK_WHITE_LIST: List[str] (json)¶

White list of hostnames should be crawled.

Default: []

Note

Regular expressions are supported.

LINK_BLACK_LIST: List[str] (json)¶

Black list of hostnames should be crawled.

Default: []

Note

Regular expressions are supported.

LINK_FALLBACK: bool (int)¶: Fallback value for match_host().

MIME_WHITE_LIST: List[str] (json)¶

White list of content types should be crawled.

Default: []

Note

Regular expressions are supported.

MIME_BLACK_LIST: List[str] (json)¶

Black list of content types should be crawled.

Default: []

Note

Regular expressions are supported.

MIME_FALLBACK: bool (int)¶: Fallback value for match_mime().

PROXY_WHITE_LIST: List[str] (json)¶

White list of proxy types should be crawled.

Default: []

Note

The proxy types are case insensitive.

PROXY_BLACK_LIST: List[str] (json)¶

Black list of proxy types should be crawled.

Default: []

Note

The proxy types are case insensitive.

PROXY_FALLBACK: bool (int)¶: Fallback value for match_proxy().

Note

If provided, LINK_WHITE_LIST, LINK_BLACK_LIST, MIME_WHITE_LIST, MIME_BLACK_LIST, PROXY_WHITE_LIST and PROXY_BLACK_LIST should all be JSON encoded strings.

Data Submission¶

API_RETRY: int¶

Retry times for API submission when failure.

Default: 3

API_NEW_HOST: str¶

API URL for submit_new_host().

Default: None

API_REQUESTS: str¶

API URL for submit_requests().

Default: None

API_SELENIUM: str¶

API URL for submit_selenium().

Default: None

Note

If API_NEW_HOST, API_REQUESTS and API_SELENIUM is None, the corresponding submit function will save the JSON data in the path specified by PATH_DATA.

Tor Proxy Configuration¶

TOR_PORT: int¶

Port for Tor proxy connection.

Default: 9050

TOR_CTRL: int¶

Port for Tor controller connection.

Default: 9051

TOR_STEM: bool (int)¶

If manage the Tor proxy through stem.

Default: 1

TOR_PASS: str¶

Tor controller authentication token.

Default: None

Note

If not provided, it will be requested at runtime.

TOR_RETRY: int¶

Retry times for Tor bootstrap when failure.

Default: 3

TOR_WAIT: float¶

Time after which the attempt to start Tor is aborted.

Default: 90

Note

If not provided, there will be NO timeouts.

TOR_CFG: Dict[str, Any] (json)¶

Tor bootstrap configuration for stem.process.launch_tor_with_config().

Default: {}

Note

If provided, it should be a JSON encoded string.

I2P Proxy Configuration¶

I2P_PORT: int¶

Port for I2P proxy connection.

Default: 4444

I2P_RETRY: int¶

Retry times for I2P bootstrap when failure.

Default: 3

I2P_WAIT: float¶

Time after which the attempt to start I2P is aborted.

Default: 90

Note

If not provided, there will be NO timeouts.

I2P_ARGS: str (shell)¶

I2P bootstrap arguments for i2prouter start.

If provided, it should be parsed as command line arguments (c.f. shlex.split).

Default: ''

Note

The command will be run as DARC_USER, if current user (c.f. getpass.getuser()) is root.

ZeroNet Proxy Configuration¶

ZERONET_PORT: int¶

Port for ZeroNet proxy connection.

Default: 4444

ZERONET_RETRY: int¶

Retry times for ZeroNet bootstrap when failure.

Default: 3

ZERONET_WAIT: float¶

Time after which the attempt to start ZeroNet is aborted.

Default: 90

Note

If not provided, there will be NO timeouts.

ZERONET_PATH: str (path)¶

Path to the ZeroNet project.

Default: /usr/local/src/zeronet

ZERONET_ARGS: str (shell)¶

ZeroNet bootstrap arguments for ZeroNet.sh main.

Default: ''

Note

If provided, it should be parsed as command line arguments (c.f. shlex.split).

Freenet Proxy Configuration¶

FREENET_PORT: int¶

Port for Freenet proxy connection.

Default: 8888

FREENET_RETRY: int¶

Retry times for Freenet bootstrap when failure.

Default: 3

FREENET_WAIT: float¶

Time after which the attempt to start Freenet is aborted.

Default: 90

Note

If not provided, there will be NO timeouts.

FREENET_PATH: str (path)¶

Path to the Freenet project.

Default: /usr/local/src/freenet

FREENET_ARGS: str (shell)¶

Freenet bootstrap arguments for run.sh start.

If provided, it should be parsed as command line arguments (c.f. shlex.split).

Default: ''

Note

The command will be run as DARC_USER, if current user (c.f. getpass.getuser()) is root.

`darc` - Darkweb Crawler Project¶

Installation¶

Usage¶

Configuration¶

General Configurations¶

Data Storage¶

Web Crawlers¶

White / Black Lists¶

Data Submission¶

Tor Proxy Configuration¶

I2P Proxy Configuration¶

ZeroNet Proxy Configuration¶

Freenet Proxy Configuration¶

Indices and tables¶

darc

Navigation

Related Topics

darc - Darkweb Crawler Project¶

Installation¶

Usage¶

Configuration¶

General Configurations¶

Data Storage¶

Web Crawlers¶

White / Black Lists¶

Data Submission¶

Tor Proxy Configuration¶

I2P Proxy Configuration¶

ZeroNet Proxy Configuration¶

Freenet Proxy Configuration¶

Indices and tables¶

`darc` - Darkweb Crawler Project¶