darc
- Darkweb Crawler Project¶
Darkweb Crawler Project¶
darc
is designed as a swiss army knife for darkweb crawling.
It integrates requests
to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium
to provide a fully rendered web page and screenshot
of such view.
Main Processing¶
The darc.process
module contains the main processing
logic of the darc
module.
-
darc.process.
_dump_last_word
(errors=True)¶ Dump data in queue.
- Parameters
errors (bool) – If the function is called upon error raised.
The function will remove the backup of the
requests
database_queue_requests.txt.tmp
(if exists) and the backup of theselenium
database_queue_selenium.txt.tmp
(if exists).If
errors
isTrue
, the function will copy the backup of therequests
database_queue_requests.txt.tmp
(if exists) and the backup of theselenium
database_queue_selenium.txt.tmp
(if exists) to the corresponding database.The function will also remove the PID file
darc.pid
-
darc.process.
_get_requests_links
()¶ Fetch links from queue.
- Returns
List of links from the
requests
database.- Return type
List[str]
Deprecated since version 0.1.0: Use
darc.db.load_requests()
instead.
-
darc.process.
_get_selenium_links
()¶ Fetch links from queue.
- Returns
List of links from the
selenium
database.- Return type
List[str]
Deprecated since version 0.1.0: Use
darc.db.load_selenium()
instead.
-
darc.process.
_load_last_word
()¶ Load data to queue.
The function will copy the backup of the
requests
database_queue_requests.txt.tmp
(if exists) and the backup of theselenium
database_queue_selenium.txt.tmp
(if exists) to the corresponding database.The function will also save the process ID to the
darc.pid
PID file.
-
darc.process.
_signal_handler
(signum=None, frame=None)¶ Signal handler.
The function will call
_dump_last_word()
to keep a decent death.If the current process is not the main process, the function shall do nothing.
- Parameters
signum (Union[int, signal.Signals, None]) – The signal to handle.
frame (types.FrameType) – The traceback frame from the signal.
See also
-
darc.process.
process
()¶ Main process.
The function will register
_signal_handler()
forSIGTERM
, and start the main process of thedarc
darkweb crawlers.The general process can be described as following:
process()
: obtain URLs from therequests
link database (c.f.load_requests()
), and feed such URLs tocrawler()
with multiprocessing support.crawler()
: parse the URL usingparse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
). Also, if the submission API is provided,submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),darc
will save the document usingsave_html()
orsave_file()
accordingly. And if the submission API is provided,submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).process()
: after the obtained URLs have all been crawled,darc
will obtain URLs from theselenium
link database (c.f.load_selenium()
), and feed such URLs toloader()
.loader()
: parse the URL usingparse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalselenium.webdriver.Chrome
object.If successful, the rendered source HTML document will be saved using
save_html()
, and a full-page screenshot will be taken and saved.If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).
If in reboot mode, i.e.
REBOOT
isTrue
, the function will exit after first round. If not, it will renew the Tor connections (if bootstrapped), c.f.renew_tor_session()
, and start another round.
Web Crawlers¶
The darc.crawl
module provides two types of crawlers.
-
darc.crawl.
crawler
(url)¶ Single
requests
crawler for a entry link.- Parameters
url (str) – URL to be crawled by
requests
.
The function will first parse the URL using
parse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
). Also, if the submission API is provided,submit_new_host()
will be called and submit the documents just fetched.See also
If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),darc
will save the document usingsave_html()
orsave_file()
accordingly. And if the submission API is provided,submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).
-
darc.crawl.
loader
(url)¶ Single
selenium
loader for a entry link.- Parameters
url (str) – URL to be crawled by
requests
.
The function will first parse the URL using
parse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalselenium.webdriver.Chrome
object.If successful, the rendered source HTML document will be saved using
save_html()
, and a full-page screenshot will be taken and saved.Note
When taking full-page screenshot,
loader()
will usedocument.body.scrollHeight
to get the total height of web page. If the page height is less than 1,000 pixels, thendarc
will by default set the height as 1,000 pixels.Later
darc
will tellselenium
to resize the window (in headless mode) to 1,024 pixels in width and 110% of the page height in height, and take a PNG screenshot.See also
If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).
URL Utilities¶
The Link
class is the key data structure
of the darc
project, it contains all information
required to identify a URL’s proxy type, hostname, path prefix
when saving, etc.
The link
module also provides several wrapper
function to the urllib.parse
.
-
class
darc.link.
Link
(url, proxy, url_parse, host, base, name)¶ Bases:
object
Parsed link.
- Parameters
url (str) – original link
proxy (str) – proxy type
host (str) – URL’s hostname
base (str) – base folder for saving files
name (str) – hashed link for saving files
url_parse (urllib.parse.ParseResult) – parsed URL from
urllib.parse.urlparse()
- Returns
Parsed link object.
- Return type
-
base
: str = None¶ base folder for saving files
-
host
: str = None¶ URL’s hostname
-
name
: str = None¶ hashed link for saving files
-
proxy
: str = None¶ proxy type
-
url
: str = None¶ original link
-
url_parse
: urllib.parse.ParseResult = None¶ parsed URL from
urllib.parse.urlparse()
-
darc.link.
parse_link
(link, host=None)¶ Parse link.
- Parameters
link (str) – link to be parsed
host (Optional[str]) – hostname of the link
- Returns
The parsed link object.
- Return type
Note
If
host
is provided, it will override the hostname of the originallink
.The parsing process of proxy type is as follows:
If
host
isNone
and the parse result fromurllib.parse.urlparse()
has nonetloc
(or hostname) specified, then sethostname
as(null)
; else set it as is.If the scheme is
data
, then thelink
is a data URI, sethostname
asdata
andproxy
asdata
.If the scheme is
javascript
, then the link is some JavaScript codes, setproxy
asscript
.If the scheme is
bitcoin
, then the link is a Bitcoin address, setproxy
asbitcoin
.If the scheme is
ed2k
, then the link is an ED2K magnet link, setproxy
ased2k
.If the scheme is
magnet
, then the link is a magnet link, setproxy
asmagnet
.If the scheme is
mailto
, then the link is an email address, setproxy
asmail
.If the scheme is
irc
, then the link is an IRC link, setproxy
asirc
.If the scheme is NOT any of
http
orhttps
, then setproxy
to the scheme.If the host is
None
, sethostname
to(null)
, setproxy
tonull
.If the host is an onion (
.onion
) address, setproxy
totor
.If the host is an I2P (
.i2p
) address, or any oflocalhost:7657
andlocalhost:7658
, setproxy
toi2p
.If the host is localhost on
ZERONET_PORT
, and the path is not/
, i.e. NOT root path, setproxy
tozeronet
; and set the first part of its path ashostname
.Example:
For a ZeroNet address, e.g. http://127.0.0.1:43110/1HeLLo4uzjaLetFx6NH3PMwFP3qbRbTf3D,
parse_link()
will parse thehostname
as1HeLLo4uzjaLetFx6NH3PMwFP3qbRbTf3D
.If the host is localhost on
FREENET_PORT
, and the path is not/
, i.e. NOT root path, setproxy
tofreenet
; and set the first part of its path ashostname
.Example:
For a Freenet address, e.g. http://127.0.0.1:8888/USK@nwa8lHa271k2QvJ8aa0Ov7IHAV-DFOCFgmDt3X6BpCI,DuQSUZiI~agF8c-6tjsFFGuZ8eICrzWCILB60nT8KKo,AQACAAE/sone/77/,
parse_link()
will parse thehostname
asUSK@nwa8lHa271k2QvJ8aa0Ov7IHAV-DFOCFgmDt3X6BpCI,DuQSUZiI~agF8c-6tjsFFGuZ8eICrzWCILB60nT8KKo,AQACAAE
.If none of the cases above satisfied, the
proxy
will be set asnull
, marking it a plain normal link.
The
base
for parsed linkLink
object is defined as<root>/<proxy>/<scheme>/<hostname>/
where
root
isPATH_DB
.The
name
for parsed linkLink
object is the sha256 hash (c.f.hashlib.sha256()
) of the originallink
.
-
darc.link.
quote
(string, safe='/', encoding=None, errors=None)¶ Wrapper function for
urllib.parse.quote()
.- Parameters
string (AnyStr) – string to be quoted
safe (AnyStr) – charaters not to escape
encoding (Optional[str]) – string encoding
errors (Optional[str]) – encoding error handler
- Returns
The quoted string.
- Return type
str
Note
The function suppressed possible errors when calling
urllib.parse.quote()
. If any, it will return the original string.
-
darc.link.
unquote
(string, encoding='utf-8', errors='replace')¶ Wrapper function for
urllib.parse.unquote()
.- Parameters
string (AnyStr) – string to be unquoted
encoding (str) – string encoding
errors (str) – encoding error handler
- Returns
The quoted string.
- Return type
str
Note
The function suppressed possible errors when calling
urllib.parse.unquote()
. If any, it will return the original string.
-
darc.link.
urljoin
(base, url, allow_fragments=True)¶ Wrapper function for
urllib.parse.urljoin()
.- Parameters
base (AnyStr) – base URL
url (AnyStr) – URL to be joined
allow_fragments (bool) – if allow fragments
- Returns
The joined URL.
- Return type
str
Note
The function suppressed possible errors when calling
urllib.parse.urljoin()
. If any, it will returnbase/url
directly.
-
darc.link.
urlparse
(url, scheme='', allow_fragments=True)¶ Wrapper function for
urllib.parse.urlparse()
.- Parameters
url (str) – URL to be parsed
scheme (str) – URL scheme
allow_fragments (bool) – if allow fragments
- Returns
The parse result.
- Return type
urllib.parse.ParseResult
Note
The function suppressed possible errors when calling
urllib.parse.urlparse()
. If any, it will returnurllib.parse.ParseResult(scheme=scheme, netloc='', path=url, params='', query='', fragment='')
directly.
Source Parsing¶
The darc.parse
module provides auxiliary functions
to read robots.txt
, sitemaps and HTML documents. It
also contains utility functions to check if the proxy type,
hostname and content type if in any of the black and white
lists.
-
darc.parse.
_check
(temp_list)¶ Check hostname and proxy type of links.
- Parameters
temp_list (List[str]) – List of links to be checked.
- Returns
List of links matches the requirements.
- Return type
List[str]
Note
If
CHECK_NG
isTrue
, the function will directly call_check_ng()
instead.
-
darc.parse.
_check_ng
(temp_list)¶ Check content type of links through
HEAD
requests.- Parameters
temp_list (List[str]) – List of links to be checked.
- Returns
List of links matches the requirements.
- Return type
List[str]
-
darc.parse.
check_robots
(link)¶ Check if
link
is allowed inrobots.txt
.- Parameters
link (darc.link.Link) – The link object to be checked.
- Returns
If
link
is allowed inrobots.txt
.- Return type
bool
Note
The root path of a URL will always return
True
.
-
darc.parse.
extract_links
(link, html, check=False)¶ Extract links from HTML document.
- Parameters
link (str) – Original link of the HTML document.
html (Union[str, bytes]) – Content of the HTML document.
check (bool) – If perform checks on extracted links, default to
CHECK
.
- Returns
An iterator of extracted links.
- Return type
Iterator[str]
-
darc.parse.
get_content_type
(response)¶ Get content type from
response
.- Parameters
response (
requests.Response
.) – Response object.- Returns
The content type from
response
.- Return type
str
Note
If the
Content-Type
header is not defined inresponse
, the function will utilisemagic
to detect its content type.
-
darc.parse.
get_sitemap
(link, text, host=None)¶ Fetch link to other sitemaps from a sitemap.
- Parameters
link (str) – Original link to the sitemap.
text (str) – Content of the sitemap.
host (Optional[str]) – Hostname of the URL to the sitemap, the value may not be same as in
link
.
- Returns
List of link to sitemaps.
- Return type
List[darc.link.Link]
Note
As specified in the sitemap protocol, it may contain links to other sitemaps. *
-
darc.parse.
match_host
(host)¶ Check if hostname in black list.
- Parameters
host (str) – Hostname to be checked.
- Returns
If
host
in black list.- Return type
bool
Note
If
host
isNone
, then it will always returnTrue
.
-
darc.parse.
match_mime
(mime)¶ Check if content type in black list.
- Parameters
mime (str) – Content type to be checked.
- Returns
If
mime
in black list.- Return type
bool
-
darc.parse.
match_proxy
(proxy)¶ Check if proxy type in black list.
- Parameters
proxy (str) – Proxy type to be checked.
- Returns
If
proxy
in black list.- Return type
bool
Note
If
proxy
isscript
, then it will always returnTrue
.
-
darc.parse.
read_robots
(link, text, host=None)¶ Read
robots.txt
to fetch link to sitemaps.- Parameters
link (str) – Original link to
robots.txt
.text (str) – Content of
robots.txt
.host (Optional[str]) – Hostname of the URL to
robots.txt
, the value may not be same as inlink
.
- Returns
List of link to sitemaps.
- Return type
List[darc.link.Link]
Note
If the link to sitemap is not specified in
robots.txt
†, the fallback link/sitemap.xml
will be used.
Source Saving¶
The darc.save
module contains the core utilities
for managing fetched files and documents.
The data storage under the root path (PATH_DB
)
is typically as following:
data
├── _queue_requests.txt
├── _queue_requests.txt.tmp
├── _queue_selenium.txt
├── _queue_selenium.txt.tmp
├── api
│ └── <proxy>
│ └── <scheme>
│ └── <hostname>
│ ├── new_host
│ │ └── <hash>_<timestamp>.json
│ ├── requests
│ │ └── <hash>_<timestamp>.json
│ └── selenium
│ └── <hash>_<timestamp>.json
├── link.csv
├── misc
│ ├── bitcoin.txt
│ ├── data
│ │ └── <hash>_<timestamp>.<ext>
│ ├── ed2k.txt
│ ├── invalid.txt
│ ├── irc.txt
│ ├── magnet.txt
│ └── mail.txt
└── <proxy>
└── <scheme>
└── <hostname>
├── <hash>_<timestamp>.dat
├── <hash>_<timestamp>.json
├── <hash>_<timestamp>_raw.html
├── <hash>_<timestamp>.html
├── <hash>_<timestamp>.png
├── robots.txt
└── sitemap_<hash>.xml
-
darc.save.
has_folder
(link)¶ Check if is a new host.
- Parameters
link (darc.link.Link) – Link object to check if is a new host.
- Returns
If
link
is a new host, returnlink.base
.If not, return
None
.
- Return type
Optional[str]
-
darc.save.
has_html
(time, link)¶ Check if we need to re-craw the link by
selenium
.- Parameters
link (darc.link.Link) – Link object to check if we need to re-craw the link by
selenium
.time (NewType.<locals>.new_type) –
- Returns
If no need, return the path to the document, i.e.
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html
.If needed, return
None
.
- Return type
Optional[str]
See also
-
darc.save.
has_raw
(time, link)¶ Check if we need to re-craw the link by
requests
.- Parameters
link (darc.link.Link) – Link object to check if we need to re-craw the link by
requests
.time (NewType.<locals>.new_type) –
- Returns
If no need, return the path to the document, i.e.
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html
, or<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat
.If needed, return
None
.
- Return type
Optional[str]
See also
-
darc.save.
has_robots
(link)¶ Check if
robots.txt
already exists.- Parameters
link (darc.link.Link) – Link object to check if
robots.txt
already exists.- Returns
If
robots.txt
exists, return the path torobots.txt
, i.e.<root>/<proxy>/<scheme>/<hostname>/robots.txt
.If not, return
None
.
- Return type
Optional[str]
-
darc.save.
has_sitemap
(link)¶ Check if sitemap already exists.
- Parameters
link (darc.link.Link) – Link object to check if sitemap already exists.
- Returns
If sitemap exists, return the path to the sitemap, i.e.
<root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml
.If not, return
None
.
- Return type
Optional[str]
-
darc.save.
sanitise
(link, time=None, raw=False, data=False, headers=False, screenshot=False)¶ Sanitise link to path.
- Parameters
link (darc.link.Link) – Link object to sanitise the path
time (datetime) – Timestamp for the path.
raw (bool) – If this is a raw HTML document from
requests
.data (bool) – If this is a generic content type document.
headers (bool) – If this is response headers from
requests
.screenshot (bool) – If this is the screenshot from
selenium
.
- Returns
If
raw
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html
.If
data
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat
.If
headers
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.json
.If
screenshot
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.png
.If none above,
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html
.
- Return type
str
See also
-
darc.save.
save_file
(time, link, content)¶ Save file.
The function will also try to make symbolic links from the saved file standard path to the relative path as in the URL.
- Parameters
time (datetime) – Timestamp of generic file.
link (darc.link.Link) – Link object of original URL.
content (bytes) – Content of generic file.
- Returns
Saved path to generic content type file,
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat
.- Return type
str
See also
-
darc.save.
save_headers
(time, link, response, session)¶ Save HTTP response headers.
- Parameters
time (datetime) – Timestamp of response.
link (darc.link.Link) – Link object of response.
response (
requests.Response
) – Response object to be saved.session (
requests.Session
) – Session object of response.
- Returns
Saved path to response headers, i.e.
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.json
.- Return type
str
The JSON data saved is as following:
{ "[metadata]": { "url": "...", "proxy": "...", "host": "...", "base": "...", "name": "..." }, "Timestamp": "...", "URL": "...", "Method": "GET", "Status-Code": "...", "Reason": "...", "Cookies": { "...": "..." }, "Session": { "...": "..." }, "Request": { "...": "..." }, "Response": { "...": "..." } }
See also
-
darc.save.
save_html
(time, link, html, raw=False)¶ Save response.
- Parameters
time (datetime) – Timestamp of HTML document.
link (darc.link.Link) – Link object of original URL.
html (Union[str, bytes]) – Content of HTML document.
raw (bool) – If is fetched from
requests
.
- Returns
Saved path to HTML document.
If
raw
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html
.If not,
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html
.
- Return type
str
-
darc.save.
save_link
(link)¶ Save link hash database
link.csv
.The CSV file has following fields:
proxy type:
link.proxy
URL scheme:
link.url_parse.scheme
hostname:
link.base
link hash:
link.name
original URL:
link.url
- Parameters
link (darc.link.Link) – Link object to be saved.
See also
-
darc.save.
save_robots
(link, text)¶ Save
robots.txt
.- Parameters
link (darc.link.Link) – Link object of
robots.txt
.text (str) – Content of
robots.txt
.
- Returns
Saved path to
robots.txt
, i.e.<root>/<proxy>/<scheme>/<hostname>/robots.txt
.- Return type
str
See also
-
darc.save.
save_sitemap
(link, text)¶ Save sitemap.
- Parameters
link (darc.link.Link) – Link object of sitemap.
text (str) – Content of sitemap.
- Returns
Saved path to sitemap, i.e.
<root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml
.- Return type
str
See also
-
darc.save.
_SAVE_LOCK
: multiprocessing.Lock¶ I/O lock for saving link hash database
link.csv
.See also
Link Database¶
The darc
project utilises file system based database
to provide tele-process communication.
Note
In its first implementation, the darc
project used
multiprocessing.Queue
to support such communication. However, as noticed
when runtime, the multiprocessing.Queue
object will be much affected by
the lack of memory.
There will be two databases, both locate at root of the
data storage path PATH_DB
:
At runtime, after reading such database, darc
will keep a backup of the database with .tmp
suffix
to its file extension.
-
darc.db.
load_requests
()¶ Load link from the
requests
database.After loading,
darc
will backup the original databasequeue_requests.txt
asqueue_requests.txt.tmp
and empty the loaded database.- Returns
List of loaded links from the
requests
database.- Return type
List[str]
Note
Lines start with
#
will be considered as comments. Empty lines and comment lines will be ignored when loading.
-
darc.db.
load_selenium
()¶ Load link from the
selenium
database.After loading,
darc
will backup the original databasequeue_selenium.txt
asqueue_selenium.txt.tmp
and empty the loaded database.- Returns
List of loaded links from the
selenium
database.- Return type
List[str]
Note
Lines start with
#
will be considered as comments. Empty lines and comment lines will be ignored when loading.
-
darc.db.
save_requests
(entries, single=False)¶ Save link to the
requests
database.- Parameters
entries (Iterable[str]) – Links to be added to the
requests
database. It can be either an iterable of links, or a single link string (ifsingle
set asTrue
).single (bool) – Indicate if
entries
is an iterable of links or a single link string.
-
darc.db.
save_selenium
(entries, single=False)¶ Save link to the
selenium
database.- Parameters
entries (Iterable[str]) – Links to be added to the
selenium
database. It can be either an iterable of links, or a single link string (ifsingle
set asTrue
).single (bool) – Indicate if
entries
is an iterable of links or a single link string.
-
darc.db.
QR_LOCK
: multiprocessing.Lock¶ I/O lock for the
requests
database_queue_requests.txt
.See also
-
darc.db.
QS_LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for the
selenium
database_queue_selenium.txt
.If
FLAG_MP
isTrue
, it will be an instance ofmultiprocessing.Lock
. IfFLAG_TH
isTrue
, it will be an instance ofthreading.Lock
. If none above, it will be an instance ofcontextlib.nullcontext
.
Data Submission¶
The darc
project integrates the capability of submitting
fetched data and information to a web server, to support real-time
cross-analysis and status display.
There are three submission events:
New Host Submission –
API_NEW_HOST
Submitted in
crawler()
function call, when the crawling URL is marked as a new host.Requests Submission –
API_REQUESTS
Submitted in
crawler()
function call, after the crawling process of the URL usingrequests
.Selenium Submission –
API_SELENIUM
Submitted in
loader()
function call, after the loading process of the URL usingselenium
.
-
darc.submit.
get_html
(link, time)¶ Read HTML document.
- Parameters
link (darc.link.Link) – Link object to read document from
selenium
.time (str) –
- Returns
If document exists, return the data from document.
path
– relative path from document to root of data storagePATH_DB
,<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html
data
– base64 encoded content of document
If not, return
None
.
- Return type
Optional[Dict[str, Union[str, ByteString]]]
See also
-
darc.submit.
get_metadata
(link)¶ Generate metadata field.
- Parameters
link (darc.link.Link) – Link object to generate metadata.
- Returns
The metadata from
link
.url
– original URL,link.url
proxy
– proxy type,link.proxy
host
– hostname,link.host
base
– base path,link.base
name
– link hash,link.name
- Return type
Dict[str, str]
-
darc.submit.
get_raw
(link, time)¶ Read raw document.
- Parameters
link (darc.link.Link) – Link object to read document from
requests
.time (str) –
- Returns
If document exists, return the data from document.
path
– relative path from document to root of data storagePATH_DB
,<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html
or<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat
data
– base64 encoded content of document
If not, return
None
.
- Return type
Optional[Dict[str, Union[str, ByteString]]]
-
darc.submit.
get_robots
(link)¶ Read
robots.txt
.- Parameters
link (darc.link.Link) – Link object to read
robots.txt
.- Returns
If
robots.txt
exists, return the data fromrobots.txt
.path
– relative path fromrobots.txt
to root of data storagePATH_DB
,<proxy>/<scheme>/<hostname>/robots.txt
data
– base64 encoded content ofrobots.txt
If not, return
None
.
- Return type
Optional[Dict[str, Union[str, ByteString]]]
-
darc.submit.
get_screenshot
(link, time)¶ Read screenshot picture.
- Parameters
link (darc.link.Link) – Link object to read screenshot from
selenium
.time (str) –
- Returns
If screenshot exists, return the data from screenshot.
path
– relative path from screenshot to root of data storagePATH_DB
,<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.png
data
– base64 encoded content of screenshot
If not, return
None
.
- Return type
Optional[Dict[str, Union[str, ByteString]]]
See also
-
darc.submit.
get_sitemap
(link)¶ Read sitemaps.
- Parameters
link (darc.link.Link) – Link object to read sitemaps.
- Returns
If sitemaps exist, return list of the data from sitemaps.
path
– relative path from sitemap to root of data storagePATH_DB
,<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml
data
– base64 encoded content of sitemap
If not, return
None
.
- Return type
Optional[List[Dict[str, Union[str, ByteString]]]]
-
darc.submit.
save_submit
(domain, data)¶ Save failed submit data.
- Parameters
domain (
'new_host'
,'requests'
or'selenium'
) – Domain of the submit data.data (Dict[str, Any]) – Submit data.
-
darc.submit.
submit
(api, domain, data)¶ Submit data.
- Parameters
api (str) – API URL.
domain (
'new_host'
,'requests'
or'selenium'
) – Domain of the submit data.data (Dict[str, Any]) – Submit data.
-
darc.submit.
submit_new_host
(time, link)¶ Submit new host.
When a new host is discovered, the
darc
crawler will submit the host information. Such includesrobots.txt
(if exists) andsitemap.xml
(if any).- Parameters
time (datetime.datetime) – Timestamp of submission.
link (darc.link.Link) – Link object of submission.
If
API_NEW_HOST
isNone
, the data for submission will directly be save throughsave_submit()
.The data submitted should have following format:
{ // metadata of URL "[metadata]": { // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment> "url": ..., // proxy type - null / tor / i2p / zeronet / freenet "proxy": ..., // hostname / netloc, c.f. ``urllib.parse.urlparse`` "host": ..., // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host> "base": ..., // sha256 of URL as name for saved files (timestamp is in ISO format) // JSON log as this one - <base>/<name>_<timestamp>.json // HTML from requests - <base>/<name>_<timestamp>_raw.html // HTML from selenium - <base>/<name>_<timestamp>.html // generic data files - <base>/<name>_<timestamp>.dat "name": ... }, // requested timestamp in ISO format as in name of saved file "Timestamp": ..., // original URL "URL": ..., // robots.txt from the host (if not exists, then ``null``) "Robots": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/robots.txt "path": ..., // content of the file (**base64** encoded) "data": ..., }, // sitemaps from the host (if none, then ``null``) "Sitemaps": [ { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/sitemap_<name>.txt "path": ..., // content of the file (**base64** encoded) "data": ..., }, ... ], // hosts.txt from the host (if proxy type is ``i2p``; if not exists, then ``null``) "Hosts": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/hosts.txt "path": ..., // content of the file (**base64** encoded) "data": ..., } }
-
darc.submit.
submit_requests
(time, link, response, session)¶ Submit requests data.
When crawling, we’ll first fetch the URl using
requests
, to check its availability and to save its HTTP headers information. Such information will be submitted to the web UI.- Parameters
time (datetime.datetime) – Timestamp of submission.
link (darc.link.Link) – Link object of submission.
response (
requests.Response
) – Response object of submission.session (
requests.Session
) – Session object of submission.
If
API_REQUESTS
isNone
, the data for submission will directly be save throughsave_submit()
.The data submitted should have following format:
{ // metadata of URL "[metadata]": { // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment> "url": ..., // proxy type - null / tor / i2p / zeronet / freenet "proxy": ..., // hostname / netloc, c.f. ``urllib.parse.urlparse`` "host": ..., // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host> "base": ..., // sha256 of URL as name for saved files (timestamp is in ISO format) // JSON log as this one - <base>/<name>_<timestamp>.json // HTML from requests - <base>/<name>_<timestamp>_raw.html // HTML from selenium - <base>/<name>_<timestamp>.html // generic data files - <base>/<name>_<timestamp>.dat "name": ... }, // requested timestamp in ISO format as in name of saved file "Timestamp": ..., // original URL "URL": ..., // request method "Method": "GET", // response status code "Status-Code": ..., // response reason "Reason": ..., // response cookies (if any) "Cookies": { ... }, // session cookies (if any) "Session": { ... }, // request headers (if any) "Request": { ... }, // response headers (if any) "Response": { ... }, // requested file (if not exists, then ``null``) "Document": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/<name>_<timestamp>_raw.html // or if the document is of generic content type, i.e. not HTML // - <proxy>/<scheme>/<host>/<name>_<timestamp>.dat "path": ..., // content of the file (**base64** encoded) "data": ..., } }
-
darc.submit.
submit_selenium
(time, link)¶ Submit selenium data.
After crawling with
requests
, we’ll then render the URl usingselenium
with Google Chrome and its web driver, to provide a fully rendered web page. Such information will be submitted to the web UI.- Parameters
time (datetime.datetime) – Timestamp of submission.
link (darc.link.Link) – Link object of submission.
If
API_SELENIUM
isNone
, the data for submission will directly be save throughsave_submit()
.Note
This information is optional, only provided if the content type from
requests
is HTML, status code not between400
and600
, and HTML data not empty.The data submitted should have following format:
{ // metadata of URL "[metadata]": { // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment> "url": ..., // proxy type - null / tor / i2p / zeronet / freenet "proxy": ..., // hostname / netloc, c.f. ``urllib.parse.urlparse`` "host": ..., // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host> "base": ..., // sha256 of URL as name for saved files (timestamp is in ISO format) // JSON log as this one - <base>/<name>_<timestamp>.json // HTML from requests - <base>/<name>_<timestamp>_raw.html // HTML from selenium - <base>/<name>_<timestamp>.html // generic data files - <base>/<name>_<timestamp>.dat "name": ... }, // requested timestamp in ISO format as in name of saved file "Timestamp": ..., // original URL "URL": ..., // rendered HTML document (if not exists, then ``null``) "Document": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/<name>_<timestamp>.html "path": ..., // content of the file (**base64** encoded) "data": ..., }, // web page screenshot (if not exists, then ``null``) "Screenshot": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/<name>_<timestamp>.png "path": ..., // content of the file (**base64** encoded) "data": ..., } }
-
darc.submit.
PATH_API
= '{PATH_DB}/api/'¶ Path to the API submittsion records, i.e.
api
folder under the root of data storage.See also
-
darc.submit.
API_RETRY
: int¶ Retry times for API submission when failure.
- Default
3
- Environ
-
darc.submit.
API_NEW_HOST
: str¶ API URL for
submit_new_host()
.- Default
None
- Environ
-
darc.submit.
API_REQUESTS
: str¶ API URL for
submit_requests()
.- Default
None
- Environ
-
darc.submit.
API_SELENIUM
: str¶ API URL for
submit_selenium()
.- Default
None
- Environ
Note
If API_NEW_HOST
, API_REQUESTS
and API_SELENIUM
is None
, the corresponding
submit function will save the JSON data in the path
specified by PATH_API
.
See also
The darc
provides a demo on how to implement a darc
-compliant
web backend for the data submission module. See the demo page
for more information.
Requests Wrapper¶
The darc.requests
module wraps around the requests
module, and provides some simple interface for the darc
project.
-
darc.requests.
i2p_session
(futures=False)¶ I2P (.i2p) session.
- Parameters
futures (bool) – If returns a
requests_futures.FuturesSession
.- Returns
The session object with I2P proxy settings.
- Return type
See also
darc.proxy.i2p.I2P_REQUESTS_PROXY
-
darc.requests.
null_session
(futures=False)¶ No proxy session.
- Parameters
futures (bool) – If returns a
requests_futures.FuturesSession
.- Returns
The session object with no proxy settings.
- Return type
-
darc.requests.
request_session
(link, futures=False)¶ Get requests session.
- Parameters
link (darc.link.Link) – Link requesting for
requests.Session
.futures (bool) – If returns a
requests_futures.FuturesSession
.
- Returns
The session object with corresponding proxy settings.
- Return type
- Raises
UnsupportedLink – If the proxy type of
link
if not specified in theLINK_MAP
.
See also
-
darc.requests.
tor_session
(futures=False)¶ Tor (.onion) session.
- Parameters
futures (bool) – If returns a
requests_futures.FuturesSession
.- Returns
The session object with Tor proxy settings.
- Return type
See also
darc.proxy.tor.TOR_REQUESTS_PROXY
Selenium Wrapper¶
The darc.selenium
module wraps around the selenium
module, and provides some simple interface for the darc
project.
-
darc.selenium.
get_capabilities
(type='null')¶ Generate desied capabilities.
- Parameters
type (str) – Proxy type for capabilities.
- Returns
The desied capabilities for the web driver
selenium.webdriver.Chrome
.- Raises
UnsupportedProxy – If the proxy type is NOT
null
,tor
ori2p
.- Return type
dict
See also
darc.proxy.tor.TOR_SELENIUM_PROXY
darc.proxy.i2p.I2P_SELENIUM_PROXY
-
darc.selenium.
get_options
(type='null')¶ Generate options.
- Parameters
type (str) – Proxy type for options.
- Returns
The options for the web driver
selenium.webdriver.Chrome
.- Return type
- Raises
UnsupportedPlatform – If the operation system is NOT macOS or Linux.
UnsupportedProxy – If the proxy type is NOT
null
,tor
ori2p
.
See also
darc.proxy.tor.TOR_PORT
darc.proxy.i2p.I2P_PORT
References
Disable sandbox (
--no-sandbox
) when running asroot
userDisable usage of
/dev/shm
-
darc.selenium.
i2p_driver
()¶ I2P (.i2p) driver.
- Returns
The web driver object with I2P proxy settings.
- Return type
-
darc.selenium.
null_driver
()¶ No proxy driver.
- Returns
The web driver object with no proxy settings.
- Return type
-
darc.selenium.
request_driver
(link)¶ Get selenium driver.
- Parameters
link (darc.link.Link) – Link requesting for
selenium.webdriver.Chrome
.- Returns
The web driver object with corresponding proxy settings.
- Return type
- Raises
UnsupportedLink – If the proxy type of
link
if not specified in theLINK_MAP
.
See also
-
darc.selenium.
tor_driver
()¶ Tor (.onion) driver.
- Returns
The web driver object with Tor proxy settings.
- Return type
Proxy Utilities¶
The darc.proxy
module provides various proxy support
to the darc
project.
Bitcoin Addresses¶
The darc.proxy.bitcoin
module contains the auxiliary functions
around managing and processing the bitcoin addresses.
Currently, the darc
project directly save the bitcoin
addresses extracted to the data storage file
PATH
without further processing.
-
darc.proxy.bitcoin.
save_bitcoin
(link)¶ Save bitcoin address.
The function will save bitcoin address to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the bitcoin address.
-
darc.proxy.bitcoin.
PATH
= '{PATH_MISC}/bitcoin.txt'¶ Path to the data storage of bitcoin addresses.
See also
Data URI Schemes¶
The darc.proxy.data
module contains the auxiliary functions
around managing and processing the data URI schemes.
Currently, the darc
project directly save the data URI
schemes extracted to the data storage path
PATH
without further processing.
-
darc.proxy.data.
save_data
(link)¶ Save data URI.
The function will save data URIs to the data storage as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the data URI.
-
darc.proxy.data.
PATH
= '{PATH_MISC}/data/'¶ Path to the data storage of data URI schemes.
See also
ED2K Magnet Links¶
The darc.proxy.ed2k
module contains the auxiliary functions
around managing and processing the ED2K magnet links.
Currently, the darc
project directly save the ED2K magnet
links extracted to the data storage file
PATH
without further processing.
-
darc.proxy.ed2k.
save_ed2k
(link)¶ Save ed2k magnet link.
The function will save ED2K magnet link to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the ED2K magnet links.
-
darc.proxy.ed2k.
PATH
= '{PATH_MISC}/ed2k.txt'¶ Path to the data storage of bED2K magnet links.
See also
Freenet Proxy¶
The darc.proxy.freenet
module contains the auxiliary functions
around managing and processing the Freenet proxy.
-
darc.proxy.freenet.
_freenet_bootstrap
()¶ Freenet bootstrap.
The bootstrap arguments are defined as
_FREENET_ARGS
.- Raises
subprocess.CalledProcessError – If the return code of
_FREENET_PROC
is non-zero.
-
darc.proxy.freenet.
freenet_bootstrap
()¶ Bootstrap wrapper for Freenet.
The function will bootstrap the Freenet proxy. It will retry for
FREENET_RETRY
times in case of failure.Also, it will NOT re-bootstrap the proxy as is guaranteed by
_FREENET_BS_FLAG
.- Warns
FreenetBootstrapFailed – If failed to bootstrap Freenet proxy.
- Raises
UnsupportedPlatform – If the system is not supported, i.e. not macOS or Linux.
-
darc.proxy.freenet.
has_freenet
(link_pool)¶ Check if contain Freenet links.
- Parameters
link_pool (Iterable[str]) – Link pool to check.
- Returns
If the link pool contains Freenet links.
- Return type
bool
The following constants are configuration through environment variables:
-
darc.proxy.freenet.
FREENET_PORT
: int¶ Port for Freenet proxy connection.
- Default
8888
- Environ
-
darc.proxy.freenet.
FREENET_RETRY
: int¶ Retry times for Freenet bootstrap when failure.
- Default
3
- Environ
-
darc.proxy.freenet.
BS_WAIT
: float¶ Time after which the attempt to start Freenet is aborted.
- Default
90
- Environ
FREENET_WAIT
Note
If not provided, there will be NO timeouts.
-
darc.proxy.freenet.
FREENET_PATH
: str¶ Path to the Freenet project.
- Default
/usr/local/src/freenet
- Environ
-
darc.proxy.freenet.
FREENET_ARGS
: List[str]¶ Freenet bootstrap arguments for
run.sh start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split
).- Default
''
- Environ
Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
The following constants are defined for internal usage:
-
darc.proxy.freenet.
_FREENET_BS_FLAG
: bool¶ If the Freenet proxy is bootstrapped.
-
darc.proxy.freenet.
_FREENET_PROC
: subprocess.Popen¶ Freenet proxy process running in the background.
-
darc.proxy.freenet.
_FREENET_ARGS
: List[str]¶ Freenet proxy bootstrap arguments.
I2P Proxy¶
The darc.proxy.i2p
module contains the auxiliary functions
around managing and processing the I2P proxy.
-
darc.proxy.i2p.
_i2p_bootstrap
()¶ I2P bootstrap.
The bootstrap arguments are defined as
_I2P_ARGS
.- Raises
subprocess.CalledProcessError – If the return code of
_I2P_PROC
is non-zero.
-
darc.proxy.i2p.
fetch_hosts
(link)¶ Fetch
hosts.txt
.- Parameters
link (darc.link.Link) – Link object to fetch for its
hosts.txt
.
-
darc.proxy.i2p.
get_hosts
(link)¶ Read
hosts.txt
.- Parameters
link (darc.link.Link) – Link object to read
hosts.txt
.- Returns
If
hosts.txt
exists, return the data fromhosts.txt
.path
– relative path fromhosts.txt
to root of data storagePATH_DB
,<proxy>/<scheme>/<hostname>/hosts.txt
data
– base64 encoded content ofhosts.txt
If not, return
None
.
- Return type
Optional[Dict[str, Union[str, ByteString]]]
-
darc.proxy.i2p.
has_hosts
(link)¶ Check if
hosts.txt
already exists.- Parameters
link (darc.link.Link) – Link object to check if
hosts.txt
already exists.- Returns
If
hosts.txt
exists, return the path tohosts.txt
, i.e.<root>/<proxy>/<scheme>/<hostname>/hosts.txt
.If not, return
None
.
- Return type
Optional[str]
-
darc.proxy.i2p.
has_i2p
(link_pool)¶ Check if contain I2P links.
- Parameters
link_pool (Set[str]) – Link pool to check.
- Returns
If the link pool contains I2P links.
- Return type
bool
-
darc.proxy.i2p.
i2p_bootstrap
()¶ Bootstrap wrapper for I2P.
The function will bootstrap the I2P proxy. It will retry for
I2P_RETRY
times in case of failure.Also, it will NOT re-bootstrap the proxy as is guaranteed by
_I2P_BS_FLAG
.- Warns
I2PBootstrapFailed – If failed to bootstrap I2P proxy.
- Raises
UnsupportedPlatform – If the system is not supported, i.e. not macOS or Linux.
-
darc.proxy.i2p.
read_hosts
(text, check=False)¶ Read
hosts.txt
.- Parameters
text (Iterable[str]) – Content of
hosts.txt
.check (bool) – If perform checks on extracted links, default to
CHECK
.
- Returns
List of links extracted.
- Return type
Iterable[str]
-
darc.proxy.i2p.
save_hosts
(link, text)¶ Save
hosts.txt
.- Parameters
link (darc.link.Link) – Link object of
hosts.txt
.text (str) – Content of
hosts.txt
.
- Returns
Saved path to
hosts.txt
, i.e.<root>/<proxy>/<scheme>/<hostname>/hosts.txt
.- Return type
str
See also
-
darc.proxy.i2p.
I2P_REQUESTS_PROXY
: Dict[str, Any]¶ Proxy for I2P sessions.
See also
-
darc.proxy.i2p.
I2P_SELENIUM_PROXY
: selenium.webdriver.Proxy¶ Proxy (
selenium.webdriver.Proxy
) for I2P web drivers.See also
The following constants are configuration through environment variables:
-
darc.proxy.i2p.
I2P_RETRY
: int¶ Retry times for I2P bootstrap when failure.
- Default
3
- Environ
-
darc.proxy.i2p.
BS_WAIT
: float¶ Time after which the attempt to start I2P is aborted.
- Default
90
- Environ
I2P_WAIT
Note
If not provided, there will be NO timeouts.
-
darc.proxy.i2p.
I2P_ARGS
: List[str]¶ I2P bootstrap arguments for
i2prouter start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split
).- Default
''
- Environ
Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
The following constants are defined for internal usage:
-
darc.proxy.i2p.
_I2P_BS_FLAG
: bool¶ If the I2P proxy is bootstrapped.
-
darc.proxy.i2p.
_I2P_PROC
: subprocess.Popen¶ I2P proxy process running in the background.
-
darc.proxy.i2p.
_I2P_ARGS
: List[str]¶ I2P proxy bootstrap arguments.
IRC Addresses¶
The darc.proxy.irc
module contains the auxiliary functions
around managing and processing the IRC addresses.
Currently, the darc
project directly save the IRC
addresses extracted to the data storage file
PATH
without further processing.
-
darc.proxy.irc.
save_irc
(link)¶ Save IRC address.
The function will save IRC address to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the IRC address.
-
darc.proxy.irc.
PATH
= '{PATH_MISC}/irc.txt'¶ Path to the data storage of IRC addresses.
See also
Magnet Links¶
The darc.proxy.magnet
module contains the auxiliary functions
around managing and processing the magnet links.
Currently, the darc
project directly save the magnet
links extracted to the data storage file
PATH
without further processing.
-
darc.proxy.magnet.
save_magnet
(link)¶ Save magnet link.
The function will save magnet link to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the magnet link
-
darc.proxy.magnet.
PATH
= '{PATH_MISC}/magnet.txt'¶ Path to the data storage of magnet links.
See also
Email Addresses¶
The darc.proxy.mail
module contains the auxiliary functions
around managing and processing the email addresses.
Currently, the darc
project directly save the email
addresses extracted to the data storage file
PATH
without further processing.
-
darc.proxy.mail.
save_mail
(link)¶ Save email address.
The function will save email address to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the email address.
-
darc.proxy.mail.
PATH
= '{PATH_MISC}/mail.txt'¶ Path to the data storage of email addresses.
See also
No Proxy¶
The darc.proxy.null
module contains the auxiliary functions
around managing and processing normal websites with no proxy.
-
darc.proxy.null.
fetch_sitemap
(link)¶ Fetch sitemap.
The function will first fetch the
robots.txt
, then fetch the sitemaps accordingly.- Parameters
link (darc.link.Link) – Link object to fetch for its sitemaps.
-
darc.proxy.null.
save_invalid
(link)¶ Save link with invalid scheme.
The function will save link with invalid scheme to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the link with invalid scheme.
-
darc.proxy.null.
PATH
= '{PATH_MISC}/invalid.txt'¶ Path to the data storage of links with invalid scheme.
See also
Tor Proxy¶
The darc.proxy.tor
module contains the auxiliary functions
around managing and processing the Tor proxy.
-
darc.proxy.tor.
_tor_bootstrap
()¶ Tor bootstrap.
The bootstrap configuration is defined as
_TOR_CONFIG
.If
TOR_PASS
not provided, the function will request for it.
-
darc.proxy.tor.
has_tor
(link_pool)¶ Check if contain Tor links.
- Parameters
link_pool (Set[str]) – Link pool to check.
- Returns
If the link pool contains Tor links.
- Return type
bool
-
darc.proxy.tor.
print_bootstrap_lines
(line)¶ Print Tor bootstrap lines.
- Parameters
line (str) – Tor bootstrap line.
-
darc.proxy.tor.
renew_tor_session
()¶ Renew Tor session.
-
darc.proxy.tor.
tor_bootstrap
()¶ Bootstrap wrapper for Tor.
The function will bootstrap the Tor proxy. It will retry for
TOR_RETRY
times in case of failure.Also, it will NOT re-bootstrap the proxy as is guaranteed by
_TOR_BS_FLAG
.- Warns
TorBootstrapFailed – If failed to bootstrap Tor proxy.
-
darc.proxy.tor.
TOR_REQUESTS_PROXY
: Dict[str, Any]¶ Proxy for Tor sessions.
See also
-
darc.proxy.tor.
TOR_SELENIUM_PROXY
: selenium.webdriver.Proxy¶ Proxy (
selenium.webdriver.Proxy
) for Tor web drivers.See also
The following constants are configuration through environment variables:
-
darc.proxy.tor.
TOR_PASS
: str¶ Tor controller authentication token.
- Default
None
- Environ
Note
If not provided, it will be requested at runtime.
-
darc.proxy.tor.
TOR_RETRY
: int¶ Retry times for Tor bootstrap when failure.
- Default
3
- Environ
-
darc.proxy.tor.
BS_WAIT
: float¶ Time after which the attempt to start Tor is aborted.
- Default
90
- Environ
TOR_WAIT
Note
If not provided, there will be NO timeouts.
-
darc.proxy.tor.
TOR_CFG
: Dict[str, Any]¶ Tor bootstrap configuration for
stem.process.launch_tor_with_config()
.- Default
{}
- Environ
Note
If provided, it will be parsed from a JSON encoded string.
The following constants are defined for internal usage:
-
darc.proxy.tor.
_TOR_BS_FLAG
: bool¶ If the Tor proxy is bootstrapped.
-
darc.proxy.tor.
_TOR_PROC
: subprocess.Popen¶ Tor proxy process running in the background.
-
darc.proxy.tor.
_TOR_CTRL
: stem.control.Controller¶ Tor controller process (
stem.control.Controller
) running in the background.
-
darc.proxy.tor.
_TOR_CONFIG
: List[str]¶ Tor bootstrap configuration for
stem.process.launch_tor_with_config()
.
ZeroNet Proxy¶
The darc.proxy.zeronet
module contains the auxiliary functions
around managing and processing the ZeroNet proxy.
-
darc.proxy.zeronet.
_zeronet_bootstrap
()¶ ZeroNet bootstrap.
The bootstrap arguments are defined as
_ZERONET_ARGS
.- Raises
subprocess.CalledProcessError – If the return code of
_ZERONET_PROC
is non-zero.
-
darc.proxy.zeronet.
has_zeronet
(link_pool)¶ Check if contain ZeroNet links.
- Parameters
link_pool (Set[str]) – Link pool to check.
- Returns
If the link pool contains ZeroNet links.
- Return type
bool
-
darc.proxy.zeronet.
zeronet_bootstrap
()¶ Bootstrap wrapper for ZeroNet.
The function will bootstrap the ZeroNet proxy. It will retry for
ZERONET_RETRY
times in case of failure.Also, it will NOT re-bootstrap the proxy as is guaranteed by
_ZERONET_BS_FLAG
.- Warns
ZeroNetBootstrapFailed – If failed to bootstrap ZeroNet proxy.
- Raises
UnsupportedPlatform – If the system is not supported, i.e. not macOS or Linux.
The following constants are configuration through environment variables:
-
darc.proxy.zeronet.
ZERONET_PORT
: int¶ Port for ZeroNet proxy connection.
- Default
43110
- Environ
-
darc.proxy.zeronet.
ZERONET_RETRY
: int¶ Retry times for ZeroNet bootstrap when failure.
- Default
3
- Environ
-
darc.proxy.zeronet.
BS_WAIT
: float¶ Time after which the attempt to start ZeroNet is aborted.
- Default
90
- Environ
ZERONET_WAIT
Note
If not provided, there will be NO timeouts.
-
darc.proxy.zeronet.
ZERONET_PATH
: str¶ Path to the ZeroNet project.
- Default
/usr/local/src/zeronet
- Environ
-
darc.proxy.zeronet.
ZERONET_ARGS
: List[str]¶ ZeroNet bootstrap arguments for
run.sh start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split
).- Default
''
- Environ
Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
The following constants are defined for internal usage:
-
darc.proxy.zeronet.
_ZERONET_BS_FLAG
: bool¶ If the ZeroNet proxy is bootstrapped.
-
darc.proxy.zeronet.
_ZERONET_PROC
: subprocess.Popen¶ ZeroNet proxy process running in the background.
-
darc.proxy.zeronet.
_ZERONET_ARGS
: List[str]¶ ZeroNet proxy bootstrap arguments.
To tell the darc
project which proxy settings to be used for the
requests.Session
objects and selenium.webdriver.Chrome
objects, you can specify such information
in the darc.proxy.LINK_MAP
mapping dictionarty.
-
darc.proxy.
LINK_MAP
: DefaultDict[str, Tuple[types.FunctionType, types.FunctionType]]¶ LINK_MAP = collections.defaultdict( lambda: (darc.requests.null_session, darc.selenium.null_driver), dict( tor=(darc.requests.tor_session, darc.selenium.tor_driver), i2p=(darc.requests.i2p_session, darc.selenium.i2p_driver), ) )
The mapping dictionary for proxy type to its corresponding
requests.Session
factory function andselenium.webdriver.Chrome
factory function.The fallback value is the no proxy
requests.Session
object (null_session()
) andselenium.webdriver.Chrome
object (null_driver()
).See also
darc.requests
–requests.Session
factory functionsdarc.selenium
–selenium.webdriver.Chrome
factory functions
Sites Customisation¶
As websites may have authentication requirements, etc., over
its content, the darc.sites
module provides sites
customisation hooks to both requests
and selenium
crawling processes.
Default Hooks¶
The darc.sites.default
module is the fallback for sites
customisation.
-
darc.sites.default.
crawler
(session, link)¶ Default crawler hook.
- Parameters
session (
requests.Session
) – Session object with proxy settings.link (darc.link.Link) – Link object to be crawled.
- Returns
The final response object with crawled data.
- Return type
See also
-
darc.sites.default.
loader
(driver, link)¶ Default loader hook.
When loading, if
SE_WAIT
is a valid time lapse, the function will sleep for such time to wait for the page to finish loading contents.- Parameters
driver (
selenium.webdriver.Chrome
) – Web driver object with proxy settings.link (darc.link.Link) – Link object to be loaded.
- Returns
The web driver object with loaded data.
- Return type
Note
Internally,
selenium
will wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded
). However, some extra scripts may take more time running after the event.See also
To customise behaviours over requests
, you sites customisation
module should have a crawler()
function, e.g.
crawler()
.
The function takes the requests.Session
object with proxy settings and
a Link
object representing the link to be
crawled, then returns a requests.Response
object containing the final
data of the crawling process.
-
darc.sites.
crawler_hook
(link, session)¶ Customisation as to
requests
sessions.- Parameters
link (darc.link.Link) – Link object to be crawled.
session (
requests.Session
) – Session object with proxy settings.
- Returns
The final response object with crawled data.
- Return type
See also
darc.sites.SITE_MAP
To customise behaviours over selenium
, you sites customisation
module should have a loader()
function, e.g.
loader()
.
The function takes the selenium.webdriver.Chrome
object with proxy settings and
a Link
object representing the link to be
loaded, then returns the selenium.webdriver.Chrome
object containing the final
data of the loading process.
-
darc.sites.
loader_hook
(link, driver)¶ Customisation as to
selenium
drivers.- Parameters
link (darc.link.Link) – Link object to be loaded.
driver (
selenium.webdriver.Chrome
) – Web driver object with proxy settings.
- Returns
The web driver object with loaded data.
- Return type
See also
darc.sites.SITE_MAP
To tell the darc
project which sites customisation
module it should use for a certain hostname, you can register
such module to the SITEMAP
mapping dictionary.
-
darc.sites.
SITEMAP
: DefaultDict[str, str]¶ SITEMAP = collections.defaultdict(lambda: 'default', { # 'www.sample.com': 'sample', # darc.sites.sample })
The mapping dictionary for hostname to sites customisation modules.
The fallback value is
default
, c.f.darc.sites.default
.
-
darc.sites.
_get_spec
(link)¶ Load spec if any.
If the sites customisation failed to import, it will fallback to the default hooks,
default
.- Parameters
link (darc.link.Link) – Link object to fetch sites customisation module.
- Returns
The sites customisation module.
- Return type
types.ModuleType
- Warns
SiteNotFoundWarning – If the sites customisation failed to import.
See also
Module Constants¶
Auxiliary Function¶
General Configurations¶
-
darc.const.
REBOOT
: bool¶ If exit the program after first round, i.e. crawled all links from the
requests
link database and loaded all links from theselenium
link database.This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.
- Default
False
- Environ
-
darc.const.
DEBUG
: bool¶ If run the program in debugging mode.
- Default
False
- Environ
-
darc.const.
VERBOSE
: bool¶ If run the program in verbose mode. If
DEBUG
isTrue
, then the verbose mode will be always enabled.- Default
False
- Environ
-
darc.const.
FORCE
: bool¶ If ignore
robots.txt
rules when crawling (c.f.crawler()
).- Default
False
- Environ
-
darc.const.
CHECK
: bool¶ If check proxy and hostname before crawling (when calling
extract_links()
,read_sitemap()
andread_hosts()
).If
CHECK_NG
isTrue
, then this environment variable will be always set asTrue
.- Default
False
- Environ
-
darc.const.
CHECK_NG
: bool¶ If check content type through
HEAD
requests before crawling (when callingextract_links()
,read_sitemap()
andread_hosts()
).- Default
False
- Environ
-
darc.const.
ROOT
: str¶ The root folder of the project.
-
darc.const.
CWD
= '.'¶ The current working direcory.
-
darc.const.
DARC_CPU
: int¶ Number of concurrent processes. If not provided, then the number of system CPUs will be used.
- Default
None
- Environ
-
darc.const.
FLAG_MP
: bool¶ If enable multiprocessing support.
- Default
True
- Environ
-
darc.const.
FLAG_TH
: bool¶ If enable multithreading support.
- Default
False
- Environ
-
darc.const.
DARC_USER
: str¶ Non-root user for proxies.
- Default
current login user (c.f.
getpass.getuser()
)- Environ
Data Storage¶
-
darc.const.
PATH_DB
: str¶ Path to data storage.
- Default
data
- Environ
See also
See
darc.save
for more information about source saving.
-
darc.const.
PATH_MISC
= '{PATH_DB}/misc/'¶ Path to miscellaneous data storage, i.e.
misc
folder under the root of data storage.See also
-
darc.const.
PATH_LN
= '{PATH_DB}/link.csv'¶ Path to the link CSV file,
link.csv
.See also
-
darc.const.
PATH_QR
= '{PATH_DB}/_queue_requests.txt'¶ Path to the
requests
database,_queue_requests.txt
.
-
darc.const.
PATH_QS
= '{PATH_DB}/_queue_selenium.txt'¶ Path to the
selenium
database,_queue_selenium.txt
.
-
darc.const.
PATH_ID
= '{PATH_DB}/darc.pid'¶ Path to the process ID file,
darc.pid
.See also
Web Crawlers¶
-
darc.const.
TIME_CACHE
: float¶ Time delta for caches in seconds.
The
darc
project supports caching for fetched files.TIME_CACHE
will specify for how log the fetched files will be cached and NOT fetched again.Note
If
TIME_CACHE
isNone
then caching will be marked as forever.- Default
60
- Environ
-
darc.const.
SE_WAIT
: float¶ Time to wait for
selenium
to finish loading pages.Note
Internally,
selenium
will wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded
). However, some extra scripts may take more time running after the event.- Default
60
- Environ
White / Black Lists¶
-
darc.const.
LINK_WHITE_LIST
: List[re.Pattern]¶ White list of hostnames should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
LINK_BLACK_LIST
: List[re.Pattern]¶ Black list of hostnames should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
LINK_FALLBACK
: bool¶ Fallback value for
match_host()
.- Default
False
- Environ
-
darc.const.
MIME_WHITE_LIST
: List[re.Pattern]¶ White list of content types should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
MIME_BLACK_LIST
: List[re.Pattern]¶ Black list of content types should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
MIME_FALLBACK
: bool¶ Fallback value for
match_mime()
.- Default
False
- Environ
-
darc.const.
PROXY_WHITE_LIST
: List[str]¶ White list of proxy types should be crawled.
- Default
[]
- Environ
Note
The proxy types are case insensitive.
-
darc.const.
PROXY_BLACK_LIST
: List[str]¶ Black list of proxy types should be crawled.
- Default
[]
- Environ
Note
The proxy types are case insensitive.
-
darc.const.
PROXY_FALLBACK
: bool¶ Fallback value for
match_proxy()
.- Default
False
- Environ
Custom Exceptions¶
The render_error()
function can be used to render
multi-line error messages with stem.util.term
colours.
The darc
project provides following custom exceptions:
The darc
project provides following custom exceptions:
-
exception
darc.error.
APIRequestFailed
¶ Bases:
Warning
API submit failed.
-
exception
darc.error.
FreenetBootstrapFailed
¶ Bases:
Warning
Freenet bootstrap process failed.
-
exception
darc.error.
I2PBootstrapFailed
¶ Bases:
Warning
I2P bootstrap process failed.
-
exception
darc.error.
SiteNotFoundWarning
¶ Bases:
ImportWarning
Site customisation not found.
-
exception
darc.error.
TorBootstrapFailed
¶ Bases:
Warning
Tor bootstrap process failed.
-
exception
darc.error.
UnsupportedLink
¶ Bases:
Exception
The link is not supported.
-
exception
darc.error.
UnsupportedPlatform
¶ Bases:
Exception
The platform is not supported.
-
exception
darc.error.
UnsupportedProxy
¶ Bases:
Exception
The proxy is not supported.
-
exception
darc.error.
ZeroNetBootstrapFailed
¶ Bases:
Warning
ZeroNet bootstrap process failed.
-
darc.error.
render_error
(message, colour)¶ Render error message.
The function wraps the
stem.util.term.format()
function to provide multi-line formatting support.- Parameters
message (str) – Multi-line message to be rendered with
colour
.colour (stem.util.term.Color) – Front colour of text, c.f.
stem.util.term.Color
.
- Returns
The rendered error message.
- Return type
str
As the websites can be sometimes irritating for their anti-robots
verification, login requirements, etc., the darc
project
also privides hooks to customise crawling behaviours around both
requests
and selenium
.
See also
Such customisation, as called in the darc
project, site
hooks, is site specific, user can set up your own hooks unto a
certain site, c.f. darc.sites
for more information.
Still, since the network is a world full of mysteries and miracles,
the speed of crawling will much depend on the response speed of
the target website. To boost up, as well as meet the system capacity,
the darc
project introduced multiprocessing, multithreading
and the fallback slowest single-threaded solutions when crawling.
Note
When rendering the target website using selenium
powered by
the renown Google Chrome, it will require much memory to run.
Thus, the three solutions mentioned above would only toggle the
behaviour around the use of selenium
.
To keep the darc
project as it is a swiss army knife, only the
main entrypoint function darc.process.process()
is exported
in global namespace (and renamed to darc.darc()
), see below:
-
darc.
darc
()¶ Main process.
The function will register
_signal_handler()
forSIGTERM
, and start the main process of thedarc
darkweb crawlers.The general process can be described as following:
process()
: obtain URLs from therequests
link database (c.f.load_requests()
), and feed such URLs tocrawler()
with multiprocessing support.crawler()
: parse the URL usingparse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
). Also, if the submission API is provided,submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),darc
will save the document usingsave_html()
orsave_file()
accordingly. And if the submission API is provided,submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).process()
: after the obtained URLs have all been crawled,darc
will obtain URLs from theselenium
link database (c.f.load_selenium()
), and feed such URLs toloader()
.loader()
: parse the URL usingparse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalselenium.webdriver.Chrome
object.If successful, the rendered source HTML document will be saved using
save_html()
, and a full-page screenshot will be taken and saved.If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).
If in reboot mode, i.e.
REBOOT
isTrue
, the function will exit after first round. If not, it will renew the Tor connections (if bootstrapped), c.f.renew_tor_session()
, and start another round.
Web Backend Demo¶
This is a demo of API for communication between the
darc
crawlers (darc.submit
) and web UI.
Assuming the web UI is developed using the Flask
microframework.
# -*- coding: utf-8 -*-
import flask # pylint: disable=import-error
# Flask application
app = flask.Flask(__file__)
@app.route('/api/new_host', methods=['POST'])
def new_host():
"""When a new host is discovered, the :mod:`darc` crawler will submit the
host information. Such includes ``robots.txt`` (if exists) and
``sitemap.xml`` (if any).
Data format::
{
// metadata of URL
"[metadata]": {
// original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
"url": ...,
// proxy type - null / tor / i2p / zeronet / freenet
"proxy": ...,
// hostname / netloc, c.f. ``urllib.parse.urlparse``
"host": ...,
// base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>
"base": ...,
// sha256 of URL as name for saved files (timestamp is in ISO format)
// JSON log as this one - <base>/<name>_<timestamp>.json
// HTML from requests - <base>/<name>_<timestamp>_raw.html
// HTML from selenium - <base>/<name>_<timestamp>.html
// generic data files - <base>/<name>_<timestamp>.dat
"name": ...
},
// requested timestamp in ISO format as in name of saved file
"Timestamp": ...,
// original URL
"URL": ...,
// robots.txt from the host (if not exists, then ``null``)
"Robots": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/robots.txt
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
},
// sitemaps from the host (if none, then ``null``)
"Sitemaps": [
{
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/sitemap_<name>.txt
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
},
...
],
// hosts.txt from the host (if proxy type is ``i2p``; if not exists, then ``null``)
"Hosts": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/hosts.txt
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
}
}
"""
# JSON data from the request
data = flask.request.json # pylint: disable=unused-variable
# do whatever processing needed
...
@app.route('/api/requests', methods=['POST'])
def from_requests():
"""When crawling, we'll first fetch the URl using ``requests``, to check
its availability and to save its HTTP headers information. Such information
will be submitted to the web UI.
Data format::
{
// metadata of URL
"[metadata]": {
// original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
"url": ...,
// proxy type - null / tor / i2p / zeronet / freenet
"proxy": ...,
// hostname / netloc, c.f. ``urllib.parse.urlparse``
"host": ...,
// base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>
"base": ...,
// sha256 of URL as name for saved files (timestamp is in ISO format)
// JSON log as this one - <base>/<name>_<timestamp>.json
// HTML from requests - <base>/<name>_<timestamp>_raw.html
// HTML from selenium - <base>/<name>_<timestamp>.html
// generic data files - <base>/<name>_<timestamp>.dat
"name": ...
},
// requested timestamp in ISO format as in name of saved file
"Timestamp": ...,
// original URL
"URL": ...,
// request method
"Method": "GET",
// response status code
"Status-Code": ...,
// response reason
"Reason": ...,
// response cookies (if any)
"Cookies": {
...
},
// session cookies (if any)
"Session": {
...
},
// request headers (if any)
"Request": {
...
},
// response headers (if any)
"Response": {
...
},
// requested file (if not exists, then ``null``)
"Document": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/<name>_<timestamp>_raw.html
// or if the document is of generic content type, i.e. not HTML
// - <proxy>/<scheme>/<host>/<name>_<timestamp>.dat
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
}
}
"""
# JSON data from the request
data = flask.request.json # pylint: disable=unused-variable
# do whatever processing needed
...
@app.route('/api/selenium', methods=['POST'])
def from_selenium():
"""After crawling with ``requests``, we'll then render the URl using
``selenium`` with Google Chrome and its driver, to provide a fully rendered
web page. Such information will be submitted to the web UI.
Note:
This information is optional, only provided if the content type from
``requests`` is HTML, status code < 400, and HTML data not empty.
Data format::
{
// metadata of URL
"[metadata]": {
// original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
"url": ...,
// proxy type - null / tor / i2p / zeronet / freenet
"proxy": ...,
// hostname / netloc, c.f. ``urllib.parse.urlparse``
"host": ...,
// base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>
"base": ...,
// sha256 of URL as name for saved files (timestamp is in ISO format)
// JSON log as this one - <base>/<name>_<timestamp>.json
// HTML from requests - <base>/<name>_<timestamp>_raw.html
// HTML from selenium - <base>/<name>_<timestamp>.html
// generic data files - <base>/<name>_<timestamp>.dat
"name": ...
},
// requested timestamp in ISO format as in name of saved file
"Timestamp": ...,
// original URL
"URL": ...,
// rendered HTML document (if not exists, then ``null``)
"Document": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/<name>_<timestamp>.html
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
},
// web page screenshot (if not exists, then ``null``)
"Screenshot": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/<name>_<timestamp>.png
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
}
}
"""
# JSON data from the request
data = flask.request.json # pylint: disable=unused-variable
# do whatever processing needed
...
if __name__ == "__main__":
flask.run()
Docker Integration¶
The darc
project is integrated with Docker and
Compose. Though not published to Docker Hub, you can
still build by yourself.
The Docker image is based on Ubuntu Bionic (18.04 LTS),
setting up all Python dependencies for the darc
project, installing Google Chrome (version
79.0.3945.36) and corresponding ChromeDriver, as well as
installing and configuring Tor, I2P, ZeroNet, FreeNet,
NoIP proxies.
Note
NoIP is currently not fully integrated in the
darc
due to misunderstanding in the configuration
process. Contributions are welcome.
When building the image, there is an optional argument
for setting up a non-root user, c.f. environment variable
DARC_USER
and module constant DARC_USER
.
By default, the username is darc
.
Content of Dockerfile
FROM ubuntu:bionic
LABEL Name=darc \
Version=0.1.4
STOPSIGNAL SIGINT
HEALTHCHECK --interval=1h --timeout=1m \
CMD wget https://httpbin.org/get -O /dev/null || exit 1
ARG DARC_USER="darc"
ENV LANG="C.UTF-8" \
LC_ALL="C.UTF-8" \
PYTHONIOENCODING="UTF-8" \
DEBIAN_FRONTEND="teletype" \
DARC_USER="${DARC_USER}"
# DEBIAN_FRONTEND="noninteractive"
COPY extra/retry.sh /usr/local/bin/retry
COPY extra/install.py /usr/local/bin/pty-install
COPY vendor/jdk-13.0.2_linux-x64_bin.tar.gz /var/cache/oracle-jdk13-installer/
RUN set -x \
&& retry apt-get update \
&& retry apt-get install --yes --no-install-recommends \
apt-utils \
&& retry apt-get install --yes --no-install-recommends \
gcc \
g++ \
libmagic1 \
make \
software-properties-common \
tar \
unzip \
zlib1g-dev \
&& retry add-apt-repository ppa:deadsnakes/ppa --yes \
&& retry add-apt-repository ppa:linuxuprising/java --yes \
&& retry add-apt-repository ppa:i2p-maintainers/i2p --yes
RUN retry apt-get update \
&& retry apt-get install --yes --no-install-recommends \
python3.8 \
python3-pip \
python3-setuptools \
python3-wheel \
&& ln -sf /usr/bin/python3.8 /usr/local/bin/python3
RUN retry pty-install --stdin '6\n70' apt-get install --yes --no-install-recommends \
tzdata \
&& retry pty-install --stdin 'yes' apt-get install --yes \
oracle-java13-installer
RUN retry apt-get install --yes --no-install-recommends \
sudo \
&& adduser --disabled-password --gecos '' ${DARC_USER} \
&& adduser ${DARC_USER} sudo \
&& echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
## Tor
RUN retry apt-get install --yes --no-install-recommends tor
COPY extra/torrc.bionic /etc/tor/torrc
## I2P
RUN retry apt-get install --yes --no-install-recommends i2p
COPY extra/i2p.bionic /etc/defaults/i2p
## ZeroNet
COPY vendor/ZeroNet-py3-linux64.tar.gz /tmp
RUN set -x \
&& cd /tmp \
&& tar xvpfz ZeroNet-py3-linux64.tar.gz \
&& mv ZeroNet-linux-dist-linux64 /usr/local/src/zeronet
COPY extra/zeronet.bionic.conf /usr/local/src/zeronet/zeronet.conf
## FreeNet
USER darc
COPY vendor/new_installer_offline.jar /tmp
RUN set -x \
&& cd /tmp \
&& ( pty-install --stdin '/home/darc/freenet\n1' java -jar new_installer_offline.jar || true ) \
&& sudo mv /home/darc/freenet /usr/local/src/freenet
USER root
## NoIP
COPY vendor/noip-duc-linux.tar.gz /tmp
RUN set -x \
&& cd /tmp \
&& tar xvpfz noip-duc-linux.tar.gz \
&& mv noip-2.1.9-1 /usr/local/src/noip \
&& cd /usr/local/src/noip \
&& make
# && make install
# # set up timezone
# RUN echo 'Asia/Shanghai' > /etc/timezone \
# && rm -f /etc/localtime \
# && ln -snf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \
# && dpkg-reconfigure -f noninteractive tzdata
COPY vendor/chromedriver_linux64-79.0.3945.36.zip \
vendor/google-chrome-stable_current_amd64.deb /tmp/
RUN set -x \
## ChromeDriver
&& unzip -d /usr/bin /tmp/chromedriver_linux64-79.0.3945.36.zip \
&& which chromedriver \
## Google Chrome
&& ( dpkg --install /tmp/google-chrome-stable_current_amd64.deb || true ) \
&& retry apt-get install --fix-broken --yes --no-install-recommends \
&& dpkg --install /tmp/google-chrome-stable_current_amd64.deb \
&& which google-chrome
# Using pip:
COPY requirements.txt /tmp
RUN python3 -m pip install -r /tmp/requirements.txt --no-cache-dir
RUN set -x \
&& rm -rf \
## APT repository lists
/var/lib/apt/lists/* \
## Python dependencies
/tmp/requirements.txt \
/tmp/pip \
## ChromeDriver
/tmp/chromedriver_linux64-79.0.3945.36.zip \
## Google Chrome
/tmp/google-chrome-stable_current_amd64.deb \
## Vendors
/tmp/new_installer_offline.jar \
/tmp/noip-duc-linux.tar.gz \
/tmp/ZeroNet-py3-linux64.tar.gz \
#&& apt-get remove --auto-remove --yes \
# software-properties-common \
# unzip \
&& apt-get autoremove -y \
&& apt-get autoclean \
&& apt-get clean
ENTRYPOINT [ "python3", "-m", "darc" ]
#ENTRYPOINT [ "bash", "/app/run.sh" ]
CMD [ "--help" ]
WORKDIR /app
COPY darc/ /app/darc/
COPY LICENSE \
MANIFEST.in \
README.rst \
extra/run.sh \
setup.cfg \
setup.py \
test_darc.py /app/
RUN python3 -m pip install -e .
Note
retry
is a shell script for retrying the commands until success
Content of retry
#!/usr/bin/env bash
while true; do
>&2 echo "+ $@"
$@ && break
>&2 echo "exit: $?"
done
>&2 echo "exit: 0"
pty-install
is a Python script simulating user input for APT package installation withDEBIAN_FRONTEND
set asTeletype
.
Content of pty-install
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Install packages requiring interactions."""
import argparse
import os
import subprocess
import sys
import tempfile
def get_parser():
"""Argument parser."""
parser = argparse.ArgumentParser('install',
description='pseudo-interactive package installer')
parser.add_argument('-i', '--stdin', help='content for input')
parser.add_argument('command', nargs=argparse.REMAINDER, help='command to execute')
return parser
def main():
"""Entrypoint."""
parser = get_parser()
args = parser.parse_args()
text = args.stdin.encode().decode('unicode_escape')
path = tempfile.mktemp(prefix='install-')
with open(path, 'w') as file:
file.write(text)
with open(path, 'r') as file:
proc = subprocess.run(args.command, stdin=file) # pylint: disable=subprocess-run-check
os.remove(path)
return proc.returncode
if __name__ == "__main__":
sys.exit(main())
As always, you can also use Docker Compose to manage the darc
image. Environment variables can be set as described in the
configuration section.
Content of docker-compose.yml
version: '3'
services:
darc:
image: darc
build:
context: .
args:
# non-root user
DARC_USER: "darc"
command: [ "--file", "/app/text/market.txt",
"--file", "/app/text/i2p.txt",
"--file", "/app/text/zeronet.txt",
"--file", "/app/text/freenet.txt" ]
environment:
## [PYTHON] force the stdout and stderr streams to be unbuffered
PYTHONUNBUFFERED: 1
# reboot mode
DARC_REBOOT: 1
# debug mode
DARC_DEBUG: 0
# verbose mode
DARC_VERBOSE: 1
# force mode (ignore robots.txt)
DARC_FORCE: 1
# check mode (check proxy and hostname before crawling)
DARC_CHECK: 1
# check mode (check content type before crawling)
DARC_CHECK_CONTENT_TYPE: 0
# processes
DARC_CPU: 16
# multiprocessing
DARC_MULTIPROCESSING: 0
# multithreading
DARC_MULTITHREADING: 0
# time lapse
DARC_WAIT: 60
# data storage
PATH_DATA: "data"
# Tor proxy & control port
TOR_PORT: 9050
TOR_CTRL: 9051
# Tor management method
TOR_STEM: 1
# Tor authentication
TOR_PASS: "16:B9D36206B5374B3F609045F9609EE670F17047D88FF713EFB9157EA39F"
# Tor bootstrap retry
TOR_RETRY: 10
# Tor bootstrap wait
TOR_WAIT: 90
# Tor bootstrap config
TOR_CFG: "{}"
# I2P port
I2P_PORT: 4444
# I2P bootstrap retry
I2P_RETRY: 10
# I2P bootstrap wait
I2P_WAIT: 90
# I2P bootstrap config
I2P_ARGS: ""
# ZeroNet port
ZERONET_PORT: 43110
# ZeroNet bootstrap retry
ZERONET_RETRY: 10
# ZeroNet project path
ZERONET_PATH: "/usr/local/src/zeronet"
# ZeroNet bootstrap wait
ZERONET_WAIT: 90
# ZeroNet bootstrap config
ZERONET_ARGS: ""
# Freenet port
FREENET_PORT: 8888
# Freenet bootstrap retry
FREENET_RETRY: 0
# Freenet project path
FREENET_PATH: "/usr/local/src/freenet"
# Freenet bootstrap wait
FREENET_WAIT: 90
# Freenet bootstrap config
FREENET_ARGS: ""
# time delta for caches in seconds
TIME_CACHE: 86400 # 1 day
# time to wait for selenium
SE_WAIT: 5
# extract link pattern
LINK_WHITE_LIST: '[
"(?!(.*\\.)?facebookcorewwwi).*\\.onion",
"(?!(.*\\.)?nytimes3xbfgragh).*\\.onion",
".*?\\.i2p", "127\\.0\\.0\\.1:7657", "localhost:7657", "127\\.0\\.0\\.1:7658", "localhost:7658",
"127\\.0\\.0\\.1:43110", "localhost:43110",
"127\\.0\\.0\\.1:8888", "localhost:8888"
]'
# link black list
LINK_BLACK_LIST: '[ "(.*\\.)?facebookcorewwwi\\.onion", "(.*\\.)?nytimes3xbfgragh\\.onion" ]'
# link fallback flag
LINK_FALLBACK: 1
# content type white list
MIME_WHITE_LIST: '[ "text/html", "application/xhtml+xml" ]'
# content type black list
MIME_BLACK_LIST: '[ "text/css", "application/javascript", "text/json" ]'
# content type fallback flag
MIME_FALLBACK: 0
# proxy type white list
PROXY_WHITE_LIST: '[ "tor", "i2p", "freenet", "zeronet" ]'
# proxy type black list
PROXY_BLACK_LIST: '[ "null", "data" ]'
# proxy type fallback flag
PROXY_FALLBACK: 0
# API retry times
API_RETRY: 10
# API URLs
#API_NEW_HOST: 'https://example.com/api/new_host'
#API_REQUESTS: 'https://example.com/api/requests'
#API_SELENIUM: 'https://example.com/api/selenium'
restart: "always"
volumes:
- ./text:/app/text
- ./extra:/app/extra
- /data/darc:/app/data
# - ./cache:/app/cache
# ## change timezone
# - /etc:/etc
# - /usr/share/zoneinfo:/usr/share/zoneinfo
Note
Should you wish to run darc
in reboot mode, i.e. set
DARC_REBOOT
and/or REBOOT
as True
, you may wish to change the entrypoint to
bash /app/run.sh
where run.sh
is a shell script wraps around darc
especially for reboot mode.
Content of run.sh
#!/usr/bin/env bash
set -e
# time lapse
WAIT=${DARC_WAIT=10}
# signal handlers
trap '[ -f ${PATH_DATA}/darc.pid ] && kill -2 $(cat ${PATH_DATA}/darc.pid)' SIGINT SIGTERM SIGKILL
# initialise
echo "+ Starting application..."
python3 -m darc $@
sleep ${WAIT}
# mainloop
while true; do
echo "+ Restarting application..."
python3 -m darc
sleep ${WAIT}
done
In such scenario, you can customise your run.sh
to, for
instance, archive then upload current data crawled by darc
to somewhere else and save up some disk space.
Or you may wish to look into the _queue_requests.txt
and
_queue_selenium.txt
databases (c.f. darc.db
), and make
some minor adjustments to, perhaps, narrow down the crawling targets.
darc
is designed as a swiss army knife for darkweb crawling.
It integrates requests
to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium
to provide a fully rendered web page and screenshot
of such view.
The general process of darc
can be described as following:
process()
: obtain URLs from therequests
link database (c.f.load_requests()
), and feed such URLs tocrawler()
with multiprocessing support.crawler()
: parse the URL usingparse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
). Also, if the submission API is provided,submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),darc
will save the document usingsave_html()
orsave_file()
accordingly. And if the submission API is provided,submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).process()
: after the obtained URLs have all been crawled,darc
will obtain URLs from theselenium
link database (c.f.load_selenium()
), and feed such URLs toloader()
.loader()
: parse the URL usingparse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalselenium.webdriver.Chrome
object.If successful, the rendered source HTML document will be saved using
save_html()
, and a full-page screenshot will be taken and saved.If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).
Installation¶
Note
darc
supports Python all versions above and includes 3.6.
Currently, it only supports and is tested on Linux (Ubuntu 18.04)
and macOS (Catalina).
When installing in Python versions below 3.8, darc
will
use walrus
to compile itself for backport compatibility.
pip install darc
Please make sure you have Google Chrome and corresponding version of Chrome Driver installed on your system.
However, the darc
project is shipped with Docker and Compose support.
Please see :Docker Integration for more information.
Usage¶
The darc
project provides a simple CLI:
usage: darc [-h] [-f FILE] ...
the darkweb knife crawling swiss army knife
positional arguments:
link links to craw
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE read links from file
It can also be called through module entrypoint:
python -m python-darc ...
Note
The link files can contain comment lines, which should start with #
.
Empty lines and comment lines will be ignored when loading.
Configuration¶
Though simple CLI, the darc
project is more configurable by
environment variables.
General Configurations¶
-
DARC_REBOOT
¶ - Type
bool
(int
)- Default
0
If exit the program after first round, i.e. crawled all links from the
requests
link database and loaded all links from theselenium
link database.This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.
-
DARC_DEBUG
¶ - Type
bool
(int
)- Default
0
If run the program in debugging mode.
-
DARC_VERBOSE
¶ - Type
bool
(int
)- Default
0
If run the program in verbose mode. If
DARC_DEBUG
isTrue
, then the verbose mode will be always enabled.
-
DARC_CHECK
¶ - Type
bool
(int
)- Default
0
If check proxy and hostname before crawling (when calling
extract_links()
,read_sitemap()
andread_hosts()
).If
DARC_CHECK_CONTENT_TYPE
isTrue
, then this environment variable will be always set asTrue
.
-
DARC_CHECK_CONTENT_TYPE
¶ - Type
bool
(int
)- Default
0
If check content type through
HEAD
requests before crawling (when callingextract_links()
,read_sitemap()
andread_hosts()
).
-
DARC_CPU
¶ - Type
int
- Default
None
Number of concurrent processes. If not provided, then the number of system CPUs will be used.
-
DARC_MULTIPROCESSING
¶ - Type
bool
(int
)- Default
1
If enable multiprocessing support.
-
DARC_MULTITHREADING
¶ - Type
bool
(int
)- Default
0
If enable multithreading support.
Note
DARC_MULTIPROCESSING
and DARC_MULTITHREADING
can
NOT be toggled at the same time.
-
DARC_USER
¶ - Type
str
- Default
current login user (c.f.
getpass.getuser()
)
Non-root user for proxies.
Data Storage¶
Web Crawlers¶
-
TIME_CACHE
¶ - Type
float
- Default
60
Time delta for caches in seconds.
The
darc
project supports caching for fetched files.TIME_CACHE
will specify for how log the fetched files will be cached and NOT fetched again.Note
If
TIME_CACHE
isNone
then caching will be marked as forever.
-
SE_WAIT
¶ - Type
float
- Default
60
Time to wait for
selenium
to finish loading pages.Note
Internally,
selenium
will wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded
). However, some extra scripts may take more time running after the event.
White / Black Lists¶
-
LINK_WHITE_LIST
¶ - Type
List[str]
(JSON)- Default
[]
White list of hostnames should be crawled.
Note
Regular expressions are supported.
-
LINK_BLACK_LIST
¶ - Type
List[str]
(JSON)- Default
[]
Black list of hostnames should be crawled.
Note
Regular expressions are supported.
-
LINK_FALLBACK
¶ - Type
bool
(int
)- Default
0
Fallback value for
match_host()
.
-
MIME_WHITE_LIST
¶ - Type
List[str]
(JSON)- Default
[]
White list of content types should be crawled.
Note
Regular expressions are supported.
-
MIME_BLACK_LIST
¶ - Type
List[str]
(JSON)- Default
[]
Black list of content types should be crawled.
Note
Regular expressions are supported.
-
MIME_FALLBACK
¶ - Type
bool
(int
)- Default
0
Fallback value for
match_mime()
.
-
PROXY_WHITE_LIST
¶ - Type
List[str]
(JSON)- Default
[]
White list of proxy types should be crawled.
Note
The proxy types are case insensitive.
-
PROXY_BLACK_LIST
¶ - Type
List[str]
(JSON)- Default
[]
Black list of proxy types should be crawled.
Note
The proxy types are case insensitive.
-
PROXY_FALLBACK
¶ - Type
bool
(int
)- Default
0
Fallback value for
match_proxy()
.
Note
If provided,
LINK_WHITE_LIST
, LINK_BLACK_LIST
,
MIME_WHITE_LIST
, MIME_BLACK_LIST
,
PROXY_WHITE_LIST
and PROXY_BLACK_LIST
should all be JSON encoded strings.
Data Submission¶
-
API_RETRY
¶ - Type
int
- Default
3
Retry times for API submission when failure.
-
API_NEW_HOST
¶ - Type
str
- Default
None
API URL for
submit_new_host()
.
-
API_REQUESTS
¶ - Type
str
- Default
None
API URL for
submit_requests()
.
-
API_SELENIUM
¶ - Type
str
- Default
None
API URL for
submit_selenium()
.
Note
If API_NEW_HOST
, API_REQUESTS
and API_SELENIUM
is None
, the corresponding
submit function will save the JSON data in the path
specified by PATH_DATA
.
Tor Proxy Configuration¶
-
TOR_PORT
¶ - Type
int
- Default
9050
Port for Tor proxy connection.
-
TOR_CTRL
¶ - Type
int
- Default
9051
Port for Tor controller connection.
-
TOR_PASS
¶ - Type
str
- Default
None
Tor controller authentication token.
Note
If not provided, it will be requested at runtime.
-
TOR_RETRY
¶ - Type
int
- Default
3
Retry times for Tor bootstrap when failure.
-
TOR_WAIT
¶ - Type
float
- Default
90
Time after which the attempt to start Tor is aborted.
Note
If not provided, there will be NO timeouts.
-
TOR_CFG
¶ - Type
Dict[str, Any]
(JSON)- Default
{}
Tor bootstrap configuration for
stem.process.launch_tor_with_config()
.Note
If provided, it should be a JSON encoded string.
I2P Proxy Configuration¶
-
I2P_PORT
¶ - Type
int
- Default
4444
Port for I2P proxy connection.
-
I2P_RETRY
¶ - Type
int
- Default
3
Retry times for I2P bootstrap when failure.
-
I2P_WAIT
¶ - Type
float
- Default
90
Time after which the attempt to start I2P is aborted.
Note
If not provided, there will be NO timeouts.
-
I2P_ARGS
¶ - Type
str
(Shell)- Default
''
I2P bootstrap arguments for
i2prouter start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split
).Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
ZeroNet Proxy Configuration¶
-
ZERONET_PORT
¶ - Type
int
- Default
4444
Port for ZeroNet proxy connection.
-
ZERONET_RETRY
¶ - Type
int
- Default
3
Retry times for ZeroNet bootstrap when failure.
-
ZERONET_WAIT
¶ - Type
float
- Default
90
Time after which the attempt to start ZeroNet is aborted.
Note
If not provided, there will be NO timeouts.
-
ZERONET_PATH
¶ - Type
str
(path)- Default
/usr/local/src/zeronet
Path to the ZeroNet project.
-
ZERONET_ARGS
¶ - Type
str
(Shell)- Default
''
ZeroNet bootstrap arguments for
ZeroNet.sh main
.Note
If provided, it should be parsed as command line arguments (c.f.
shlex.split
).
Freenet Proxy Configuration¶
-
FREENET_PORT
¶ - Type
int
- Default
8888
Port for Freenet proxy connection.
-
FREENET_RETRY
¶ - Type
int
- Default
3
Retry times for Freenet bootstrap when failure.
-
FREENET_WAIT
¶ - Type
float
- Default
90
Time after which the attempt to start Freenet is aborted.
Note
If not provided, there will be NO timeouts.
-
FREENET_PATH
¶ - Type
str
(path)- Default
/usr/local/src/freenet
Path to the Freenet project.
-
FREENET_ARGS
¶ - Type
str
(Shell)- Default
''
Freenet bootstrap arguments for
run.sh start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split
).Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.