darc
- Darkweb Crawler Project¶
darc
is designed as a swiss army knife for darkweb crawling.
It integrates requests
to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium
to provide a fully rendered web page and screenshot
of such view.

Technical Documentation¶
darc
is designed as a swiss army knife for darkweb crawling.
It integrates requests
to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium
to provide a fully rendered web page and screenshot
of such view.
Main Processing¶
The darc.process
module contains the main processing
logic of the darc
module.
-
darc.process.
_process
(worker)[source]¶ Wrapper function to start the worker process.
- Parameters
worker (Union[darc.process.process_crawler, darc.process.process_loader]) –
-
darc.process.
_signal_handler
(signum=None, frame=None)[source]¶ Signal handler.
If the current process is not the main process, the function shall do nothing.
- Parameters
signum (Optional[Union[int, signal.Signals]]) – The signal to handle.
frame (types.FrameType) – The traceback frame from the signal.
See also
-
darc.process.
process
(worker)[source]¶ Main process.
The function will register
_signal_handler()
forSIGTERM
, and start the main process of thedarc
darkweb crawlers.- Parameters
worker (Literal[crawler, loader]) – Worker process type.
- Raises
ValueError – If
worker
is not a valid value.
Before starting the workers, the function will start proxies through
darc.proxy.tor.tor_proxy()
darc.proxy.i2p.i2p_proxy()
darc.proxy.zeronet.zeronet_proxy()
darc.proxy.freenet.freenet_proxy()
The general process can be described as following for workers of
crawler
type:process_crawler()
: obtain URLs from therequests
link database (c.f.load_requests()
), and feed such URLs tocrawler()
.crawler()
: parse the URL usingparse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
). Also, if the submission API is provided,submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).
The general process can be described as following for workers of
loader
type:process_loader()
: in the meanwhile,darc
will obtain URLs from theselenium
link database (c.f.load_selenium()
), and feed such URLs toloader()
.loader()
: parse the URL usingparse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalChrome
object.If successful, the rendered source HTML document will be saved, and a full-page screenshot will be taken and saved.
If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).
After each round,
darc
will call registered hook functions in sequential order, with the type of worker ('crawler'
or'loader'
) and the current link pool as its parameters, seeregister()
for more information.If in reboot mode, i.e.
REBOOT
isTrue
, the function will exit after first round. If not, it will renew the Tor connections (if bootstrapped), c.f.renew_tor_session()
, and start another round.
-
darc.process.
process_crawler
()[source]¶ A worker to run the
crawler()
process.- Warns
HookExecutionFailed – When hook function raises an error.
-
darc.process.
process_loader
()[source]¶ A worker to run the
loader()
process.- Warns
HookExecutionFailed – When hook function raises an error.
-
darc.process.
register
(hook, *, _index=None)[source]¶ Register hook function.
- Parameters
hook (Callable[[Literal[crawler, loader], darc.link.Link], None]) – Hook function to be registered.
_index (Optional[int]) –
- Keyword Arguments
_index – Position index for the hook function.
The hook function takes two parameters:
a
str
object indicating the type of worker, i.e.'crawler'
or'loader'
;a
list
object containingLink
objects, as the current processed link pool.
The hook function may raises
WorkerBreak
so that the worker shall break from its indefinite loop upon finishing of current round. Any value returned from the hook function will be ignored by the workers.See also
The hook functions will be saved into
_HOOK_REGISTRY
.
-
darc.process.
_HOOK_REGISTRY
: typing.List[typing.Callable[[typing.Literal[crawler, loader], Link], None]] = []¶ List of hook functions to be called between each round.
- Type
List[Callable[[Literal[‘crawler’, ‘loader’], List[Link]]]
-
darc.process.
_WORKER_POOL
= None¶ List of active child processes and/or threads.
- Type
List[Union[multiprocessing.Process, threading.Thread]]
Web Crawlers¶
The darc.crawl
module provides two types of crawlers.
-
darc.crawl.
crawler
(link)[source]¶ Single
requests
crawler for a entry link.- Parameters
link (darc.link.Link) – URL to be crawled by
requests
.
The function will first parse the URL using
parse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
).Note
A host is new if
have_hostname()
returnsTrue
.If
darc.proxy.null.fetch_sitemap()
and/ordarc.proxy.i2p.fetch_hosts()
failed when fetching such documents, the host will be removed from the hostname database throughdrop_hostname()
, and considered as new when next encounter.Also, if the submission API is provided,
submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped, and the link will be removed from therequests
database throughdrop_requests()
.If
LinkNoReturn
is raised, the link will be removed from therequests
database throughdrop_requests()
.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).
-
darc.crawl.
loader
(link)[source]¶ Single
selenium
loader for a entry link.- Parameters
Link – URL to be crawled by
selenium
.link (darc.link.Link) –
The function will first parse the URL using
parse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalselenium.webdriver.Chrome
object.Note
If
LinkNoReturn
is raised, the link will be removed from theselenium
database throughdrop_selenium()
.If successful, the rendered source HTML document will be saved, and a full-page screenshot will be taken and saved.
Note
When taking full-page screenshot,
loader()
will usedocument.body.scrollHeight
to get the total height of web page. If the page height is less than 1,000 pixels, thendarc
will by default set the height as 1,000 pixels.Later
darc
will tellselenium
to resize the window (in headless mode) to 1,024 pixels in width and 110% of the page height in height, and take a PNG screenshot.If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).See also
URL Utilities¶
The Link
class is the key data structure
of the darc
project, it contains all information
required to identify a URL’s proxy type, hostname, path prefix
when saving, etc.
The link
module also provides several wrapper
function to the urllib.parse
module.
-
class
darc.link.
Link
(url, proxy, url_parse, host, base, name)[source]¶ Bases:
object
Parsed link.
- Parameters
url – original link
proxy – proxy type
host – URL’s hostname
base – base folder for saving files
name – hashed link for saving files
url_parse – parsed URL from
urllib.parse.urlparse()
- Returns
Parsed link object.
- Return type
-
url_parse
: urllib.parse.ParseResult¶ parsed URL from
urllib.parse.urlparse()
-
darc.link.
parse_link
(link, host=None)[source]¶ Parse link.
- Parameters
- Returns
The parsed link object.
- Return type
Note
If
host
is provided, it will override the hostname of the originallink
.The parsing process of proxy type is as follows:
If
host
isNone
and the parse result fromurllib.parse.urlparse()
has nonetloc
(or hostname) specified, then sethostname
as(null)
; else set it as is.If the scheme is
data
, then thelink
is a data URI, sethostname
asdata
andproxy
asdata
.If the scheme is
javascript
, then the link is some JavaScript codes, setproxy
asscript
.If the scheme is
bitcoin
, then the link is a Bitcoin address, setproxy
asbitcoin
.If the scheme is
ed2k
, then the link is an ED2K magnet link, setproxy
ased2k
.If the scheme is
magnet
, then the link is a magnet link, setproxy
asmagnet
.If the scheme is
mailto
, then the link is an email address, setproxy
asmail
.If the scheme is
irc
, then the link is an IRC link, setproxy
asirc
.If the scheme is NOT any of
http
orhttps
, then setproxy
to the scheme.If the host is
None
, sethostname
to(null)
, setproxy
tonull
.If the host is an onion (
.onion
) address, setproxy
totor
.If the host is an I2P (
.i2p
) address, or any oflocalhost:7657
andlocalhost:7658
, setproxy
toi2p
.If the host is localhost on
ZERONET_PORT
, and the path is not/
, i.e. NOT root path, setproxy
tozeronet
; and set the first part of its path ashostname
.Example:
For a ZeroNet address, e.g. http://127.0.0.1:43110/1HeLLo4uzjaLetFx6NH3PMwFP3qbRbTf3D,
parse_link()
will parse thehostname
as1HeLLo4uzjaLetFx6NH3PMwFP3qbRbTf3D
.If the host is localhost on
FREENET_PORT
, and the path is not/
, i.e. NOT root path, setproxy
tofreenet
; and set the first part of its path ashostname
.Example:
For a Freenet address, e.g. http://127.0.0.1:8888/USK@nwa8lHa271k2QvJ8aa0Ov7IHAV-DFOCFgmDt3X6BpCI,DuQSUZiI~agF8c-6tjsFFGuZ8eICrzWCILB60nT8KKo,AQACAAE/sone/77/,
parse_link()
will parse thehostname
asUSK@nwa8lHa271k2QvJ8aa0Ov7IHAV-DFOCFgmDt3X6BpCI,DuQSUZiI~agF8c-6tjsFFGuZ8eICrzWCILB60nT8KKo,AQACAAE
.If the host is a proxied onion (
.onion.sh
) address, setproxy
totor2web
.If none of the cases above satisfied, the
proxy
will be set asnull
, marking it a plain normal link.
The
base
for parsed linkLink
object is defined as<root>/<proxy>/<scheme>/<hostname>/
where
root
isPATH_DB
.The
name
for parsed linkLink
object is the sha256 hash (c.f.hashlib.sha256()
) of the originallink
.
-
darc.link.
quote
(string, safe='/', encoding=None, errors=None)[source]¶ Wrapper function for
urllib.parse.quote()
.- Parameters
- Returns
The quoted string.
- Return type
Note
The function suppressed possible errors when calling
urllib.parse.quote()
. If any, it will return the original string.
-
darc.link.
unquote
(string, encoding='utf-8', errors='replace')[source]¶ Wrapper function for
urllib.parse.unquote()
.- Parameters
- Returns
The quoted string.
- Return type
Note
The function suppressed possible errors when calling
urllib.parse.unquote()
. If any, it will return the original string.
-
darc.link.
urljoin
(base, url, allow_fragments=True)[source]¶ Wrapper function for
urllib.parse.urljoin()
.- Parameters
base (AnyStr) – base URL
url (AnyStr) – URL to be joined
allow_fragments (bool) – if allow fragments
- Returns
The joined URL.
- Return type
Note
The function suppressed possible errors when calling
urllib.parse.urljoin()
. If any, it will returnbase/url
directly.
-
darc.link.
urlparse
(url, scheme='', allow_fragments=True)[source]¶ Wrapper function for
urllib.parse.urlparse()
.- Parameters
- Returns
The parse result.
- Return type
Note
The function suppressed possible errors when calling
urllib.parse.urlparse()
. If any, it will returnurllib.parse.ParseResult(scheme=scheme, netloc='', path=url, params='', query='', fragment='')
directly.
Source Parsing¶
The darc.parse
module provides auxiliary functions
to read robots.txt
, sitemaps and HTML documents. It
also contains utility functions to check if the proxy type,
hostname and content type if in any of the black and white
lists.
-
darc.parse.
_check
(temp_list)[source]¶ Check hostname and proxy type of links.
- Parameters
temp_list (List[darc.link.Link]) – List of links to be checked.
- Returns
List of links matches the requirements.
- Return type
List[darc.link.Link]
Note
If
CHECK_NG
isTrue
, the function will directly call_check_ng()
instead.
-
darc.parse.
_check_ng
(temp_list)[source]¶ Check content type of links through
HEAD
requests.- Parameters
temp_list (List[darc.link.Link]) – List of links to be checked.
- Returns
List of links matches the requirements.
- Return type
List[darc.link.Link]
-
darc.parse.
check_robots
(link)[source]¶ Check if
link
is allowed inrobots.txt
.- Parameters
link (darc.link.Link) – The link object to be checked.
- Returns
If
link
is allowed inrobots.txt
.- Return type
Note
The root path of a URL will always return
True
.
-
darc.parse.
extract_links
(link, html, check=False)[source]¶ Extract links from HTML document.
- Parameters
link (darc.link.Link) – Original link of the HTML document.
check (bool) – If perform checks on extracted links, default to
CHECK
.
- Returns
List of extracted links.
- Return type
List[darc.link.Link]
-
darc.parse.
get_content_type
(response)[source]¶ Get content type from
response
.- Parameters
response (
requests.Response
) – Response object.- Returns
The content type from
response
.- Return type
Note
If the
Content-Type
header is not defined inresponse
, the function will utilisemagic
to detect its content type.
Source Saving¶
The darc.save
module contains the core utilities
for managing fetched files and documents.
The data storage under the root path (PATH_DB
)
is typically as following:
data
├── api
| └── <date>
│ └── <proxy>
│ └── <scheme>
│ └── <hostname>
│ ├── new_host
│ │ └── <hash>_<timestamp>.json
│ ├── requests
│ │ └── <hash>_<timestamp>.json
│ └── selenium
│ └── <hash>_<timestamp>.json
├── link.csv
├── misc
│ ├── bitcoin.txt
│ ├── data
│ │ └── <hash>_<timestamp>.<ext>
│ ├── ed2k.txt
│ ├── invalid.txt
│ ├── irc.txt
│ ├── magnet.txt
│ └── mail.txt
└── <proxy>
└── <scheme>
└── <hostname>
├── <hash>_<timestamp>.json
├── robots.txt
└── sitemap_<hash>.xml
-
darc.save.
sanitise
(link, time=None, raw=False, data=False, headers=False, screenshot=False)[source]¶ Sanitise link to path.
- Parameters
link (darc.link.Link) – Link object to sanitise the path
time (datetime) – Timestamp for the path.
data (bool) – If this is a generic content type document.
screenshot (bool) – If this is the screenshot from
selenium
.
- Returns
If
raw
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html
.If
data
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat
.If
headers
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.json
.If
screenshot
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.png
.If none above,
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html
.
- Return type
See also
-
darc.save.
save_headers
(time, link, response, session)[source]¶ Save HTTP response headers.
- Parameters
time (datetime) – Timestamp of response.
link (darc.link.Link) – Link object of response.
response (
requests.Response
) – Response object to be saved.session (
requests.Session
) – Session object of response.
- Returns
Saved path to response headers, i.e.
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.json
.- Return type
The JSON data saved is as following:
{ "[metadata]": { "url": "...", "proxy": "...", "host": "...", "base": "...", "name": "..." }, "Timestamp": "...", "URL": "...", "Method": "GET", "Status-Code": "...", "Reason": "...", "Cookies": { "...": "..." }, "Session": { "...": "..." }, "Request": { "...": "..." }, "Response": { "...": "..." }, "History": [ {"...": "..."} ] }
See also
-
darc.save.
save_link
(link)[source]¶ Save link hash database
link.csv
.The CSV file has following fields:
proxy type:
link.proxy
URL scheme:
link.url_parse.scheme
hostname:
link.base
link hash:
link.name
original URL:
link.url
- Parameters
link (darc.link.Link) – Link object to be saved.
See also
-
darc.save.
_SAVE_LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving link hash database
link.csv
.
Link Database¶
The darc
project utilises Redis based database
to provide tele-process communication.
Note
In its first implementation, the darc
project used
Queue
to support such communication.
However, as noticed when runtime, the Queue
object will be much affected by the lack of memory.
There will be three databases, all following the save naming
convension with queue_
prefix:
the hostname database –
queue_hostname
(HostnameQueueModel
)the
requests
database –queue_requests
(RequestsQueueModel
)the
selenium
database –queue_selenium
(SeleniumQueueModel
)
For queue_hostname
, queue_requests
and queue_selenium
,
they are all Redis sorted set data type.
If FLAG_DB
is True
, then the
module uses the RDS storage described by the peewee
models as backend.
-
darc.db.
_db_operation
(operation, *args, **kwargs)[source]¶ Retry operation on database.
- Parameters
operation (Callable[[..], T]) – Callable / method to perform.
*args – Arbitrary positional arguments.
- Keyword Arguments
**kwargs – Arbitrary keyword arguments.
- Returns
Any return value from a successful
operation
call.- Return type
T
-
darc.db.
_drop_hostname_db
(link)[source]¶ Remove link from the hostname database.
The function updates the
HostnameQueueModel
table.- Parameters
link (darc.link.Link) – Link to be removed.
-
darc.db.
_drop_hostname_redis
(link)[source]¶ Remove link from the hostname database.
The function updates the
queue_hostname
database.- Parameters
link (darc.link.Link) – Link to be removed.
-
darc.db.
_drop_requests_db
(link)[source]¶ Remove link from the
requests
database.The function updates the
RequestsQueueModel
table.- Parameters
link (darc.link.Link) – Link to be removed.
-
darc.db.
_drop_requests_redis
(link)[source]¶ Remove link from the
requests
database.The function updates the
queue_requests
database.- Parameters
link (darc.link.Link) – Link to be removed.
-
darc.db.
_drop_selenium_db
(link)[source]¶ Remove link from the
selenium
database.The function updates the
SeleniumQueueModel
table.- Parameters
link (darc.link.Link) – Link to be removed.
-
darc.db.
_drop_selenium_redis
(link)[source]¶ Remove link from the
selenium
database.The function updates the
queue_selenium
database.- Parameters
link (darc.link.Link) – Link to be removed.
-
darc.db.
_gen_arg_msg
(*args, **kwargs)[source]¶ Sanitise arguments representation string.
- Parameters
*args – Arbitrary arguments.
- Keyword Arguments
**kwargs – Arbitrary keyword arguments.
- Returns
Sanitised arguments representation string.
- Return type
-
darc.db.
_have_hostname_db
(link)[source]¶ Check if current link is a new host.
The function checks the
HostnameQueueModel
table.- Parameters
link (darc.link.Link) – Link to check against.
- Returns
A tuple of two
bool
values representing if such link is a known host and needs force refetch respectively.- Return type
-
darc.db.
_have_hostname_redis
(link)[source]¶ Check if current link is a new host.
The function checks the
queue_hostname
database.- Parameters
link (darc.link.Link) – Link to check against.
- Returns
A tuple of two
bool
values representing if such link is a known host and needs force refetch respectively.- Return type
-
darc.db.
_load_requests_db
()[source]¶ Load link from the
requests
database.The function reads the
RequestsQueueModel
table.- Returns
List of loaded links from the
requests
database.- Return type
List[darc.link.Link]
Note
At runtime, the function will load links with maximum number at
MAX_POOL
to limit the memory usage.
-
darc.db.
_load_requests_redis
()[source]¶ Load link from the
requests
database.The function reads the
queue_requests
database.- Returns
List of loaded links from the
requests
database.- Return type
List[darc.link.Link]
Note
At runtime, the function will load links with maximum number at
MAX_POOL
to limit the memory usage.
-
darc.db.
_load_selenium_db
()[source]¶ Load link from the
selenium
database.The function reads the
SeleniumQueueModel
table.- Returns
List of loaded links from the
selenium
database.- Return type
List[darc.link.Link]
Note
At runtime, the function will load links with maximum number at
MAX_POOL
to limit the memory usage.
-
darc.db.
_load_selenium_redis
()[source]¶ Load link from the
selenium
database.The function reads the
queue_selenium
database.- Parameters
check – If perform checks on loaded links, default to
CHECK
.- Returns
List of loaded links from the
selenium
database.- Return type
List[darc.link.Link]
Note
At runtime, the function will load links with maximum number at
MAX_POOL
to limit the memory usage.
-
darc.db.
_redis_command
(command, *args, **kwargs)[source]¶ Wrapper function for Redis command.
- Parameters
command (str) – Command name.
*args – Arbitrary arguments for the Redis command.
- Keyword Arguments
**kwargs – Arbitrary keyword arguments for the Redis command.
- Returns
Values returned from the Redis command.
- Warns
RedisCommandFailed – Warns at each round when the command failed.
- Return type
Any
See also
Between each retry, the function sleeps for
RETRY_INTERVAL
second(s) if such value is NOTNone
.
-
darc.db.
_redis_get_lock
(name, timeout=None, sleep=0.1, blocking_timeout=None, lock_class=None, thread_local=True)[source]¶ Get a lock for Redis operations.
- Parameters
name (str) – Lock name.
timeout (Optional[float]) – Maximum life for the lock.
sleep (float) – Amount of time to sleep per loop iteration when the lock is in blocking mode and another client is currently holding the lock.
blocking_timeout (Optional[float]) – Maximum amount of time in seconds to spend trying to acquire the lock.
lock_class (Optional[redis.lock.Lock]) – Lock implementation.
thread_local (bool) – Whether the lock token is placed in thread-local storage.
- Returns
Return a new
redis.lock.Lock
object using keyname
that mimics the behavior ofthreading.Lock
.- Return type
Union[redis.lock.Lock, contextlib.nullcontext]
- Seel Also:
If
REDIS_LOCK
isFalse
, returns acontextlib.nullcontext
instead.
-
darc.db.
_save_requests_db
(entries, single=False, score=None, nx=False, xx=False)[source]¶ Save link to the
requests
database.The function updates the
RequestsQueueModel
table.- Parameters
entries (Union[darc.link.Link, List[darc.link.Link]]) – Links to be added to the
requests
database. It can be either alist
of links, or a single link string (ifsingle
set asTrue
).single (bool) – Indicate if
entries
is alist
of links or a single link string.score – Score to for the Redis sorted set.
nx – Only create new elements and not to update scores for elements that already exist.
xx – Only update scores of elements that already exist. New elements will not be added.
-
darc.db.
_save_requests_redis
(entries, single=False, score=None, nx=False, xx=False)[source]¶ Save link to the
requests
database.The function updates the
queue_requests
database.- Parameters
entries (Union[darc.link.Link, List[darc.link.Link]]) – Links to be added to the
requests
database. It can be either alist
of links, or a single link string (ifsingle
set asTrue
).single (bool) – Indicate if
entries
is alist
of links or a single link string.score – Score to for the Redis sorted set.
nx – Forces
ZADD
to only create new elements and not to update scores for elements that already exist.xx – Forces
ZADD
to only update scores of elements that already exist. New elements will not be added.
-
darc.db.
_save_selenium_db
(entries, single=False, score=None, nx=False, xx=False)[source]¶ Save link to the
selenium
database.The function updates the
SeleniumQueueModel
table.- Parameters
entries (Union[darc.link.Link, List[darc.link.Link]]) – Links to be added to the
selenium
database. It can be either alist
of links, or a single link string (ifsingle
set asTrue
).single (bool) – Indicate if
entries
is alist
of links or a single link string.score – Score to for the Redis sorted set.
nx – Only create new elements and not to update scores for elements that already exist.
xx – Only update scores of elements that already exist. New elements will not be added.
-
darc.db.
_save_selenium_redis
(entries, single=False, score=None, nx=False, xx=False)[source]¶ Save link to the
selenium
database.The function updates the
queue_selenium
database.- Parameters
entries (Union[darc.link.Link, List[darc.link.Link]]) – Links to be added to the
selenium
database. It can be either an iterable of links, or a single link string (ifsingle
set asTrue
).single (bool) – Indicate if
entries
is an iterable of links or a single link string.score – Score to for the Redis sorted set.
nx – Forces
ZADD
to only create new elements and not to update scores for elements that already exist.xx – Forces
ZADD
to only update scores of elements that already exist. New elements will not be added.
When
entries
is a list ofLink
instances, we tries to perform bulk update to easy the memory consumption. The bulk size is defined byBULK_SIZE
.
-
darc.db.
drop_hostname
(link)[source]¶ Remove link from the hostname database.
- Parameters
link (darc.link.Link) – Link to be removed.
-
darc.db.
drop_requests
(link)[source]¶ Remove link from the
requests
database.- Parameters
link (darc.link.Link) – Link to be removed.
-
darc.db.
drop_selenium
(link)[source]¶ Remove link from the
selenium
database.- Parameters
link (darc.link.Link) – Link to be removed.
-
darc.db.
have_hostname
(link)[source]¶ Check if current link is a new host.
- Parameters
link (darc.link.Link) – Link to check against.
- Returns
A tuple of two
bool
values representing if such link is a known host and needs force refetch respectively.- Return type
-
darc.db.
load_requests
(check=False)[source]¶ Load link from the
requests
database.- Parameters
check (bool) – If perform checks on loaded links, default to
CHECK
.- Returns
List of loaded links from the
requests
database.- Return type
List[darc.link.Link]
Note
At runtime, the function will load links with maximum number at
MAX_POOL
to limit the memory usage.
-
darc.db.
load_selenium
(check=False)[source]¶ Load link from the
selenium
database.- Parameters
check (bool) – If perform checks on loaded links, default to
CHECK
.- Returns
List of loaded links from the
selenium
database.- Return type
List[darc.link.Link]
Note
At runtime, the function will load links with maximum number at
MAX_POOL
to limit the memory usage.
-
darc.db.
save_requests
(entries, single=False, score=None, nx=False, xx=False)[source]¶ Save link to the
requests
database.The function updates the
queue_requests
database.- Parameters
entries (Union[darc.link.Link, List[darc.link.Link]]) – Links to be added to the
requests
database. It can be either alist
of links, or a single link string (ifsingle
set asTrue
).single (bool) – Indicate if
entries
is alist
of links or a single link string.score – Score to for the Redis sorted set.
nx – Only create new elements and not to update scores for elements that already exist.
xx – Only update scores of elements that already exist. New elements will not be added.
When
entries
is a list ofLink
instances, we tries to perform bulk update to easy the memory consumption. The bulk size is defined byBULK_SIZE
.
-
darc.db.
save_selenium
(entries, single=False, score=None, nx=False, xx=False)[source]¶ Save link to the
selenium
database.- Parameters
entries (Union[darc.link.Link, List[darc.link.Link]]) – Links to be added to the
selenium
database. It can be either alist
of links, or a single link string (ifsingle
set asTrue
).single (bool) – Indicate if
entries
is alist
of links or a single link string.score – Score to for the Redis sorted set.
nx – Only create new elements and not to update scores for elements that already exist.
xx – Only update scores of elements that already exist. New elements will not be added.
When
entries
is a list ofLink
instances, we tries to perform bulk update to easy the memory consumption. The bulk size is defined byBULK_SIZE
.
-
darc.db.
LOCK_TIMEOUT
: Optional[float]¶ - Default
10
- Environ
DARC_LOCK_TIMEOUT
Lock blocking timeout.
Note
If is an infinit
inf
, no timeout will be applied.See also
Get a lock from
darc.db.get_lock()
.
-
darc.db.
MAX_POOL
: int¶ - Default
1_000
- Environ
Maximum number of links loading from the database.
Note
If is an infinit
inf
, no limit will be applied.
-
darc.db.
REDIS_LOCK
: bool¶ - Default
- Environ
DARC_REDIS_LOCK
If use Redis (Lua) lock to ensure process/thread-safely operations.
See also
Toggles the behaviour of
darc.db.get_lock()
.
-
darc.db.
RETRY_INTERVAL
: int¶ - Default
10
- Environ
DARC_RETRY
Retry interval between each Redis command failure.
Note
If is an infinit
inf
, no interval will be applied.See also
Toggles the behaviour of
darc.db.redis_command()
.
Data Submission¶
The darc
project integrates the capability of submitting
fetched data and information to a web server, to support real-time
cross-analysis and status display.
There are three submission events:
New Host Submission –
API_NEW_HOST
Submitted in
crawler()
function call, when the crawling URL is marked as a new host.Requests Submission –
API_REQUESTS
Submitted in
crawler()
function call, after the crawling process of the URL usingrequests
.Selenium Submission –
API_SELENIUM
Submitted in
loader()
function call, after the loading process of the URL usingselenium
.
See also
Please refer to data schema for more information about the submission data.
-
darc.submit.
get_hosts
(link)[source]¶ Read
hosts.txt
.- Parameters
link (darc.link.Link) – Link object to read
hosts.txt
.- Returns
- Return type
Optional[Dict[str, AnyStr]]
-
darc.submit.
get_metadata
(link)[source]¶ Generate metadata field.
- Parameters
link (darc.link.Link) – Link object to generate metadata.
- Returns
The metadata from
link
.url
– original URL,link.url
proxy
– proxy type,link.proxy
host
– hostname,link.host
base
– base path,link.base
name
– link hash,link.name
- Return type
-
darc.submit.
get_robots
(link)[source]¶ Read
robots.txt
.- Parameters
link (darc.link.Link) – Link object to read
robots.txt
.- Returns
- Return type
Optional[Dict[str, AnyStr]]
-
darc.submit.
get_sitemaps
(link)[source]¶ Read sitemaps.
- Parameters
link (darc.link.Link) – Link object to read sitemaps.
- Returns
- Return type
Optional[List[Dict[str, AnyStr]]]
-
darc.submit.
save_submit
(domain, data)[source]¶ Save failed submit data.
- Parameters
domain (
'new_host'
,'requests'
or'selenium'
) – Domain of the submit data.data (Dict[str, Any]) – Submit data.
Notes
The saved files will be categorised by the actual runtime day for better maintenance.
-
darc.submit.
submit_new_host
(time, link, partial=False, force=False)[source]¶ Submit new host.
When a new host is discovered, the
darc
crawler will submit the host information. Such includesrobots.txt
(if exists) andsitemap.xml
(if any).- Parameters
time (datetime.datetime) – Timestamp of submission.
link (darc.link.Link) – Link object of submission.
partial (bool) – If the data is not complete, i.e. failed when fetching
robots.txt
,hosts.txt
and/or sitemaps.force (bool) – If the data is force re-fetched, i.e. cache expired when checking with
darc.db.have_hostname()
.
If
API_NEW_HOST
isNone
, the data for submission will directly be save throughsave_submit()
.The data submitted should have following format:
{ // partial flag - true / false "$PARTIAL$": ..., // force flag - true / false "$FORCE$": ..., // metadata of URL "[metadata]": { // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment> "url": ..., // proxy type - null / tor / i2p / zeronet / freenet "proxy": ..., // hostname / netloc, c.f. ``urllib.parse.urlparse`` "host": ..., // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host> "base": ..., // sha256 of URL as name for saved files (timestamp is in ISO format) // JSON log as this one - <base>/<name>_<timestamp>.json // HTML from requests - <base>/<name>_<timestamp>_raw.html // HTML from selenium - <base>/<name>_<timestamp>.html // generic data files - <base>/<name>_<timestamp>.dat "name": ... }, // requested timestamp in ISO format as in name of saved file "Timestamp": ..., // original URL "URL": ..., // robots.txt from the host (if not exists, then ``null``) "Robots": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/robots.txt "path": ..., // content of the file (**base64** encoded) "data": ..., }, // sitemaps from the host (if none, then ``null``) "Sitemaps": [ { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/sitemap_<name>.xml "path": ..., // content of the file (**base64** encoded) "data": ..., }, ... ], // hosts.txt from the host (if proxy type is ``i2p``; if not exists, then ``null``) "Hosts": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/hosts.txt "path": ..., // content of the file (**base64** encoded) "data": ..., } }
-
darc.submit.
submit_requests
(time, link, response, session, content, mime_type, html=True)[source]¶ Submit requests data.
When crawling, we’ll first fetch the URl using
requests
, to check its availability and to save its HTTP headers information. Such information will be submitted to the web UI.- Parameters
time (datetime.datetime) – Timestamp of submission.
link (darc.link.Link) – Link object of submission.
response (requests.Response) – Response object of submission.
session (requests.Session) – Session object of submission.
content (bytes) – Raw content of from the response.
mime_type (str) – Content type.
html (bool) – If current document is HTML or other files.
If
API_REQUESTS
isNone
, the data for submission will directly be save throughsave_submit()
.The data submitted should have following format:
{ // metadata of URL "[metadata]": { // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment> "url": ..., // proxy type - null / tor / i2p / zeronet / freenet "proxy": ..., // hostname / netloc, c.f. ``urllib.parse.urlparse`` "host": ..., // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host> "base": ..., // sha256 of URL as name for saved files (timestamp is in ISO format) // JSON log as this one - <base>/<name>_<timestamp>.json // HTML from requests - <base>/<name>_<timestamp>_raw.html // HTML from selenium - <base>/<name>_<timestamp>.html // generic data files - <base>/<name>_<timestamp>.dat "name": ... }, // requested timestamp in ISO format as in name of saved file "Timestamp": ..., // original URL "URL": ..., // request method "Method": "GET", // response status code "Status-Code": ..., // response reason "Reason": ..., // response cookies (if any) "Cookies": { ... }, // session cookies (if any) "Session": { ... }, // request headers (if any) "Request": { ... }, // response headers (if any) "Response": { ... }, // content type "Content-Type": ..., // requested file (if not exists, then ``null``) "Document": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/<name>_<timestamp>_raw.html // or if the document is of generic content type, i.e. not HTML // - <proxy>/<scheme>/<host>/<name>_<timestamp>.dat "path": ..., // content of the file (**base64** encoded) "data": ..., }, // redirection history (if any) "History": [ // same record data as the original response {"...": "..."} ] }
See also
-
darc.submit.
submit_selenium
(time, link, html, screenshot)[source]¶ Submit selenium data.
After crawling with
requests
, we’ll then render the URl usingselenium
with Google Chrome and its web driver, to provide a fully rendered web page. Such information will be submitted to the web UI.- Parameters
time (datetime.datetime) – Timestamp of submission.
link (darc.link.Link) – Link object of submission.
html (str) – HTML source of the web page.
screenshot (Optional[str]) – base64 encoded screenshot.
If
API_SELENIUM
isNone
, the data for submission will directly be save throughsave_submit()
.Note
This information is optional, only provided if the content type from
requests
is HTML, status code not between400
and600
, and HTML data not empty.The data submitted should have following format:
{ // metadata of URL "[metadata]": { // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment> "url": ..., // proxy type - null / tor / i2p / zeronet / freenet "proxy": ..., // hostname / netloc, c.f. ``urllib.parse.urlparse`` "host": ..., // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host> "base": ..., // sha256 of URL as name for saved files (timestamp is in ISO format) // JSON log as this one - <base>/<name>_<timestamp>.json // HTML from requests - <base>/<name>_<timestamp>_raw.html // HTML from selenium - <base>/<name>_<timestamp>.html // generic data files - <base>/<name>_<timestamp>.dat "name": ... }, // requested timestamp in ISO format as in name of saved file "Timestamp": ..., // original URL "URL": ..., // rendered HTML document (if not exists, then ``null``) "Document": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/<name>_<timestamp>.html "path": ..., // content of the file (**base64** encoded) "data": ..., }, // web page screenshot (if not exists, then ``null``) "Screenshot": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/<name>_<timestamp>.png "path": ..., // content of the file (**base64** encoded) "data": ..., } }
See also
darc.submit.get_html()
darc.submit.get_screenshot()
-
darc.submit.
PATH_API
= '{PATH_DB}/api/'¶ Path to the API submittsion records, i.e.
api
folder under the root of data storage.See also
-
darc.submit.
API_NEW_HOST
: str¶ API URL for
submit_new_host()
.- Default
- Environ
-
darc.submit.
API_REQUESTS
: str¶ API URL for
submit_requests()
.- Default
- Environ
-
darc.submit.
API_SELENIUM
: str¶ API URL for
submit_selenium()
.- Default
- Environ
Note
If API_NEW_HOST
, API_REQUESTS
and API_SELENIUM
is None
, the corresponding
submit function will save the JSON data in the path
specified by PATH_API
.
See also
The darc
provides a demo on how to implement a darc
-compliant
web backend for the data submission module. See the demo
page for more information.
Requests Wrapper¶
The darc.requests
module wraps around the requests
module, and provides some simple interface for the darc
project.
-
darc.requests.
default_user_agent
(name='python-darc', proxy=None)[source]¶ Generates the default user agent.
-
darc.requests.
i2p_session
(futures=False)[source]¶ I2P (.i2p) session.
- Parameters
futures (bool) – If returns a
requests_futures.FuturesSession
.- Returns
The session object with I2P proxy settings.
- Return type
Union[requests.Session, requests_futures.FuturesSession]
See also
darc.proxy.i2p.I2P_REQUESTS_PROXY
-
darc.requests.
null_session
(futures=False)[source]¶ No proxy session.
- Parameters
futures (bool) – If returns a
requests_futures.FuturesSession
.- Returns
The session object with no proxy settings.
- Return type
Union[requests.Session, requests_futures.FuturesSession]
-
darc.requests.
request_session
(link, futures=False)[source]¶ Get requests session.
- Parameters
link (darc.link.Link) – Link requesting for requests.Session.
futures (bool) – If returns a
requests_futures.FuturesSession
.
- Returns
The session object with corresponding proxy settings.
- Return type
Union[requests.Session, requests_futures.FuturesSession]
- Raises
UnsupportedLink – If the proxy type of
link
if not specified in theLINK_MAP
.
See also
darc.proxy.LINK_MAP
-
darc.requests.
tor_session
(futures=False)[source]¶ Tor (.onion) session.
- Parameters
futures (bool) – If returns a
requests_futures.FuturesSession
.- Returns
The session object with Tor proxy settings.
- Return type
Union[requests.Session, requests_futures.FuturesSession]
See also
darc.proxy.tor.TOR_REQUESTS_PROXY
Selenium Wrapper¶
The darc.selenium
module wraps around the selenium
module, and provides some simple interface for the darc
project.
-
darc.selenium.
get_capabilities
(type='null')[source]¶ Generate desied capabilities.
- Parameters
type (str) – Proxy type for capabilities.
- Returns
The desied capabilities for the web driver
Chrome
.- Raises
UnsupportedProxy – If the proxy type is NOT
null
,tor
ori2p
.- Return type
See also
darc.proxy.tor.TOR_SELENIUM_PROXY
darc.proxy.i2p.I2P_SELENIUM_PROXY
-
darc.selenium.
get_options
(type='null')[source]¶ Generate options.
- Parameters
type (str) – Proxy type for options.
- Returns
The options for the web driver
Chrome
.- Return type
selenium.webdriver.ChromeOptions
- Raises
UnsupportedPlatform – If the operation system is NOT macOS or Linux and
CHROME_BINARY_LOCATION
is NOT set.UnsupportedProxy – If the proxy type is NOT
null
,tor
ori2p
.
Important
The function raises
UnsupportedPlatform
in cases whereBINARY_LOCATION
isNone
.Please provide
CHROME_BINARY_LOCATION
when runningdarc
inloader
mode on non macOS and/or Linux systems.See also
darc.proxy.tor.TOR_PORT
darc.proxy.i2p.I2P_PORT
References
Disable sandbox (
--no-sandbox
) when running asroot
userDisable usage of
/dev/shm
-
darc.selenium.
i2p_driver
()[source]¶ I2P (
.i2p
) driver.- Returns
The web driver object with I2P proxy settings.
- Return type
selenium.webdriver.Chrome
-
darc.selenium.
null_driver
()[source]¶ No proxy driver.
- Returns
The web driver object with no proxy settings.
- Return type
selenium.webdriver.Chrome
-
darc.selenium.
request_driver
(link)[source]¶ Get selenium driver.
- Parameters
link (darc.link.Link) – Link requesting for
Chrome
.- Returns
The web driver object with corresponding proxy settings.
- Return type
selenium.webdriver.Chrome
- Raises
UnsupportedLink – If the proxy type of
link
if not specified in theLINK_MAP
.
See also
darc.proxy.LINK_MAP
-
darc.selenium.
BINARY_LOCATION
: Optional[str]¶ Path to Google Chrome binary location.
- Default
google-chrome
- Environ
Proxy Utilities¶
The darc.proxy
module provides various proxy support
to the darc
project.
Bitcoin Addresses¶
The darc.proxy.bitcoin
module contains the auxiliary functions
around managing and processing the bitcoin addresses.
Currently, the darc
project directly save the bitcoin
addresses extracted to the data storage file
PATH
without further processing.
-
darc.proxy.bitcoin.
save_bitcoin
(link)[source]¶ Save bitcoin address.
The function will save bitcoin address to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the bitcoin address.
-
darc.proxy.bitcoin.
PATH
= '{PATH_MISC}/bitcoin.txt'¶ Path to the data storage of bitcoin addresses.
See also
-
darc.proxy.bitcoin.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving bitcoin addresses
PATH
.See also
Data URI Schemes¶
The darc.proxy.data
module contains the auxiliary functions
around managing and processing the data URI schemes.
Currently, the darc
project directly save the data URI
schemes extracted to the data storage path
PATH
without further processing.
-
darc.proxy.data.
save_data
(link)[source]¶ Save data URI.
The function will save data URIs to the data storage as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the data URI.
-
darc.proxy.data.
PATH
= '{PATH_MISC}/data/'¶ Path to the data storage of data URI schemes.
See also
ED2K Magnet Links¶
The darc.proxy.ed2k
module contains the auxiliary functions
around managing and processing the ED2K magnet links.
Currently, the darc
project directly save the ED2K magnet
links extracted to the data storage file
PATH
without further processing.
-
darc.proxy.ed2k.
save_ed2k
(link)[source]¶ Save ed2k magnet link.
The function will save ED2K magnet link to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the ED2K magnet links.
-
darc.proxy.ed2k.
PATH
= '{PATH_MISC}/ed2k.txt'¶ Path to the data storage of bED2K magnet links.
See also
-
darc.proxy.ed2k.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving ED2K magnet links
PATH
.See also
Freenet Proxy¶
The darc.proxy.freenet
module contains the auxiliary functions
around managing and processing the Freenet proxy.
-
darc.proxy.freenet.
_freenet_bootstrap
()[source]¶ Freenet bootstrap.
The bootstrap arguments are defined as
_FREENET_ARGS
.- Raises
subprocess.CalledProcessError – If the return code of
_FREENET_PROC
is non-zero.
-
darc.proxy.freenet.
freenet_bootstrap
()[source]¶ Bootstrap wrapper for Freenet.
The function will bootstrap the Freenet proxy. It will retry for
FREENET_RETRY
times in case of failure.Also, it will NOT re-bootstrap the proxy as is guaranteed by
_FREENET_BS_FLAG
.- Warns
FreenetBootstrapFailed – If failed to bootstrap Freenet proxy.
- Raises
UnsupportedPlatform – If the system is not supported, i.e. not macOS or Linux.
The following constants are configuration through environment variables:
-
darc.proxy.freenet.
FREENET_RETRY
: int¶ Retry times for Freenet bootstrap when failure.
- Default
3
- Environ
-
darc.proxy.freenet.
BS_WAIT
: float¶ Time after which the attempt to start Freenet is aborted.
- Default
90
- Environ
FREENET_WAIT
Note
If not provided, there will be NO timeouts.
-
darc.proxy.freenet.
FREENET_PATH
: str¶ Path to the Freenet project.
- Default
/usr/local/src/freenet
- Environ
-
darc.proxy.freenet.
FREENET_ARGS
: List[str]¶ Freenet bootstrap arguments for
run.sh start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).- Default
''
- Environ
Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
The following constants are defined for internal usage:
-
darc.proxy.freenet.
_MNG_FREENET
: bool¶ If manage Freenet proxy through
darc
.- Default
- Environ
DARC_FREENET
-
darc.proxy.freenet.
_FREENET_PROC
: subprocess.Popen¶ Freenet proxy process running in the background.
I2P Proxy¶
The darc.proxy.i2p
module contains the auxiliary functions
around managing and processing the I2P proxy.
-
darc.proxy.i2p.
_i2p_bootstrap
()[source]¶ I2P bootstrap.
The bootstrap arguments are defined as
_I2P_ARGS
.- Raises
subprocess.CalledProcessError – If the return code of
_I2P_PROC
is non-zero.
-
darc.proxy.i2p.
fetch_hosts
(link, force=False)[source]¶ Fetch
hosts.txt
.- Parameters
link (darc.link.Link) – Link object to fetch for its
hosts.txt
.force (bool) – Force refetch
hosts.txt
.
- Returns
Content of the
hosts.txt
file.
-
darc.proxy.i2p.
get_hosts
(link)[source]¶ Read
hosts.txt
.- Parameters
link (darc.link.Link) – Link object to read
hosts.txt
.- Returns
- Return type
-
darc.proxy.i2p.
have_hosts
(link)[source]¶ Check if
hosts.txt
already exists.- Parameters
link (darc.link.Link) – Link object to check if
hosts.txt
already exists.- Returns
If
hosts.txt
exists, return the path tohosts.txt
, i.e.<root>/<proxy>/<scheme>/<hostname>/hosts.txt
.If not, return
None
.
- Return type
Optional[str]
-
darc.proxy.i2p.
i2p_bootstrap
()[source]¶ Bootstrap wrapper for I2P.
The function will bootstrap the I2P proxy. It will retry for
I2P_RETRY
times in case of failure.Also, it will NOT re-bootstrap the proxy as is guaranteed by
_I2P_BS_FLAG
.- Warns
I2PBootstrapFailed – If failed to bootstrap I2P proxy.
- Raises
UnsupportedPlatform – If the system is not supported, i.e. not macOS or Linux.
-
darc.proxy.i2p.
read_hosts
(text, check=False)[source]¶ Read
hosts.txt
.- Parameters
- Returns
List of links extracted.
- Return type
List[darc.link.Link]
-
darc.proxy.i2p.
save_hosts
(link, text)[source]¶ Save
hosts.txt
.- Parameters
link (darc.link.Link) – Link object of
hosts.txt
.text (str) – Content of
hosts.txt
.
- Returns
Saved path to
hosts.txt
, i.e.<root>/<proxy>/<scheme>/<hostname>/hosts.txt
.- Return type
See also
-
darc.proxy.i2p.
I2P_SELENIUM_PROXY
: selenium.webdriver.common.proxy.Proxy¶ Proxy
for I2P web drivers.See also
The following constants are configuration through environment variables:
-
darc.proxy.i2p.
BS_WAIT
: float¶ Time after which the attempt to start I2P is aborted.
- Default
90
- Environ
I2P_WAIT
Note
If not provided, there will be NO timeouts.
-
darc.proxy.i2p.
I2P_ARGS
: List[str]¶ I2P bootstrap arguments for
i2prouter start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).- Default
''
- Environ
Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
The following constants are defined for internal usage:
-
darc.proxy.i2p.
_I2P_PROC
: subprocess.Popen¶ I2P proxy process running in the background.
IRC Addresses¶
The darc.proxy.irc
module contains the auxiliary functions
around managing and processing the IRC addresses.
Currently, the darc
project directly save the IRC
addresses extracted to the data storage file
PATH
without further processing.
-
darc.proxy.irc.
save_irc
(link)[source]¶ Save IRC address.
The function will save IRC address to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the IRC address.
-
darc.proxy.irc.
PATH
= '{PATH_MISC}/irc.txt'¶ Path to the data storage of IRC addresses.
See also
-
darc.proxy.irc.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving IRC addresses
PATH
.See also
Magnet Links¶
The darc.proxy.magnet
module contains the auxiliary functions
around managing and processing the magnet links.
Currently, the darc
project directly save the magnet
links extracted to the data storage file
PATH
without further processing.
-
darc.proxy.magnet.
save_magnet
(link)[source]¶ Save magnet link.
The function will save magnet link to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the magnet link
-
darc.proxy.magnet.
PATH
= '{PATH_MISC}/magnet.txt'¶ Path to the data storage of magnet links.
See also
-
darc.proxy.magnet.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving magnet links
PATH
.See also
Email Addresses¶
The darc.proxy.mail
module contains the auxiliary functions
around managing and processing the email addresses.
Currently, the darc
project directly save the email
addresses extracted to the data storage file
PATH
without further processing.
-
darc.proxy.mail.
save_mail
(link)[source]¶ Save email address.
The function will save email address to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the email address.
-
darc.proxy.mail.
PATH
= '{PATH_MISC}/mail.txt'¶ Path to the data storage of email addresses.
See also
-
darc.proxy.mail.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving email addresses
PATH
.See also
No Proxy¶
The darc.proxy.null
module contains the auxiliary functions
around managing and processing normal websites with no proxy.
-
darc.proxy.null.
fetch_sitemap
(link, force=False)[source]¶ Fetch sitemap.
The function will first fetch the
robots.txt
, then fetch the sitemaps accordingly.- Parameters
link (darc.link.Link) – Link object to fetch for its sitemaps.
force (bool) – Force refetch its sitemaps.
- Returns
Contents of
robots.txt
and sitemaps.
See also
darc.parse.get_sitemap()
-
darc.proxy.null.
get_sitemap
(link, text, host=None)[source]¶ Fetch link to other sitemaps from a sitemap.
- Parameters
link (darc.link.Link) – Original link to the sitemap.
text (str) – Content of the sitemap.
host (Optional[str]) – Hostname of the URL to the sitemap, the value may not be same as in
link
.
- Returns
List of link to sitemaps.
- Return type
List[darc.link.Link]
Note
As specified in the sitemap protocol, it may contain links to other sitemaps. *
-
darc.proxy.null.
have_robots
(link)[source]¶ Check if
robots.txt
already exists.- Parameters
link (darc.link.Link) – Link object to check if
robots.txt
already exists.- Returns
If
robots.txt
exists, return the path torobots.txt
, i.e.<root>/<proxy>/<scheme>/<hostname>/robots.txt
.If not, return
None
.
- Return type
Optional[str]
-
darc.proxy.null.
have_sitemap
(link)[source]¶ Check if sitemap already exists.
- Parameters
link (darc.link.Link) – Link object to check if sitemap already exists.
- Returns
If sitemap exists, return the path to the sitemap, i.e.
<root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml
.If not, return
None
.
- Return type
Optional[str]
-
darc.proxy.null.
read_robots
(link, text, host=None)[source]¶ Read
robots.txt
to fetch link to sitemaps.- Parameters
link (darc.link.Link) – Original link to
robots.txt
.text (str) – Content of
robots.txt
.host (Optional[str]) – Hostname of the URL to
robots.txt
, the value may not be same as inlink
.
- Returns
List of link to sitemaps.
- Return type
List[darc.link.Link]
Note
If the link to sitemap is not specified in
robots.txt
†, the fallback link/sitemap.xml
will be used.
-
darc.proxy.null.
read_sitemap
(link, text, check=False)[source]¶ Read sitemap.
- Parameters
link (darc.link.Link) – Original link to the sitemap.
text (str) – Content of the sitemap.
check (bool) – If perform checks on extracted links, default to
CHECK
.
- Returns
List of links extracted.
- Return type
List[darc.link.Link]
-
darc.proxy.null.
save_invalid
(link)[source]¶ Save link with invalid scheme.
The function will save link with invalid scheme to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the link with invalid scheme.
-
darc.proxy.null.
save_robots
(link, text)[source]¶ Save
robots.txt
.- Parameters
link (darc.link.Link) – Link object of
robots.txt
.text (str) – Content of
robots.txt
.
- Returns
Saved path to
robots.txt
, i.e.<root>/<proxy>/<scheme>/<hostname>/robots.txt
.- Return type
See also
-
darc.proxy.null.
save_sitemap
(link, text)[source]¶ Save sitemap.
- Parameters
link (darc.link.Link) – Link object of sitemap.
text (str) – Content of sitemap.
- Returns
Saved path to sitemap, i.e.
<root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml
.- Return type
See also
-
darc.proxy.null.
PATH
= '{PATH_MISC}/invalid.txt'¶ Path to the data storage of links with invalid scheme.
See also
-
darc.proxy.null.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving links with invalid scheme
PATH
.See also
JavaScript Links¶
The darc.proxy.script
module contains the auxiliary functions
around managing and processing the JavaScript links.
Currently, the darc
project directly save the JavaScript links
extracted to the data storage path PATH
without further processing.
-
darc.proxy.script.
save_script
(link)[source]¶ Save JavaScript link.
The function will save JavaScript link to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the JavaScript link.
-
darc.proxy.script.
PATH
= '{PATH_MISC}/script.txt'¶ Path to the data storage of bitcoin addresses.
See also
-
darc.proxy.script.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving JavaScript links
PATH
.See also
Telephone Numbers¶
The darc.proxy.tel
module contains the auxiliary functions
around managing and processing the telephone numbers.
Currently, the darc
project directly save the telephone
numbers extracted to the data storage file
PATH
without further processing.
-
darc.proxy.tel.
save_tel
(link)[source]¶ Save telephone number.
The function will save telephone number to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the telephone number.
-
darc.proxy.tel.
PATH
= '{PATH_MISC}/tel.txt'¶ Path to the data storage of bitcoin addresses.
See also
-
darc.proxy.tel.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving telephone numbers
PATH
.See also
Tor Proxy¶
The darc.proxy.tor
module contains the auxiliary functions
around managing and processing the Tor proxy.
-
darc.proxy.tor.
_tor_bootstrap
()[source]¶ Tor bootstrap.
The bootstrap configuration is defined as
_TOR_CONFIG
.If
TOR_PASS
not provided, the function will request for it.
-
darc.proxy.tor.
print_bootstrap_lines
(line)[source]¶ Print Tor bootstrap lines.
- Parameters
line (str) – Tor bootstrap line.
-
darc.proxy.tor.
tor_bootstrap
()[source]¶ Bootstrap wrapper for Tor.
The function will bootstrap the Tor proxy. It will retry for
TOR_RETRY
times in case of failure.Also, it will NOT re-bootstrap the proxy as is guaranteed by
_TOR_BS_FLAG
.- Warns
TorBootstrapFailed – If failed to bootstrap Tor proxy.
-
darc.proxy.tor.
TOR_SELENIUM_PROXY
: selenium.webdriver.common.proxy.Proxy¶ Proxy
for Tor web drivers.See also
The following constants are configuration through environment variables:
-
darc.proxy.tor.
TOR_PASS
: str¶ Tor controller authentication token.
Note
If not provided, it will be requested at runtime.
-
darc.proxy.tor.
BS_WAIT
: float¶ Time after which the attempt to start Tor is aborted.
- Default
90
- Environ
TOR_WAIT
Note
If not provided, there will be NO timeouts.
-
darc.proxy.tor.
TOR_CFG
: Dict[str, Any]¶ Tor bootstrap configuration for
stem.process.launch_tor_with_config()
.- Default
{}
- Environ
Note
If provided, it will be parsed from a JSON encoded string.
The following constants are defined for internal usage:
-
darc.proxy.tor.
_TOR_PROC
: subprocess.Popen¶ Tor proxy process running in the background.
-
darc.proxy.tor.
_TOR_CTRL
: stem.control.Controller¶ Tor controller process (
stem.control.Controller
) running in the background.
-
darc.proxy.tor.
_TOR_CONFIG
: List[str]¶ Tor bootstrap configuration for
stem.process.launch_tor_with_config()
.
ZeroNet Proxy¶
The darc.proxy.zeronet
module contains the auxiliary functions
around managing and processing the ZeroNet proxy.
-
darc.proxy.zeronet.
_zeronet_bootstrap
()[source]¶ ZeroNet bootstrap.
The bootstrap arguments are defined as
_ZERONET_ARGS
.- Raises
subprocess.CalledProcessError – If the return code of
_ZERONET_PROC
is non-zero.
-
darc.proxy.zeronet.
zeronet_bootstrap
()[source]¶ Bootstrap wrapper for ZeroNet.
The function will bootstrap the ZeroNet proxy. It will retry for
ZERONET_RETRY
times in case of failure.Also, it will NOT re-bootstrap the proxy as is guaranteed by
_ZERONET_BS_FLAG
.- Warns
ZeroNetBootstrapFailed – If failed to bootstrap ZeroNet proxy.
- Raises
UnsupportedPlatform – If the system is not supported, i.e. not macOS or Linux.
The following constants are configuration through environment variables:
-
darc.proxy.zeronet.
ZERONET_RETRY
: int¶ Retry times for ZeroNet bootstrap when failure.
- Default
3
- Environ
-
darc.proxy.zeronet.
BS_WAIT
: float¶ Time after which the attempt to start ZeroNet is aborted.
- Default
90
- Environ
ZERONET_WAIT
Note
If not provided, there will be NO timeouts.
-
darc.proxy.zeronet.
ZERONET_PATH
: str¶ Path to the ZeroNet project.
- Default
/usr/local/src/zeronet
- Environ
-
darc.proxy.zeronet.
ZERONET_ARGS
: List[str]¶ ZeroNet bootstrap arguments for
run.sh start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).- Default
''
- Environ
Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
The following constants are defined for internal usage:
-
darc.proxy.zeronet.
_MNG_ZERONET
: bool¶ If manage ZeroNet proxy through
darc
.- Default
- Environ
DARC_ZERONET
-
darc.proxy.zeronet.
_ZERONET_PROC
: subprocess.Popen¶ ZeroNet proxy process running in the background.
To tell the darc
project which proxy settings to be used for the
requests.Session
objects and WebDriver
objects, you can specify such information in the darc.proxy.LINK_MAP
mapping dictionarty.
-
darc.proxy.
LINK_MAP
: DefaultDict[str, Tuple[types.FunctionType, types.FunctionType]]¶ LINK_MAP = collections.defaultdict( lambda: (darc.requests.null_session, darc.selenium.null_driver), dict( tor=(darc.requests.tor_session, darc.selenium.tor_driver), i2p=(darc.requests.i2p_session, darc.selenium.i2p_driver), ) )
The mapping dictionary for proxy type to its corresponding
requests.Session
factory function andWebDriver
factory function.The fallback value is the no proxy
requests.Session
object (null_session()
) andWebDriver
object (null_driver()
).See also
darc.requests
–requests.Session
factory functionsdarc.selenium
–WebDriver
factory functions
Sites Customisation¶
As websites may have authentication requirements, etc., over
its content, the darc.sites
module provides sites
customisation hooks to both requests
and selenium
crawling processes.
Important
To create a sites customisation, define your class by inheriting
darc.sites.BaseSite
and register it to the darc
module through darc.sites.register()
.
Base Sites Customisation¶
The darc.sites._abc
module provides the abstract base class
for sites customisation implementation. All sites customisation must
inherit from the BaseSite
exclusively.
Important
The BaseSite
class is NOT intended to
be used directly from the darc.sites._abc
module. Instead,
you are recommended to import it from darc.sites
respectively.
-
class
darc.sites._abc.
BaseSite
[source]¶ Bases:
object
Abstract base class for sites customisation.
-
static
crawler
(session, link)[source]¶ Crawler hook for my site.
- Parameters
session (requests.sessions.Session) – Session object with proxy settings.
link (darc.link.Link) – Link object to be crawled.
- Raises
LinkNoReturn – This link has no return response.
- Return type
Union[NoReturn, requests.models.Response]
-
static
loader
(driver, link)[source]¶ Loader hook for my site.
- Parameters
driver (selenium.webdriver.Chrome) – Web driver object with proxy settings.
link (darc.link.Link) – Link object to be loaded.
- Raises
LinkNoReturn – This link has no return response.
- Return type
Union[NoReturn, selenium.webdriver.chrome.webdriver.WebDriver]
-
static
Default Hooks¶
The darc.sites.default
module is the fallback for sites
customisation.
-
class
darc.sites.default.
DefaultSite
[source]¶ Bases:
darc.sites._abc.BaseSite
Default hooks.
-
static
crawler
(session, link)[source]¶ Default crawler hook.
- Parameters
session (requests.Session) – Session object with proxy settings.
link (darc.link.Link) – Link object to be crawled.
- Returns
The final response object with crawled data.
- Return type
See also
-
static
loader
(driver, link)[source]¶ Default loader hook.
When loading, if
SE_WAIT
is a valid time lapse, the function will sleep for such time to wait for the page to finish loading contents.- Parameters
driver (selenium.webdriver.Chrome) – Web driver object with proxy settings.
link (darc.link.Link) – Link object to be loaded.
- Returns
The web driver object with loaded data.
- Return type
selenium.webdriver.Chrome
Note
Internally,
selenium
will wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded
). However, some extra scripts may take more time running after the event.See also
-
static
Bitcoin Addresses¶
The darc.sites.bitcoin
module is customised to
handle bitcoin addresses.
-
class
darc.sites.bitcoin.
Bitcoin
[source]¶ Bases:
darc.sites._abc.BaseSite
Bitcoin addresses.
-
static
crawler
(session, link)[source]¶ Crawler hook for bitcoin addresses.
- Parameters
session (
requests.Session
) – Session object with proxy settings.link (darc.link.Link) – Link object to be crawled.
- Raises
LinkNoReturn – This link has no return response.
- Return type
NoReturn
-
static
loader
(driver, link)[source]¶ Not implemented.
- Raises
LinkNoReturn – This hook is not implemented.
- Parameters
driver (selenium.webdriver.chrome.webdriver.WebDriver) –
link (darc.link.Link) –
- Return type
NoReturn
-
static
Data URI Schemes¶
The darc.sites.data
module is customised to
handle data URI schemes.
-
class
darc.sites.data.
DataURI
[source]¶ Bases:
darc.sites._abc.BaseSite
Data URI schemes.
-
static
crawler
(session, link)[source]¶ Crawler hook for data URIs.
- Parameters
session (
requests.Session
) – Session object with proxy settings.link (darc.link.Link) – Link object to be crawled.
- Raises
LinkNoReturn – This link has no return response.
- Return type
NoReturn
-
static
loader
(driver, link)[source]¶ Not implemented.
- Raises
LinkNoReturn – This hook is not implemented.
- Parameters
driver (selenium.webdriver.chrome.webdriver.WebDriver) –
link (darc.link.Link) –
- Return type
NoReturn
-
static
ED2K Magnet Links¶
The darc.sites.ed2k
module is customised to
handle ED2K magnet links.
-
class
darc.sites.ed2k.
ED2K
[source]¶ Bases:
darc.sites._abc.BaseSite
ED2K magnet links.
-
static
crawler
(session, link)[source]¶ Crawler hook for ED2K magnet links.
- Parameters
session (
requests.Session
) – Session object with proxy settings.link (darc.link.Link) – Link object to be crawled.
- Raises
LinkNoReturn – This link has no return response.
- Return type
NoReturn
-
static
loader
(driver, link)[source]¶ Not implemented.
- Raises
LinkNoReturn – This hook is not implemented.
- Parameters
driver (selenium.webdriver.chrome.webdriver.WebDriver) –
link (darc.link.Link) –
- Return type
NoReturn
-
static
IRC Addresses¶
The darc.sites.script
module is customised to
handle IRC addresses.
-
class
darc.sites.irc.
IRC
[source]¶ Bases:
darc.sites._abc.BaseSite
IRC addresses.
-
static
crawler
(session, link)[source]¶ Crawler hook for IRC addresses.
- Parameters
session (
requests.Session
) – Session object with proxy settings.link (darc.link.Link) – Link object to be crawled.
- Raises
LinkNoReturn – This link has no return response.
- Return type
NoReturn
-
static
loader
(driver, link)[source]¶ Not implemented.
- Raises
LinkNoReturn – This hook is not implemented.
- Parameters
driver (selenium.webdriver.chrome.webdriver.WebDriver) –
link (darc.link.Link) –
- Return type
NoReturn
-
static
Magnet Links¶
The darc.sites.magnet
module is customised to
handle magnet links.
-
class
darc.sites.magnet.
Magnet
[source]¶ Bases:
darc.sites._abc.BaseSite
Magnet links.
-
static
crawler
(session, link)[source]¶ Crawler hook for magnet links.
- Parameters
session (
requests.Session
) – Session object with proxy settings.link (darc.link.Link) – Link object to be crawled.
- Raises
LinkNoReturn – This link has no return response.
- Return type
NoReturn
-
static
loader
(driver, link)[source]¶ Not implemented.
- Raises
LinkNoReturn – This hook is not implemented.
- Parameters
driver (selenium.webdriver.chrome.webdriver.WebDriver) –
link (darc.link.Link) –
- Return type
NoReturn
-
static
Email Addresses¶
The darc.sites.mail
module is customised to
handle email addresses.
-
class
darc.sites.mail.
Email
[source]¶ Bases:
darc.sites._abc.BaseSite
Email addresses.
-
static
crawler
(session, link)[source]¶ Crawler hook for email addresses.
- Parameters
session (
requests.Session
) – Session object with proxy settings.link (darc.link.Link) – Link object to be crawled.
- Raises
LinkNoReturn – This link has no return response.
- Return type
NoReturn
-
static
loader
(driver, link)[source]¶ Not implemented.
- Raises
LinkNoReturn – This hook is not implemented.
- Parameters
driver (selenium.webdriver.chrome.webdriver.WebDriver) –
link (darc.link.Link) –
- Return type
NoReturn
-
static
JavaScript Links¶
The darc.sites.script
module is customised to
handle JavaScript links.
-
class
darc.sites.script.
Script
[source]¶ Bases:
darc.sites._abc.BaseSite
JavaScript links.
-
static
crawler
(session, link)[source]¶ Crawler hook for JavaScript links.
- Parameters
session (
requests.Session
) – Session object with proxy settings.link (darc.link.Link) – Link object to be crawled.
- Raises
LinkNoReturn – This link has no return response.
- Return type
NoReturn
-
static
loader
(driver, link)[source]¶ Not implemented.
- Raises
LinkNoReturn – This hook is not implemented.
- Parameters
driver (selenium.webdriver.chrome.webdriver.WebDriver) –
link (darc.link.Link) –
- Return type
NoReturn
-
static
Telephone Numbers¶
The darc.sites.script
module is customised to
handle telephone numbers.
-
class
darc.sites.tel.
Tel
[source]¶ Bases:
darc.sites._abc.BaseSite
Telephone numbers.
-
static
crawler
(session, link)[source]¶ Crawler hook for telephone numbers.
- Parameters
session (
requests.Session
) – Session object with proxy settings.link (darc.link.Link) – Link object to be crawled.
- Raises
LinkNoReturn – This link has no return response.
- Return type
NoReturn
-
static
loader
(driver, link)[source]¶ Not implemented.
- Raises
LinkNoReturn – This hook is not implemented.
- Parameters
driver (selenium.webdriver.chrome.webdriver.WebDriver) –
link (darc.link.Link) –
- Return type
NoReturn
-
static
To start with, you just need to define your sites customisation by
inheriting BaseSite
and overload corresponding
crawler()
and/or
loader()
methods.
To customise behaviours over requests
, you sites customisation
class should have a crawler()
method, e.g.
DefaultSite.crawler
.
The function takes the requests.Session
object with proxy settings and
a Link
object representing the link to be
crawled, then returns a requests.Response
object containing the final
data of the crawling process.
-
darc.sites.
crawler_hook
(link, session)[source]¶ Customisation as to
requests
sessions.- Parameters
link (darc.link.Link) – Link object to be crawled.
session (requests.Session) – Session object with proxy settings.
- Returns
The final response object with crawled data.
- Return type
See also
darc.sites.SITE_MAP
To customise behaviours over selenium
, you sites customisation
class should have a loader()
method, e.g.
DefaultSite.loader
.
The function takes the WebDriver
object with proxy settings and a Link
object representing
the link to be loaded, then returns the WebDriver
object containing the final data of the loading process.
-
darc.sites.
loader_hook
(link, driver)[source]¶ Customisation as to
selenium
drivers.- Parameters
link (darc.link.Link) – Link object to be loaded.
driver (selenium.webdriver.Chrome) – Web driver object with proxy settings.
- Returns
The web driver object with loaded data.
- Return type
selenium.webdriver.Chrome
See also
darc.sites.SITE_MAP
To tell the darc
project which sites customisation
module it should use for a certain hostname, you can register
such module to the SITEMAP
mapping dictionary
through register()
:
-
darc.sites.
register
(site, *hostname)[source]¶ Register new site map.
- Parameters
site (Type[darc.sites._abc.BaseSite]) – Sites customisation class inherited from
BaseSite
.*hostname (Tuple[str]) – Optional list of hostnames the sites customisation should be registered with. By default, we use
site.hostname
.
-
darc.sites.
SITEMAP
: DefaultDict[str, Type[darc.sites._abc.BaseSite]]¶ from darc.sites.default import DefaultSite SITEMAP = collections.defaultdict(lambda: DefaultSite, { # 'www.sample.com': SampleSite, # local customised class })
The mapping dictionary for hostname to sites customisation classes.
The fallback value is
darc.sites.default.DefaultSite
.
-
darc.sites.
_get_site
(link)[source]¶ Load sites customisation if any.
If the sites customisation does not exist, it will fallback to the default hooks,
DefaultSite
.- Parameters
link (darc.link.Link) – Link object to fetch sites customisation class.
- Returns
The sites customisation class.
- Return type
Type[darc.sites._abc.BaseSite]
See also
See also
Please refer to Customisations for more examples and explanations.
Module Constants¶
Auxiliary Function¶
General Configurations¶
-
darc.const.
REBOOT
: bool¶ If exit the program after first round, i.e. crawled all links from the
requests
link database and loaded all links from theselenium
link database.This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.
- Default
- Environ
-
darc.const.
VERBOSE
: bool¶ If run the program in verbose mode. If
DEBUG
isTrue
, then the verbose mode will be always enabled.- Default
- Environ
-
darc.const.
CHECK
: bool¶ If check proxy and hostname before crawling (when calling
extract_links()
,read_sitemap()
andread_hosts()
).If
CHECK_NG
isTrue
, then this environment variable will be always set asTrue
.- Default
- Environ
-
darc.const.
CHECK_NG
: bool¶ If check content type through
HEAD
requests before crawling (when callingextract_links()
,read_sitemap()
andread_hosts()
).- Default
- Environ
-
darc.const.
CWD
= '.'¶ The current working direcory.
-
darc.const.
DARC_CPU
: int¶ Number of concurrent processes. If not provided, then the number of system CPUs will be used.
-
darc.const.
DARC_USER
: str¶ Non-root user for proxies.
- Default
current login user (c.f.
getpass.getuser()
)- Environ
Data Storage¶
See also
See darc.db
for more information about database integration.
-
darc.const.
REDIS
: redis.Redis¶ URL to the Redis database.
- Default
redis://127.0.0.1
- Environ
-
darc.const.
DB
: peewee.Database¶ URL to the RDS storage.
- Default
sqlite://{PATH_DB}/darc.db
- Environ
:envvar`DB_URL`
-
darc.const.
DB
: peewee.Database¶ URL to the data submission storage.
- Default
sqlite://{PATH_DB}/darcweb.db
- Environ
:envvar`DB_URL`
-
darc.const.
FLAG_DB
: bool¶ Flag if uses RDS as the task queue backend. If
REDIS_URL
is provided, thenFalse
; else,True
.
-
darc.const.
PATH_DB
: str¶ Path to data storage.
- Default
data
- Environ
See also
See
darc.save
for more information about source saving.
-
darc.const.
PATH_MISC
= '{PATH_DB}/misc/'¶ Path to miscellaneous data storage, i.e.
misc
folder under the root of data storage.See also
-
darc.const.
PATH_LN
= '{PATH_DB}/link.csv'¶ Path to the link CSV file,
link.csv
.See also
-
darc.const.
PATH_ID
= '{PATH_DB}/darc.pid'¶ Path to the process ID file,
darc.pid
.See also
Web Crawlers¶
-
darc.const.
DARC_WAIT
: Optional[float]¶ Time interval between each round when the
requests
and/orselenium
database are empty.- Default
60
- Environ
-
darc.const.
TIME_CACHE
: float¶ Time delta for caches in seconds.
The
darc
project supports caching for fetched files.TIME_CACHE
will specify for how log the fetched files will be cached and NOT fetched again.Note
If
TIME_CACHE
isNone
then caching will be marked as forever.- Default
60
- Environ
-
darc.const.
SE_WAIT
: float¶ Time to wait for
selenium
to finish loading pages.Note
Internally,
selenium
will wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded
). However, some extra scripts may take more time running after the event.- Default
60
- Environ
-
darc.const.
SE_EMPTY
= '<html><head></head><body></body></html>'¶ The empty page from
selenium
.See also
White / Black Lists¶
-
darc.const.
LINK_WHITE_LIST
: List[re.Pattern]¶ White list of hostnames should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
LINK_BLACK_LIST
: List[re.Pattern]¶ Black list of hostnames should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
LINK_FALLBACK
: bool¶ Fallback value for
match_host()
.- Default
- Environ
-
darc.const.
MIME_WHITE_LIST
: List[re.Pattern]¶ White list of content types should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
MIME_BLACK_LIST
: List[re.Pattern]¶ Black list of content types should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
MIME_FALLBACK
: bool¶ Fallback value for
match_mime()
.- Default
- Environ
-
darc.const.
PROXY_WHITE_LIST
: List[str]¶ White list of proxy types should be crawled.
- Default
[]
- Environ
Note
The proxy types are case insensitive.
-
darc.const.
PROXY_BLACK_LIST
: List[str]¶ Black list of proxy types should be crawled.
- Default
[]
- Environ
Note
The proxy types are case insensitive.
-
darc.const.
PROXY_FALLBACK
: bool¶ Fallback value for
match_proxy()
.- Default
- Environ
Custom Exceptions¶
The render_error()
function can be used to render
multi-line error messages with stem.util.term
colours.
The darc
project provides following custom exceptions:
Note
All exceptions are inherited from _BaseException
.
The darc
project provides following custom warnings:
Note
All warnings are inherited from _BaseWarning
.
-
exception
darc.error.
APIRequestFailed
[source]¶ Bases:
darc.error._BaseWarning
API submit failed.
-
exception
darc.error.
DatabaseOperaionFailed
[source]¶ Bases:
darc.error._BaseWarning
Database operation execution failed.
-
exception
darc.error.
FreenetBootstrapFailed
[source]¶ Bases:
darc.error._BaseWarning
Freenet bootstrap process failed.
-
exception
darc.error.
HookExecutionFailed
[source]¶ Bases:
darc.error._BaseWarning
Failed to execute hook function.
-
exception
darc.error.
I2PBootstrapFailed
[source]¶ Bases:
darc.error._BaseWarning
I2P bootstrap process failed.
-
exception
darc.error.
LinkNoReturn
[source]¶ Bases:
darc.error._BaseException
The link has no return value from the hooks.
-
exception
darc.error.
LockWarning
[source]¶ Bases:
darc.error._BaseWarning
Failed to acquire Redis lock.
-
exception
darc.error.
RedisCommandFailed
[source]¶ Bases:
darc.error._BaseWarning
Redis command execution failed.
-
exception
darc.error.
SiteNotFoundWarning
[source]¶ Bases:
darc.error._BaseWarning
,ImportWarning
Site customisation not found.
-
exception
darc.error.
TorBootstrapFailed
[source]¶ Bases:
darc.error._BaseWarning
Tor bootstrap process failed.
-
exception
darc.error.
TorRenewFailed
[source]¶ Bases:
darc.error._BaseWarning
Tor renew request failed.
-
exception
darc.error.
UnsupportedLink
[source]¶ Bases:
darc.error._BaseException
The link is not supported.
-
exception
darc.error.
UnsupportedPlatform
[source]¶ Bases:
darc.error._BaseException
The platform is not supported.
-
exception
darc.error.
UnsupportedProxy
[source]¶ Bases:
darc.error._BaseException
The proxy is not supported.
-
exception
darc.error.
WorkerBreak
[source]¶ Bases:
darc.error._BaseException
Break from the worker loop.
-
exception
darc.error.
ZeroNetBootstrapFailed
[source]¶ Bases:
darc.error._BaseWarning
ZeroNet bootstrap process failed.
-
darc.error.
render_error
(message, colour)[source]¶ Render error message.
The function wraps the
stem.util.term.format()
function to provide multi-line formatting support.
Data Models¶
The darc.model
module contains all data models defined for the
darc
project, including RDS-based task queue and data submission.
Task Queues¶
The darc.model.tasks
module defines the data models
required for the task queue of darc
.
See also
Please refer to darc.db
module for more information
about the task queues.
Hostname Queue¶
Important
The hostname queue is a set named queue_hostname
in
a Redis based task queue.
The darc.model.tasks.hostname
model contains the data model
defined for the hostname queue.
-
class
darc.model.tasks.hostname.
HostnameQueueModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModel
Hostname task queue.
-
DoesNotExist
¶ alias of
HostnameQueueModelDoesNotExist
-
hostname
: Union[str, peewee.TextField] = <TextField: HostnameQueueModel.hostname>¶ Hostname (c.f.
link.host
).
-
id
= <AutoField: HostnameQueueModel.id>¶
-
timestamp
: Union[datetime.datetime, peewee.DateTimeField] = <DateTimeField: HostnameQueueModel.timestamp>¶ Timestamp of last update.
-
Crawler Queue¶
The darc.model.tasks.requests
model contains the data model
defined for the crawler
queue.
-
class
darc.model.tasks.requests.
RequestsQueueModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModel
Task queue for
crawler()
.-
DoesNotExist
¶ alias of
RequestsQueueModelDoesNotExist
-
hash
: Union[str, peewee.CharField] = <CharField: RequestsQueueModel.hash>¶ Sha256 hash value (c.f.
Link.name
).
-
id
= <AutoField: RequestsQueueModel.id>¶
-
link
: Union[darc.link.Link, darc.model.utils.PickleField] = <PickleField: RequestsQueueModel.link>¶ Pickled target
Link
instance.
-
text
: Union[str, peewee.TextField] = <TextField: RequestsQueueModel.text>¶ URL as raw text (c.f.
Link.url
).
-
timestamp
: Union[datetime.datetime, peewee.DateTimeField] = <DateTimeField: RequestsQueueModel.timestamp>¶ Timestamp of last update.
-
Loader Queue¶
The darc.model.tasks.selenium
model contains the data model
defined for the loader
queue.
-
class
darc.model.tasks.selenium.
SeleniumQueueModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModel
Task queue for
loader()
.-
DoesNotExist
¶ alias of
SeleniumQueueModelDoesNotExist
-
hash
: Union[str, peewee.CharField] = <CharField: SeleniumQueueModel.hash>¶ Sha256 hash value (c.f.
Link.name
).
-
id
= <AutoField: SeleniumQueueModel.id>¶
-
link
: Union[darc.link.Link, darc.model.utils.PickleField] = <PickleField: SeleniumQueueModel.link>¶ Pickled target
Link
instance.
-
text
: Union[str, peewee.TextField] = <TextField: SeleniumQueueModel.text>¶ URL as raw text (c.f.
Link.url
).
-
timestamp
: Union[datetime.datetime, peewee.DateTimeField] = <DateTimeField: SeleniumQueueModel.timestamp>¶ Timestamp of last update.
-
Submission Data Models¶
The darc.model.web
module defines the data models
to store the data crawled from the darc
project.
See also
Please refer to darc.submit
module for more information
about data submission.
Hostname Records¶
The darc.model.web.hostname
module defines the data model
representing hostnames, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host()
for more
information.
-
class
darc.model.web.hostname.
HostnameModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for a hostname record.
Important
The alive of a hostname is toggled if
crawler()
successfully requested a URL with such hostname.-
DoesNotExist
¶ alias of
HostnameModelDoesNotExist
-
alive
¶ If the hostname is still active.
We consider the hostname as inactive, only if all subsidiary URLs are inactive.
-
discovery
: datetime.datetime = <DateTimeField: HostnameModel.discovery>¶ Timestamp of first
new_host
submission.
-
hosts
¶
-
id
= <AutoField: HostnameModel.id>¶
-
last_seen
: datetime.datetime = <DateTimeField: HostnameModel.last_seen>¶ Timestamp of last related submission.
-
proxy
: darc.model.utils.Proxy = <IntEnumField: HostnameModel.proxy>¶ Proxy type (c.f.
link.proxy
).
-
robots
¶
-
since
¶ The hostname is active/inactive since such timestamp.
We confider the timestamp by the earlies timestamp of related subsidiary active/inactive URLs.
-
sitemaps
¶
-
urls
¶
-
URL Records¶
The darc.model.web.url
module defines the data model
representing URLs, specifically from requests
and
selenium
submission.
See also
Please refer to darc.submit.submit_requests()
and
darc.submit.submit_selenium()
for more information.
-
class
darc.model.web.url.
URLModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for a requested URL.
Important
The alive of a URL is toggled if
crawler()
successfully requested such URL and the status code isok
.-
DoesNotExist
¶ alias of
URLModelDoesNotExist
-
discovery
: datetime.datetime = <DateTimeField: URLModel.discovery>¶ Timestamp of first submission.
-
hostname
: darc.model.web.hostname.HostnameModel = <ForeignKeyField: URLModel.hostname>¶ Hostname (c.f.
link.host
).
-
hostname_id
= <ForeignKeyField: URLModel.hostname>¶
-
id
= <AutoField: URLModel.id>¶
-
last_seen
: datetime.datetime = <DateTimeField: URLModel.last_seen>¶ Timestamp of last submission.
-
proxy
: darc.model.utils.Proxy = <IntEnumField: URLModel.proxy>¶ Proxy type (c.f.
link.proxy
).
-
requests
¶
-
selenium
¶
-
since
: datetime.datetime = <DateTimeField: URLModel.since>¶ The hostname is active/inactive since this timestamp.
-
robots.txt
Records¶
The darc.model.web.robots
module defines the data model
representing robots.txt
data, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host()
for more
information.
-
class
darc.model.web.robots.
RobotsModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for
robots.txt
data.-
DoesNotExist
¶ alias of
RobotsModelDoesNotExist
-
host
: darc.model.web.hostname.HostnameModel = <ForeignKeyField: RobotsModel.host>¶ Hostname (c.f.
link.host
).
-
host_id
= <ForeignKeyField: RobotsModel.host>¶
-
id
= <AutoField: RobotsModel.id>¶
-
timestamp
: datetime.datetime = <DateTimeField: RobotsModel.timestamp>¶ Timestamp of the submission.
-
sitemap.xml
Records¶
The darc.model.web.sitemap
module defines the data model
representing sitemap.xml
data, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host()
for more
information.
-
class
darc.model.web.sitemap.
SitemapModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for
sitemap.xml
data.-
DoesNotExist
¶ alias of
SitemapModelDoesNotExist
-
host
: darc.model.web.hostname.HostnameModel = <ForeignKeyField: SitemapModel.host>¶ Hostname (c.f.
link.host
).
-
host_id
= <ForeignKeyField: SitemapModel.host>¶
-
id
= <AutoField: SitemapModel.id>¶
-
timestamp
: datetime.datetime = <DateTimeField: SitemapModel.timestamp>¶ Timestamp of the submission.
-
hosts.txt
Records¶
The darc.model.web.hosts
module defines the data model
representing hosts.txt
data, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host()
for more
information.
-
class
darc.model.web.hosts.
HostsModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for
hosts.txt
data.-
DoesNotExist
¶ alias of
HostsModelDoesNotExist
-
host
: darc.model.web.hostname.HostnameModel = <ForeignKeyField: HostsModel.host>¶ Hostname (c.f.
link.host
).
-
host_id
= <ForeignKeyField: HostsModel.host>¶
-
id
= <AutoField: HostsModel.id>¶
-
timestamp
: datetime.datetime = <DateTimeField: HostsModel.timestamp>¶ Timestamp of the submission.
-
Crawler Records¶
The darc.model.web.requests
module defines the data model
representing crawler
, specifically
from requests
submission.
See also
Please refer to darc.submit.submit_requests()
for more
information.
-
class
darc.model.web.requests.
RequestsHistoryModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for history records from
requests
submission.-
DoesNotExist
¶ alias of
RequestsHistoryModelDoesNotExist
Response cookies.
-
id
= <AutoField: RequestsHistoryModel.id>¶
-
model
: darc.model.web.requests.RequestsModel = <ForeignKeyField: RequestsHistoryModel.model>¶ Original record.
-
model_id
= <ForeignKeyField: RequestsHistoryModel.model>¶
-
timestamp
: datetime.datetime = <DateTimeField: RequestsHistoryModel.timestamp>¶ Timestamp of the submission.
-
-
class
darc.model.web.requests.
RequestsModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for documents from
requests
submission.-
DoesNotExist
¶ alias of
RequestsModelDoesNotExist
Response cookies.
-
history
¶
-
id
= <AutoField: RequestsModel.id>¶
-
timestamp
: datetime.datetime = <DateTimeField: RequestsModel.timestamp>¶ Timestamp of the submission.
-
url
: darc.model.web.url.URLModel = <ForeignKeyField: RequestsModel.url>¶ Original URL (c.f.
link.url
).
-
url_id
= <ForeignKeyField: RequestsModel.url>¶
-
Loader Records¶
The darc.model.web.selenium
module defines the data model
representing loader
, specifically
from selenium
submission.
See also
Please refer to darc.submit.submit_selenium()
for more
information.
-
class
darc.model.web.selenium.
SeleniumModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for documents from
selenium
submission.-
DoesNotExist
¶ alias of
SeleniumModelDoesNotExist
-
id
= <AutoField: SeleniumModel.id>¶
-
screenshot
: Optional[bytes] = <BlobField: SeleniumModel.screenshot>¶ Screenshot in PNG format as
bytes
.
-
timestamp
: datetime.datetime = <DateTimeField: SeleniumModel.timestamp>¶ Timestamp of the submission.
-
url
: darc.model.web.url.URLModel = <ForeignKeyField: SeleniumModel.url>¶ Original URL (c.f.
link.url
).
-
url_id
= <ForeignKeyField: SeleniumModel.url>¶
-
Base Model¶
The darc.model.abc
module contains abstract base class
of all data models for the darc
project.
-
class
darc.model.abc.
BaseMeta
[source]¶ Bases:
object
Basic metadata for data models.
-
table_function
()¶ Generate table name dynamically (c.f.
table_function()
).- Parameters
model_class (peewee.Model) –
- Return type
-
-
class
darc.model.abc.
BaseMetaWeb
[source]¶ Bases:
darc.model.abc.BaseMeta
Basic metadata for data models of data submission.
-
class
darc.model.abc.
BaseModel
(*args, **kwargs)[source]¶ Bases:
peewee.Model
Base model with standard patterns.
Notes
The model will implicitly have a
AutoField
attribute named asid
.-
DoesNotExist
¶ alias of
BaseModelDoesNotExist
-
to_dict
(keep_id=False)[source]¶ Convert record to
dict
.- Parameters
keep_id (bool) – If keep the ID auto field.
- Returns
The data converted through
playhouse.shortcuts.model_to_dict()
.
-
Meta
¶ Basic metadata for data models.
-
id
= <AutoField: BaseModel.id>¶
-
-
class
darc.model.abc.
BaseModelWeb
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModel
Base model with standard patterns for data submission.
Notes
The model will implicitly have a
AutoField
attribute named asid
.-
DoesNotExist
¶ alias of
BaseModelWebDoesNotExist
-
Meta
¶ Basic metadata for data models.
-
id
= <AutoField: BaseModelWeb.id>¶
-
Miscellaneous Utilities¶
The darc.model.utils
module contains several miscellaneous
utility functions and data fields.
-
class
darc.model.utils.
IPField
(null=False, index=False, unique=False, column_name=None, default=None, primary_key=False, constraints=None, sequence=None, collation=None, unindexed=False, choices=None, help_text=None, verbose_name=None, index_type=None, db_column=None, _hidden=False)[source]¶ Bases:
peewee.IPField
IP data field.
-
db_value
(val)[source]¶ Dump the value for database storage.
- Parameters
value – Source IP address instance.
val (Optional[Union[str, ipaddress.IPv4Address, ipaddress.IPv6Address]]) –
- Returns
Integral representation of the IP address.
- Return type
Optional[int]
-
python_value
(val)[source]¶ Load the value from database storage.
- Parameters
value – Integral representation of the IP address.
val (Optional[int]) –
- Returns
Original IP address instance.
- Return type
Optional[Union[ipaddress.IPv4Address, ipaddress.IPv6Address]]
-
-
class
darc.model.utils.
IntEnumField
(null=False, index=False, unique=False, column_name=None, default=None, primary_key=False, constraints=None, sequence=None, collation=None, unindexed=False, choices=None, help_text=None, verbose_name=None, index_type=None, db_column=None, _hidden=False)[source]¶ Bases:
peewee.IntegerField
enum.IntEnum
data field.-
python_value
(value)[source]¶ Load the value from database storage.
- Parameters
value (Optional[int]) – Integral representation of the enumeration.
- Returns
Original enumeration object.
- Return type
Optional[enum.IntEnum]
-
choices
: enum.IntEnum¶ The original
enum.IntEnum
class.
-
-
class
darc.model.utils.
JSONField
(null=False, index=False, unique=False, column_name=None, default=None, primary_key=False, constraints=None, sequence=None, collation=None, unindexed=False, choices=None, help_text=None, verbose_name=None, index_type=None, db_column=None, _hidden=False)[source]¶ Bases:
playhouse.mysql_ext.JSONField
JSON data field.
-
class
darc.model.utils.
PickleField
(null=False, index=False, unique=False, column_name=None, default=None, primary_key=False, constraints=None, sequence=None, collation=None, unindexed=False, choices=None, help_text=None, verbose_name=None, index_type=None, db_column=None, _hidden=False)[source]¶ Bases:
peewee.BlobField
Pickled data field.
-
class
darc.model.utils.
Proxy
(value)[source]¶ Bases:
enum.IntEnum
Proxy types supported by
darc
.-
FREENET
= 5¶ Freenet proxy.
-
I2P
= 3¶ I2P proxy.
-
NULL
= 1¶ No proxy.
-
TOR
= 2¶ Tor proxy.
-
ZERONET
= 4¶ ZeroNet proxy.
-
-
darc.model.utils.
table_function
(model_class)[source]¶ Generate table name dynamically.
The function strips
Model
from the class name and callspeewee.make_snake_case()
to generate a proper table name.- Parameters
model_class (peewee.Model) – Data model class.
- Returns
Generated table name.
- Return type
As the websites can be sometimes irritating for their anti-robots
verification, login requirements, etc., the darc
project
also privides hooks to customise crawling behaviours around both
requests
and selenium
.
See also
Such customisation, as called in the darc
project, site
hooks, is site specific, user can set up your own hooks unto a
certain site, c.f. darc.sites
for more information.
Still, since the network is a world full of mysteries and miracles,
the speed of crawling will much depend on the response speed of
the target website. To boost up, as well as meet the system capacity,
the darc
project introduced multiprocessing, multithreading
and the fallback slowest single-threaded solutions when crawling.
Note
When rendering the target website using selenium
powered by
the renown Google Chrome, it will require much memory to run.
Thus, the three solutions mentioned above would only toggle the
behaviour around the use of selenium
.
To keep the darc
project as it is a swiss army knife, only the
main entrypoint function darc.process.process()
is exported
in global namespace (and renamed to darc.darc()
), see below:
-
darc.
darc
(worker)¶ Main process.
The function will register
_signal_handler()
forSIGTERM
, and start the main process of thedarc
darkweb crawlers.- Parameters
worker (Literal[crawler, loader]) – Worker process type.
- Raises
ValueError – If
worker
is not a valid value.
Before starting the workers, the function will start proxies through
darc.proxy.tor.tor_proxy()
darc.proxy.i2p.i2p_proxy()
darc.proxy.zeronet.zeronet_proxy()
darc.proxy.freenet.freenet_proxy()
The general process can be described as following for workers of
crawler
type:process_crawler()
: obtain URLs from therequests
link database (c.f.load_requests()
), and feed such URLs tocrawler()
.crawler()
: parse the URL usingparse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
). Also, if the submission API is provided,submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).
The general process can be described as following for workers of
loader
type:process_loader()
: in the meanwhile,darc
will obtain URLs from theselenium
link database (c.f.load_selenium()
), and feed such URLs toloader()
.loader()
: parse the URL usingparse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalChrome
object.If successful, the rendered source HTML document will be saved, and a full-page screenshot will be taken and saved.
If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).
After each round,
darc
will call registered hook functions in sequential order, with the type of worker ('crawler'
or'loader'
) and the current link pool as its parameters, seeregister()
for more information.If in reboot mode, i.e.
REBOOT
isTrue
, the function will exit after first round. If not, it will renew the Tor connections (if bootstrapped), c.f.renew_tor_session()
, and start another round.See also
The function is renamed from
darc.process.process()
.
And we also exported the necessary hook registration functions to the global namespace, see below:
-
darc.
register_hooks
(hook, *, _index=None)¶ Register hook function.
- Parameters
hook (Callable[[Literal[crawler, loader], darc.link.Link], None]) – Hook function to be registered.
_index (Optional[int]) –
- Keyword Arguments
_index – Position index for the hook function.
The hook function takes two parameters:
a
str
object indicating the type of worker, i.e.'crawler'
or'loader'
;a
list
object containingLink
objects, as the current processed link pool.
The hook function may raises
WorkerBreak
so that the worker shall break from its indefinite loop upon finishing of current round. Any value returned from the hook function will be ignored by the workers.See also
The hook functions will be saved into
_HOOK_REGISTRY
.See also
The function is renamed from
darc.process.register()
.
-
darc.
register_proxy
(proxy, session=<function null_session>, driver=<function null_driver>)¶ Register new proxy type.
- Parameters
proxy (str) – Proxy type.
session (Callable[[bool], requests.sessions.Session]) – Session factory function, c.f.
darc.requests.null_session()
.driver (Callable[[], selenium.webdriver.chrome.webdriver.WebDriver]) – Driver factory function, c.f.
darc.selenium.null_driver()
.
See also
The function is renamed from
darc.proxy.register()
.
-
darc.
register_sites
(site, *hostname)¶ Register new site map.
- Parameters
site (Type[darc.sites._abc.BaseSite]) – Sites customisation class inherited from
BaseSite
.*hostname (Tuple[str]) – Optional list of hostnames the sites customisation should be registered with. By default, we use
site.hostname
.
See also
The function is renamed from
darc.sites.register()
.
For more information on the hooks, please refer to the customisation documentations.
Configuration¶
The darc
project is generally configurable through numerous
environment variables. Below is the full list of supported environment
variables you may use to configure the behaviour of darc
.
General Configurations¶
-
DARC_REBOOT
¶ -
If exit the program after first round, i.e. crawled all links from the
requests
link database and loaded all links from theselenium
link database.This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.
-
DARC_VERBOSE
¶ -
If run the program in verbose mode. If
DARC_DEBUG
isTrue
, then the verbose mode will be always enabled.
-
DARC_CHECK
¶ -
If check proxy and hostname before crawling (when calling
extract_links()
,read_sitemap()
andread_hosts()
).If
DARC_CHECK_CONTENT_TYPE
isTrue
, then this environment variable will be always set asTrue
.
-
DARC_CHECK_CONTENT_TYPE
¶ -
If check content type through
HEAD
requests before crawling (when callingextract_links()
,read_sitemap()
andread_hosts()
).
-
DARC_CPU
¶ -
Number of concurrent processes. If not provided, then the number of system CPUs will be used.
Note
DARC_MULTIPROCESSING
and DARC_MULTITHREADING
can
NOT be toggled at the same time.
-
DARC_USER
¶ - Type
- Default
current login user (c.f.
getpass.getuser()
)
Non-root user for proxies.
Data Storage¶
See also
See darc.save
for more information about source saving.
See darc.db
for more information about database integration.
-
DB_URL
¶ - Type
str
(url)
URL to the RDS storage.
Important
The task queues will be saved to
darc
database; the data submittsion will be saved todarcweb
database.Thus, when providing this environment variable, please do NOT specify the database name.
-
LOCK_TIMEOUT
¶ - Type
- Default
10
Lock blocking timeout.
Note
If is an infinit
inf
, no timeout will be applied.See also
Get a lock from
darc.db.get_lock()
.
-
DARC_MAX_POOL
¶ - Type
- Default
1_000
Maximum number of links loaded from the database.
Note
If is an infinit
inf
, no limit will be applied.
-
REDIS_LOCK
¶ -
If use Redis (Lua) lock to ensure process/thread-safely operations.
See also
Toggles the behaviour of
darc.db.get_lock()
.
Web Crawlers¶
-
DARC_WAIT
¶ - Type
- Default
60
Time interval between each round when the
requests
and/orselenium
database are empty.
-
DARC_SAVE
¶ -
If save processed link back to database.
Note
If
DARC_SAVE
isTrue
, thenDARC_SAVE_REQUESTS
andDARC_SAVE_SELENIUM
will be forced to beTrue
.See also
See
darc.db
for more information about link database.
-
DARC_SAVE_REQUESTS
¶ -
If save
crawler()
crawled link back torequests
database.See also
See
darc.db
for more information about link database.
-
DARC_SAVE_SELENIUM
¶ -
If save
loader()
crawled link back toselenium
database.See also
See
darc.db
for more information about link database.
-
TIME_CACHE
¶ - Type
- Default
60
Time delta for caches in seconds.
The
darc
project supports caching for fetched files.TIME_CACHE
will specify for how log the fetched files will be cached and NOT fetched again.Note
If
TIME_CACHE
isNone
then caching will be marked as forever.
-
SE_WAIT
¶ - Type
- Default
60
Time to wait for
selenium
to finish loading pages.Note
Internally,
selenium
will wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded
). However, some extra scripts may take more time running after the event.
-
CHROME_BINARY_LOCATION
¶ - Type
- Default
google-chrome
Path to the Google Chrome binary location.
Note
This environment variable is mandatory for non macOS and/or Linux systems.
See also
See
darc.selenium
for more information.
White / Black Lists¶
-
LINK_WHITE_LIST
¶ - Type
List[str]
(JSON)- Default
[]
White list of hostnames should be crawled.
Note
Regular expressions are supported.
-
LINK_BLACK_LIST
¶ - Type
List[str]
(JSON)- Default
[]
Black list of hostnames should be crawled.
Note
Regular expressions are supported.
-
LINK_FALLBACK
¶ -
Fallback value for
match_host()
.
-
MIME_WHITE_LIST
¶ - Type
List[str]
(JSON)- Default
[]
White list of content types should be crawled.
Note
Regular expressions are supported.
-
MIME_BLACK_LIST
¶ - Type
List[str]
(JSON)- Default
[]
Black list of content types should be crawled.
Note
Regular expressions are supported.
-
MIME_FALLBACK
¶ -
Fallback value for
match_mime()
.
-
PROXY_WHITE_LIST
¶ - Type
List[str]
(JSON)- Default
[]
White list of proxy types should be crawled.
Note
The proxy types are case insensitive.
-
PROXY_BLACK_LIST
¶ - Type
List[str]
(JSON)- Default
[]
Black list of proxy types should be crawled.
Note
The proxy types are case insensitive.
-
PROXY_FALLBACK
¶ -
Fallback value for
match_proxy()
.
Note
If provided,
LINK_WHITE_LIST
, LINK_BLACK_LIST
,
MIME_WHITE_LIST
, MIME_BLACK_LIST
,
PROXY_WHITE_LIST
and PROXY_BLACK_LIST
should all be JSON encoded strings.
Data Submission¶
-
API_NEW_HOST
¶ -
API URL for
submit_new_host()
.
-
API_REQUESTS
¶ -
API URL for
submit_requests()
.
-
API_SELENIUM
¶ -
API URL for
submit_selenium()
.
Note
If API_NEW_HOST
, API_REQUESTS
and API_SELENIUM
is None
, the corresponding
submit function will save the JSON data in the path
specified by PATH_DATA
.
Tor Proxy Configuration¶
-
TOR_PASS
¶ -
Tor controller authentication token.
Note
If not provided, it will be requested at runtime.
-
TOR_WAIT
¶ - Type
- Default
90
Time after which the attempt to start Tor is aborted.
Note
If not provided, there will be NO timeouts.
-
TOR_CFG
¶ - Type
Dict[str, Any]
(JSON)- Default
{}
Tor bootstrap configuration for
stem.process.launch_tor_with_config()
.Note
If provided, it should be a JSON encoded string.
I2P Proxy Configuration¶
-
I2P_WAIT
¶ - Type
- Default
90
Time after which the attempt to start I2P is aborted.
Note
If not provided, there will be NO timeouts.
-
I2P_ARGS
¶ - Type
str
(Shell)- Default
''
I2P bootstrap arguments for
i2prouter start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
ZeroNet Proxy Configuration¶
-
ZERONET_WAIT
¶ - Type
- Default
90
Time after which the attempt to start ZeroNet is aborted.
Note
If not provided, there will be NO timeouts.
-
ZERONET_ARGS
¶ - Type
str
(Shell)- Default
''
ZeroNet bootstrap arguments for
ZeroNet.sh main
.Note
If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).
Freenet Proxy Configuration¶
-
FREENET_WAIT
¶ - Type
- Default
90
Time after which the attempt to start Freenet is aborted.
Note
If not provided, there will be NO timeouts.
-
FREENET_ARGS
¶ - Type
str
(Shell)- Default
''
Freenet bootstrap arguments for
run.sh start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
Customisations¶
Currently, darc
provides three major customisation points, besides the
various environment variables.
Hooks between Rounds¶
See also
See darc.process.register()
for technical information.
As the workers are defined as indefinite loops, we introduced the hooks between rounds to be called at end of each loop. Such hook functions can process all links that had been crawled and/or loaded in the past round, or to indicate the end of the indefinite loop, so that we can stop the workers in an elegant way.
A typical hook function can be defined as following:
from darc.process import register
def dummy_hook(node_type, link_pool):
"""A sample hook function that prints the processed links
in the past round and informs the work to quit.
Args:
node_type (Literal['crawler', 'loader']): Type of worker node.
link_pool (List[darc.link.Link]): List of processed links.
Returns:
NoReturn: The hook function will never return, though return
values will be ignored anyway.
Raises:
darc.error.WorkerBreak: Inform the work to quit after this round.
"""
if node_type == 'crawler':
verb = 'crawled'
elif node_type == 'loader':
verb = 'loaded'
else:
raise ValueError('unknown type of worker node: %s' % node_type)
for link in link_pool:
print('We just %s the link: %s' % (verb, link.url))
raise WorkerBreak
# register the hook function
register(dummy_hook)
Custom Proxy¶
See also
See darc.proxy.register()
for technical information.
Sometimes, we need proxies to connect to certain targers, such as the Tor
network and I2P proxy. darc
decides if it need to use a proxy for
connection based on the proxy
value of the target
link.
By default, darc
uses no proxy for requests
sessions
and selenium
drivers. However, you may use your own proxies by
registering and/or customising the corresponding factory functions.
A typical factory function pair (e.g., for Socks5 proxy) can be defined as following:
import selenium.webdriver
import selenium.webdriver.common.proxy
from darc.proxy import register
from darc.requests import default_user_agent
from darc.selenium import BINARY_LOCATION
def socks5_session(futures=False):
"""Socks5 proxy session.
Args:
futures: If returns a :class:`requests_futures.FuturesSession`.
Returns:
Union[requests.Session, requests_futures.FuturesSession]:
The session object with Socks5 proxy settings.
"""
if futures:
session = requests_futures.sessions.FuturesSession(max_workers=DARC_CPU)
else:
session = requests.Session()
session.headers['User-Agent'] = default_user_agent(proxy='Socks5')
session.proxies.update(dict(
'http': 'socks5h://localhost:9293',
'https': 'socks5h://localhost:9293',
))
return session
def socks5_driver():
"""Socks5 proxy driver.
Returns:
selenium.webdriver.Chrome: The web driver object with Socks5 proxy settings.
"""
options = selenium.webdriver.ChromeOptions()
options.binary_location = BINARY_LOCATION
options.add_argument('--proxy-server=socks5://localhost:9293')
options.add_argument('--host-resolver-rules="MAP * ~NOTFOUND , EXCLUDE localhost"')
proxy = selenium.webdriver.Proxy()
proxy.proxyType = selenium.webdriver.common.proxy.ProxyType.MANUAL
proxy.http_proxy = 'socks5://localhost:9293'
proxy.ssl_proxy = 'socks5://localhost:9293'
capabilities = selenium.webdriver.DesiredCapabilities.CHROME.copy()
proxy.add_to_capabilities(capabilities)
driver = selenium.webdriver.Chrome(options=options,
desired_capabilities=capabilities)
return driver
# register proxy
register('socks5', socks5_session, socks5_driver)
Sites Customisation¶
See also
See darc.sites.register()
for technical information.
Since websites may require authentication and/or anti-robot checks,
we need to insert certain cookies, animate some user interactions to
bypass such requirements. darc
decides which customisation to
use based on the hostname, i.e. host
value of
the target link.
By default, darc
uses darc.sites.default
as the no op
for both requests
sessions and selenium
drivers. However,
you may use your own sites customisation by registering and/or customising
the corresponding classes, which inherited from BaseSite
.
A typical sites customisation class (for better demonstration) can be defined as following:
import time
from darc.const import SE_WAIT
from darc.sites import BaseSite, register
class MySite(BaseSite):
"""This is a site customisation class for demonstration purpose.
You may implement a module as well should you prefer."""
#: List[str]: Hostnames the sites customisation is designed for.
hostname = ['mysite.com', 'www.mysite.com']
@staticmethod
def crawler(session, link):
"""Crawler hook for my site.
Args:
session (requests.Session): Session object with proxy settings.
link (darc.link.Link): Link object to be crawled.
Returns:
requests.Response: The final response object with crawled data.
"""
# inject cookies
session.cookies.set('SessionID', 'fake-session-id-value')
response = session.get(link.url, allow_redirects=True)
return response
@staticmethod
def loader(driver, link):
"""Loader hook for my site.
Args:
driver (selenium.webdriver.Chrome): Web driver object with proxy settings.
link (darc.link.Link): Link object to be loaded.
Returns:
selenium.webdriver.Chrome: The web driver object with loaded data.
"""
# land on login page
driver.get('https://%s/login' % link.host)
# animate login attempt
form = driver.find_element_by_id('login-form')
form.find_element_by_id('username').send_keys('admin')
form.find_element_by_id('password').send_keys('p@ssd')
form.click()
driver.get(link.url)
# wait for page to finish loading
if SE_WAIT is not None:
time.sleep(SE_WAIT)
return driver
# register sites
register(MySite)
Important
Please note that you may raise darc.error.LinkNoReturn
in the crawler
and/or loader
methods to indicate that such link should be ignored and removed
from the task queues, e.g. darc.sites.data
.
Docker Integration¶
The darc
project is integrated with Docker and
Compose. Though published to Docker Hub, you can
still build by yourself.
Important
The debug
image contains miscellaneous documents,
i.e. whole repository in it; and pre-installed some
useful tools for debugging, such as IPython, etc.
The Docker image is based on Ubuntu Bionic (18.04 LTS),
setting up all Python dependencies for the darc
project, installing Google Chrome (version
79.0.3945.36) and corresponding ChromeDriver, as well as
installing and configuring Tor, I2P, ZeroNet, FreeNet,
NoIP proxies.
Note
NoIP is currently not fully integrated in the
darc
due to misunderstanding in the configuration
process. Contributions are welcome.
When building the image, there is an optional argument
for setting up a non-root user, c.f. environment variable
DARC_USER
and module constant DARC_USER
.
By default, the username is darc
.
Content of Dockerfile
FROM ubuntu:bionic
LABEL Name=darc \
Version=0.8.0
STOPSIGNAL SIGINT
HEALTHCHECK --interval=1h --timeout=1m \
CMD wget https://httpbin.org/get -O /dev/null || exit 1
ARG DARC_USER="darc"
ENV LANG="C.UTF-8" \
LC_ALL="C.UTF-8" \
PYTHONIOENCODING="UTF-8" \
DEBIAN_FRONTEND="teletype" \
DARC_USER="${DARC_USER}"
# DEBIAN_FRONTEND="noninteractive"
COPY extra/retry.sh /usr/local/bin/retry
COPY extra/install.py /usr/local/bin/pty-install
COPY vendor/jdk-11.0.8_linux-x64_bin.tar.gz /var/cache/oracle-jdk11-installer-local/
RUN set -x \
&& retry apt-get update \
&& retry apt-get install --yes --no-install-recommends \
apt-utils \
&& retry apt-get install --yes --no-install-recommends \
gcc \
g++ \
libmagic1 \
make \
software-properties-common \
tar \
unzip \
zlib1g-dev \
&& retry add-apt-repository ppa:deadsnakes/ppa --yes \
&& retry add-apt-repository ppa:linuxuprising/java --yes \
&& retry add-apt-repository ppa:i2p-maintainers/i2p --yes
RUN retry apt-get update \
&& retry apt-get install --yes --no-install-recommends \
python3.8 \
python3-pip \
python3-setuptools \
python3-wheel \
&& ln -sf /usr/bin/python3.8 /usr/local/bin/python3
RUN retry pty-install --stdin '6\n70' apt-get install --yes --no-install-recommends \
tzdata \
&& retry pty-install --stdin 'yes' apt-get install --yes \
oracle-java11-installer-local
RUN retry apt-get install --yes --no-install-recommends \
sudo \
&& adduser --disabled-password --gecos '' ${DARC_USER} \
&& adduser ${DARC_USER} sudo \
&& echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
## Tor
RUN retry apt-get install --yes --no-install-recommends tor
COPY extra/torrc.bionic /etc/tor/torrc
## I2P
RUN retry apt-get install --yes --no-install-recommends i2p
COPY extra/i2p.bionic /etc/defaults/i2p
## ZeroNet
COPY vendor/ZeroNet-linux-dist-linux64.tar.gz /tmp
RUN set -x \
&& cd /tmp \
&& tar xvpfz ZeroNet-linux-dist-linux64.tar.gz \
&& mv ZeroNet-linux-dist-linux64 /usr/local/src/zeronet
COPY extra/zeronet.bionic.conf /usr/local/src/zeronet/zeronet.conf
## FreeNet
USER darc
COPY vendor/new_installer_offline.jar /tmp
RUN set -x \
&& cd /tmp \
&& ( pty-install --stdin '/home/darc/freenet\n1' java -jar new_installer_offline.jar || true ) \
&& sudo mv /home/darc/freenet /usr/local/src/freenet
USER root
## NoIP
COPY vendor/noip-duc-linux.tar.gz /tmp
RUN set -x \
&& cd /tmp \
&& tar xvpfz noip-duc-linux.tar.gz \
&& mv noip-2.1.9-1 /usr/local/src/noip \
&& cd /usr/local/src/noip \
&& make
# && make install
# # set up timezone
# RUN echo 'Asia/Shanghai' > /etc/timezone \
# && rm -f /etc/localtime \
# && ln -snf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \
# && dpkg-reconfigure -f noninteractive tzdata
COPY vendor/chromedriver_linux64.zip \
vendor/google-chrome-stable_current_amd64.deb /tmp/
RUN set -x \
## ChromeDriver
&& unzip -d /usr/bin /tmp/chromedriver_linux64.zip \
&& which chromedriver \
## Google Chrome
&& ( dpkg --install /tmp/google-chrome-stable_current_amd64.deb || true ) \
&& retry apt-get install --fix-broken --yes --no-install-recommends \
&& dpkg --install /tmp/google-chrome-stable_current_amd64.deb \
&& which google-chrome
# Using pip:
COPY requirements.txt /tmp
RUN python3 -m pip install -r /tmp/requirements.txt --no-cache-dir
RUN set -x \
&& rm -rf \
## APT repository lists
/var/lib/apt/lists/* \
## Python dependencies
/tmp/requirements.txt \
/tmp/pip \
## ChromeDriver
/tmp/chromedriver_linux64.zip \
## Google Chrome
/tmp/google-chrome-stable_current_amd64.deb \
## Vendors
/tmp/new_installer_offline.jar \
/tmp/noip-duc-linux.tar.gz \
/tmp/ZeroNet-linux-dist-linux64.tar.gz \
#&& apt-get remove --auto-remove --yes \
# software-properties-common \
# unzip \
&& apt-get autoremove -y \
&& apt-get autoclean \
&& apt-get clean
ENTRYPOINT [ "python3", "-m", "darc" ]
#ENTRYPOINT [ "bash", "/app/run.sh" ]
CMD [ "--help" ]
WORKDIR /app
COPY darc/ /app/darc/
COPY LICENSE \
MANIFEST.in \
README.rst \
extra/run.sh \
setup.cfg \
setup.py \
test_darc.py /app/
RUN python3 -m pip install -e .
Note
retry
is a shell script for retrying the commands until success
Content of retry
#!/usr/bin/env bash
while true; do
>&2 echo "+ $@"
$@ && break
>&2 echo "exit: $?"
done
>&2 echo "exit: 0"
pty-install
is a Python script simulating user input for APT package installation withDEBIAN_FRONTEND
set asTeletype
.
Content of pty-install
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Install packages requiring interactions."""
import argparse
import os
import subprocess
import sys
import tempfile
def get_parser():
"""Argument parser."""
parser = argparse.ArgumentParser('install',
description='pseudo-interactive package installer')
parser.add_argument('-i', '--stdin', help='content for input')
parser.add_argument('command', nargs=argparse.REMAINDER, help='command to execute')
return parser
def main():
"""Entrypoint."""
parser = get_parser()
args = parser.parse_args()
text = args.stdin.encode().decode('unicode_escape')
path = tempfile.mktemp(prefix='install-')
with open(path, 'w') as file:
file.write(text)
with open(path, 'r') as file:
proc = subprocess.run(args.command, stdin=file) # pylint: disable=subprocess-run-check
os.remove(path)
return proc.returncode
if __name__ == "__main__":
sys.exit(main())
As always, you can also use Docker Compose to manage the darc
image. Environment variables can be set as described in the
configuration section.
Content of docker-compose.yml
version: '3'
services:
crawler:
image: jsnbzh/darc:latest
build: &build
context: .
args:
# non-root user
DARC_USER: "darc"
container_name: crawler
#entrypoint: [ "bash", "/app/run.sh" ]
command: [ "--type", "crawler",
"--file", "/app/text/tor.txt",
"--file", "/app/text/tor2web.txt",
"--file", "/app/text/i2p.txt",
"--file", "/app/text/zeronet.txt",
"--file", "/app/text/freenet.txt" ]
environment:
## [PYTHON] force the stdout and stderr streams to be unbuffered
PYTHONUNBUFFERED: 1
# reboot mode
DARC_REBOOT: 0
# debug mode
DARC_DEBUG: 0
# verbose mode
DARC_VERBOSE: 1
# force mode (ignore robots.txt)
DARC_FORCE: 1
# check mode (check proxy and hostname before crawling)
DARC_CHECK: 1
# check mode (check content type before crawling)
DARC_CHECK_CONTENT_TYPE: 0
# save mode
DARC_SAVE: 0
# save mode (for requests)
DAVE_SAVE_REQUESTS: 0
# save mode (for selenium)
DAVE_SAVE_SELENIUM: 0
# processes
DARC_CPU: 16
# multiprocessing
DARC_MULTIPROCESSING: 1
# multithreading
DARC_MULTITHREADING: 0
# time lapse
DARC_WAIT: 60
# bulk size
DARC_BULK_SIZE: 1000
# data storage
PATH_DATA: "data"
# save data submitssion
SAVE_DB: 0
# Redis URL
REDIS_URL: 'redis://:UCf7y123aHgaYeGnvLRasALjFfDVHGCz6KiR5Z0WC0DL4ExvSGw5SkcOxBywc0qtZBHVrSVx2QMGewXNP6qVow@redis'
# database URL
#DB_URL: 'mysql://root:b8y9dpz3MJSQtwnZIW77ydASBOYfzA7HJfugv77wLrWQzrjCx5m3spoaiqRi4kU52syYy2jxJZR3U2kwPkEVTA@db'
# max pool
DARC_MAX_POOL: 10
# Tor proxy & control port
TOR_PORT: 9050
TOR_CTRL: 9051
# Tor management method
TOR_STEM: 1
# Tor authentication
TOR_PASS: "16:B9D36206B5374B3F609045F9609EE670F17047D88FF713EFB9157EA39F"
# Tor bootstrap retry
TOR_RETRY: 10
# Tor bootstrap wait
TOR_WAIT: 90
# Tor bootstrap config
TOR_CFG: "{}"
# I2P port
I2P_PORT: 4444
# I2P bootstrap retry
I2P_RETRY: 10
# I2P bootstrap wait
I2P_WAIT: 90
# I2P bootstrap config
I2P_ARGS: ""
# ZeroNet port
ZERONET_PORT: 43110
# ZeroNet bootstrap retry
ZERONET_RETRY: 10
# ZeroNet project path
ZERONET_PATH: "/usr/local/src/zeronet"
# ZeroNet bootstrap wait
ZERONET_WAIT: 90
# ZeroNet bootstrap config
ZERONET_ARGS: ""
# Freenet port
FREENET_PORT: 8888
# Freenet bootstrap retry
FREENET_RETRY: 0
# Freenet project path
FREENET_PATH: "/usr/local/src/freenet"
# Freenet bootstrap wait
FREENET_WAIT: 90
# Freenet bootstrap config
FREENET_ARGS: ""
# time delta for caches in seconds
TIME_CACHE: 2_592_000 # 30 days
# time to wait for selenium
SE_WAIT: 5
# extract link pattern
LINK_WHITE_LIST: '[
".*?\\.onion",
".*?\\.i2p", "127\\.0\\.0\\.1:7657", "localhost:7657", "127\\.0\\.0\\.1:7658", "localhost:7658",
"127\\.0\\.0\\.1:43110", "localhost:43110",
"127\\.0\\.0\\.1:8888", "localhost:8888"
]'
# link black list
LINK_BLACK_LIST: '[ "(.*\\.)?facebookcorewwwi\\.onion", "(.*\\.)?nytimes3xbfgragh\\.onion" ]'
# link fallback flag
LINK_FALLBACK: 1
# content type white list
MIME_WHITE_LIST: '[ "text/html", "application/xhtml+xml" ]'
# content type black list
MIME_BLACK_LIST: '[ "text/css", "application/javascript", "text/json" ]'
# content type fallback flag
MIME_FALLBACK: 0
# proxy type white list
PROXY_WHITE_LIST: '[ "tor", "i2p", "freenet", "zeronet", "tor2web" ]'
# proxy type black list
PROXY_BLACK_LIST: '[ "null", "data" ]'
# proxy type fallback flag
PROXY_FALLBACK: 0
# API retry times
API_RETRY: 10
# API URLs
#API_NEW_HOST: 'https://example.com/api/new_host'
#API_REQUESTS: 'https://example.com/api/requests'
#API_SELENIUM: 'https://example.com/api/selenium'
restart: "always"
networks: &networks
- darc
volumes: &volumes
- ./text:/app/text
- ./extra:/app/extra
- /data/darc:/app/data
loader:
image: jsnbzh/darc:latest
build: *build
container_name: loader
#entrypoint: [ "bash", "/app/run.sh" ]
command: [ "--type", "loader" ]
environment:
## [PYTHON] force the stdout and stderr streams to be unbuffered
PYTHONUNBUFFERED: 1
# reboot mode
DARC_REBOOT: 0
# debug mode
DARC_DEBUG: 0
# verbose mode
DARC_VERBOSE: 1
# force mode (ignore robots.txt)
DARC_FORCE: 1
# check mode (check proxy and hostname before crawling)
DARC_CHECK: 1
# check mode (check content type before crawling)
DARC_CHECK_CONTENT_TYPE: 0
# save mode
DARC_SAVE: 0
# save mode (for requests)
DAVE_SAVE_REQUESTS: 0
# save mode (for selenium)
DAVE_SAVE_SELENIUM: 0
# processes
DARC_CPU: 1
# multiprocessing
DARC_MULTIPROCESSING: 0
# multithreading
DARC_MULTITHREADING: 0
# time lapse
DARC_WAIT: 60
# data storage
PATH_DATA: "data"
# Redis URL
REDIS_URL: 'redis://:UCf7y123aHgaYeGnvLRasALjFfDVHGCz6KiR5Z0WC0DL4ExvSGw5SkcOxBywc0qtZBHVrSVx2QMGewXNP6qVow@redis'
# database URL
#DB_URL: 'mysql://root:b8y9dpz3MJSQtwnZIW77ydASBOYfzA7HJfugv77wLrWQzrjCx5m3spoaiqRi4kU52syYy2jxJZR3U2kwPkEVTA@db'
# max pool
DARC_MAX_POOL: 10
# save data submitssion
SAVE_DB: 0
# Tor proxy & control port
TOR_PORT: 9050
TOR_CTRL: 9051
# Tor management method
TOR_STEM: 1
# Tor authentication
TOR_PASS: "16:B9D36206B5374B3F609045F9609EE670F17047D88FF713EFB9157EA39F"
# Tor bootstrap retry
TOR_RETRY: 10
# Tor bootstrap wait
TOR_WAIT: 90
# Tor bootstrap config
TOR_CFG: "{}"
# I2P port
I2P_PORT: 4444
# I2P bootstrap retry
I2P_RETRY: 10
# I2P bootstrap wait
I2P_WAIT: 90
# I2P bootstrap config
I2P_ARGS: ""
# ZeroNet port
ZERONET_PORT: 43110
# ZeroNet bootstrap retry
ZERONET_RETRY: 10
# ZeroNet project path
ZERONET_PATH: "/usr/local/src/zeronet"
# ZeroNet bootstrap wait
ZERONET_WAIT: 90
# ZeroNet bootstrap config
ZERONET_ARGS: ""
# Freenet port
FREENET_PORT: 8888
# Freenet bootstrap retry
FREENET_RETRY: 0
# Freenet project path
FREENET_PATH: "/usr/local/src/freenet"
# Freenet bootstrap wait
FREENET_WAIT: 90
# Freenet bootstrap config
FREENET_ARGS: ""
# time delta for caches in seconds
TIME_CACHE: 2_592_000 # 30 days
# time to wait for selenium
SE_WAIT: 5
# extract link pattern
LINK_WHITE_LIST: '[
".*?\\.onion",
".*?\\.i2p", "127\\.0\\.0\\.1:7657", "localhost:7657", "127\\.0\\.0\\.1:7658", "localhost:7658",
"127\\.0\\.0\\.1:43110", "localhost:43110",
"127\\.0\\.0\\.1:8888", "localhost:8888"
]'
# link black list
LINK_BLACK_LIST: '[ "(.*\\.)?facebookcorewwwi\\.onion", "(.*\\.)?nytimes3xbfgragh\\.onion" ]'
# link fallback flag
LINK_FALLBACK: 1
# content type white list
MIME_WHITE_LIST: '[ "text/html", "application/xhtml+xml" ]'
# content type black list
MIME_BLACK_LIST: '[ "text/css", "application/javascript", "text/json" ]'
# content type fallback flag
MIME_FALLBACK: 0
# proxy type white list
PROXY_WHITE_LIST: '[ "tor", "i2p", "freenet", "zeronet", "tor2web" ]'
# proxy type black list
PROXY_BLACK_LIST: '[ "null", "data" ]'
# proxy type fallback flag
PROXY_FALLBACK: 0
# API retry times
API_RETRY: 10
# API URLs
#API_NEW_HOST: 'https://example.com/api/new_host'
#API_REQUESTS: 'https://example.com/api/requests'
#API_SELENIUM: 'https://example.com/api/selenium'
restart: "always"
networks: *networks
volumes: *volumes
# network settings
networks:
darc:
driver: bridge
Note
Should you wish to run darc
in reboot mode, i.e. set
DARC_REBOOT
and/or REBOOT
as True
, you may wish to change the entrypoint to
bash /app/run.sh
where run.sh
is a shell script wraps around darc
especially for reboot mode.
Content of run.sh
#!/usr/bin/env bash
set -e
# time lapse
WAIT=${DARC_WAIT=10}
# signal handlers
trap '[ -f ${PATH_DATA}/darc.pid ] && kill -2 $(cat ${PATH_DATA}/darc.pid)' SIGINT SIGTERM SIGKILL
# initialise
echo "+ Starting application..."
python3 -m darc $@
sleep ${WAIT}
# mainloop
while true; do
echo "+ Restarting application..."
python3 -m darc
sleep ${WAIT}
done
In such scenario, you can customise your run.sh
to, for
instance, archive then upload current data crawled by darc
to somewhere else and save up some disk space.
Web Backend Demo¶
This is a demo of API for communication between the
darc
crawlers (darc.submit
) and web UI.
See also
Please refer to data schema for more information about the submission data.
Assuming the web UI is developed using the Flask
microframework.
# -*- coding: utf-8 -*-
import sys
import flask # pylint: disable=import-error
# Flask application
app = flask.Flask(__file__)
@app.route('/api/new_host', methods=['POST'])
def new_host():
"""When a new host is discovered, the :mod:`darc` crawler will submit the
host information. Such includes ``robots.txt`` (if exists) and
``sitemap.xml`` (if any).
Data format::
{
// partial flag - true / false
"$PARTIAL$": ...,
// force flag - true / false
"$FORCE$": ...,
// metadata of URL
"[metadata]": {
// original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
"url": ...,
// proxy type - null / tor / i2p / zeronet / freenet
"proxy": ...,
// hostname / netloc, c.f. ``urllib.parse.urlparse``
"host": ...,
// base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>
"base": ...,
// sha256 of URL as name for saved files (timestamp is in ISO format)
// JSON log as this one - <base>/<name>_<timestamp>.json
// HTML from requests - <base>/<name>_<timestamp>_raw.html
// HTML from selenium - <base>/<name>_<timestamp>.html
// generic data files - <base>/<name>_<timestamp>.dat
"name": ...
},
// requested timestamp in ISO format as in name of saved file
"Timestamp": ...,
// original URL
"URL": ...,
// robots.txt from the host (if not exists, then ``null``)
"Robots": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/robots.txt
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
},
// sitemaps from the host (if none, then ``null``)
"Sitemaps": [
{
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/sitemap_<name>.xml
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
},
...
],
// hosts.txt from the host (if proxy type is ``i2p``; if not exists, then ``null``)
"Hosts": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/hosts.txt
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
}
}
"""
# JSON data from the request
data = flask.request.json # pylint: disable=unused-variable
# do whatever processing needed
...
@app.route('/api/requests', methods=['POST'])
def from_requests():
"""When crawling, we'll first fetch the URl using ``requests``, to check
its availability and to save its HTTP headers information. Such information
will be submitted to the web UI.
Data format::
{
// metadata of URL
"[metadata]": {
// original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
"url": ...,
// proxy type - null / tor / i2p / zeronet / freenet
"proxy": ...,
// hostname / netloc, c.f. ``urllib.parse.urlparse``
"host": ...,
// base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>
"base": ...,
// sha256 of URL as name for saved files (timestamp is in ISO format)
// JSON log as this one - <base>/<name>_<timestamp>.json
// HTML from requests - <base>/<name>_<timestamp>_raw.html
// HTML from selenium - <base>/<name>_<timestamp>.html
// generic data files - <base>/<name>_<timestamp>.dat
"name": ...
},
// requested timestamp in ISO format as in name of saved file
"Timestamp": ...,
// original URL
"URL": ...,
// request method
"Method": "GET",
// response status code
"Status-Code": ...,
// response reason
"Reason": ...,
// response cookies (if any)
"Cookies": {
...
},
// session cookies (if any)
"Session": {
...
},
// request headers (if any)
"Request": {
...
},
// response headers (if any)
"Response": {
...
},
// content type
"Content-Type": ...,
// requested file (if not exists, then ``null``)
"Document": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/<name>_<timestamp>_raw.html
// or if the document is of generic content type, i.e. not HTML
// - <proxy>/<scheme>/<host>/<name>_<timestamp>.dat
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
},
// redirection history (if any)
"History": [
// same records as the original response
{"...": "..."}
]
}
"""
# JSON data from the request
data = flask.request.json # pylint: disable=unused-variable
# do whatever processing needed
...
@app.route('/api/selenium', methods=['POST'])
def from_selenium():
"""After crawling with ``requests``, we'll then render the URl using
``selenium`` with Google Chrome and its driver, to provide a fully rendered
web page. Such information will be submitted to the web UI.
Note:
This information is optional, only provided if the content type from
``requests`` is HTML, status code < 400, and HTML data not empty.
Data format::
{
// metadata of URL
"[metadata]": {
// original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
"url": ...,
// proxy type - null / tor / i2p / zeronet / freenet
"proxy": ...,
// hostname / netloc, c.f. ``urllib.parse.urlparse``
"host": ...,
// base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>
"base": ...,
// sha256 of URL as name for saved files (timestamp is in ISO format)
// JSON log as this one - <base>/<name>_<timestamp>.json
// HTML from requests - <base>/<name>_<timestamp>_raw.html
// HTML from selenium - <base>/<name>_<timestamp>.html
// generic data files - <base>/<name>_<timestamp>.dat
"name": ...
},
// requested timestamp in ISO format as in name of saved file
"Timestamp": ...,
// original URL
"URL": ...,
// rendered HTML document (if not exists, then ``null``)
"Document": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/<name>_<timestamp>.html
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
},
// web page screenshot (if not exists, then ``null``)
"Screenshot": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/<name>_<timestamp>.png
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
}
}
"""
# JSON data from the request
data = flask.request.json # pylint: disable=unused-variable
# do whatever processing needed
...
if __name__ == "__main__":
sys.exit(app.run()) # type: ignore
Data Models Demo¶
This is a demo of data models for database storage of
the submitted data from the darc
crawlers.
Assuming the database is using peewee
as ORM and
MySQL as backend.
# -*- coding: utf-8 -*-
import datetime
import os
import peewee
import playhouse.shortcuts
# database client
DB = playhouse.db_url.connect(os.getenv('DB_URL', 'mysql://127.0.0.1'))
def table_function(model_class: peewee.Model) -> str:
"""Generate table name dynamically.
The function strips ``Model`` from the class name and
calls :func:`peewee.make_snake_case` to generate a
proper table name.
Args:
model_class: Data model class.
Returns:
Generated table name.
"""
name: str = model_class.__name__
if name.endswith('Model'):
name = name[:-5] # strip ``Model`` suffix
return peewee.make_snake_case(name)
class BaseMeta:
"""Basic metadata for data models."""
#: Reference database storage (c.f. :class:`~darc.const.DB`).
database = DB
#: Generate table name dynamically (c.f. :func:`~darc.model.table_function`).
table_function = table_function
class BaseModel(peewee.Model):
"""Base model with standard patterns.
Notes:
The model will implicitly have a :class:`~peewee.AutoField`
attribute named as :attr:`id`.
"""
#: Basic metadata for data models.
Meta = BaseMeta
def to_dict(self, keep_id: bool = False):
"""Convert record to :obj:`dict`.
Args:
keep_id: If keep the ID auto field.
Returns:
The data converted through :func:`playhouse.shortcuts.model_to_dict`.
"""
data = playhouse.shortcuts.model_to_dict(self)
if keep_id:
return data
if 'id' in data:
del data['id']
return data
class HostnameModel(BaseModel):
"""Data model for a hostname record."""
#: Hostname (c.f. :attr:`link.host <darc.link.Link.host>`).
hostname: str = peewee.TextField()
#: Proxy type (c.f. :attr:`link.proxy <darc.link.Link.proxy>`).
proxy: str = peewee.CharField(max_length=8)
#: Timestamp of first ``new_host`` submission.
discovery: datetime.datetime = peewee.DateTimeField()
#: Timestamp of last related submission.
last_seen: datetime.datetime = peewee.DateTimeField()
class RobotsModel(BaseModel):
"""Data model for ``robots.txt`` data."""
#: Hostname (c.f. :attr:`link.host <darc.link.Link.host>`).
host: HostnameModel = peewee.ForeignKeyField(HostnameModel, backref='robots')
#: Timestamp of the submission.
timestamp: datetime.datetime = peewee.DateTimeField()
#: Document data as :obj:`bytes`.
data: bytes = peewee.BlobField()
#: Path to the document.
path: str = peewee.CharField()
class SitemapModel(BaseModel):
"""Data model for ``sitemap.xml`` data."""
#: Hostname (c.f. :attr:`link.host <darc.link.Link.host>`).
host: HostnameModel = peewee.ForeignKeyField(HostnameModel, backref='sitemaps')
#: Timestamp of the submission.
timestamp: datetime.datetime = peewee.DateTimeField()
#: Document data as :obj:`bytes`.
data: bytes = peewee.BlobField()
#: Path to the document.
path: str = peewee.CharField()
class HostsModel(BaseModel):
"""Data model for ``hosts.txt`` data."""
#: Hostname (c.f. :attr:`link.host <darc.link.Link.host>`).
host: HostnameModel = peewee.ForeignKeyField(HostnameModel, backref='hosts')
#: Timestamp of the submission.
timestamp: datetime.datetime = peewee.DateTimeField()
#: Document data as :obj:`bytes`.
data: bytes = peewee.BlobField()
#: Path to the document.
path: str = peewee.CharField()
class URLModel(BaseModel):
"""Data model for a requested URL."""
#: Timestamp of last related submission.
last_seen: datetime.datetime = peewee.DateTimeField()
#: Original URL (c.f. :attr:`link.url <darc.link.Link.url>`).
url: str = peewee.TextField()
#: Hostname (c.f. :attr:`link.host <darc.link.Link.host>`).
host: HostnameModel = peewee.ForeignKeyField(HostnameModel, backref='urls')
#: Proxy type (c.f. :attr:`link.proxy <darc.link.Link.proxy>`).
proxy: str = peewee.CharField(max_length=8)
#: Base path (c.f. :attr:`link.base <darc.link.Link.base>`).
base: str = peewee.CharField()
#: Link hash (c.f. :attr:`link.name <darc.link.Link.name>`).
name: str = peewee.FixedCharField(max_length=64)
class RequestsDocumentModel(BaseModel):
"""Data model for documents from ``requests`` submission."""
#: Original URL (c.f. :attr:`link.url <darc.link.Link.url>`).
url: URLModel = peewee.ForeignKeyField(URLModel, backref='requests')
#: Document data as :obj:`bytes`.
data: bytes = peewee.BlobField()
#: Path to the document.
path: str = peewee.CharField()
class SeleniumDocumentModel(BaseModel):
"""Data model for documents from ``selenium`` submission."""
#: Original URL (c.f. :attr:`link.url <darc.link.Link.url>`).
url: URLModel = peewee.ForeignKeyField(URLModel, backref='selenium')
#: Document data as :obj:`bytes`.
data: bytes = peewee.BlobField()
#: Path to the document.
path: str = peewee.CharField()
Submission Data Schema¶
To better describe the submitted data, darc
provides
several JSON schema generated
from pydantic
models.
New Host Submission¶
The data submission from darc.submit.submit_new_host()
.
{
"title": "new_host",
"description": "Data submission from :func:`darc.submit.submit_new_host`.",
"type": "object",
"properties": {
"$PARTIAL$": {
"title": "$Partial$",
"description": "partial flag - true / false",
"type": "boolean"
},
"$RELOAD$": {
"title": "$Reload$",
"description": "reload flag - true / false",
"type": "boolean"
},
"[metadata]": {
"title": "[Metadata]",
"description": "metadata of URL",
"allOf": [
{
"$ref": "#/definitions/Metadata"
}
]
},
"Timestamp": {
"title": "Timestamp",
"description": "requested timestamp in ISO format as in name of saved file",
"type": "string",
"format": "date-time"
},
"URL": {
"title": "Url",
"description": "original URL",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"Robots": {
"title": "Robots",
"description": "robots.txt from the host (if not exists, then ``null``)",
"allOf": [
{
"$ref": "#/definitions/RobotsDocument"
}
]
},
"Sitemaps": {
"title": "Sitemaps",
"description": "sitemaps from the host (if none, then ``null``)",
"type": "array",
"items": {
"$ref": "#/definitions/SitemapDocument"
}
},
"Hosts": {
"title": "Hosts",
"description": "hosts.txt from the host (if proxy type is ``i2p``; if not exists, then ``null``)",
"allOf": [
{
"$ref": "#/definitions/HostsDocument"
}
]
}
},
"required": [
"$PARTIAL$",
"$RELOAD$",
"[metadata]",
"Timestamp",
"URL"
],
"definitions": {
"Proxy": {
"title": "Proxy",
"description": "Proxy type.",
"enum": [
"null",
"tor",
"i2p",
"zeronet",
"freenet"
],
"type": "string"
},
"Metadata": {
"title": "metadata",
"description": "Metadata of URL.",
"type": "object",
"properties": {
"url": {
"title": "Url",
"description": "original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"proxy": {
"$ref": "#/definitions/Proxy"
},
"host": {
"title": "Host",
"description": "hostname / netloc, c.f. ``urllib.parse.urlparse``",
"type": "string"
},
"base": {
"title": "Base",
"description": "base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>",
"type": "string"
},
"name": {
"title": "Name",
"description": "sha256 of URL as name for saved files (timestamp is in ISO format) - JSON log as this one: <base>/<name>_<timestamp>.json; - HTML from requests: <base>/<name>_<timestamp>_raw.html; - HTML from selenium: <base>/<name>_<timestamp>.html; - generic data files: <base>/<name>_<timestamp>.dat",
"type": "string"
}
},
"required": [
"url",
"proxy",
"host",
"base",
"name"
]
},
"RobotsDocument": {
"title": "RobotsDocument",
"description": "``robots.txt`` document data.",
"type": "object",
"properties": {
"path": {
"title": "Path",
"description": "path of the file, relative path (to data root path ``PATH_DATA``) in container - <proxy>/<scheme>/<host>/robots.txt",
"type": "string"
},
"data": {
"title": "Data",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"path",
"data"
]
},
"SitemapDocument": {
"title": "SitemapDocument",
"description": "Sitemaps document data.",
"type": "object",
"properties": {
"path": {
"title": "Path",
"description": "path of the file, relative path (to data root path ``PATH_DATA``) in container - <proxy>/<scheme>/<host>/sitemap_<name>.xml",
"type": "string"
},
"data": {
"title": "Data",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"path",
"data"
]
},
"HostsDocument": {
"title": "HostsDocument",
"description": "``hosts.txt`` document data.",
"type": "object",
"properties": {
"path": {
"title": "Path",
"description": "path of the file, relative path (to data root path ``PATH_DATA``) in container - <proxy>/<scheme>/<host>/hosts.txt",
"type": "string"
},
"data": {
"title": "Data",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"path",
"data"
]
}
}
}
Requests Submission¶
The data submission from darc.submit.submit_requests()
.
{
"title": "requests",
"description": "Data submission from :func:`darc.submit.submit_requests`.",
"type": "object",
"properties": {
"$PARTIAL$": {
"title": "$Partial$",
"description": "partial flag - true / false",
"type": "boolean"
},
"[metadata]": {
"title": "[Metadata]",
"description": "metadata of URL",
"allOf": [
{
"$ref": "#/definitions/Metadata"
}
]
},
"Timestamp": {
"title": "Timestamp",
"description": "requested timestamp in ISO format as in name of saved file",
"type": "string",
"format": "date-time"
},
"URL": {
"title": "Url",
"description": "original URL",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"Method": {
"title": "Method",
"description": "request method",
"type": "string"
},
"Status-Code": {
"title": "Status-Code",
"description": "response status code",
"exclusiveMinimum": 0,
"type": "integer"
},
"Reason": {
"title": "Reason",
"description": "response reason",
"type": "string"
},
"Cookies": {
"title": "Cookies",
"description": "response cookies (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Session": {
"title": "Session",
"description": "session cookies (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Request": {
"title": "Request",
"description": "request headers (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Response": {
"title": "Response",
"description": "response headers (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Content-Type": {
"title": "Content-Type",
"description": "content type",
"pattern": "[a-zA-Z0-9.-]+/[a-zA-Z0-9.-]+",
"type": "string"
},
"Document": {
"title": "Document",
"description": "requested file (if not exists, then ``null``)",
"allOf": [
{
"$ref": "#/definitions/RequestsDocument"
}
]
},
"History": {
"title": "History",
"description": "redirection history (if any)",
"type": "array",
"items": {
"$ref": "#/definitions/HistoryModel"
}
}
},
"required": [
"$PARTIAL$",
"[metadata]",
"Timestamp",
"URL",
"Method",
"Status-Code",
"Reason",
"Cookies",
"Session",
"Request",
"Response",
"Content-Type",
"History"
],
"definitions": {
"Proxy": {
"title": "Proxy",
"description": "Proxy type.",
"enum": [
"null",
"tor",
"i2p",
"zeronet",
"freenet"
],
"type": "string"
},
"Metadata": {
"title": "metadata",
"description": "Metadata of URL.",
"type": "object",
"properties": {
"url": {
"title": "Url",
"description": "original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"proxy": {
"$ref": "#/definitions/Proxy"
},
"host": {
"title": "Host",
"description": "hostname / netloc, c.f. ``urllib.parse.urlparse``",
"type": "string"
},
"base": {
"title": "Base",
"description": "base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>",
"type": "string"
},
"name": {
"title": "Name",
"description": "sha256 of URL as name for saved files (timestamp is in ISO format) - JSON log as this one: <base>/<name>_<timestamp>.json; - HTML from requests: <base>/<name>_<timestamp>_raw.html; - HTML from selenium: <base>/<name>_<timestamp>.html; - generic data files: <base>/<name>_<timestamp>.dat",
"type": "string"
}
},
"required": [
"url",
"proxy",
"host",
"base",
"name"
]
},
"RequestsDocument": {
"title": "RequestsDocument",
"description": ":mod:`requests` document data.",
"type": "object",
"properties": {
"path": {
"title": "Path",
"description": "path of the file, relative path (to data root path ``PATH_DATA``) in container - <proxy>/<scheme>/<host>/<name>_<timestamp>_raw.html; or if the document is of generic content type, i.e. not HTML - <proxy>/<scheme>/<host>/<name>_<timestamp>.dat",
"type": "string"
},
"data": {
"title": "Data",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"path",
"data"
]
},
"HistoryModel": {
"title": "HistoryModel",
"description": ":mod:`requests` history data.",
"type": "object",
"properties": {
"URL": {
"title": "Url",
"description": "original URL",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"Method": {
"title": "Method",
"description": "request method",
"type": "string"
},
"Status-Code": {
"title": "Status-Code",
"description": "response status code",
"exclusiveMinimum": 0,
"type": "integer"
},
"Reason": {
"title": "Reason",
"description": "response reason",
"type": "string"
},
"Cookies": {
"title": "Cookies",
"description": "response cookies (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Session": {
"title": "Session",
"description": "session cookies (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Request": {
"title": "Request",
"description": "request headers (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Response": {
"title": "Response",
"description": "response headers (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Document": {
"title": "Document",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"URL",
"Method",
"Status-Code",
"Reason",
"Cookies",
"Session",
"Request",
"Response",
"Document"
]
}
}
}
Selenium Submission¶
The data submission from darc.submit.submit_selenium()
.
{
"title": "selenium",
"description": "Data submission from :func:`darc.submit.submit_requests`.",
"type": "object",
"properties": {
"$PARTIAL$": {
"title": "$Partial$",
"description": "partial flag - true / false",
"type": "boolean"
},
"[metadata]": {
"title": "[Metadata]",
"description": "metadata of URL",
"allOf": [
{
"$ref": "#/definitions/Metadata"
}
]
},
"Timestamp": {
"title": "Timestamp",
"description": "requested timestamp in ISO format as in name of saved file",
"type": "string",
"format": "date-time"
},
"URL": {
"title": "Url",
"description": "original URL",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"Document": {
"title": "Document",
"description": "rendered HTML document (if not exists, then ``null``)",
"allOf": [
{
"$ref": "#/definitions/SeleniumDocument"
}
]
},
"Screenshot": {
"title": "Screenshot",
"description": "web page screenshot (if not exists, then ``null``)",
"allOf": [
{
"$ref": "#/definitions/ScreenshotDocument"
}
]
}
},
"required": [
"$PARTIAL$",
"[metadata]",
"Timestamp",
"URL"
],
"definitions": {
"Proxy": {
"title": "Proxy",
"description": "Proxy type.",
"enum": [
"null",
"tor",
"i2p",
"zeronet",
"freenet"
],
"type": "string"
},
"Metadata": {
"title": "metadata",
"description": "Metadata of URL.",
"type": "object",
"properties": {
"url": {
"title": "Url",
"description": "original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"proxy": {
"$ref": "#/definitions/Proxy"
},
"host": {
"title": "Host",
"description": "hostname / netloc, c.f. ``urllib.parse.urlparse``",
"type": "string"
},
"base": {
"title": "Base",
"description": "base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>",
"type": "string"
},
"name": {
"title": "Name",
"description": "sha256 of URL as name for saved files (timestamp is in ISO format) - JSON log as this one: <base>/<name>_<timestamp>.json; - HTML from requests: <base>/<name>_<timestamp>_raw.html; - HTML from selenium: <base>/<name>_<timestamp>.html; - generic data files: <base>/<name>_<timestamp>.dat",
"type": "string"
}
},
"required": [
"url",
"proxy",
"host",
"base",
"name"
]
},
"SeleniumDocument": {
"title": "SeleniumDocument",
"description": ":mod:`selenium` document data.",
"type": "object",
"properties": {
"path": {
"title": "Path",
"description": "path of the file, relative path (to data root path ``PATH_DATA``) in container - <proxy>/<scheme>/<host>/<name>_<timestamp>.html",
"type": "string"
},
"data": {
"title": "Data",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"path",
"data"
]
},
"ScreenshotDocument": {
"title": "ScreenshotDocument",
"description": "Screenshot document data.",
"type": "object",
"properties": {
"path": {
"title": "Path",
"description": "path of the file, relative path (to data root path ``PATH_DATA``) in container - <proxy>/<scheme>/<host>/<name>_<timestamp>.png",
"type": "string"
},
"data": {
"title": "Data",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"path",
"data"
]
}
}
}
Model Definitions¶
# -*- coding: utf-8 -*-
"""JSON schema generator."""
# pylint: disable=no-member
import enum
import pydantic.schema
import darc.typing as typing
__all__ = ['NewHostModel', 'RequestsModel', 'SeleniumModel']
###############################################################################
# Miscellaneous auxiliaries
###############################################################################
class Proxy(str, enum.Enum):
"""Proxy type."""
null = 'null'
tor = 'tor'
i2p = 'i2p'
zeronet = 'zeronet'
freenet = 'freenet'
class Metadata(pydantic.BaseModel):
"""Metadata of URL."""
url: pydantic.AnyUrl = pydantic.Field(
description='original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>')
proxy: Proxy = pydantic.Field(
description='proxy type - null / tor / i2p / zeronet / freenet')
host: str = pydantic.Field(
description='hostname / netloc, c.f. ``urllib.parse.urlparse``')
base: str = pydantic.Field(
description=('base folder, relative path (to data root path ``PATH_DATA``) in containter '
'- <proxy>/<scheme>/<host>'))
name: str = pydantic.Field(
description=('sha256 of URL as name for saved files (timestamp is in ISO format) '
'- JSON log as this one: <base>/<name>_<timestamp>.json; '
'- HTML from requests: <base>/<name>_<timestamp>_raw.html; '
'- HTML from selenium: <base>/<name>_<timestamp>.html; '
'- generic data files: <base>/<name>_<timestamp>.dat'))
class Config:
title = 'metadata'
class RobotsDocument(pydantic.BaseModel):
"""``robots.txt`` document data."""
path: str = pydantic.Field(
description=('path of the file, relative path (to data root path ``PATH_DATA``) in container '
'- <proxy>/<scheme>/<host>/robots.txt'))
data: str = pydantic.Field(
description='content of the file (**base64** encoded)')
class SitemapDocument(pydantic.BaseModel):
"""Sitemaps document data."""
path: str = pydantic.Field(
description=('path of the file, relative path (to data root path ``PATH_DATA``) in container '
'- <proxy>/<scheme>/<host>/sitemap_<name>.xml'))
data: str = pydantic.Field(
description='content of the file (**base64** encoded)')
class HostsDocument(pydantic.BaseModel):
"""``hosts.txt`` document data."""
path: str = pydantic.Field(
description=('path of the file, relative path (to data root path ``PATH_DATA``) in container '
'- <proxy>/<scheme>/<host>/hosts.txt'))
data: str = pydantic.Field(
description='content of the file (**base64** encoded)')
class RequestsDocument(pydantic.BaseModel):
""":mod:`requests` document data."""
path: str = pydantic.Field(
description=('path of the file, relative path (to data root path ``PATH_DATA``) in container '
'- <proxy>/<scheme>/<host>/<name>_<timestamp>_raw.html; '
'or if the document is of generic content type, i.e. not HTML '
'- <proxy>/<scheme>/<host>/<name>_<timestamp>.dat'))
data: str = pydantic.Field(
description='content of the file (**base64** encoded)')
class HistoryModel(pydantic.BaseModel):
""":mod:`requests` history data."""
URL: pydantic.AnyUrl = pydantic.Field(
description='original URL')
Method: str = pydantic.Field(
description='request method')
status_code: pydantic.PositiveInt = pydantic.Field(
alias='Status-Code',
description='response status code')
Reason: str = pydantic.Field(
description='response reason')
Cookies: typing.Cookies = pydantic.Field(
description='response cookies (if any)')
Session: typing.Cookies = pydantic.Field(
description='session cookies (if any)')
Request: typing.Headers = pydantic.Field(
description='request headers (if any)')
Response: typing.Headers = pydantic.Field(
description='response headers (if any)')
Document: str = pydantic.Field(
description='content of the file (**base64** encoded)')
class SeleniumDocument(pydantic.BaseModel):
""":mod:`selenium` document data."""
path: str = pydantic.Field(
description=('path of the file, relative path (to data root path ``PATH_DATA``) in container '
'- <proxy>/<scheme>/<host>/<name>_<timestamp>.html'))
data: str = pydantic.Field(
description='content of the file (**base64** encoded)')
class ScreenshotDocument(pydantic.BaseModel):
"""Screenshot document data."""
path: str = pydantic.Field(
description=('path of the file, relative path (to data root path ``PATH_DATA``) in container '
'- <proxy>/<scheme>/<host>/<name>_<timestamp>.png'))
data: str = pydantic.Field(
description='content of the file (**base64** encoded)')
###############################################################################
# JSON schema definitions
###############################################################################
class NewHostModel(pydantic.BaseModel):
"""Data submission from :func:`darc.submit.submit_new_host`."""
partial: bool = pydantic.Field(
alias='$PARTIAL$',
description='partial flag - true / false')
reload: bool = pydantic.Field(
alias='$RELOAD$',
description='reload flag - true / false')
metadata: Metadata = pydantic.Field(
alias='[metadata]',
description='metadata of URL')
Timestamp: typing.Datetime = pydantic.Field(
description='requested timestamp in ISO format as in name of saved file')
URL: pydantic.AnyUrl = pydantic.Field(
description='original URL')
Robots: typing.Optional[RobotsDocument] = pydantic.Field(
description='robots.txt from the host (if not exists, then ``null``)')
Sitemaps: typing.Optional[typing.List[SitemapDocument]] = pydantic.Field(
description='sitemaps from the host (if none, then ``null``)')
Hosts: typing.Optional[HostsDocument] = pydantic.Field(
description='hosts.txt from the host (if proxy type is ``i2p``; if not exists, then ``null``)')
class Config:
title = 'new_host'
class RequestsModel(pydantic.BaseModel):
"""Data submission from :func:`darc.submit.submit_requests`."""
partial: bool = pydantic.Field(
alias='$PARTIAL$',
description='partial flag - true / false')
metadata: Metadata = pydantic.Field(
alias='[metadata]',
description='metadata of URL')
Timestamp: typing.Datetime = pydantic.Field(
description='requested timestamp in ISO format as in name of saved file')
URL: pydantic.AnyUrl = pydantic.Field(
description='original URL')
Method: str = pydantic.Field(
description='request method')
status_code: pydantic.PositiveInt = pydantic.Field(
alias='Status-Code',
description='response status code')
Reason: str = pydantic.Field(
description='response reason')
Cookies: typing.Cookies = pydantic.Field(
description='response cookies (if any)')
Session: typing.Cookies = pydantic.Field(
description='session cookies (if any)')
Request: typing.Headers = pydantic.Field(
description='request headers (if any)')
Response: typing.Headers = pydantic.Field(
description='response headers (if any)')
content_type: str = pydantic.Field(
alias='Content-Type',
regex='[a-zA-Z0-9.-]+/[a-zA-Z0-9.-]+',
description='content type')
Document: typing.Optional[RequestsDocument] = pydantic.Field(
description='requested file (if not exists, then ``null``)')
History: typing.List[HistoryModel] = pydantic.Field(
description='redirection history (if any)')
class Config:
title = 'requests'
class SeleniumModel(pydantic.BaseModel):
"""Data submission from :func:`darc.submit.submit_requests`."""
partial: bool = pydantic.Field(
alias='$PARTIAL$',
description='partial flag - true / false')
metadata: Metadata = pydantic.Field(
alias='[metadata]',
description='metadata of URL')
Timestamp: typing.Datetime = pydantic.Field(
description='requested timestamp in ISO format as in name of saved file')
URL: pydantic.AnyUrl = pydantic.Field(
description='original URL')
Document: typing.Optional[SeleniumDocument] = pydantic.Field(
description='rendered HTML document (if not exists, then ``null``)')
Screenshot: typing.Optional[ScreenshotDocument] = pydantic.Field(
description='web page screenshot (if not exists, then ``null``)')
class Config:
title = 'selenium'
if __name__ == "__main__":
import json
import os
os.makedirs('schema', exist_ok=True)
with open('schema/new_host.schema.json', 'w') as file:
print(NewHostModel.schema_json(indent=2), file=file)
with open('schema/requests.schema.json', 'w') as file:
print(RequestsModel.schema_json(indent=2), file=file)
with open('schema/selenium.schema.json', 'w') as file:
print(SeleniumModel.schema_json(indent=2), file=file)
schema = pydantic.schema.schema([NewHostModel, RequestsModel, SeleniumModel],
title='DARC Data Submission JSON Schema')
with open('schema/darc.schema.json', 'w') as file:
json.dump(schema, file, indent=2)
Auxiliary Scripts¶
Since the darc
project can be deployed through Docker Integration,
we provided some auxiliary scripts to help with the deployment.
Health Check¶
- File location
Entry point:
extra/healthcheck.py
System V service:
extra/healthcheck.service
usage: healthcheck [-h] [-f FILE] [-i INTERVAL] ...
health check running container
positional arguments:
services name of services
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE path to compose file
-i INTERVAL, --interval INTERVAL
interval (in seconds) of health check
This script will watch the running status of containers managed by Docker Compose. If the containers are stopped or of unhealthy status, it will bring the containers back alive.
Also, as the internal program may halt unexpectedly whilst the container remains healthy, the script will watch if the program is still active through its output messages. If inactive, the script will restart the containers.
Upload API Submission Files¶
- File location
Entry point:
extra/upload.py
Helper script:
extra/upload.sh
Cron sample:
extra/upload.cron
usage: upload [-h] [-p PATH] -H HOST [-U USER]
upload API submission files
optional arguments:
-h, --help show this help message and exit
-p PATH, --path PATH path to data storage
-H HOST, --host HOST upstream hostname
-U USER, --user USER upstream user credential
This script will automatically upload API submission files, c.f.
darc.submit
, using curl(1). The --user
option is
supplied for the same option of curl(1).
Important
As the darc.submit.save_submit()
is categorising saved API
submission files by its actual date, the script is also uploading
such files by the saved dates. Therefore, as the cron(8)
sample suggests, the script should better be run everyday slightly
after 12:00 AM (0:00 in 24-hour format).
Rationale¶
There are two types of workers:
crawler
– runs thedarc.crawl.crawler()
to provide a fresh view of a link and test its connectabilityloader
– run thedarc.crawl.loader()
to provide an in-depth view of a link and provide more visual information
The general process can be described as following for workers of crawler
type:
process_crawler()
: obtain URLs from therequests
link database (c.f.load_requests()
), and feed such URLs tocrawler()
.crawler()
: parse the URL usingparse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
). Also, if the submission API is provided,submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).
The general process can be described as following for workers of loader
type:
process_loader()
: in the meanwhile,darc
will obtain URLs from theselenium
link database (c.f.load_selenium()
), and feed such URLs toloader()
.loader()
: parse the URL usingparse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalWebDriver
object.If successful, the rendered source HTML document will be saved, and a full-page screenshot will be taken and saved.
If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).
Important
For more information about the hook functions, please refer to the customisation documentations.
Installation¶
Note
darc
supports Python all versions above and includes 3.6.
Currently, it only supports and is tested on Linux (Ubuntu 18.04)
and macOS (Catalina).
When installing in Python versions below 3.8, darc
will
use walrus
to compile itself for backport compatibility.
pip install darc
Please make sure you have Google Chrome and corresponding version of Chrome Driver installed on your system.
Important
Starting from version 0.3.0, we introduced Redis for the task queue database backend.
Since version 0.6.0, we introduced relationship database storage (e.g. MySQL, SQLite, PostgreSQL, etc.) for the task queue database backend, besides the Redis database, since it can be too much memory-costly when the task queue becomes vary large.
Please make sure you have one of the backend database installed, configured,
and running when using the darc
project.
However, the darc
project is shipped with Docker and Compose support.
Please see Docker Integration for more information.
Or, you may refer to and/or install from the Docker Hub repository:
docker pull jsnbzh/darc[:TAGNAME]
Usage¶
Important
Though simple CLI, the darc
project is more configurable by
environment variables. For more information, please refer to the
environment variable configuration documentations.
The darc
project provides a simple CLI:
usage: darc [-h] [-v] -t {crawler,loader} [-f FILE] ...
the darkweb crawling swiss army knife
positional arguments:
link links to craw
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-t {crawler,loader}, --type {crawler,loader}
type of worker process
-f FILE, --file FILE read links from file
It can also be called through module entrypoint:
python -m python-darc ...
Note
The link files can contain comment lines, which should start with #
.
Empty lines and comment lines will be ignored when loading.