darc
- Darkweb Crawler Project¶
darc
is designed as a swiss army knife for darkweb crawling.
It integrates requests
to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium
to provide a fully rendered web page and screenshot
of such view.

How to …¶
This is the knowledge base for darc
project. Should you request for more
articles, please create an issue at the
GitHub repository.
How to gracefully deploy darc
?¶
Important
It is NOT necessary to work at the darc
repository
folder directly. You can just use darc
with your
customised code somewhere as you wish.
However, for simplicity, all relative paths referred in this
article is relative to the project root of the darc
repository.
To deploy darc
, there would generally be three basic steps:
deploy the
darc
Docker image;setup the
healthcheck
watchdog service;install the
upload
cron job (optional)
To Start With¶
Per best practice, the system must have as least 2 GB RAM
and 2 CPU cores to handle the loader
worker properly.
And the capacity of the RAM will heavily impact the performance
of the selenium
integration as Google Chrome is the
renowned memory monster.
Note
Imma assume that you’re using *NIX systems, as I don’t believe a Windows user is gonna see this ;)
Firstly, you will need to clone the repository to your system:
git clone https://github.com/JarryShaw/darc.git
# change your working directory
cd darc
then set up the folders you need for the log files:
mkdir -p logs
mkdir -p logs/cron
And now, you will need to decide where you would like to store
the data (documents crawled and saved by darc
); let’s assume
that you have a /data
disk mounted on the system – since that’s
what I have on mine xD – which would be big enough to use as a
safe seperated storage place from the system so that darc
will
not crash your system by exhausting the storage,
mkdir /data/darc
# and make a shortcut
ln -s /data/darc data
therefore, you’re gonna save your data in /data/darc
folder.
Software Dependency¶
After setting local systems, there’re some software dependencies you shall install:
darc
is exclusively deployed through Docker environment, even
though it can also be deployed directly on a host machine, either
Linux or macOS, and perhaps Windows but I had never tested.
Database
darc
needs database backend for the task queue management and
other stuffs. It is highly recommended to deploy darc
with
Redis; but if you insist, you may use
relationship database (e.g. MySQL, SQLite, PostgreSQL,
etc.) instead.
Important
In this article, I will not discuss about the usage of relationship
database as they’re just too miserable for darc
in terms of
availability anyway.
As per best practice, 4 GB RAM would be minimal requirement
for the Redis database. It would be suggested to use directly a
cloud provider hosted Redis database instead of running it on
the same server as darc
.
Deploy darc
Docker Image¶
As discussed in Docker Integration, darc
is exclusively
integrated with Docker workflow. So basically, just pull
the image from Docker Hub or GitHub Container Registry:
# Docker Hub
docker pull jsnbzh/darc:latest
# GitHub Container Registry
docker pull ghcr.io/jarryshaw/darc:latest
In cases where you would like to use a debug image, which
changes the apt
sources to China hosted and IPython and
other auxiliaries installed, you call also pull such image
instead:
# Docker Hub
docker pull jsnbzh/darc:debug
# GitHub Container Registry
docker pull ghcr.io/jarryshaw/darc-debug:latest
Then you will need to customise the docker-compose.yml
based on your needs. Default values and descriptive help
messages can be found in the file.
The rest of it is easy as just calling docker-compose
command to manage the deployed containers, thus I shall
not discuss further.
Deploy with Customisations¶
Important
I’ve made a sample customisation at demo/deploy/
folder,
which can be used directly as a new repository to start with
your customisation, please check it out before moving forwards.
As in the sample customisation, you can simply use the Dockerfile
there as your Docker environment declration. And the entrypoint file
market/run.py
has the sites customisations registered and the
CLI bundled.
Setup healthcheck
Daemon Service¶
Since darc
can be quite a burden to its host system,
I introduced this healthcheck service as discussed in
Auxiliary Scripts.
For a normal System V based service system, you can
simply install the darc-healthcheck
service to
/etc/systemd/system/
:
ln -s extra/healthcheck.service /etc/systemd/system/darc-healthcheck.service
then enable it to run at startup:
sudo systemctl enable darc-healthcheck.service
And from now on, you can simply manage the darc-healthcheck
service through systemctl
or service
command
as you prefer.
Install upload
Cron Job¶
In certain cases, you might wish to upload the API submission
JSON files to your FTP server which has much more space than
the deploy server, then you can utilise the upload
cron
job as mentioned in Auxiliary Scripts.
Simply type the following command:
crontab -e
and add the cron job into the file opened:
10 0 * * * ( cd /path/to/darc/ && /path/to/python3 /path/to/darc/extra/upload.py --host ftp://hostname --user username:password ) >> /path/to/darc/logs/cron/darc-upload.log 2>&1
just remember to change the paths, hostname and credential respectively; and at last, to activate the new cron job:
sudo systemctl restart cron.service
Now, darc
API submission JSON files will be uploaded
to the target FTP server everyday at 0:10 am.
Bonus Tip¶
There is a Makefile
at the project root. You can play
and try to exploit it. A very useful command is that
make reload
when you wish to pull the remote repository and restart
darc
gracefully.
How to implement a sites customisation?¶
As had been discussed already in the documentation, the implementation
of a sites customisation is dead simple: just inherits the
darc.sites.BaseSite
class and overwrites
the corresponding crawler()
and
loader()
abstract static methods.
See below an example from the documentation.
from darc.sites import BaseSite, register
As the class below suggests, you may implement and register your sites
customisation for mysite.com and www.mysite.com using the
MySite
class, where hostname
attribute contains the list of hostnames to which the class should be
associated with.
NB: Implementation details of the crawler
and loader
methods will be discussed in following sections.
class MySite(BaseSite):
"""This is a site customisation class for demonstration purpose.
You may implement a module as well should you prefer."""
#: List[str]: Hostnames the sites customisation is designed for.
hostname = ['mysite.com', 'www.mysite.com']
@staticmethod
def crawler(timestamp, session, link): ...
@staticmethod
def loader(timestamp, driver, link): ...
Should your sites customisation be associated with multiple sites, you can
just add them all to the hostname
attribute; when you call
darc.sites.register()
to register your sites customisation, the
function will automatically handle the registry association information.
# register sites implicitly
register(MySite)
Nonetheless, in case where you would rather specify the hostnames at
runtime (instead of adding them to the hostname
attribute), you may
just leave out the hostname
attribute as None
and specify
your list of hostnames at darc.sites.register()
function call.
# register sites explicitly
register(MySite, 'mysite.com', 'www.mysite.com')
Crawler Hook¶
The crawler
method is based on requests.Session
objects
and returns a requests.Response
instance representing the
crawled web page.
Type annotations of the method can be described as
@staticmethod
def crawler(session: requests.Session, link: darc.link.Link) -> requests.Response: ...
where session
is the requests.Session
instance with proxy
presets and link
is the target link (parsed by
darc.link.parse_link()
to provide more information than mere string).
For example, let’s say you would like to inject a cookie named SessionID
and an Authentication
header with some fake identity, then you may write
the crawler
method as below.
@staticmethod
def crawler(timestamp, session, link):
"""Crawler hook for my site.
Args:
timestamp (datetime.datetime): Timestamp of the worker node reference.
session (requests.Session): Session object with proxy settings.
link (darc.link.Link): Link object to be crawled.
Returns:
requests.Response: The final response object with crawled data.
"""
# inject cookies
session.cookies.set('SessionID', 'fake-session-id-value')
# insert headers
session.headers['Authentication'] = 'Basic fake-identity-credential'
response = session.get(link.url, allow_redirects=True)
return response
In this case when darc
crawling the link, the HTTP(S) request will be
provided with a session cookie and HTTP header, so that it may bypass
potential authorisation checks and land on the target page.
Loader Hook¶
The loader
method is based on selenium.webdriver.Chrome
objects
and returns a the original web driver instance containing the loaded web
page.
Type annotations of the method can be described as
@staticmethod
def loader(driver: selenium.webdriver.Chrome, link: darc.link.Link) -> selenium.webdriver.Chrome: ...
where driver
is the selenium.webdriver.Chrome
instance with
proxy presets and link
is the target link (parsed by
darc.link.parse_link()
to provide more information than mere string).
For example, let’s say you would like to animate user login and go to the
target page after successful attempt, then you may write the loader
method as below.
@staticmethod
def loader(timestamp, driver, link):
"""Loader hook for my site.
Args:
timestamp: Timestamp of the worker node reference.
driver (selenium.webdriver.Chrome): Web driver object with proxy settings.
link (darc.link.Link): Link object to be loaded.
Returns:
selenium.webdriver.Chrome: The web driver object with loaded data.
"""
# land on login page
driver.get('https://%s/login' % link.host)
# animate login attempt
form = driver.find_element_by_id('login-form')
form.find_element_by_id('username').send_keys('admin')
form.find_element_by_id('password').send_keys('p@ssd')
form.click()
# check if the attempt succeeded
if driver.title == 'Please login!':
raise ValueError('failed to login %s' % link.host)
# go to the target page
driver.get(link.url)
# wait for page to finish loading
from darc.const import SE_WAIT # should've been put with the top-level import statements
if SE_WAIT is not None:
time.sleep(SE_WAIT)
return driver
In this case when darc
loading the link, the web driver will first
perform user login, so that it may bypass potential authorisation checks
and land on the target page.
In case to drop the link from task queue…¶
In some scenarios, you may want to remove the target link from the task queue, then there’re basically two ways:
do like a wildling, remove it directly from the database
As there’re three task queues used in darc
, each represents task
queues for the crawler (requests
database) and loader
(selenium
database) worker nodes and a track record for known
hostnames (hostname database), you will need to call corresponding functions
to remove the target link from the database desired.
Possible functions are as below:
all take one positional argument link
, i.e. the darc.link.Link
object to be removed.
Say you would like to remove https://www.mysite.com
from the
requests
database, then you may just run
from darc.db import drop_requests
from darc.link import parse_link
link = parse_link('https://www.mysite.com')
drop_requests(link)
or make it in an elegant way
When implementing the sites customisation, you may wish to drop certain
links at runtime, then you may simply raise darc.error.LinkNoReturn
in the corresponding crawler
and/or loader
methods.
For instance, you would like to proceed with mysite.com
but NOT
www.mysite.com
in the sites customisation, then you may implement your
class as
from darc.error import LinkNoReturn
class MySite(BaseSite):
...
@staticmethod
def crawler(timestamp, session, link):
if link.host == 'www.mysite.com':
raise LinkNoReturn(link)
...
@staticmethod
def loader(timestamp, driver, link):
if link.host == 'www.mysite.com':
raise LinkNoReturn(link)
...
Then what should I do to include my sites customisation?¶
Simple as well!
Just install your codes to where you’re running darc
, e.g. the
Docker container, remote server, etc.; then change the startup by injecting
your codes before the entrypoint.
Say the structure of the working directory is as below:
.
|-- .venv/
| |-- lib/python3.8/site-packages
| | |-- darc/
| | | |-- ...
| | |-- ...
| |-- ...
|-- mysite.py
|-- ...
where .venv
is the folder of virtual environment with darc
installed and mysite.py
is the file with your sites customisation.
Then you just need to change your mysite.py
with some additional lines
as below:
# mysite.py
import sys
from darc.__main__ import main
from darc.sites import BaseSite, register
class MySite(BaseSite):
...
# register sites
register(MySite)
if __name__ == '__main__':
sys.exit(main())
And now, you can start darc
through python mysite.py [...]
instead
of python -m darc [...]
with your sites customisation registered to the
system.
See also
How to implement a custom proxy middleware?¶
As had been discussed already in the documentation, the implementation
of a custom proxy is merely two factory functions: one yields a
requests.Session
and/or
requests_futures.sessions.FuturesSession
instance, one yields a
selenium.webdriver.Chrome
instance; both with proxy presets.
See below an example from the documentation.
from darc.proxy import register
Session Factory¶
The session factory returns a requests.Session
and/or
requests_futures.sessions.FuturesSession
instance with presets,
e.g. proxies, user agent, etc.
Type annocation of the function can be described as
def get_session(futures=False) -> requests.Session: ...
@typing.overload
def get_session(futures=True) -> requests_futures.sessions.FuturesSession: ...
For example, let’s say you’re implementing a Socks5 proxy for localhost:9293,
with other presets same as the default factory function, c.f.
darc.requests.null_session()
.
import requests
import requests_futures.sessions
from darc.const import DARC_CPU
from darc.requests import default_user_agent
def socks5_session(futures=False):
"""Socks5 proxy session.
Args:
futures: If returns a :class:`requests_futures.FuturesSession`.
Returns:
Union[requests.Session, requests_futures.FuturesSession]:
The session object with Socks5 proxy settings.
"""
if futures:
session = requests_futures.sessions.FuturesSession(max_workers=DARC_CPU)
else:
session = requests.Session()
session.headers['User-Agent'] = default_user_agent(proxy='Socks5')
session.proxies.update({
'http': 'socks5h://localhost:9293',
'https': 'socks5h://localhost:9293',
})
return session
In this case when darc
needs to use a Socks5 session for its crawler
worker nodes, it will call the socks5_session
function to obtain a preset
session instance.
Driver Factory¶
The driver factory returns a selenium.webdriver.Chrome
instance with
presets, e.g. proxies, options/switches, etc.
Type annocation of the function can be described as
def get_driver() -> selenium.webdriver.Chrome: ...
For example, let’s say you’re implementing a Socks5 proxy for localhost:9293,
with other presets same as the default factory function, c.f.
darc.selenium.null_driver()
.
import selenium.webdriver
import selenium.webdriver.common.proxy
from darc.selenium import BINARY_LOCATION
def socks5_driver():
"""Socks5 proxy driver.
Returns:
selenium.webdriver.Chrome: The web driver object with Socks5 proxy settings.
"""
options = selenium.webdriver.ChromeOptions()
options.binary_location = BINARY_LOCATION
options.add_argument('--proxy-server=socks5://localhost:9293')
options.add_argument('--host-resolver-rules="MAP * ~NOTFOUND , EXCLUDE localhost"')
proxy = selenium.webdriver.Proxy()
proxy.proxyType = selenium.webdriver.common.proxy.ProxyType.MANUAL
proxy.http_proxy = 'socks5://localhost:9293'
proxy.ssl_proxy = 'socks5://localhost:9293'
capabilities = selenium.webdriver.DesiredCapabilities.CHROME.copy()
proxy.add_to_capabilities(capabilities)
driver = selenium.webdriver.Chrome(options=options,
desired_capabilities=capabilities)
return driver
In this case when darc
needs to use a Socks5 driver for its loader
worker nodes, it will call the socks5_driver
function to obtain a preset
driver instance.
What should I do to register the proxy?¶
All proxies are managed in the darc.proxy
module and you can register
your own proxy through darc.proxy.register()
:
# register proxy
register('socks5', socks5_session, socks5_driver)
As the codes above suggest, the darc.proxy.register()
takes three
positional arguments: proxy type, session and driver factory functions.
See also
Technical Documentation¶
darc
is designed as a swiss army knife for darkweb crawling.
It integrates requests
to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium
to provide a fully rendered web page and screenshot
of such view.
URL Utilities¶
The Link
class is the key data structure
of the darc
project, it contains all information
required to identify a URL’s proxy type, hostname, path prefix
when saving, etc.
The link
module also provides several wrapper
function to the urllib.parse
module.
-
class
darc.link.
Link
(url, proxy, host, base, name, url_parse, url_backref=None)[source]¶ Bases:
object
Parsed link.
- Parameters
url (
str
) – original linkproxy (
str
) – proxy typebase (
str
) – base folder for saving filesname (
str
) – hashed link for saving filesurl_parse (
ParseResult
) – parsed URL fromurllib.parse.urlparse()
url_backref (
Optional
[Link
]) – optionalLink
instance from which current link was extracted
- Returns
Parsed link object.
- Return type
-
url_backref
: Optional[darc.link.Link] = None¶ optional
Link
instance from which current link was extracted
-
url_parse
: urllib.parse.ParseResult¶ parsed URL from
urllib.parse.urlparse()
-
darc.link.
parse_link
(link, host=None, *, backref=None)[source]¶ Parse link.
- Parameters
link (
str
) – link to be parsedbackref (Optional[darc.link.Link]) –
- Keyword Arguments
backref – optional
Link
instance from which current link was extracted- Return type
- Returns
The parsed link object.
Note
If
host
is provided, it will override the hostname of the originallink
.The parsing process of proxy type is as follows:
If
host
isNone
and the parse result fromurllib.parse.urlparse()
has nonetloc
(or hostname) specified, then sethostname
as(null)
; else set it as is.If the scheme is
data
, then thelink
is a data URI, sethostname
asdata
andproxy
asdata
.If the scheme is
javascript
, then the link is some JavaScript codes, setproxy
asscript
.If the scheme is
bitcoin
, then the link is a Bitcoin address, setproxy
asbitcoin
.If the scheme is
ethereum
, then the link is an Ethereum address, setproxy
asethereum
.If the scheme is
ed2k
, then the link is an ED2K magnet link, setproxy
ased2k
.If the scheme is
magnet
, then the link is a magnet link, setproxy
asmagnet
.If the scheme is
mailto
, then the link is an email address, setproxy
asmail
.If the scheme is
irc
, then the link is an IRC link, setproxy
asirc
.If the scheme is NOT any of
http
orhttps
, then setproxy
to the scheme.If the host is
None
, sethostname
to(null)
, setproxy
tonull
.If the host is an onion (
.onion
) address, setproxy
totor
.If the host is an I2P (
.i2p
) address, or any oflocalhost:7657
andlocalhost:7658
, setproxy
toi2p
.If the host is localhost on
ZERONET_PORT
, and the path is not/
, i.e. NOT root path, setproxy
tozeronet
; and set the first part of its path ashostname
.Example:
For a ZeroNet address, e.g.,
http://127.0.0.1:43110/1HeLLo4uzjaLetFx6NH3PMwFP3qbRbTf3D
,parse_link()
will parse thehostname
as1HeLLo4uzjaLetFx6NH3PMwFP3qbRbTf3D
.If the host is localhost on
FREENET_PORT
, and the path is not/
, i.e. NOT root path, setproxy
tofreenet
; and set the first part of its path ashostname
.Example:
For a Freenet address, e.g.,
http://127.0.0.1:8888/USK@nwa8lHa271k2QvJ8aa0Ov7IHAV-DFOCFgmDt3X6BpCI,DuQSUZiI~agF8c-6tjsFFGuZ8eICrzWCILB60nT8KKo,AQACAAE/sone/77/
,parse_link()
will parse thehostname
asUSK@nwa8lHa271k2QvJ8aa0Ov7IHAV-DFOCFgmDt3X6BpCI,DuQSUZiI~agF8c-6tjsFFGuZ8eICrzWCILB60nT8KKo,AQACAAE
.If the host is a proxied onion (
.onion.sh
) address, setproxy
totor2web
.If none of the cases above satisfied, the
proxy
will be set asnull
, marking it a plain normal link.
The
base
for parsed linkLink
object is defined as<root>/<proxy>/<scheme>/<hostname>/
where
root
isPATH_DB
.The
name
for parsed linkLink
object is the sha256 hash (c.f.hashlib.sha256()
) of the originallink
.
-
darc.link.
quote
(string, safe='/', encoding=None, errors=None)[source]¶ Wrapper function for
urllib.parse.quote()
.- Parameters
- Return type
- Returns
The quoted string.
Note
The function suppressed possible errors when calling
urllib.parse.quote()
. If any, it will return the original string.
-
darc.link.
unquote
(string, encoding='utf-8', errors='replace')[source]¶ Wrapper function for
urllib.parse.unquote()
.- Parameters
- Return type
- Returns
The quoted string.
Note
The function suppressed possible errors when calling
urllib.parse.unquote()
. If any, it will return the original string.
-
darc.link.
urljoin
(base, url, allow_fragments=True)[source]¶ Wrapper function for
urllib.parse.urljoin()
.- Parameters
- Return type
- Returns
The joined URL.
Note
The function suppressed possible errors when calling
urllib.parse.urljoin()
. If any, it will returnbase/url
directly.
-
darc.link.
urlparse
(url, scheme='', allow_fragments=True)[source]¶ Wrapper function for
urllib.parse.urlparse()
.- Parameters
- Return type
- Returns
The parse result.
Note
The function suppressed possible errors when calling
urllib.parse.urlparse()
. If any, it will returnurllib.parse.ParseResult(scheme=scheme, netloc='', path=url, params='', query='', fragment='')
directly.
-
darc.link.
urlsplit
(url, scheme='', allow_fragments=True)[source]¶ Wrapper function for
urllib.parse.urlsplit()
.- Parameters
- Return type
- Returns
The split result.
Note
The function suppressed possible errors when calling
urllib.parse.urlsplit()
. If any, it will returnurllib.parse.SplitResult(scheme=scheme, netloc='', path=url, params='', query='', fragment='')
directly.
Source Parsing¶
The darc.parse
module provides auxiliary functions
to read robots.txt
, sitemaps and HTML documents. It
also contains utility functions to check if the proxy type,
hostname and content type if in any of the black and white
lists.
-
darc.parse.
_check
(temp_list)[source]¶ Check hostname and proxy type of links.
- Parameters
- Return type
- Returns
List of links matches the requirements.
Note
If
CHECK_NG
isTrue
, the function will directly call_check_ng()
instead.
-
darc.parse.
check_robots
(link)[source]¶ Check if
link
is allowed inrobots.txt
.- Parameters
link (
Link
) – The link object to be checked.- Return type
- Returns
If
link
is allowed inrobots.txt
.
Note
The root path of a URL will always return
True
.
-
darc.parse.
extract_links_from_text
(link, text)[source]¶ Extract links from raw text source.
- Parameters
- Return type
- Returns
List of extracted links.
Important
The extraction is NOT as reliable since we did not perform TLD checks on the extracted links and we cannot guarantee all links to be extracted.
The URL patterns used to extract links are defined by
darc.parse.URL_PAT
and you may register your own expressions byDARC_URL_PAT
.
-
darc.parse.
get_content_type
(response)[source]¶ Get content type from
response
.- Parameters
response (
requests.Response
) – Response object.- Return type
- Returns
The content type from
response
.
Note
If the
Content-Type
header is not defined inresponse
, the function will utilisemagic
to detect its content type.
-
darc.parse.
URL_PAT
: List[re.Pattern]¶ Regular expression patterns to match all reasonable URLs.
Currently, we have two builtin patterns:
HTTP(S) and other regular URLs, e.g. WebSocket, IRC, etc.
re.compile(r'(?P<url>((https?|wss?|irc):)?(//)?\w+(\.\w+)+/?\S*)', re.UNICODE),
Bitcoin accounts, data URIs, (ED2K) magnet links, email addresses, telephone numbers, JavaScript functions, etc.
re.compile(r'(?P<url>(bitcoin|data|ed2k|magnet|mailto|script|tel):\w+)', re.ASCII)
- Environ
See also
The patterns are used in
darc.parse.extract_links_from_text()
.
-
darc.save.
_SAVE_LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving link hash database
link.csv
.See also
darc.save.save_link()
Link Database¶
The darc
project utilises Redis based database
to provide tele-process communication.
Note
In its first implementation, the darc
project used
Queue
to support such communication.
However, as noticed when runtime, the Queue
object will be much affected by the lack of memory.
There will be three databases, all following the save naming
convension with queue_
prefix:
the hostname database –
queue_hostname
(HostnameQueueModel
)the
requests
database –queue_requests
(RequestsQueueModel
)the
selenium
database –queue_selenium
(SeleniumQueueModel
)
For queue_hostname
, queue_requests
and queue_selenium
,
they are all Redis sorted set data type.
If FLAG_DB
is True
, then the
module uses the RDS storage described by the peewee
models as backend.
-
darc.db.
_db_operation
(operation, *args, **kwargs)[source]¶ Retry operation on database.
- Parameters
operation (
Callable
[…, ~_T]) – Callable / method to perform.*args – Arbitrary positional arguments.
args (Any) –
kwargs (Any) –
- Keyword Arguments
**kwargs – Arbitrary keyword arguments.
- Return type
~_T
- Returns
Any return value from a successful
operation
call.
-
darc.db.
_drop_hostname_db
(link)[source]¶ Remove link from the hostname database.
The function updates the
HostnameQueueModel
table.
-
darc.db.
_drop_hostname_redis
(link)[source]¶ Remove link from the hostname database.
The function updates the
queue_hostname
database.
-
darc.db.
_drop_requests_db
(link)[source]¶ Remove link from the
requests
database.The function updates the
RequestsQueueModel
table.
-
darc.db.
_drop_requests_redis
(link)[source]¶ Remove link from the
requests
database.The function updates the
queue_requests
database.
-
darc.db.
_drop_selenium_db
(link)[source]¶ Remove link from the
selenium
database.The function updates the
SeleniumQueueModel
table.
-
darc.db.
_drop_selenium_redis
(link)[source]¶ Remove link from the
selenium
database.The function updates the
queue_selenium
database.
-
darc.db.
_gen_arg_msg
(*args, **kwargs)[source]¶ Sanitise arguments representation string.
- Parameters
*args – Arbitrary arguments.
args (Any) –
kwargs (Any) –
- Keyword Arguments
**kwargs – Arbitrary keyword arguments.
- Return type
- Returns
Sanitised arguments representation string.
-
darc.db.
_have_hostname_db
(link)[source]¶ Check if current link is a new host.
The function checks the
HostnameQueueModel
table.
-
darc.db.
_have_hostname_redis
(link)[source]¶ Check if current link is a new host.
The function checks the
queue_hostname
database.
-
darc.db.
_load_requests_db
()[source]¶ Load link from the
requests
database.The function reads the
RequestsQueueModel
table.Note
At runtime, the function will load links with maximum number at
MAX_POOL
to limit the memory usage.
-
darc.db.
_load_requests_redis
()[source]¶ Load link from the
requests
database.The function reads the
queue_requests
database.Note
At runtime, the function will load links with maximum number at
MAX_POOL
to limit the memory usage.
-
darc.db.
_load_selenium_db
()[source]¶ Load link from the
selenium
database.The function reads the
SeleniumQueueModel
table.Note
At runtime, the function will load links with maximum number at
MAX_POOL
to limit the memory usage.
-
darc.db.
_load_selenium_redis
()[source]¶ Load link from the
selenium
database.The function reads the
queue_selenium
database.- Parameters
check – If perform checks on loaded links, default to
CHECK
.- Return type
- Returns
List of loaded links from the
selenium
database.
Note
At runtime, the function will load links with maximum number at
MAX_POOL
to limit the memory usage.
-
darc.db.
_redis_command
(command, *args, **kwargs)[source]¶ Wrapper function for Redis command.
- Parameters
command (
str
) – Command name.*args – Arbitrary arguments for the Redis command.
args (Any) –
kwargs (Any) –
- Keyword Arguments
**kwargs – Arbitrary keyword arguments for the Redis command.
- Return type
- Returns
Values returned from the Redis command.
- Warns
RedisCommandFailed – Warns at each round when the command failed.
See also
Between each retry, the function sleeps for
RETRY_INTERVAL
second(s) if such value is NOTNone
.
-
darc.db.
_redis_get_lock
(key)[source]¶ Get a lock for Redis operations.
- Parameters
key (
Literal
[‘queue_hostname’, ‘queue_requests’, ‘queue_selenium’]) – Lock target key.- Return type
Union
[Redlock
,AbstractContextManager
[+T_co]]- Returns
Return a new
pottery.redlock.Redlock
object using keykey
that mimics the behavior ofthreading.Lock
.
- Seel Also:
If
REDIS_LOCK
isFalse
, returns acontextlib.nullcontext
instead.
-
darc.db.
_save_requests_db
(entries, single=False, score=None, nx=False, xx=False)[source]¶ Save link to the
requests
database.The function updates the
RequestsQueueModel
table.- Parameters
entries (
Union
[Link
,List
[Link
]]) – Links to be added to therequests
database. It can be either alist
of links, or a single link string (ifsingle
set asTrue
).single (
bool
) – Indicate ifentries
is alist
of links or a single link string.score (
Optional
[float
]) – Score to for the Redis sorted set.nx (
bool
) – Only create new elements and not to update scores for elements that already exist.xx (
bool
) – Only update scores of elements that already exist. New elements will not be added.
- Return type
-
darc.db.
_save_requests_redis
(entries, single=False, score=None, nx=False, xx=False)[source]¶ Save link to the
requests
database.The function updates the
queue_requests
database.- Parameters
entries (
Union
[Link
,List
[Link
]]) – Links to be added to therequests
database. It can be either alist
of links, or a single link string (ifsingle
set asTrue
).single (
bool
) – Indicate ifentries
is alist
of links or a single link string.score (
Optional
[float
]) – Score to for the Redis sorted set.nx (
bool
) – ForcesZADD
to only create new elements and not to update scores for elements that already exist.xx (
bool
) – ForcesZADD
to only update scores of elements that already exist. New elements will not be added.
- Return type
-
darc.db.
_save_selenium_db
(entries, single=False, score=None, nx=False, xx=False)[source]¶ Save link to the
selenium
database.The function updates the
SeleniumQueueModel
table.- Parameters
entries (
Union
[Link
,List
[Link
]]) – Links to be added to theselenium
database. It can be either alist
of links, or a single link string (ifsingle
set asTrue
).single (
bool
) – Indicate ifentries
is alist
of links or a single link string.score (
Optional
[float
]) – Score to for the Redis sorted set.nx (
bool
) – Only create new elements and not to update scores for elements that already exist.xx (
bool
) – Only update scores of elements that already exist. New elements will not be added.
- Return type
-
darc.db.
_save_selenium_redis
(entries, single=False, score=None, nx=False, xx=False)[source]¶ Save link to the
selenium
database.The function updates the
queue_selenium
database.- Parameters
entries (
Union
[Link
,List
[Link
]]) – Links to be added to theselenium
database. It can be either an iterable of links, or a single link string (ifsingle
set asTrue
).single (
bool
) – Indicate ifentries
is an iterable of links or a single link string.score (
Optional
[float
]) – Score to for the Redis sorted set.nx (
bool
) – ForcesZADD
to only create new elements and not to update scores for elements that already exist.xx (
bool
) – ForcesZADD
to only update scores of elements that already exist. New elements will not be added.
- Return type
When
entries
is a list ofLink
instances, we tries to perform bulk update to easy the memory consumption. The bulk size is defined byBULK_SIZE
.- Return type
- Parameters
entries (Union[darc.link.Link, List[darc.link.Link]]) –
single (bool) –
score (Optional[float]) –
nx (bool) –
xx (bool) –
-
darc.db.
drop_hostname
(link)[source]¶ Remove link from the hostname database.
- Return type
- Parameters
link (darc.link.Link) –
-
darc.db.
drop_requests
(link)[source]¶ Remove link from the
requests
database.- Return type
- Parameters
link (darc.link.Link) –
-
darc.db.
drop_selenium
(link)[source]¶ Remove link from the
selenium
database.- Return type
- Parameters
link (darc.link.Link) –
-
darc.db.
load_requests
(check=False)[source]¶ Load link from the
requests
database.- Parameters
check (
bool
) – If perform checks on loaded links, default toCHECK
.- Return type
- Returns
List of loaded links from the
requests
database.
Note
At runtime, the function will load links with maximum number at
MAX_POOL
to limit the memory usage.
-
darc.db.
load_selenium
(check=False)[source]¶ Load link from the
selenium
database.- Parameters
check (
bool
) – If perform checks on loaded links, default toCHECK
.- Return type
- Returns
List of loaded links from the
selenium
database.
Note
At runtime, the function will load links with maximum number at
MAX_POOL
to limit the memory usage.
-
darc.db.
save_requests
(entries, single=False, score=None, nx=False, xx=False)[source]¶ Save link to the
requests
database.The function updates the
queue_requests
database.- Parameters
entries (
Union
[Link
,List
[Link
]]) – Links to be added to therequests
database. It can be either alist
of links, or a single link string (ifsingle
set asTrue
).single (
bool
) – Indicate ifentries
is alist
of links or a single link string.score (
Optional
[float
]) – Score to for the Redis sorted set.nx (
bool
) – Only create new elements and not to update scores for elements that already exist.xx (
bool
) – Only update scores of elements that already exist. New elements will not be added.
- Return type
When
entries
is a list ofLink
instances, we tries to perform bulk update to easy the memory consumption. The bulk size is defined byBULK_SIZE
.- Return type
- Parameters
entries (Union[darc.link.Link, List[darc.link.Link]]) –
single (bool) –
score (Optional[float]) –
nx (bool) –
xx (bool) –
-
darc.db.
save_selenium
(entries, single=False, score=None, nx=False, xx=False)[source]¶ Save link to the
selenium
database.- Parameters
entries (
Union
[Link
,List
[Link
]]) – Links to be added to theselenium
database. It can be either alist
of links, or a single link string (ifsingle
set asTrue
).single (
bool
) – Indicate ifentries
is alist
of links or a single link string.score (
Optional
[float
]) – Score to for the Redis sorted set.nx (
bool
) – Only create new elements and not to update scores for elements that already exist.xx (
bool
) – Only update scores of elements that already exist. New elements will not be added.
- Return type
When
entries
is a list ofLink
instances, we tries to perform bulk update to easy the memory consumption. The bulk size is defined byBULK_SIZE
.- Return type
- Parameters
entries (Union[darc.link.Link, List[darc.link.Link]]) –
single (bool) –
score (Optional[float]) –
nx (bool) –
xx (bool) –
-
darc.db.
LOCK_TIMEOUT
: Optional[float]¶ - Default
10
- Environ
DARC_LOCK_TIMEOUT
Lock blocking timeout.
Note
If is an infinit
inf
, no timeout will be applied.See also
Get a lock from
darc.db.get_lock()
.
-
darc.db.
MAX_POOL
: int¶ - Default
1_000
- Environ
Maximum number of links loading from the database.
Note
If is an infinit
inf
, no limit will be applied.
-
darc.db.
REDIS_LOCK
: bool¶ - Default
- Environ
DARC_REDIS_LOCK
If use Redis (Lua) lock to ensure process/thread-safely operations.
See also
Toggles the behaviour of
darc.db.get_lock()
.
-
darc.db.
RETRY_INTERVAL
: int¶ - Default
10
- Environ
DARC_RETRY
Retry interval between each Redis command failure.
Note
If is an infinit
inf
, no interval will be applied.See also
Toggles the behaviour of
darc.db.redis_command()
.
-
darc.submit.
PATH_API
= '{PATH_DB}/api/'¶ Path to the API submittsion records, i.e.
api
folder under the root of data storage.See also
Note
If API_NEW_HOST
, API_REQUESTS
and API_SELENIUM
is None
, the corresponding
submit function will save the JSON data in the path
specified by PATH_API
.
See also
The darc
provides a demo on how to implement a darc
-compliant
web backend for the data submission module. See the demo
page for more information.
-
darc.selenium.
BINARY_LOCATION
: Optional[str]¶ Path to Google Chrome binary location.
- Default
google-chrome
- Environ
Proxy Utilities¶
The darc.proxy
module provides various proxy support
to the darc
project.
-
darc.proxy.bitcoin.
PATH
= '{PATH_MISC}/bitcoin.txt'¶ Path to the data storage of bitcoin addresses.
See also
-
darc.proxy.bitcoin.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving bitcoin addresses
PATH
.See also
-
darc.proxy.data.
PATH
= '{PATH_MISC}/data/'¶ Path to the data storage of data URI schemes.
See also
-
darc.proxy.ed2k.
PATH
= '{PATH_MISC}/ed2k.txt'¶ Path to the data storage of bED2K magnet links.
See also
-
darc.proxy.ed2k.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving ED2K magnet links
PATH
.See also
-
darc.proxy.ethereum.
PATH
= '{PATH_MISC}/ethereum.txt'¶ Path to the data storage of ethereum addresses.
See also
-
darc.proxy.ethereum.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving ethereum addresses
PATH
.See also
The following constants are configuration through environment variables:
-
darc.proxy.freenet.
FREENET_RETRY
: int¶ Retry times for Freenet bootstrap when failure.
- Default
3
- Environ
-
darc.proxy.freenet.
BS_WAIT
: float¶ Time after which the attempt to start Freenet is aborted.
- Default
90
- Environ
FREENET_WAIT
Note
If not provided, there will be NO timeouts.
-
darc.proxy.freenet.
FREENET_PATH
: str¶ Path to the Freenet project.
- Default
/usr/local/src/freenet
- Environ
-
darc.proxy.freenet.
FREENET_ARGS
: List[str]¶ Freenet bootstrap arguments for
run.sh start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).- Default
''
- Environ
Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
The following constants are defined for internal usage:
-
darc.proxy.freenet.
_MNG_FREENET
: bool¶ If manage Freenet proxy through
darc
.- Default
- Environ
DARC_FREENET
-
darc.proxy.freenet.
_FREENET_PROC
: subprocess.Popen¶ Freenet proxy process running in the background.
-
darc.proxy.i2p.
I2P_REQUESTS_PROXY
: Dict[str, Any]¶ Proxy for I2P sessions.
See also
darc.requests.i2p_session()
-
darc.proxy.i2p.
I2P_SELENIUM_PROXY
: selenium.webdriver.common.proxy.Proxy¶ Proxy
for I2P web drivers.See also
darc.selenium.i2p_driver()
The following constants are configuration through environment variables:
-
darc.proxy.i2p.
BS_WAIT
: float¶ Time after which the attempt to start I2P is aborted.
- Default
90
- Environ
I2P_WAIT
Note
If not provided, there will be NO timeouts.
-
darc.proxy.i2p.
I2P_ARGS
: List[str]¶ I2P bootstrap arguments for
i2prouter start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).- Default
''
- Environ
Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
The following constants are defined for internal usage:
-
darc.proxy.i2p.
_I2P_PROC
: subprocess.Popen¶ I2P proxy process running in the background.
-
darc.proxy.irc.
PATH
= '{PATH_MISC}/irc.txt'¶ Path to the data storage of IRC addresses.
See also
-
darc.proxy.irc.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving IRC addresses
PATH
.See also
-
darc.proxy.magnet.
PATH
= '{PATH_MISC}/magnet.txt'¶ Path to the data storage of magnet links.
See also
-
darc.proxy.magnet.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving magnet links
PATH
.See also
-
darc.proxy.mail.
PATH
= '{PATH_MISC}/mail.txt'¶ Path to the data storage of email addresses.
See also
-
darc.proxy.mail.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving email addresses
PATH
.See also
-
darc.proxy.null.
PATH
= '{PATH_MISC}/invalid.txt'¶ Path to the data storage of links with invalid scheme.
See also
-
darc.proxy.null.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving links with invalid scheme
PATH
.See also
-
darc.proxy.script.
PATH
= '{PATH_MISC}/script.txt'¶ Path to the data storage of bitcoin addresses.
See also
-
darc.proxy.script.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving JavaScript links
PATH
.See also
-
darc.proxy.tel.
PATH
= '{PATH_MISC}/tel.txt'¶ Path to the data storage of bitcoin addresses.
See also
-
darc.proxy.tel.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving telephone numbers
PATH
.See also
-
darc.proxy.tor.
TOR_REQUESTS_PROXY
: Dict[str, Any]¶ Proxy for Tor sessions.
See also
darc.requests.tor_session()
-
darc.proxy.tor.
TOR_SELENIUM_PROXY
: selenium.webdriver.common.proxy.Proxy¶ Proxy
for Tor web drivers.See also
darc.selenium.tor_driver()
The following constants are configuration through environment variables:
-
darc.proxy.tor.
TOR_PASS
: str¶ Tor controller authentication token.
Note
If not provided, it will be requested at runtime.
-
darc.proxy.tor.
BS_WAIT
: float¶ Time after which the attempt to start Tor is aborted.
- Default
90
- Environ
TOR_WAIT
Note
If not provided, there will be NO timeouts.
-
darc.proxy.tor.
TOR_CFG
: Dict[str, Any]¶ Tor bootstrap configuration for
stem.process.launch_tor_with_config()
.- Default
{}
- Environ
Note
If provided, it will be parsed from a JSON encoded string.
The following constants are defined for internal usage:
-
darc.proxy.tor.
_TOR_PROC
: subprocess.Popen¶ Tor proxy process running in the background.
-
darc.proxy.tor.
_TOR_CTRL
: stem.control.Controller¶ Tor controller process (
stem.control.Controller
) running in the background.
-
darc.proxy.tor.
_TOR_CONFIG
: List[str]¶ Tor bootstrap configuration for
stem.process.launch_tor_with_config()
.
The following constants are configuration through environment variables:
-
darc.proxy.zeronet.
ZERONET_RETRY
: int¶ Retry times for ZeroNet bootstrap when failure.
- Default
3
- Environ
-
darc.proxy.zeronet.
BS_WAIT
: float¶ Time after which the attempt to start ZeroNet is aborted.
- Default
90
- Environ
ZERONET_WAIT
Note
If not provided, there will be NO timeouts.
-
darc.proxy.zeronet.
ZERONET_PATH
: str¶ Path to the ZeroNet project.
- Default
/usr/local/src/zeronet
- Environ
-
darc.proxy.zeronet.
ZERONET_ARGS
: List[str]¶ ZeroNet bootstrap arguments for
run.sh start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).- Default
''
- Environ
Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
The following constants are defined for internal usage:
-
darc.proxy.zeronet.
_MNG_ZERONET
: bool¶ If manage ZeroNet proxy through
darc
.- Default
- Environ
DARC_ZERONET
-
darc.proxy.zeronet.
_ZERONET_PROC
: subprocess.Popen¶ ZeroNet proxy process running in the background.
To tell the darc
project which proxy settings to be used for the
requests.Session
objects and WebDriver
objects, you can specify such information in the darc.proxy.LINK_MAP
mapping dictionarty.
-
darc.proxy.
LINK_MAP
: DefaultDict[str, Tuple[types.FunctionType, types.FunctionType]]¶ LINK_MAP = collections.defaultdict( lambda: (darc.requests.null_session, darc.selenium.null_driver), { 'tor': (darc.requests.tor_session, darc.selenium.tor_driver), 'i2p': (darc.requests.i2p_session, darc.selenium.i2p_driver), } )
The mapping dictionary for proxy type to its corresponding
requests.Session
factory function andWebDriver
factory function.The fallback value is the no proxy
requests.Session
object (null_session()
) andWebDriver
object (null_driver()
).See also
darc.requests
–requests.Session
factory functionsdarc.selenium
–WebDriver
factory functions
Sites Customisation¶
As websites may have authentication requirements, etc., over
its content, the darc.sites
module provides sites
customisation hooks to both requests
and selenium
crawling processes.
Important
To create a sites customisation, define your class by inheriting
darc.sites.BaseSite
and register it to the darc
module through darc.sites.register()
.
To start with, you just need to define your sites customisation by
inheriting BaseSite
and overload corresponding
crawler()
and/or
loader()
methods.
To customise behaviours over requests
, you sites customisation
class should have a crawler()
method, e.g.
DefaultSite.crawler
.
The function takes the requests.Session
object with proxy settings and
a Link
object representing the link to be
crawled, then returns a requests.Response
object containing the final
data of the crawling process.
To customise behaviours over selenium
, you sites customisation
class should have a loader()
method, e.g.
DefaultSite.loader
.
The function takes the WebDriver
object with proxy settings and a Link
object representing
the link to be loaded, then returns the WebDriver
object containing the final data of the loading process.
To tell the darc
project which sites customisation
module it should use for a certain hostname, you can register
such module to the SITEMAP
mapping dictionary
through register()
:
-
darc.sites.
SITEMAP
: DefaultDict[str, Type[darc.sites._abc.BaseSite]]¶ from darc.sites.default import DefaultSite SITEMAP = collections.defaultdict(lambda: DefaultSite, { # 'www.sample.com': SampleSite, # local customised class })
The mapping dictionary for hostname to sites customisation classes.
The fallback value is
darc.sites.default.DefaultSite
.
See also
Please refer to Customisations for more examples and explanations.
Module Constants¶
Auxiliary Function¶
General Configurations¶
-
darc.const.
REBOOT
: bool¶ If exit the program after first round, i.e. crawled all links from the
requests
link database and loaded all links from theselenium
link database.This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.
- Default
- Environ
-
darc.const.
VERBOSE
: bool¶ If run the program in verbose mode. If
DEBUG
isTrue
, then the verbose mode will be always enabled.- Default
- Environ
-
darc.const.
CHECK
: bool¶ If check proxy and hostname before crawling (when calling
extract_links()
,read_sitemap()
andread_hosts()
).If
CHECK_NG
isTrue
, then this environment variable will be always set asTrue
.- Default
- Environ
-
darc.const.
CHECK_NG
: bool¶ If check content type through
HEAD
requests before crawling (when callingextract_links()
,read_sitemap()
andread_hosts()
).- Default
- Environ
-
darc.const.
CWD
= '.'¶ The current working direcory.
-
darc.const.
DARC_CPU
: int¶ Number of concurrent processes. If not provided, then the number of system CPUs will be used.
-
darc.const.
DARC_USER
: str¶ Non-root user for proxies.
- Default
current login user (c.f.
getpass.getuser()
)- Environ
Data Storage¶
See also
See darc.db
for more information about database integration.
-
darc.const.
REDIS
: redis.Redis¶ URL to the Redis database.
- Default
redis://127.0.0.1
- Environ
-
darc.const.
DB
: peewee.Database¶ URL to the RDS storage.
- Default
sqlite://{PATH_DB}/darc.db
- Environ
:envvar`DB_URL`
-
darc.const.
DB_WEB
: peewee.Database¶ URL to the data submission storage.
- Default
sqlite://{PATH_DB}/darcweb.db
- Environ
:envvar`DB_URL`
-
darc.const.
FLAG_DB
: bool¶ Flag if uses RDS as the task queue backend. If
REDIS_URL
is provided, thenFalse
; else,True
.
-
darc.const.
PATH_DB
: str¶ Path to data storage.
- Default
data
- Environ
See also
See
darc.save
for more information about source saving.
-
darc.const.
PATH_MISC
= '{PATH_DB}/misc/'¶ Path to miscellaneous data storage, i.e.
misc
folder under the root of data storage.See also
-
darc.const.
PATH_LN
= '{PATH_DB}/link.csv'¶ Path to the link CSV file,
link.csv
.See also
darc.save.save_link
-
darc.const.
PATH_ID
= '{PATH_DB}/darc.pid'¶ Path to the process ID file,
darc.pid
.See also
darc.const.getpid()
Web Crawlers¶
-
darc.const.
DARC_WAIT
: Optional[float]¶ Time interval between each round when the
requests
and/orselenium
database are empty.- Default
60
- Environ
-
darc.const.
TIME_CACHE
: float¶ Time delta for caches in seconds.
The
darc
project supports caching for fetched files.TIME_CACHE
will specify for how log the fetched files will be cached and NOT fetched again.Note
If
TIME_CACHE
isNone
then caching will be marked as forever.- Default
60
- Environ
-
darc.const.
SE_WAIT
: float¶ Time to wait for
selenium
to finish loading pages.Note
Internally,
selenium
will wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded
). However, some extra scripts may take more time running after the event.- Default
60
- Environ
-
darc.const.
SE_EMPTY
= '<html><head></head><body></body></html>'¶ The empty page from
selenium
.See also
darc.crawl.loader()
White / Black Lists¶
-
darc.const.
LINK_WHITE_LIST
: List[re.Pattern]¶ White list of hostnames should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
LINK_BLACK_LIST
: List[re.Pattern]¶ Black list of hostnames should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
LINK_FALLBACK
: bool¶ Fallback value for
match_host()
.- Default
- Environ
-
darc.const.
MIME_WHITE_LIST
: List[re.Pattern]¶ White list of content types should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
MIME_BLACK_LIST
: List[re.Pattern]¶ Black list of content types should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
MIME_FALLBACK
: bool¶ Fallback value for
match_mime()
.- Default
- Environ
-
darc.const.
PROXY_WHITE_LIST
: List[str]¶ White list of proxy types should be crawled.
- Default
[]
- Environ
Note
The proxy types are case insensitive.
-
darc.const.
PROXY_BLACK_LIST
: List[str]¶ Black list of proxy types should be crawled.
- Default
[]
- Environ
Note
The proxy types are case insensitive.
-
darc.const.
PROXY_FALLBACK
: bool¶ Fallback value for
match_proxy()
.- Default
- Environ
Custom Exceptions¶
The render_error()
function can be used to render
multi-line error messages with stem.util.term
colours.
The darc
project provides following custom exceptions:
Note
All exceptions are inherited from _BaseException
.
The darc
project provides following custom warnings:
Note
All warnings are inherited from _BaseWarning
.
-
exception
darc.error.
APIRequestFailed
[source]¶ Bases:
darc.error._BaseWarning
API submit failed.
-
exception
darc.error.
DatabaseOperaionFailed
[source]¶ Bases:
darc.error._BaseWarning
Database operation execution failed.
-
exception
darc.error.
FreenetBootstrapFailed
[source]¶ Bases:
darc.error._BaseWarning
Freenet bootstrap process failed.
-
exception
darc.error.
HookExecutionFailed
[source]¶ Bases:
darc.error._BaseWarning
Failed to execute hook function.
-
exception
darc.error.
I2PBootstrapFailed
[source]¶ Bases:
darc.error._BaseWarning
I2P bootstrap process failed.
-
exception
darc.error.
LinkNoReturn
(link=None, *, drop=True)[source]¶ Bases:
darc.error._BaseException
The link has no return value from the hooks.
- Parameters
link (darc.link.Link) – Original link object.
drop (bool) –
- Keyword Arguments
drop – If drops the
link
from task queues.- Return type
-
exception
darc.error.
LockWarning
[source]¶ Bases:
darc.error._BaseWarning
Failed to acquire Redis lock.
-
exception
darc.error.
RedisCommandFailed
[source]¶ Bases:
darc.error._BaseWarning
Redis command execution failed.
-
exception
darc.error.
SiteNotFoundWarning
[source]¶ Bases:
darc.error._BaseWarning
,ImportWarning
Site customisation not found.
-
exception
darc.error.
TorBootstrapFailed
[source]¶ Bases:
darc.error._BaseWarning
Tor bootstrap process failed.
-
exception
darc.error.
TorRenewFailed
[source]¶ Bases:
darc.error._BaseWarning
Tor renew request failed.
-
exception
darc.error.
UnsupportedLink
[source]¶ Bases:
darc.error._BaseException
The link is not supported.
-
exception
darc.error.
UnsupportedPlatform
[source]¶ Bases:
darc.error._BaseException
The platform is not supported.
-
exception
darc.error.
UnsupportedProxy
[source]¶ Bases:
darc.error._BaseException
The proxy is not supported.
-
exception
darc.error.
WorkerBreak
[source]¶ Bases:
darc.error._BaseException
Break from the worker loop.
-
exception
darc.error.
ZeroNetBootstrapFailed
[source]¶ Bases:
darc.error._BaseWarning
ZeroNet bootstrap process failed.
-
darc.error.
render_error
(message, colour)[source]¶ Render error message.
The function wraps the
stem.util.term.format()
function to provide multi-line formatting support.- Parameters
message (AnyStr) – Multi-line message to be rendered with
colour
.colour (stem.util.term.Color) – Front colour of text, c.f.
stem.util.term.Color
.
- Returns
The rendered error message.
- Return type
Data Models¶
The darc.model
module contains all data models defined for the
darc
project, including RDS-based task queue and data submission.
Task Queues¶
The darc.model.tasks
module defines the data models
required for the task queue of darc
.
See also
Please refer to darc.db
module for more information
about the task queues.
Hostname Queue¶
Important
The hostname queue is a set named queue_hostname
in
a Redis based task queue.
The darc.model.tasks.hostname
model contains the data model
defined for the hostname queue.
-
class
darc.model.tasks.hostname.
HostnameQueueModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModel
Hostname task queue.
-
DoesNotExist
¶ alias of
darc.model.tasks.hostname.HostnameQueueModelDoesNotExist
-
id
= <AutoField: HostnameQueueModel.id>¶
-
timestamp
: datetime.datetime = <DateTimeField: HostnameQueueModel.timestamp>¶ Timestamp of last update.
-
Crawler Queue¶
Important
The crawler
queue is a sorted set
named queue_requests
in a Redis based task queue.
The darc.model.tasks.requests
model contains the data model
defined for the crawler
queue.
-
class
darc.model.tasks.requests.
RequestsQueueModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModel
Task queue for
crawler()
.-
DoesNotExist
¶ alias of
darc.model.tasks.requests.RequestsQueueModelDoesNotExist
-
id
= <AutoField: RequestsQueueModel.id>¶
-
link
: darc.link.Link = <PickleField: RequestsQueueModel.link>¶ Pickled target
Link
instance.
-
timestamp
: datetime.datetime = <DateTimeField: RequestsQueueModel.timestamp>¶ Timestamp of last update.
-
Loader Queue¶
Important
The loader
queue is a sorted set
named queue_selenium
in a Redis based task queue.
The darc.model.tasks.selenium
model contains the data model
defined for the loader
queue.
-
class
darc.model.tasks.selenium.
SeleniumQueueModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModel
Task queue for
loader()
.-
DoesNotExist
¶ alias of
darc.model.tasks.selenium.SeleniumQueueModelDoesNotExist
-
id
= <AutoField: SeleniumQueueModel.id>¶
-
link
: darc.link.Link = <PickleField: SeleniumQueueModel.link>¶ Pickled target
Link
instance.
-
timestamp
: datetime.datetime = <DateTimeField: SeleniumQueueModel.timestamp>¶ Timestamp of last update.
-
Submission Data Models¶
The darc.model.web
module defines the data models
to store the data crawled from the darc
project.
See also
Please refer to darc.submit
module for more information
about data submission.
Base Model¶
The darc.model.abc
module contains abstract base class
of all data models for the darc
project.
-
class
darc.model.abc.
BaseMeta
[source]¶ Bases:
object
Basic metadata for data models.
-
table_function
()¶ Generate table name dynamically (c.f.
table_function()
).- Return type
- Parameters
model_class (peewee.Model) –
-
-
class
darc.model.abc.
BaseMetaWeb
[source]¶ Bases:
darc.model.abc.BaseMeta
Basic metadata for data models of data submission.
-
table_function
()¶ Generate table name dynamically (c.f.
table_function()
).- Return type
- Parameters
model_class (peewee.Model) –
-
-
class
darc.model.abc.
BaseModel
(*args, **kwargs)[source]¶ Bases:
peewee.Model
Base model with standard patterns.
Notes
The model will implicitly have a
AutoField
attribute named asid
.-
DoesNotExist
¶ alias of
darc.model.abc.BaseModelDoesNotExist
-
Meta
¶ Basic metadata for data models.
-
id
= <AutoField: BaseModel.id>¶
-
-
class
darc.model.abc.
BaseModelWeb
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModel
Base model with standard patterns for data submission.
Notes
The model will implicitly have a
AutoField
attribute named asid
.-
DoesNotExist
¶ alias of
darc.model.abc.BaseModelWebDoesNotExist
-
Meta
¶ Basic metadata for data models.
-
id
= <AutoField: BaseModelWeb.id>¶
-
Miscellaneous Utilities¶
The darc.model.utils
module contains several miscellaneous
utility functions and data fields.
-
class
darc.model.utils.
IPField
(null=False, index=False, unique=False, column_name=None, default=None, primary_key=False, constraints=None, sequence=None, collation=None, unindexed=False, choices=None, help_text=None, verbose_name=None, index_type=None, db_column=None, _hidden=False)[source]¶ Bases:
peewee.IPField
IP data field.
-
db_value
(val)[source]¶ Dump the value for database storage.
- Parameters
value – Source IP address instance.
val (Optional[Union[str, ipaddress.IPv4Address, ipaddress.IPv6Address]]) –
- Return type
- Returns
Integral representation of the IP address.
-
-
class
darc.model.utils.
IntEnumField
(null=False, index=False, unique=False, column_name=None, default=None, primary_key=False, constraints=None, sequence=None, collation=None, unindexed=False, choices=None, help_text=None, verbose_name=None, index_type=None, db_column=None, _hidden=False)[source]¶ Bases:
peewee.IntegerField
enum.IntEnum
data field.-
choices
: enum.IntEnum¶ The original
enum.IntEnum
class.
-
-
class
darc.model.utils.
JSONField
(null=False, index=False, unique=False, column_name=None, default=None, primary_key=False, constraints=None, sequence=None, collation=None, unindexed=False, choices=None, help_text=None, verbose_name=None, index_type=None, db_column=None, _hidden=False)[source]¶ Bases:
playhouse.mysql_ext.JSONField
JSON data field.
-
class
darc.model.utils.
PickleField
(null=False, index=False, unique=False, column_name=None, default=None, primary_key=False, constraints=None, sequence=None, collation=None, unindexed=False, choices=None, help_text=None, verbose_name=None, index_type=None, db_column=None, _hidden=False)[source]¶ Bases:
peewee.BlobField
Pickled data field.
As the websites can be sometimes irritating for their anti-robots
verification, login requirements, etc., the darc
project
also privides hooks to customise crawling behaviours around both
requests
and selenium
.
See also
Such customisation, as called in the darc
project, site
hooks, is site specific, user can set up your own hooks unto a
certain site, c.f. darc.sites
for more information.
Still, since the network is a world full of mysteries and miracles,
the speed of crawling will much depend on the response speed of
the target website. To boost up, as well as meet the system capacity,
the darc
project introduced multiprocessing, multithreading
and the fallback slowest single-threaded solutions when crawling.
Note
When rendering the target website using selenium
powered by
the renown Google Chrome, it will require much memory to run.
Thus, the three solutions mentioned above would only toggle the
behaviour around the use of selenium
.
To keep the darc
project as it is a swiss army knife, only the
main entrypoint function darc.process.process()
is exported
in global namespace (and renamed to darc.darc()
), see below:
And we also exported the necessary hook registration functions to the global namespace, see below:
For more information on the hooks, please refer to the customisation documentations.
Configuration¶
The darc
project is generally configurable through numerous
environment variables. Below is the full list of supported environment
variables you may use to configure the behaviour of darc
.
General Configurations¶
-
DARC_REBOOT
¶ -
If exit the program after first round, i.e. crawled all links from the
requests
link database and loaded all links from theselenium
link database.This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.
-
DARC_VERBOSE
¶ -
If run the program in verbose mode. If
DARC_DEBUG
isTrue
, then the verbose mode will be always enabled.
-
DARC_CHECK
¶ -
If check proxy and hostname before crawling (when calling
extract_links()
,read_sitemap()
andread_hosts()
).If
DARC_CHECK_CONTENT_TYPE
isTrue
, then this environment variable will be always set asTrue
.
-
DARC_CHECK_CONTENT_TYPE
¶ -
If check content type through
HEAD
requests before crawling (when callingextract_links()
,read_sitemap()
andread_hosts()
).
-
DARC_URL_PAT
¶ -
Regular expression patterns to match all reasonable URLs.
The environment variable should be JSON encoded, as an array of three-element pairs. In each pair, it contains one scheme (
str
) as the fallback default scheme for matched URL, one Python regular expression string (str
) as described in the builtinre
module and one numeric value (int
) representing the flags as defined in the builtinre
module as well.Important
The patterns must have a named match group
url
, e.g.(?P<url>bitcoin:\w+)
so that the function can extract matched URLs from the given pattern.And the regular expression will always be used in ASCII mode, i.e., with
re.ASCII
flag to compile.
-
DARC_CPU
¶ -
Number of concurrent processes. If not provided, then the number of system CPUs will be used.
Note
DARC_MULTIPROCESSING
and DARC_MULTITHREADING
can
NOT be toggled at the same time.
-
DARC_USER
¶ - Type
- Default
current login user (c.f.
getpass.getuser()
)
Non-root user for proxies.
Data Storage¶
See also
See darc.save
for more information about source saving.
See darc.db
for more information about database integration.
-
DB_URL
¶ - Type
str
(url)
URL to the RDS storage.
Important
The task queues will be saved to
darc
database; the data submittsion will be saved todarcweb
database.Thus, when providing this environment variable, please do NOT specify the database name.
-
LOCK_TIMEOUT
¶ - Type
- Default
10
Lock blocking timeout.
Note
If is an infinit
inf
, no timeout will be applied.See also
Get a lock from
darc.db.get_lock()
.
-
DARC_MAX_POOL
¶ - Type
- Default
1_000
Maximum number of links loaded from the database.
Note
If is an infinit
inf
, no limit will be applied.
-
REDIS_LOCK
¶ -
If use Redis (Lua) lock to ensure process/thread-safely operations.
See also
Toggles the behaviour of
darc.db.get_lock()
.
Web Crawlers¶
-
DARC_WAIT
¶ - Type
- Default
60
Time interval between each round when the
requests
and/orselenium
database are empty.
-
DARC_SAVE
¶ -
If save processed link back to database.
Note
If
DARC_SAVE
isTrue
, thenDARC_SAVE_REQUESTS
andDARC_SAVE_SELENIUM
will be forced to beTrue
.See also
See
darc.db
for more information about link database.
-
DARC_SAVE_REQUESTS
¶ -
If save
crawler()
crawled link back torequests
database.See also
See
darc.db
for more information about link database.
-
DARC_SAVE_SELENIUM
¶ -
If save
loader()
crawled link back toselenium
database.See also
See
darc.db
for more information about link database.
-
TIME_CACHE
¶ - Type
- Default
60
Time delta for caches in seconds.
The
darc
project supports caching for fetched files.TIME_CACHE
will specify for how log the fetched files will be cached and NOT fetched again.Note
If
TIME_CACHE
isNone
then caching will be marked as forever.
-
SE_WAIT
¶ - Type
- Default
60
Time to wait for
selenium
to finish loading pages.Note
Internally,
selenium
will wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded
). However, some extra scripts may take more time running after the event.
White / Black Lists¶
-
LINK_WHITE_LIST
¶ - Type
List[str]
(JSON)- Default
[]
White list of hostnames should be crawled.
Note
Regular expressions are supported.
-
LINK_BLACK_LIST
¶ - Type
List[str]
(JSON)- Default
[]
Black list of hostnames should be crawled.
Note
Regular expressions are supported.
-
LINK_FALLBACK
¶ -
Fallback value for
match_host()
.
-
MIME_WHITE_LIST
¶ - Type
List[str]
(JSON)- Default
[]
White list of content types should be crawled.
Note
Regular expressions are supported.
-
MIME_BLACK_LIST
¶ - Type
List[str]
(JSON)- Default
[]
Black list of content types should be crawled.
Note
Regular expressions are supported.
-
MIME_FALLBACK
¶ -
Fallback value for
match_mime()
.
-
PROXY_WHITE_LIST
¶ - Type
List[str]
(JSON)- Default
[]
White list of proxy types should be crawled.
Note
The proxy types are case insensitive.
-
PROXY_BLACK_LIST
¶ - Type
List[str]
(JSON)- Default
[]
Black list of proxy types should be crawled.
Note
The proxy types are case insensitive.
-
PROXY_FALLBACK
¶ -
Fallback value for
match_proxy()
.
Note
If provided,
LINK_WHITE_LIST
, LINK_BLACK_LIST
,
MIME_WHITE_LIST
, MIME_BLACK_LIST
,
PROXY_WHITE_LIST
and PROXY_BLACK_LIST
should all be JSON encoded strings.
Data Submission¶
Note
If API_NEW_HOST
, API_REQUESTS
and API_SELENIUM
is None
, the corresponding
submit function will save the JSON data in the path
specified by PATH_DATA
.
Tor Proxy Configuration¶
-
TOR_PASS
¶ -
Tor controller authentication token.
Note
If not provided, it will be requested at runtime.
-
TOR_WAIT
¶ - Type
- Default
90
Time after which the attempt to start Tor is aborted.
Note
If not provided, there will be NO timeouts.
-
TOR_CFG
¶ - Type
Dict[str, Any]
(JSON)- Default
{}
Tor bootstrap configuration for
stem.process.launch_tor_with_config()
.Note
If provided, it should be a JSON encoded string.
I2P Proxy Configuration¶
-
I2P_WAIT
¶ - Type
- Default
90
Time after which the attempt to start I2P is aborted.
Note
If not provided, there will be NO timeouts.
-
I2P_ARGS
¶ - Type
str
(Shell)- Default
''
I2P bootstrap arguments for
i2prouter start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
ZeroNet Proxy Configuration¶
-
ZERONET_WAIT
¶ - Type
- Default
90
Time after which the attempt to start ZeroNet is aborted.
Note
If not provided, there will be NO timeouts.
-
ZERONET_ARGS
¶ - Type
str
(Shell)- Default
''
ZeroNet bootstrap arguments for
ZeroNet.sh main
.Note
If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).
Freenet Proxy Configuration¶
-
FREENET_WAIT
¶ - Type
- Default
90
Time after which the attempt to start Freenet is aborted.
Note
If not provided, there will be NO timeouts.
-
FREENET_ARGS
¶ - Type
str
(Shell)- Default
''
Freenet bootstrap arguments for
run.sh start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
Customisations¶
Currently, darc
provides three major customisation points, besides the
various environment variables.
Hooks between Rounds¶
See also
See darc.process.register()
for technical information.
As the workers are defined as indefinite loops, we introduced the hooks between rounds to be called at end of each loop. Such hook functions can process all links that had been crawled and/or loaded in the past round, or to indicate the end of the indefinite loop, so that we can stop the workers in an elegant way.
A typical hook function can be defined as following:
from darc.error import WorkerBreak
from darc.process import register
def dummy_hook(node_type, link_pool):
"""A sample hook function that prints the processed links
in the past round and informs the work to quit.
Args:
node_type (Literal['crawler', 'loader']): Type of worker node.
link_pool (List[darc.link.Link]): List of processed links.
Returns:
NoReturn: The hook function will never return, though return
values will be ignored anyway.
Raises:
darc.error.WorkerBreak: Inform the work to quit after this round.
"""
if node_type == 'crawler':
verb = 'crawled'
elif node_type == 'loader':
verb = 'loaded'
else:
raise ValueError('unknown type of worker node: %s' % node_type)
for link in link_pool:
print('We just %s the link: %s' % (verb, link.url))
raise WorkerBreak
# register the hook function
register(dummy_hook)
Custom Proxy¶
See also
See
darc.proxy.register()
for technical information.
Sometimes, we need proxies to connect to certain targers, such as the Tor
network and I2P proxy. darc
decides if it need to use a proxy for
connection based on the proxy
value of the target
link.
By default, darc
uses no proxy for requests
sessions
and selenium
drivers. However, you may use your own proxies by
registering and/or customising the corresponding factory functions.
A typical factory function pair (e.g., for Socks5 proxy) can be defined as following:
import requests
import requests_futures.sessions
import selenium.webdriver
import selenium.webdriver.common.proxy
from darc.const import DARC_CPU
from darc.proxy import register
from darc.requests import default_user_agent
from darc.selenium import BINARY_LOCATION
def socks5_session(futures=False):
"""Socks5 proxy session.
Args:
futures: If returns a :class:`requests_futures.FuturesSession`.
Returns:
Union[requests.Session, requests_futures.FuturesSession]:
The session object with Socks5 proxy settings.
"""
if futures:
session = requests_futures.sessions.FuturesSession(max_workers=DARC_CPU)
else:
session = requests.Session()
session.headers['User-Agent'] = default_user_agent(proxy='Socks5')
session.proxies.update({
'http': 'socks5h://localhost:9293',
'https': 'socks5h://localhost:9293',
})
return session
def socks5_driver():
"""Socks5 proxy driver.
Returns:
selenium.webdriver.Chrome: The web driver object with Socks5 proxy settings.
"""
options = selenium.webdriver.ChromeOptions()
options.binary_location = BINARY_LOCATION
options.add_argument('--proxy-server=socks5://localhost:9293')
options.add_argument('--host-resolver-rules="MAP * ~NOTFOUND , EXCLUDE localhost"')
proxy = selenium.webdriver.Proxy()
proxy.proxyType = selenium.webdriver.common.proxy.ProxyType.MANUAL
proxy.http_proxy = 'socks5://localhost:9293'
proxy.ssl_proxy = 'socks5://localhost:9293'
capabilities = selenium.webdriver.DesiredCapabilities.CHROME.copy()
proxy.add_to_capabilities(capabilities)
driver = selenium.webdriver.Chrome(options=options,
desired_capabilities=capabilities)
return driver
# register proxy
register('socks5', socks5_session, socks5_driver)
Sites Customisation¶
See also
See
darc.sites.register()
for technical information.
Since websites may require authentication and/or anti-robot checks,
we need to insert certain cookies, animate some user interactions to
bypass such requirements. darc
decides which customisation to
use based on the hostname, i.e. host
value of
the target link.
By default, darc
uses darc.sites.default
as the no op
for both requests
sessions and selenium
drivers. However,
you may use your own sites customisation by registering and/or customising
the corresponding classes, which inherited from BaseSite
.
A typical sites customisation class (for better demonstration) can be defined as following:
import time
from darc.const import SE_WAIT
from darc.sites import BaseSite, register
class MySite(BaseSite):
"""This is a site customisation class for demonstration purpose.
You may implement a module as well should you prefer."""
#: List[str]: Hostnames the sites customisation is designed for.
hostname = ['mysite.com', 'www.mysite.com']
@staticmethod
def crawler(session, link):
"""Crawler hook for my site.
Args:
session (requests.Session): Session object with proxy settings.
link (darc.link.Link): Link object to be crawled.
Returns:
requests.Response: The final response object with crawled data.
"""
# inject cookies
session.cookies.set('SessionID', 'fake-session-id-value')
response = session.get(link.url, allow_redirects=True)
return response
@staticmethod
def loader(driver, link):
"""Loader hook for my site.
Args:
driver (selenium.webdriver.Chrome): Web driver object with proxy settings.
link (darc.link.Link): Link object to be loaded.
Returns:
selenium.webdriver.Chrome: The web driver object with loaded data.
"""
# land on login page
driver.get('https://%s/login' % link.host)
# animate login attempt
form = driver.find_element_by_id('login-form')
form.find_element_by_id('username').send_keys('admin')
form.find_element_by_id('password').send_keys('p@ssd')
form.click()
driver.get(link.url)
# wait for page to finish loading
if SE_WAIT is not None:
time.sleep(SE_WAIT)
return driver
# register sites
register(MySite)
Important
Please note that you may raise darc.error.LinkNoReturn
in the crawler
and/or loader
methods to indicate that such link should be ignored and removed
from the task queues, e.g. darc.sites.data
.
Docker Integration¶
The darc
project is integrated with Docker and
Compose. Though published to Docker Hub, you can
still build by yourself.
Important
The debug
image contains miscellaneous documents,
i.e. whole repository in it; and pre-installed some
useful tools for debugging, such as IPython, etc.
The Docker image is based on Ubuntu Bionic (18.04 LTS),
setting up all Python dependencies for the darc
project, installing Google Chrome (version
79.0.3945.36) and corresponding ChromeDriver, as well as
installing and configuring Tor, I2P, ZeroNet, FreeNet,
NoIP proxies.
Note
NoIP is currently not fully integrated in the
darc
due to misunderstanding in the configuration
process. Contributions are welcome.
When building the image, there is an optional argument
for setting up a non-root user, c.f. environment variable
DARC_USER
and module constant DARC_USER
.
By default, the username is darc
.
Content of Dockerfile
FROM ubuntu:focal
LABEL org.opencontainers.image.title="darc" \
org.opencontainers.image.description="Darkweb Crawler Project" \
org.opencontainers.image.url="https://darc.jarryshaw.me/" \
org.opencontainers.image.source="https://github.com/JarryShaw/darc" \
org.opencontainers.image.version="0.9.4.post2" \
org.opencontainers.image.licenses='BSD 3-Clause "New" or "Revised" License'
STOPSIGNAL SIGINT
HEALTHCHECK --interval=1h --timeout=1m \
CMD wget https://httpbin.org/get -O /dev/null || exit 1
ARG DARC_USER="darc"
ENV LANG="C.UTF-8" \
LC_ALL="C.UTF-8" \
PYTHONIOENCODING="UTF-8" \
DEBIAN_FRONTEND="teletype" \
DARC_USER="${DARC_USER}"
# DEBIAN_FRONTEND="noninteractive"
COPY extra/retry.sh /usr/local/bin/retry
COPY extra/install.py /usr/local/bin/pty-install
COPY vendor/jdk-11.0.11_linux-x64_bin.tar.gz /var/cache/oracle-jdk11-installer-local/
RUN set -x \
&& retry apt-get update \
&& retry apt-get install --yes --no-install-recommends \
apt-utils \
&& retry apt-get install --yes --no-install-recommends \
gcc \
g++ \
libmagic1 \
make \
software-properties-common \
tar \
unzip \
zlib1g-dev \
&& retry add-apt-repository ppa:deadsnakes/ppa --yes \
&& retry add-apt-repository ppa:linuxuprising/java --yes \
&& retry add-apt-repository ppa:i2p-maintainers/i2p --yes
RUN retry apt-get update \
&& retry apt-get install --yes --no-install-recommends \
python3.9-dev \
python3-pip \
python3-setuptools \
python3-wheel \
&& ln -sf /usr/bin/python3.9 /usr/local/bin/python3
RUN retry pty-install --stdin '6\n70' apt-get install --yes --no-install-recommends \
tzdata \
&& retry pty-install --stdin 'yes' apt-get install --yes \
oracle-java11-installer-local
RUN retry apt-get install --yes --no-install-recommends \
sudo \
&& adduser --disabled-password --gecos '' ${DARC_USER} \
&& adduser ${DARC_USER} sudo \
&& echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
## Tor
RUN retry apt-get install --yes --no-install-recommends tor
COPY extra/torrc.focal /etc/tor/torrc
## I2P
RUN retry apt-get install --yes --no-install-recommends i2p
COPY extra/i2p.focal /etc/defaults/i2p
## ZeroNet
COPY vendor/ZeroNet-linux-dist-linux64.tar.gz /tmp
RUN set -x \
&& cd /tmp \
&& tar xvpfz ZeroNet-linux-dist-linux64.tar.gz \
&& mv ZeroNet-linux-dist-linux64 /usr/local/src/zeronet
COPY extra/zeronet.focal.conf /usr/local/src/zeronet/zeronet.conf
## FreeNet
USER darc
COPY vendor/new_installer_offline.jar /tmp
RUN set -x \
&& cd /tmp \
&& ( pty-install --stdin '/home/darc/freenet\n1' java -jar new_installer_offline.jar || true ) \
&& sudo mv /home/darc/freenet /usr/local/src/freenet
USER root
## NoIP
COPY vendor/noip-duc-linux.tar.gz /tmp
RUN set -x \
&& cd /tmp \
&& tar xvpfz noip-duc-linux.tar.gz \
&& mv noip-2.1.9-1 /usr/local/src/noip \
&& cd /usr/local/src/noip \
&& make
# && make install
# # set up timezone
# RUN echo 'Asia/Shanghai' > /etc/timezone \
# && rm -f /etc/localtime \
# && ln -snf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \
# && dpkg-reconfigure -f noninteractive tzdata
COPY vendor/chromedriver_linux64.zip \
vendor/google-chrome-stable_current_amd64.deb /tmp/
RUN set -x \
## ChromeDriver
&& unzip -d /usr/bin /tmp/chromedriver_linux64.zip \
&& which chromedriver \
## Google Chrome
&& ( dpkg --install /tmp/google-chrome-stable_current_amd64.deb || true ) \
&& retry apt-get install --fix-broken --yes --no-install-recommends \
&& dpkg --install /tmp/google-chrome-stable_current_amd64.deb \
&& which google-chrome
# Using pip:
COPY requirements.txt /tmp
RUN python3 -m pip install -r /tmp/requirements.txt --no-cache-dir
RUN set -x \
&& rm -rf \
## APT repository lists
/var/lib/apt/lists/* \
## Python dependencies
/tmp/requirements.txt \
/tmp/pip \
## ChromeDriver
/tmp/chromedriver_linux64.zip \
## Google Chrome
/tmp/google-chrome-stable_current_amd64.deb \
## Vendors
/tmp/new_installer_offline.jar \
/tmp/noip-duc-linux.tar.gz \
/tmp/ZeroNet-linux-dist-linux64.tar.gz \
#&& apt-get remove --auto-remove --yes \
# software-properties-common \
# unzip \
&& apt-get autoremove -y \
&& apt-get autoclean \
&& apt-get clean
ENTRYPOINT [ "python3", "-m", "darc" ]
#ENTRYPOINT [ "bash", "/app/run.sh" ]
CMD [ "--help" ]
WORKDIR /app
COPY darc/ /app/darc/
COPY LICENSE \
MANIFEST.in \
README.rst \
extra/run.sh \
setup.cfg \
setup.py \
test_darc.py /app/
RUN python3 -m pip install -e .
Note
retry
is a shell script for retrying the commands until success
Content of retry
#!/usr/bin/env bash
while true; do
>&2 echo "+ $@"
$@ && break
>&2 echo "exit: $?"
done
>&2 echo "exit: 0"
pty-install
is a Python script simulating user input for APT package installation withDEBIAN_FRONTEND
set asTeletype
.
Content of pty-install
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Install packages requiring interactions."""
import argparse
import os
import subprocess # nosec: B404
import sys
import tempfile
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from argparse import ArgumentParser
def get_parser() -> 'ArgumentParser':
"""Argument parser."""
parser = argparse.ArgumentParser('install',
description='pseudo-interactive package installer')
parser.add_argument('-i', '--stdin', help='content for input')
parser.add_argument('command', nargs=argparse.REMAINDER, help='command to execute')
return parser
def main() -> int:
"""Entrypoint."""
parser = get_parser()
args = parser.parse_args()
text = args.stdin.encode().decode('unicode_escape')
path = tempfile.mktemp(prefix='install-') # nosec: B306
with open(path, 'w') as file:
file.write(text)
with open(path, 'r') as file:
proc = subprocess.run(args.command, stdin=file) # pylint: disable=subprocess-run-check # nosec: B603
os.remove(path)
return proc.returncode
if __name__ == "__main__":
sys.exit(main())
As always, you can also use Docker Compose to manage the darc
image. Environment variables can be set as described in the
configuration section.
Content of docker-compose.yml
version: '3'
services:
crawler:
image: jsnbzh/darc:latest
build: &build
context: .
args:
# non-root user
DARC_USER: "darc"
container_name: crawler
#entrypoint: [ "bash", "/app/run.sh" ]
command: [ "--type", "crawler",
"--file", "/app/text/tor.txt",
"--file", "/app/text/tor2web.txt",
"--file", "/app/text/i2p.txt",
"--file", "/app/text/zeronet.txt",
"--file", "/app/text/freenet.txt",
"--file", "/app/text/clinic.txt" ]
environment:
## [PYTHON] force the stdout and stderr streams to be unbuffered
PYTHONUNBUFFERED: 1
# reboot mode
DARC_REBOOT: 0
# debug mode
DARC_DEBUG: 0
# verbose mode
DARC_VERBOSE: 1
# force mode (ignore robots.txt)
DARC_FORCE: 1
# check mode (check proxy and hostname before crawling)
DARC_CHECK: 1
# check mode (check content type before crawling)
DARC_CHECK_CONTENT_TYPE: 0
# save mode
DARC_SAVE: 0
# save mode (for requests)
DAVE_SAVE_REQUESTS: 0
# save mode (for selenium)
DAVE_SAVE_SELENIUM: 0
# processes
DARC_CPU: 16
# multiprocessing
DARC_MULTIPROCESSING: 1
# multithreading
DARC_MULTITHREADING: 0
# time lapse
DARC_WAIT: 60
# bulk size
DARC_BULK_SIZE: 1000
# data storage
PATH_DATA: "data"
# save data submitssion
SAVE_DB: 0
# Redis URL
REDIS_URL: 'redis://:UCf7y123aHgaYeGnvLRasALjFfDVHGCz6KiR5Z0WC0DL4ExvSGw5SkcOxBywc0qtZBHVrSVx2QMGewXNP6qVow@redis'
# database URL
#DB_URL: 'mysql://root:b8y9dpz3MJSQtwnZIW77ydASBOYfzA7HJfugv77wLrWQzrjCx5m3spoaiqRi4kU52syYy2jxJZR3U2kwPkEVTA@db'
# max pool
DARC_MAX_POOL: 10
# Tor proxy & control port
TOR_PORT: 9050
TOR_CTRL: 9051
# Tor management method
TOR_STEM: 1
# Tor authentication
TOR_PASS: "16:B9D36206B5374B3F609045F9609EE670F17047D88FF713EFB9157EA39F"
# Tor bootstrap retry
TOR_RETRY: 10
# Tor bootstrap wait
TOR_WAIT: 90
# Tor bootstrap config
TOR_CFG: "{}"
# I2P port
I2P_PORT: 4444
# I2P bootstrap retry
I2P_RETRY: 10
# I2P bootstrap wait
I2P_WAIT: 90
# I2P bootstrap config
I2P_ARGS: ""
# ZeroNet port
ZERONET_PORT: 43110
# ZeroNet bootstrap retry
ZERONET_RETRY: 10
# ZeroNet project path
ZERONET_PATH: "/usr/local/src/zeronet"
# ZeroNet bootstrap wait
ZERONET_WAIT: 90
# ZeroNet bootstrap config
ZERONET_ARGS: ""
# Freenet port
FREENET_PORT: 8888
# Freenet bootstrap retry
FREENET_RETRY: 0
# Freenet project path
FREENET_PATH: "/usr/local/src/freenet"
# Freenet bootstrap wait
FREENET_WAIT: 90
# Freenet bootstrap config
FREENET_ARGS: ""
# time delta for caches in seconds
TIME_CACHE: 2_592_000 # 30 days
# time to wait for selenium
SE_WAIT: 5
# extract link pattern
LINK_WHITE_LIST: '[
".*?\\.onion",
".*?\\.i2p", "127\\.0\\.0\\.1:7657", "localhost:7657", "127\\.0\\.0\\.1:7658", "localhost:7658",
"127\\.0\\.0\\.1:43110", "localhost:43110",
"127\\.0\\.0\\.1:8888", "localhost:8888"
]'
# link black list
LINK_BLACK_LIST: '[ "(.*\\.)?facebookcorewwwi\\.onion", "(.*\\.)?nytimes3xbfgragh\\.onion" ]'
# link fallback flag
LINK_FALLBACK: 1
# content type white list
MIME_WHITE_LIST: '[ "text/html", "application/xhtml+xml" ]'
# content type black list
MIME_BLACK_LIST: '[ "text/css", "application/javascript", "text/json" ]'
# content type fallback flag
MIME_FALLBACK: 0
# proxy type white list
PROXY_WHITE_LIST: '[ "tor", "i2p", "freenet", "zeronet", "tor2web" ]'
# proxy type black list
PROXY_BLACK_LIST: '[ "null", "data" ]'
# proxy type fallback flag
PROXY_FALLBACK: 0
# API retry times
API_RETRY: 10
# API URLs
#API_NEW_HOST: 'https://example.com/api/new_host'
#API_REQUESTS: 'https://example.com/api/requests'
#API_SELENIUM: 'https://example.com/api/selenium'
restart: "always"
networks: &networks
- darc
volumes: &volumes
- ./text:/app/text
- ./extra:/app/extra
- /data/darc:/app/data
loader:
image: jsnbzh/darc:latest
build: *build
container_name: loader
#entrypoint: [ "bash", "/app/run.sh" ]
command: [ "--type", "loader" ]
environment:
## [PYTHON] force the stdout and stderr streams to be unbuffered
PYTHONUNBUFFERED: 1
# reboot mode
DARC_REBOOT: 0
# debug mode
DARC_DEBUG: 0
# verbose mode
DARC_VERBOSE: 1
# force mode (ignore robots.txt)
DARC_FORCE: 1
# check mode (check proxy and hostname before crawling)
DARC_CHECK: 1
# check mode (check content type before crawling)
DARC_CHECK_CONTENT_TYPE: 0
# save mode
DARC_SAVE: 0
# save mode (for requests)
DAVE_SAVE_REQUESTS: 0
# save mode (for selenium)
DAVE_SAVE_SELENIUM: 0
# processes
DARC_CPU: 1
# multiprocessing
DARC_MULTIPROCESSING: 0
# multithreading
DARC_MULTITHREADING: 0
# time lapse
DARC_WAIT: 60
# data storage
PATH_DATA: "data"
# Redis URL
REDIS_URL: 'redis://:UCf7y123aHgaYeGnvLRasALjFfDVHGCz6KiR5Z0WC0DL4ExvSGw5SkcOxBywc0qtZBHVrSVx2QMGewXNP6qVow@redis'
# database URL
#DB_URL: 'mysql://root:b8y9dpz3MJSQtwnZIW77ydASBOYfzA7HJfugv77wLrWQzrjCx5m3spoaiqRi4kU52syYy2jxJZR3U2kwPkEVTA@db'
# max pool
DARC_MAX_POOL: 10
# save data submitssion
SAVE_DB: 0
# Tor proxy & control port
TOR_PORT: 9050
TOR_CTRL: 9051
# Tor management method
TOR_STEM: 1
# Tor authentication
TOR_PASS: "16:B9D36206B5374B3F609045F9609EE670F17047D88FF713EFB9157EA39F"
# Tor bootstrap retry
TOR_RETRY: 10
# Tor bootstrap wait
TOR_WAIT: 90
# Tor bootstrap config
TOR_CFG: "{}"
# I2P port
I2P_PORT: 4444
# I2P bootstrap retry
I2P_RETRY: 10
# I2P bootstrap wait
I2P_WAIT: 90
# I2P bootstrap config
I2P_ARGS: ""
# ZeroNet port
ZERONET_PORT: 43110
# ZeroNet bootstrap retry
ZERONET_RETRY: 10
# ZeroNet project path
ZERONET_PATH: "/usr/local/src/zeronet"
# ZeroNet bootstrap wait
ZERONET_WAIT: 90
# ZeroNet bootstrap config
ZERONET_ARGS: ""
# Freenet port
FREENET_PORT: 8888
# Freenet bootstrap retry
FREENET_RETRY: 0
# Freenet project path
FREENET_PATH: "/usr/local/src/freenet"
# Freenet bootstrap wait
FREENET_WAIT: 90
# Freenet bootstrap config
FREENET_ARGS: ""
# time delta for caches in seconds
TIME_CACHE: 2_592_000 # 30 days
# time to wait for selenium
SE_WAIT: 5
# extract link pattern
LINK_WHITE_LIST: '[
".*?\\.onion",
".*?\\.i2p", "127\\.0\\.0\\.1:7657", "localhost:7657", "127\\.0\\.0\\.1:7658", "localhost:7658",
"127\\.0\\.0\\.1:43110", "localhost:43110",
"127\\.0\\.0\\.1:8888", "localhost:8888"
]'
# link black list
LINK_BLACK_LIST: '[ "(.*\\.)?facebookcorewwwi\\.onion", "(.*\\.)?nytimes3xbfgragh\\.onion" ]'
# link fallback flag
LINK_FALLBACK: 1
# content type white list
MIME_WHITE_LIST: '[ "text/html", "application/xhtml+xml" ]'
# content type black list
MIME_BLACK_LIST: '[ "text/css", "application/javascript", "text/json" ]'
# content type fallback flag
MIME_FALLBACK: 0
# proxy type white list
PROXY_WHITE_LIST: '[ "tor", "i2p", "freenet", "zeronet", "tor2web" ]'
# proxy type black list
PROXY_BLACK_LIST: '[ "null", "data" ]'
# proxy type fallback flag
PROXY_FALLBACK: 0
# API retry times
API_RETRY: 10
# API URLs
#API_NEW_HOST: 'https://example.com/api/new_host'
#API_REQUESTS: 'https://example.com/api/requests'
#API_SELENIUM: 'https://example.com/api/selenium'
restart: "always"
networks: *networks
volumes: *volumes
# network settings
networks:
darc:
driver: bridge
Note
Should you wish to run darc
in reboot mode, i.e. set
DARC_REBOOT
and/or REBOOT
as True
, you may wish to change the entrypoint to
bash /app/run.sh
where run.sh
is a shell script wraps around darc
especially for reboot mode.
Content of run.sh
#!/usr/bin/env bash
set -e
# time lapse
WAIT=${DARC_WAIT=10}
# signal handlers
trap '[ -f ${PATH_DATA}/darc.pid ] && kill -2 $(cat ${PATH_DATA}/darc.pid)' SIGINT SIGTERM SIGKILL
# initialise
echo "+ Starting application..."
python3 -m darc $@
sleep ${WAIT}
# mainloop
while true; do
echo "+ Restarting application..."
python3 -m darc
sleep ${WAIT}
done
In such scenario, you can customise your run.sh
to, for
instance, archive then upload current data crawled by darc
to somewhere else and save up some disk space.
Web Backend Demo¶
This is a demo of API for communication between the
darc
crawlers (darc.submit
) and web UI.
See also
Please refer to data schema for more information about the submission data.
Assuming the web UI is developed using the Flask
microframework.
# -*- coding: utf-8 -*-
import sys
from typing import TYPE_CHECKING
import flask
if TYPE_CHECKING:
from flask import Response
# Flask application
app = flask.Flask(__file__)
@app.route('/api/new_host', methods=['POST'])
def new_host() -> 'Response':
"""When a new host is discovered, the :mod:`darc` crawler will submit the
host information. Such includes ``robots.txt`` (if exists) and
``sitemap.xml`` (if any).
Data format::
{
// partial flag - true / false
"$PARTIAL$": ...,
// force flag - true / false
"$FORCE$": ...,
// metadata of URL
"[metadata]": {
// original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
"url": ...,
// proxy type - null / tor / i2p / zeronet / freenet
"proxy": ...,
// hostname / netloc, c.f. ``urllib.parse.urlparse``
"host": ...,
// base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>
"base": ...,
// sha256 of URL as name for saved files (timestamp is in ISO format)
// JSON log as this one - <base>/<name>_<timestamp>.json
// HTML from requests - <base>/<name>_<timestamp>_raw.html
// HTML from selenium - <base>/<name>_<timestamp>.html
// generic data files - <base>/<name>_<timestamp>.dat
"name": ...,
// originate URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
"backref": ...
},
// requested timestamp in ISO format as in name of saved file
"Timestamp": ...,
// original URL
"URL": ...,
// robots.txt from the host (if not exists, then ``null``)
"Robots": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/robots.txt
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
},
// sitemaps from the host (if none, then ``null``)
"Sitemaps": [
{
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/sitemap_<name>.xml
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
},
...
],
// hosts.txt from the host (if proxy type is ``i2p``; if not exists, then ``null``)
"Hosts": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/hosts.txt
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
}
}
"""
# JSON data from the request
data = flask.request.json # pylint: disable=unused-variable
# do whatever processing needed
...
return flask.make_response()
@app.route('/api/requests', methods=['POST'])
def from_requests() -> 'Response':
"""When crawling, we'll first fetch the URl using ``requests``, to check
its availability and to save its HTTP headers information. Such information
will be submitted to the web UI.
Data format::
{
// metadata of URL
"[metadata]": {
// original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
"url": ...,
// proxy type - null / tor / i2p / zeronet / freenet
"proxy": ...,
// hostname / netloc, c.f. ``urllib.parse.urlparse``
"host": ...,
// base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>
"base": ...,
// sha256 of URL as name for saved files (timestamp is in ISO format)
// JSON log as this one - <base>/<name>_<timestamp>.json
// HTML from requests - <base>/<name>_<timestamp>_raw.html
// HTML from selenium - <base>/<name>_<timestamp>.html
// generic data files - <base>/<name>_<timestamp>.dat
"name": ...,
// originate URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
"backref": ...
},
// requested timestamp in ISO format as in name of saved file
"Timestamp": ...,
// original URL
"URL": ...,
// request method
"Method": "GET",
// response status code
"Status-Code": ...,
// response reason
"Reason": ...,
// response cookies (if any)
"Cookies": {
...
},
// session cookies (if any)
"Session": {
...
},
// request headers (if any)
"Request": {
...
},
// response headers (if any)
"Response": {
...
},
// content type
"Content-Type": ...,
// requested file (if not exists, then ``null``)
"Document": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/<name>_<timestamp>_raw.html
// or if the document is of generic content type, i.e. not HTML
// - <proxy>/<scheme>/<host>/<name>_<timestamp>.dat
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
},
// redirection history (if any)
"History": [
// same records as the original response
{"...": "..."}
]
}
"""
# JSON data from the request
data = flask.request.json # pylint: disable=unused-variable
# do whatever processing needed
...
return flask.make_response()
@app.route('/api/selenium', methods=['POST'])
def from_selenium() -> 'Response':
"""After crawling with ``requests``, we'll then render the URl using
``selenium`` with Google Chrome and its driver, to provide a fully rendered
web page. Such information will be submitted to the web UI.
Note:
This information is optional, only provided if the content type from
``requests`` is HTML, status code < 400, and HTML data not empty.
Data format::
{
// metadata of URL
"[metadata]": {
// original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
"url": ...,
// proxy type - null / tor / i2p / zeronet / freenet
"proxy": ...,
// hostname / netloc, c.f. ``urllib.parse.urlparse``
"host": ...,
// base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>
"base": ...,
// sha256 of URL as name for saved files (timestamp is in ISO format)
// JSON log as this one - <base>/<name>_<timestamp>.json
// HTML from requests - <base>/<name>_<timestamp>_raw.html
// HTML from selenium - <base>/<name>_<timestamp>.html
// generic data files - <base>/<name>_<timestamp>.dat
"name": ...,
// originate URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
"backref": ...
},
// requested timestamp in ISO format as in name of saved file
"Timestamp": ...,
// original URL
"URL": ...,
// rendered HTML document (if not exists, then ``null``)
"Document": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/<name>_<timestamp>.html
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
},
// web page screenshot (if not exists, then ``null``)
"Screenshot": {
// path of the file, relative path (to data root path ``PATH_DATA``) in container
// - <proxy>/<scheme>/<host>/<name>_<timestamp>.png
"path": ...,
// content of the file (**base64** encoded)
"data": ...,
}
}
"""
# JSON data from the request
data = flask.request.json # pylint: disable=unused-variable
# do whatever processing needed
...
return flask.make_response()
if __name__ == "__main__":
sys.exit(app.run()) # type: ignore[func-returns-value]
Data Models Demo¶
This is a demo of data models for database storage of
the submitted data from the darc
crawlers.
Assuming the database is using peewee
as ORM and
MySQL as backend.
# -*- coding: utf-8 -*-
# pylint: disable=ungrouped-imports
import os
from typing import TYPE_CHECKING
import peewee
import playhouse.mysql_ext
import playhouse.shortcuts
if TYPE_CHECKING:
from datetime import datetime
from typing import Any, Dict, List
# database client
DB = playhouse.db_url.connect(os.getenv('DB_URL', 'mysql://127.0.0.1'))
def table_function(model_class: peewee.Model) -> str:
"""Generate table name dynamically.
The function strips ``Model`` from the class name and
calls :func:`peewee.make_snake_case` to generate a
proper table name.
Args:
model_class: Data model class.
Returns:
Generated table name.
"""
name = model_class.__name__ # type: str
if name.endswith('Model'):
name = name[:-5] # strip ``Model`` suffix
return peewee.make_snake_case(name)
class BaseMeta:
"""Basic metadata for data models."""
#: Reference database storage (c.f. :class:`~darc.const.DB`).
database = DB
#: Generate table name dynamically (c.f. :func:`~darc.model.table_function`).
table_function = table_function
class BaseModel(peewee.Model):
"""Base model with standard patterns.
Notes:
The model will implicitly have a :class:`~peewee.AutoField`
attribute named as :attr:`id`.
"""
#: Basic metadata for data models.
Meta = BaseMeta
def to_dict(self, keep_id: bool = False) -> 'Dict[str, Any]':
"""Convert record to :obj:`dict`.
Args:
keep_id: If keep the ID auto field.
Returns:
The data converted through :func:`playhouse.shortcuts.model_to_dict`.
"""
data = playhouse.shortcuts.model_to_dict(self)
if keep_id:
return data
if 'id' in data:
del data['id']
return data
class HostnameModel(BaseModel):
"""Data model for a hostname record."""
#: Hostname (c.f. :attr:`link.host <darc.link.Link.host>`).
hostname: str = peewee.TextField()
#: Proxy type (c.f. :attr:`link.proxy <darc.link.Link.proxy>`).
proxy: str = peewee.CharField(max_length=8)
#: Timestamp of first ``new_host`` submission.
discovery: 'datetime' = peewee.DateTimeField()
#: Timestamp of last related submission.
last_seen: 'datetime' = peewee.DateTimeField()
class RobotsModel(BaseModel):
"""Data model for ``robots.txt`` data."""
#: Hostname (c.f. :attr:`link.host <darc.link.Link.host>`).
host: 'HostnameModel' = peewee.ForeignKeyField(HostnameModel, backref='robots')
#: Timestamp of the submission.
timestamp: 'datetime' = peewee.DateTimeField()
#: Document data as :obj:`bytes`.
data: bytes = peewee.BlobField()
#: Path to the document.
path: str = peewee.CharField()
class SitemapModel(BaseModel):
"""Data model for ``sitemap.xml`` data."""
#: Hostname (c.f. :attr:`link.host <darc.link.Link.host>`).
host: 'HostnameModel' = peewee.ForeignKeyField(HostnameModel, backref='sitemaps')
#: Timestamp of the submission.
timestamp: 'datetime' = peewee.DateTimeField()
#: Document data as :obj:`bytes`.
data: bytes = peewee.BlobField()
#: Path to the document.
path: str = peewee.CharField()
class HostsModel(BaseModel):
"""Data model for ``hosts.txt`` data."""
#: Hostname (c.f. :attr:`link.host <darc.link.Link.host>`).
host: 'HostnameModel' = peewee.ForeignKeyField(HostnameModel, backref='hosts')
#: Timestamp of the submission.
timestamp: 'datetime' = peewee.DateTimeField()
#: Document data as :obj:`bytes`.
data: bytes = peewee.BlobField()
#: Path to the document.
path: str = peewee.CharField()
class URLModel(BaseModel):
"""Data model for a requested URL."""
#: Timestamp of last related submission.
last_seen: 'datetime' = peewee.DateTimeField()
#: Original URL (c.f. :attr:`link.url <darc.link.Link.url>`).
url: str = peewee.TextField()
#: Hostname (c.f. :attr:`link.host <darc.link.Link.host>`).
host: HostnameModel = peewee.ForeignKeyField(HostnameModel, backref='urls')
#: Proxy type (c.f. :attr:`link.proxy <darc.link.Link.proxy>`).
proxy: str = peewee.CharField(max_length=8)
#: Base path (c.f. :attr:`link.base <darc.link.Link.base>`).
base: str = peewee.CharField()
#: Link hash (c.f. :attr:`link.name <darc.link.Link.name>`).
name: str = peewee.FixedCharField(max_length=64)
@classmethod
def get_by_url(cls, url: str) -> 'URLModel':
"""Select by URL.
Args:
url: URL to select.
Returns:
Selected URL model.
"""
return cls.get(cls.url == url)
@property
def parents(self) -> 'List[URLModel]':
"""Back reference to where the URL was identified."""
return (URLModel
.select()
.join(URLThroughModel, on=URLThroughModel.parent)
.where(URLThroughModel.child == self)
.order_by(URLModel.url))
@property
def childrent(self) -> 'List[URLModel]':
"""Back reference to which URLs were identified from the URL."""
return (URLModel
.select()
.join(URLThroughModel, on=URLThroughModel.child)
.where(URLThroughModel.parent == self)
.order_by(URLModel.url))
class URLThroughModel(BaseModel):
"""Data model for the map of URL extration chain."""
#: Back reference to where the URL was identified.
parent: 'List[URLModel]' = peewee.ForeignKeyField(URLModel, backref='parents')
#: Back reference to which URLs were identified from the URL.
child: 'List[URLModel]' = peewee.ForeignKeyField(URLModel, backref='children')
class Meta(BaseMeta):
indexes = (
# Specify a unique multi-column index on from/to-url.
(('parent', 'child'), True),
)
class RequestsDocumentModel(BaseModel):
"""Data model for documents from ``requests`` submission."""
#: Original URL (c.f. :attr:`link.url <darc.link.Link.url>`).
url: 'URLModel' = peewee.ForeignKeyField(URLModel, backref='requests')
#: Document data as :obj:`bytes`.
data: bytes = peewee.BlobField()
#: Path to the document.
path: str = peewee.CharField()
class SeleniumDocumentModel(BaseModel):
"""Data model for documents from ``selenium`` submission."""
#: Original URL (c.f. :attr:`link.url <darc.link.Link.url>`).
url: 'URLModel' = peewee.ForeignKeyField(URLModel, backref='selenium')
#: Document data as :obj:`bytes`.
data: bytes = peewee.BlobField()
#: Path to the document.
path: str = peewee.CharField()
Submission Data Schema¶
To better describe the submitted data, darc
provides
several JSON schema generated
from pydantic
models.
New Host Submission¶
The data submission from darc.submit.submit_new_host()
.
{
"title": "new_host",
"description": "Data submission from :func:`darc.submit.submit_new_host`.",
"type": "object",
"properties": {
"$PARTIAL$": {
"title": "$Partial$",
"description": "partial flag - true / false",
"type": "boolean"
},
"$RELOAD$": {
"title": "$Reload$",
"description": "reload flag - true / false",
"type": "boolean"
},
"[metadata]": {
"title": "[Metadata]",
"description": "metadata of URL",
"allOf": [
{
"$ref": "#/definitions/Metadata"
}
]
},
"Timestamp": {
"title": "Timestamp",
"description": "requested timestamp in ISO format as in name of saved file",
"type": "string",
"format": "date-time"
},
"URL": {
"title": "Url",
"description": "original URL",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"Robots": {
"title": "Robots",
"description": "robots.txt from the host (if not exists, then ``null``)",
"allOf": [
{
"$ref": "#/definitions/RobotsDocument"
}
]
},
"Sitemaps": {
"title": "Sitemaps",
"description": "sitemaps from the host (if none, then ``null``)",
"type": "array",
"items": {
"$ref": "#/definitions/SitemapDocument"
}
},
"Hosts": {
"title": "Hosts",
"description": "hosts.txt from the host (if proxy type is ``i2p``; if not exists, then ``null``)",
"allOf": [
{
"$ref": "#/definitions/HostsDocument"
}
]
}
},
"required": [
"$PARTIAL$",
"$RELOAD$",
"[metadata]",
"Timestamp",
"URL"
],
"definitions": {
"Proxy": {
"title": "Proxy",
"description": "Proxy type.",
"enum": [
"null",
"tor",
"i2p",
"zeronet",
"freenet"
],
"type": "string"
},
"Metadata": {
"title": "metadata",
"description": "Metadata of URL.",
"type": "object",
"properties": {
"url": {
"title": "Url",
"description": "original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"proxy": {
"$ref": "#/definitions/Proxy"
},
"host": {
"title": "Host",
"description": "hostname / netloc, c.f. ``urllib.parse.urlparse``",
"type": "string"
},
"base": {
"title": "Base",
"description": "base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>",
"type": "string"
},
"name": {
"title": "Name",
"description": "sha256 of URL as name for saved files (timestamp is in ISO format) - JSON log as this one: <base>/<name>_<timestamp>.json; - HTML from requests: <base>/<name>_<timestamp>_raw.html; - HTML from selenium: <base>/<name>_<timestamp>.html; - generic data files: <base>/<name>_<timestamp>.dat",
"type": "string"
}
},
"required": [
"url",
"proxy",
"host",
"base",
"name"
]
},
"RobotsDocument": {
"title": "RobotsDocument",
"description": "``robots.txt`` document data.",
"type": "object",
"properties": {
"path": {
"title": "Path",
"description": "path of the file, relative path (to data root path ``PATH_DATA``) in container - <proxy>/<scheme>/<host>/robots.txt",
"type": "string"
},
"data": {
"title": "Data",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"path",
"data"
]
},
"SitemapDocument": {
"title": "SitemapDocument",
"description": "Sitemaps document data.",
"type": "object",
"properties": {
"path": {
"title": "Path",
"description": "path of the file, relative path (to data root path ``PATH_DATA``) in container - <proxy>/<scheme>/<host>/sitemap_<name>.xml",
"type": "string"
},
"data": {
"title": "Data",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"path",
"data"
]
},
"HostsDocument": {
"title": "HostsDocument",
"description": "``hosts.txt`` document data.",
"type": "object",
"properties": {
"path": {
"title": "Path",
"description": "path of the file, relative path (to data root path ``PATH_DATA``) in container - <proxy>/<scheme>/<host>/hosts.txt",
"type": "string"
},
"data": {
"title": "Data",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"path",
"data"
]
}
}
}
Requests Submission¶
The data submission from darc.submit.submit_requests()
.
{
"title": "requests",
"description": "Data submission from :func:`darc.submit.submit_requests`.",
"type": "object",
"properties": {
"$PARTIAL$": {
"title": "$Partial$",
"description": "partial flag - true / false",
"type": "boolean"
},
"[metadata]": {
"title": "[Metadata]",
"description": "metadata of URL",
"allOf": [
{
"$ref": "#/definitions/Metadata"
}
]
},
"Timestamp": {
"title": "Timestamp",
"description": "requested timestamp in ISO format as in name of saved file",
"type": "string",
"format": "date-time"
},
"URL": {
"title": "Url",
"description": "original URL",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"Method": {
"title": "Method",
"description": "request method",
"type": "string"
},
"Status-Code": {
"title": "Status-Code",
"description": "response status code",
"exclusiveMinimum": 0,
"type": "integer"
},
"Reason": {
"title": "Reason",
"description": "response reason",
"type": "string"
},
"Cookies": {
"title": "Cookies",
"description": "response cookies (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Session": {
"title": "Session",
"description": "session cookies (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Request": {
"title": "Request",
"description": "request headers (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Response": {
"title": "Response",
"description": "response headers (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Content-Type": {
"title": "Content-Type",
"description": "content type",
"pattern": "[a-zA-Z0-9.-]+/[a-zA-Z0-9.-]+",
"type": "string"
},
"Document": {
"title": "Document",
"description": "requested file (if not exists, then ``null``)",
"allOf": [
{
"$ref": "#/definitions/RequestsDocument"
}
]
},
"History": {
"title": "History",
"description": "redirection history (if any)",
"type": "array",
"items": {
"$ref": "#/definitions/HistoryModel"
}
}
},
"required": [
"$PARTIAL$",
"[metadata]",
"Timestamp",
"URL",
"Method",
"Status-Code",
"Reason",
"Cookies",
"Session",
"Request",
"Response",
"Content-Type",
"History"
],
"definitions": {
"Proxy": {
"title": "Proxy",
"description": "Proxy type.",
"enum": [
"null",
"tor",
"i2p",
"zeronet",
"freenet"
],
"type": "string"
},
"Metadata": {
"title": "metadata",
"description": "Metadata of URL.",
"type": "object",
"properties": {
"url": {
"title": "Url",
"description": "original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"proxy": {
"$ref": "#/definitions/Proxy"
},
"host": {
"title": "Host",
"description": "hostname / netloc, c.f. ``urllib.parse.urlparse``",
"type": "string"
},
"base": {
"title": "Base",
"description": "base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>",
"type": "string"
},
"name": {
"title": "Name",
"description": "sha256 of URL as name for saved files (timestamp is in ISO format) - JSON log as this one: <base>/<name>_<timestamp>.json; - HTML from requests: <base>/<name>_<timestamp>_raw.html; - HTML from selenium: <base>/<name>_<timestamp>.html; - generic data files: <base>/<name>_<timestamp>.dat",
"type": "string"
}
},
"required": [
"url",
"proxy",
"host",
"base",
"name"
]
},
"RequestsDocument": {
"title": "RequestsDocument",
"description": ":mod:`requests` document data.",
"type": "object",
"properties": {
"path": {
"title": "Path",
"description": "path of the file, relative path (to data root path ``PATH_DATA``) in container - <proxy>/<scheme>/<host>/<name>_<timestamp>_raw.html; or if the document is of generic content type, i.e. not HTML - <proxy>/<scheme>/<host>/<name>_<timestamp>.dat",
"type": "string"
},
"data": {
"title": "Data",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"path",
"data"
]
},
"HistoryModel": {
"title": "HistoryModel",
"description": ":mod:`requests` history data.",
"type": "object",
"properties": {
"URL": {
"title": "Url",
"description": "original URL",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"Method": {
"title": "Method",
"description": "request method",
"type": "string"
},
"Status-Code": {
"title": "Status-Code",
"description": "response status code",
"exclusiveMinimum": 0,
"type": "integer"
},
"Reason": {
"title": "Reason",
"description": "response reason",
"type": "string"
},
"Cookies": {
"title": "Cookies",
"description": "response cookies (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Session": {
"title": "Session",
"description": "session cookies (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Request": {
"title": "Request",
"description": "request headers (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Response": {
"title": "Response",
"description": "response headers (if any)",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"Document": {
"title": "Document",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"URL",
"Method",
"Status-Code",
"Reason",
"Cookies",
"Session",
"Request",
"Response",
"Document"
]
}
}
}
Selenium Submission¶
The data submission from darc.submit.submit_selenium()
.
{
"title": "selenium",
"description": "Data submission from :func:`darc.submit.submit_requests`.",
"type": "object",
"properties": {
"$PARTIAL$": {
"title": "$Partial$",
"description": "partial flag - true / false",
"type": "boolean"
},
"[metadata]": {
"title": "[Metadata]",
"description": "metadata of URL",
"allOf": [
{
"$ref": "#/definitions/Metadata"
}
]
},
"Timestamp": {
"title": "Timestamp",
"description": "requested timestamp in ISO format as in name of saved file",
"type": "string",
"format": "date-time"
},
"URL": {
"title": "Url",
"description": "original URL",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"Document": {
"title": "Document",
"description": "rendered HTML document (if not exists, then ``null``)",
"allOf": [
{
"$ref": "#/definitions/SeleniumDocument"
}
]
},
"Screenshot": {
"title": "Screenshot",
"description": "web page screenshot (if not exists, then ``null``)",
"allOf": [
{
"$ref": "#/definitions/ScreenshotDocument"
}
]
}
},
"required": [
"$PARTIAL$",
"[metadata]",
"Timestamp",
"URL"
],
"definitions": {
"Proxy": {
"title": "Proxy",
"description": "Proxy type.",
"enum": [
"null",
"tor",
"i2p",
"zeronet",
"freenet"
],
"type": "string"
},
"Metadata": {
"title": "metadata",
"description": "Metadata of URL.",
"type": "object",
"properties": {
"url": {
"title": "Url",
"description": "original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>",
"minLength": 1,
"maxLength": 65536,
"format": "uri",
"type": "string"
},
"proxy": {
"$ref": "#/definitions/Proxy"
},
"host": {
"title": "Host",
"description": "hostname / netloc, c.f. ``urllib.parse.urlparse``",
"type": "string"
},
"base": {
"title": "Base",
"description": "base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>",
"type": "string"
},
"name": {
"title": "Name",
"description": "sha256 of URL as name for saved files (timestamp is in ISO format) - JSON log as this one: <base>/<name>_<timestamp>.json; - HTML from requests: <base>/<name>_<timestamp>_raw.html; - HTML from selenium: <base>/<name>_<timestamp>.html; - generic data files: <base>/<name>_<timestamp>.dat",
"type": "string"
}
},
"required": [
"url",
"proxy",
"host",
"base",
"name"
]
},
"SeleniumDocument": {
"title": "SeleniumDocument",
"description": ":mod:`selenium` document data.",
"type": "object",
"properties": {
"path": {
"title": "Path",
"description": "path of the file, relative path (to data root path ``PATH_DATA``) in container - <proxy>/<scheme>/<host>/<name>_<timestamp>.html",
"type": "string"
},
"data": {
"title": "Data",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"path",
"data"
]
},
"ScreenshotDocument": {
"title": "ScreenshotDocument",
"description": "Screenshot document data.",
"type": "object",
"properties": {
"path": {
"title": "Path",
"description": "path of the file, relative path (to data root path ``PATH_DATA``) in container - <proxy>/<scheme>/<host>/<name>_<timestamp>.png",
"type": "string"
},
"data": {
"title": "Data",
"description": "content of the file (**base64** encoded)",
"type": "string"
}
},
"required": [
"path",
"data"
]
}
}
}
Model Definitions¶
# -*- coding: utf-8 -*-
# pylint: disable=ungrouped-imports,no-name-in-module
"""JSON schema generator."""
from typing import TYPE_CHECKING
import pydantic.schema
from pydantic import BaseModel, Field
__all__ = ['NewHostModel', 'RequestsModel', 'SeleniumModel']
if TYPE_CHECKING:
from datetime import datetime
from enum import Enum
from typing import Any, Dict, List, Optional
from pydantic import AnyUrl, PositiveInt
CookiesType = List[Dict[str, Any]]
HeadersType = Dict[str, str]
class Proxy(str, Enum):
"""Proxy type."""
null = 'null'
tor = 'tor'
i2p = 'i2p'
zeronet = 'zeronet'
freenet = 'freenet'
###############################################################################
# Miscellaneous auxiliaries
###############################################################################
class Metadata(BaseModel):
"""Metadata of URL."""
url: 'AnyUrl' = Field(
description='original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>')
proxy: 'Proxy' = Field(
description='proxy type - null / tor / i2p / zeronet / freenet')
host: str = Field(
description='hostname / netloc, c.f. ``urllib.parse.urlparse``')
base: str = Field(
description=('base folder, relative path (to data root path ``PATH_DATA``) in containter '
'- <proxy>/<scheme>/<host>'))
name: str = Field(
description=('sha256 of URL as name for saved files (timestamp is in ISO format) '
'- JSON log as this one: <base>/<name>_<timestamp>.json; '
'- HTML from requests: <base>/<name>_<timestamp>_raw.html; '
'- HTML from selenium: <base>/<name>_<timestamp>.html; '
'- generic data files: <base>/<name>_<timestamp>.dat'))
backref: str = Field(
description='originate URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>')
class Config:
title = 'metadata'
class RobotsDocument(BaseModel):
"""``robots.txt`` document data."""
path: str = Field(
description=('path of the file, relative path (to data root path ``PATH_DATA``) in container '
'- <proxy>/<scheme>/<host>/robots.txt'))
data: str = Field(
description='content of the file (**base64** encoded)')
class SitemapDocument(BaseModel):
"""Sitemaps document data."""
path: str = Field(
description=('path of the file, relative path (to data root path ``PATH_DATA``) in container '
'- <proxy>/<scheme>/<host>/sitemap_<name>.xml'))
data: str = Field(
description='content of the file (**base64** encoded)')
class HostsDocument(BaseModel):
"""``hosts.txt`` document data."""
path: str = Field(
description=('path of the file, relative path (to data root path ``PATH_DATA``) in container '
'- <proxy>/<scheme>/<host>/hosts.txt'))
data: str = Field(
description='content of the file (**base64** encoded)')
class RequestsDocument(BaseModel):
""":mod:`requests` document data."""
path: str = Field(
description=('path of the file, relative path (to data root path ``PATH_DATA``) in container '
'- <proxy>/<scheme>/<host>/<name>_<timestamp>_raw.html; '
'or if the document is of generic content type, i.e. not HTML '
'- <proxy>/<scheme>/<host>/<name>_<timestamp>.dat'))
data: str = Field(
description='content of the file (**base64** encoded)')
class HistoryModel(BaseModel):
""":mod:`requests` history data."""
URL: 'AnyUrl' = Field(
description='original URL')
Method: str = Field(
description='request method')
status_code: 'PositiveInt' = Field(
alias='Status-Code',
description='response status code')
Reason: str = Field(
description='response reason')
Cookies: 'CookiesType' = Field(
description='response cookies (if any)')
Session: 'CookiesType' = Field(
description='session cookies (if any)')
Request: 'HeadersType' = Field(
description='request headers (if any)')
Response: 'HeadersType' = Field(
description='response headers (if any)')
Document: str = Field(
description='content of the file (**base64** encoded)')
class SeleniumDocument(BaseModel):
""":mod:`selenium` document data."""
path: str = Field(
description=('path of the file, relative path (to data root path ``PATH_DATA``) in container '
'- <proxy>/<scheme>/<host>/<name>_<timestamp>.html'))
data: str = Field(
description='content of the file (**base64** encoded)')
class ScreenshotDocument(BaseModel):
"""Screenshot document data."""
path: str = Field(
description=('path of the file, relative path (to data root path ``PATH_DATA``) in container '
'- <proxy>/<scheme>/<host>/<name>_<timestamp>.png'))
data: str = Field(
description='content of the file (**base64** encoded)')
###############################################################################
# JSON schema definitions
###############################################################################
class NewHostModel(BaseModel):
"""Data submission from :func:`darc.submit.submit_new_host`."""
partial: bool = Field(
alias='$PARTIAL$',
description='partial flag - true / false')
reload: bool = Field(
alias='$RELOAD$',
description='reload flag - true / false')
metadata: 'Metadata' = Field(
alias='[metadata]',
description='metadata of URL')
Timestamp: 'datetime' = Field(
description='requested timestamp in ISO format as in name of saved file')
URL: 'AnyUrl' = Field(
description='original URL')
Robots: 'Optional[RobotsDocument]' = Field(
description='robots.txt from the host (if not exists, then ``null``)')
Sitemaps: 'Optional[List[SitemapDocument]]' = Field(
description='sitemaps from the host (if none, then ``null``)')
Hosts: 'Optional[HostsDocument]' = Field(
description='hosts.txt from the host (if proxy type is ``i2p``; if not exists, then ``null``)')
class Config:
title = 'new_host'
class RequestsModel(BaseModel):
"""Data submission from :func:`darc.submit.submit_requests`."""
partial: bool = Field(
alias='$PARTIAL$',
description='partial flag - true / false')
metadata: 'Metadata' = Field(
alias='[metadata]',
description='metadata of URL')
Timestamp: 'datetime' = Field(
description='requested timestamp in ISO format as in name of saved file')
URL: 'AnyUrl' = Field(
description='original URL')
Method: str = Field(
description='request method')
status_code: 'PositiveInt' = Field(
alias='Status-Code',
description='response status code')
Reason: str = Field(
description='response reason')
Cookies: 'CookiesType' = Field(
description='response cookies (if any)')
Session: 'CookiesType' = Field(
description='session cookies (if any)')
Request: 'HeadersType' = Field(
description='request headers (if any)')
Response: 'HeadersType' = Field(
description='response headers (if any)')
content_type: str = Field(
alias='Content-Type',
regex='[a-zA-Z0-9.-]+/[a-zA-Z0-9.-]+',
description='content type')
Document: 'Optional[RequestsDocument]' = Field(
description='requested file (if not exists, then ``null``)')
History: 'List[HistoryModel]' = Field(
description='redirection history (if any)')
class Config:
title = 'requests'
class SeleniumModel(BaseModel):
"""Data submission from :func:`darc.submit.submit_requests`."""
partial: bool = Field(
alias='$PARTIAL$',
description='partial flag - true / false')
metadata: 'Metadata' = Field(
alias='[metadata]',
description='metadata of URL')
Timestamp: 'datetime' = Field(
description='requested timestamp in ISO format as in name of saved file')
URL: 'AnyUrl' = Field(
description='original URL')
Document: 'Optional[SeleniumDocument]' = Field(
description='rendered HTML document (if not exists, then ``null``)')
Screenshot: 'Optional[ScreenshotDocument]' = Field(
description='web page screenshot (if not exists, then ``null``)')
class Config:
title = 'selenium'
if __name__ == "__main__":
import json
import os
os.makedirs('schema', exist_ok=True)
with open('schema/new_host.schema.json', 'w') as file:
print(NewHostModel.schema_json(indent=2), file=file)
with open('schema/requests.schema.json', 'w') as file:
print(RequestsModel.schema_json(indent=2), file=file)
with open('schema/selenium.schema.json', 'w') as file:
print(SeleniumModel.schema_json(indent=2), file=file)
schema = pydantic.schema.schema([NewHostModel, RequestsModel, SeleniumModel],
title='DARC Data Submission JSON Schema')
with open('schema/darc.schema.json', 'w') as file:
json.dump(schema, file, indent=2)
Auxiliary Scripts¶
Since the darc
project can be deployed through Docker Integration,
we provided some auxiliary scripts to help with the deployment.
Health Check¶
- File location
Entry point:
extra/healthcheck.py
System V service:
extra/healthcheck.service
usage: healthcheck [-h] [-f FILE] [-i INTERVAL] ...
health check running container
positional arguments:
services name of services
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE path to compose file
-i INTERVAL, --interval INTERVAL
interval (in seconds) of health check
This script will watch the running status of containers managed by Docker Compose. If the containers are stopped or of unhealthy status, it will bring the containers back alive.
Also, as the internal program may halt unexpectedly whilst the container remains healthy, the script will watch if the program is still active through its output messages. If inactive, the script will restart the containers.
Upload API Submission Files¶
- File location
Entry point:
extra/upload.py
Helper script:
extra/upload.sh
Cron sample:
extra/upload.cron
usage: upload [-h] [-p PATH] -H HOST [-U USER]
upload API submission files
optional arguments:
-h, --help show this help message and exit
-p PATH, --path PATH path to data storage
-H HOST, --host HOST upstream hostname
-U USER, --user USER upstream user credential
This script will automatically upload API submission files, c.f.
darc.submit
, using curl(1). The --user
option is
supplied for the same option of curl(1).
Important
As the darc.submit.save_submit()
is categorising saved API
submission files by its actual date, the script is also uploading
such files by the saved dates. Therefore, as the cron(8)
sample suggests, the script should better be run everyday slightly
after 12:00 AM (0:00 in 24-hour format).
Remove Repeated Lines¶
- File location
extra/uniq.py
This script works the same as uniq(1), except it filters one input line at a time without putting pressure onto memory utilisation.
Redis Clinic¶
- File location
Entry point:
extra/clinic.py
Helper script:
extra/clinic.lua
Cron sample:
extra/clinic.cron
usage: clinic [-h] -r REDIS [-f FILE] [-t TIMEOUT] ...
memory clinic for Redis
positional arguments:
services name of services
optional arguments:
-h, --help show this help message and exit
-r REDIS, --redis REDIS
URI to the Redis server
-f FILE, --file FILE path to compose file
-t TIMEOUT, --timeout TIMEOUT
shutdown timeout in seconds
Since Redis may take more and more memory as the growth of crawled
data and task queues, this script will truncate the Redis task queues
(queue_requests
& queue_selenium
), as well as the corresponding
pickle
caches of darc.link.Link
.
Note
We used Lua scrpit to slightly accelerate the whole procedure, as it may bring burden to the host server if running through Redis client.
Rationale¶
There are two types of workers:
crawler
– runs thedarc.crawl.crawler()
to provide a fresh view of a link and test its connectabilityloader
– run thedarc.crawl.loader()
to provide an in-depth view of a link and provide more visual information
The general process can be described as following for workers of crawler
type:
process_crawler()
: obtain URLs from therequests
link database (c.f.load_requests()
), and feed such URLs tocrawler()
.crawler()
: parse the URL usingparse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
). Also, if the submission API is provided,submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).
The general process can be described as following for workers of loader
type:
process_loader()
: in the meanwhile,darc
will obtain URLs from theselenium
link database (c.f.load_selenium()
), and feed such URLs toloader()
.loader()
: parse the URL usingparse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalWebDriver
object.If successful, the rendered source HTML document will be saved, and a full-page screenshot will be taken and saved.
If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).
Important
For more information about the hook functions, please refer to the customisation documentations.
Installation¶
Note
darc
supports Python all versions above and includes 3.6.
Currently, it only supports and is tested on Linux (Ubuntu 18.04)
and macOS (Catalina).
When installing in Python versions below 3.8, darc
will
use walrus
to compile itself for backport compatibility.
pip install darc
Please make sure you have Google Chrome and corresponding version of Chrome Driver installed on your system.
Important
Starting from version 0.3.0, we introduced Redis for the task queue database backend.
Since version 0.6.0, we introduced relationship database storage (e.g. MySQL, SQLite, PostgreSQL, etc.) for the task queue database backend, besides the Redis database, since it can be too much memory-costly when the task queue becomes vary large.
Please make sure you have one of the backend database installed, configured,
and running when using the darc
project.
However, the darc
project is shipped with Docker and Compose support.
Please see Docker Integration for more information.
Or, you may refer to and/or install from the Docker Hub repository:
docker pull jsnbzh/darc[:TAGNAME]
or GitHub Container Registry, with more updated and comprehensive images:
docker pull ghcr.io/jarryshaw/darc[:TAGNAME]
# or the debug image
docker pull ghcr.io/jarryshaw/darc-debug[:TAGNAME]
Usage¶
Important
Though simple CLI, the darc
project is more configurable by
environment variables. For more information, please refer to the
environment variable configuration documentations.
The darc
project provides a simple CLI:
usage: darc [-h] [-v] -t {crawler,loader} [-f FILE] ...
the darkweb crawling swiss army knife
positional arguments:
link links to craw
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-t {crawler,loader}, --type {crawler,loader}
type of worker process
-f FILE, --file FILE read links from file
It can also be called through module entrypoint:
python -m python-darc ...
Note
The link files can contain comment lines, which should start with #
.
Empty lines and comment lines will be ignored when loading.