Module Constants

Auxiliary Function

darc.const.getpid(path='/home/docs/checkouts/readthedocs.org/user_builds/darc/checkouts/latest/docs/source/data/darc.pid')[source]

Get process ID.

The process ID will be saved under the PATH_DB folder, in a file named darc.pid. If no such file exists, -1 will be returned.

Parameters

path (str) – Path to the process ID file.

Return type

int

Returns

The process ID.

darc.const.get_lock()[source]

Get a lock.

Return type

Union[Lock, allocate_lock, nullcontext]

Returns

Lock context based on FLAG_MP and FLAG_TH.

General Configurations

darc.const.REBOOT: bool

If exit the program after first round, i.e. crawled all links from the requests link database and loaded all links from the selenium link database.

This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.

Default

False

Environ

DARC_REBOOT

darc.const.DEBUG: bool

If run the program in debugging mode.

Default

False

Environ

DARC_DEBUG

darc.const.VERBOSE: bool

If run the program in verbose mode. If DEBUG is True, then the verbose mode will be always enabled.

Default

False

Environ

DARC_VERBOSE

darc.const.FORCE: bool

If ignore robots.txt rules when crawling (c.f. crawler()).

Default

False

Environ

DARC_FORCE

darc.const.CHECK: bool

If check proxy and hostname before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

If CHECK_NG is True, then this environment variable will be always set as True.

Default

False

Environ

DARC_CHECK

darc.const.CHECK_NG: bool

If check content type through HEAD requests before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

Default

False

Environ

DARC_CHECK_CONTENT_TYPE

darc.const.ROOT: str

The root folder of the project.

darc.const.CWD = '.'

The current working direcory.

darc.const.DARC_CPU: int

Number of concurrent processes. If not provided, then the number of system CPUs will be used.

Default

None

Environ

DARC_CPU

darc.const.FLAG_MP: bool

If enable multiprocessing support.

Default

True

Environ

DARC_MULTIPROCESSING

darc.const.FLAG_TH: bool

If enable multithreading support.

Default

False

Environ

DARC_MULTITHREADING

Note

FLAG_MP and FLAG_TH can NOT be toggled at the same time.

darc.const.DARC_USER: str

Non-root user for proxies.

Default

current login user (c.f. getpass.getuser())

Environ

DARC_USER

Data Storage

See also

See darc.db for more information about database integration.

darc.const.REDIS: redis.Redis

URL to the Redis database.

Default

redis://127.0.0.1

Environ

REDIS_URL

darc.const.DB: peewee.Database

URL to the RDS storage.

Default

sqlite://{PATH_DB}/darc.db

Environ

:envvar`DB_URL`

darc.const.DB_WEB: peewee.Database

URL to the data submission storage.

Default

sqlite://{PATH_DB}/darcweb.db

Environ

:envvar`DB_URL`

darc.const.FLAG_DB: bool

Flag if uses RDS as the task queue backend. If REDIS_URL is provided, then False; else, True.

darc.const.PATH_DB: str

Path to data storage.

Default

data

Environ

PATH_DATA

See also

See darc.save for more information about source saving.

darc.const.PATH_MISC = '{PATH_DB}/misc/'

Path to miscellaneous data storage, i.e. misc folder under the root of data storage.

darc.const.PATH_LN = '{PATH_DB}/link.csv'

Path to the link CSV file, link.csv.

darc.const.PATH_ID = '{PATH_DB}/darc.pid'

Path to the process ID file, darc.pid.

Web Crawlers

darc.const.DARC_WAIT: Optional[float]

Time interval between each round when the requests and/or selenium database are empty.

Default

60

Environ

DARC_WAIT

darc.const.TIME_CACHE: float

Time delta for caches in seconds.

The darc project supports caching for fetched files. TIME_CACHE will specify for how log the fetched files will be cached and NOT fetched again.

Note

If TIME_CACHE is None then caching will be marked as forever.

Default

60

Environ

TIME_CACHE

darc.const.SE_WAIT: float

Time to wait for selenium to finish loading pages.

Note

Internally, selenium will wait for the browser to finish loading the pages before return (i.e. the web API event DOMContentLoaded). However, some extra scripts may take more time running after the event.

Default

60

Environ

SE_WAIT

darc.const.SE_EMPTY = '<html><head></head><body></body></html>'

The empty page from selenium.

White / Black Lists

White list of hostnames should be crawled.

Default

[]

Environ

LINK_WHITE_LIST

Note

Regular expressions are supported.

Black list of hostnames should be crawled.

Default

[]

Environ

LINK_BLACK_LIST

Note

Regular expressions are supported.

Fallback value for match_host().

Default

False

Environ

LINK_FALLBACK

darc.const.MIME_WHITE_LIST: List[re.Pattern]

White list of content types should be crawled.

Default

[]

Environ

MIME_WHITE_LIST

Note

Regular expressions are supported.

darc.const.MIME_BLACK_LIST: List[re.Pattern]

Black list of content types should be crawled.

Default

[]

Environ

MIME_BLACK_LIST

Note

Regular expressions are supported.

darc.const.MIME_FALLBACK: bool

Fallback value for match_mime().

Default

False

Environ

MIME_FALLBACK

darc.const.PROXY_WHITE_LIST: List[str]

White list of proxy types should be crawled.

Default

[]

Environ

PROXY_WHITE_LIST

Note

The proxy types are case insensitive.

darc.const.PROXY_BLACK_LIST: List[str]

Black list of proxy types should be crawled.

Default

[]

Environ

PROXY_BLACK_LIST

Note

The proxy types are case insensitive.

darc.const.PROXY_FALLBACK: bool

Fallback value for match_proxy().

Default

False

Environ

PROXY_FALLBACK