Module Constants

Auxiliary Function

darc.const.getpid(path='/home/docs/checkouts/readthedocs.org/user_builds/darc/checkouts/latest/docs/source/data/darc.pid')[source]

Get process ID.

The process ID will be saved under the PATH_DB folder, in a file named darc.pid. If no such file exists, -1 will be returned.

Parameters:

path (str) – Path to the process ID file.

Return type:

int

Returns:

The process ID.

darc.const.get_lock()[source]

Get a lock.

Return type:

Union[Lock, allocate_lock, nullcontext]

Returns:

Lock context based on FLAG_MP and FLAG_TH.

General Configurations

darc.const.REBOOT: bool

If exit the program after first round, i.e. crawled all links from the requests link database and loaded all links from the selenium link database.

This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.

Default:

False

Environ:

DARC_REBOOT

darc.const.DEBUG: bool

If run the program in debugging mode.

Default:

False

Environ:

DARC_DEBUG

darc.const.VERBOSE: bool

If run the program in verbose mode. If DEBUG is True, then the verbose mode will be always enabled.

Default:

False

Environ:

DARC_VERBOSE

darc.const.FORCE: bool

If ignore robots.txt rules when crawling (c.f. crawler()).

Default:

False

Environ:

DARC_FORCE

darc.const.CHECK: bool

If check proxy and hostname before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

If CHECK_NG is True, then this environment variable will be always set as True.

Default:

False

Environ:

DARC_CHECK

darc.const.CHECK_NG: bool

If check content type through HEAD requests before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

Default:

False

Environ:

DARC_CHECK_CONTENT_TYPE

darc.const.ROOT: str

The root folder of the project.

darc.const.CWD = '.'

The current working direcory.

darc.const.DARC_CPU: int

Number of concurrent processes. If not provided, then the number of system CPUs will be used.

Default:

None

Environ:

DARC_CPU

darc.const.FLAG_MP: bool

If enable multiprocessing support.

Default:

True

Environ:

DARC_MULTIPROCESSING

darc.const.FLAG_TH: bool

If enable multithreading support.

Default:

False

Environ:

DARC_MULTITHREADING

Note

FLAG_MP and FLAG_TH can NOT be toggled at the same time.

darc.const.DARC_USER: str

Non-root user for proxies.

Default:

current login user (c.f. getpass.getuser())

Environ:

DARC_USER

Data Storage

See also

See darc.db for more information about database integration.

darc.const.REDIS: redis.Redis

URL to the Redis database.

Default:

redis://127.0.0.1

Environ:

REDIS_URL

darc.const.DB: peewee.Database

URL to the RDS storage.

Default:

sqlite://{PATH_DB}/darc.db

Environ:

:envvar`DB_URL`

darc.const.DB_WEB: peewee.Database

URL to the data submission storage.

Default:

sqlite://{PATH_DB}/darcweb.db

Environ:

:envvar`DB_URL`

darc.const.FLAG_DB: bool

Flag if uses RDS as the task queue backend. If REDIS_URL is provided, then False; else, True.

darc.const.PATH_DB: str

Path to data storage.

Default:

data

Environ:

PATH_DATA

See also

See darc.save for more information about source saving.

darc.const.PATH_MISC = '{PATH_DB}/misc/'

Path to miscellaneous data storage, i.e. misc folder under the root of data storage.

darc.const.PATH_LN = '{PATH_DB}/link.csv'

Path to the link CSV file, link.csv.

darc.const.PATH_ID = '{PATH_DB}/darc.pid'

Path to the process ID file, darc.pid.

Web Crawlers

darc.const.DARC_WAIT: float | None

Time interval between each round when the requests and/or selenium database are empty.

Default:

60

Environ:

DARC_WAIT

darc.const.TIME_CACHE: float

Time delta for caches in seconds.

The darc project supports caching for fetched files. TIME_CACHE will specify for how log the fetched files will be cached and NOT fetched again.

Note

If TIME_CACHE is None then caching will be marked as forever.

Default:

60

Environ:

TIME_CACHE

darc.const.SE_WAIT: float

Time to wait for selenium to finish loading pages.

Note

Internally, selenium will wait for the browser to finish loading the pages before return (i.e. the web API event DOMContentLoaded). However, some extra scripts may take more time running after the event.

Default:

60

Environ:

SE_WAIT

darc.const.SE_EMPTY = '<html><head></head><body></body></html>'

The empty page from selenium.

White / Black Lists

White list of hostnames should be crawled.

Default:

[]

Environ:

LINK_WHITE_LIST

Note

Regular expressions are supported.

Black list of hostnames should be crawled.

Default:

[]

Environ:

LINK_BLACK_LIST

Note

Regular expressions are supported.

Fallback value for match_host().

Default:

False

Environ:

LINK_FALLBACK

darc.const.MIME_WHITE_LIST: List[re.Pattern]

White list of content types should be crawled.

Default:

[]

Environ:

MIME_WHITE_LIST

Note

Regular expressions are supported.

darc.const.MIME_BLACK_LIST: List[re.Pattern]

Black list of content types should be crawled.

Default:

[]

Environ:

MIME_BLACK_LIST

Note

Regular expressions are supported.

darc.const.MIME_FALLBACK: bool

Fallback value for match_mime().

Default:

False

Environ:

MIME_FALLBACK

darc.const.PROXY_WHITE_LIST: List[str]

White list of proxy types should be crawled.

Default:

[]

Environ:

PROXY_WHITE_LIST

Note

The proxy types are case insensitive.

darc.const.PROXY_BLACK_LIST: List[str]

Black list of proxy types should be crawled.

Default:

[]

Environ:

PROXY_BLACK_LIST

Note

The proxy types are case insensitive.

darc.const.PROXY_FALLBACK: bool

Fallback value for match_proxy().

Default:

False

Environ:

PROXY_FALLBACK