Module Constants¶

Auxiliary Function¶

darc.const.getpid(path='/home/docs/checkouts/readthedocs.org/user_builds/darc/checkouts/latest/docs/source/data/darc.pid')[source]¶

Get process ID.

The process ID will be saved under the PATH_DB folder, in a file named darc.pid. If no such file exists, -1 will be returned.

Parameters:: path (str) – Path to the process ID file.
Return type:: int
Returns:: The process ID.

See also

darc.const.PATH_ID

darc.const.get_lock()[source]¶

Get a lock.

Return type:: Union[Lock, allocate_lock, nullcontext]
Returns:: Lock context based on FLAG_MP and FLAG_TH.

General Configurations¶

darc.const.REBOOT: bool¶

If exit the program after first round, i.e. crawled all links from the requests link database and loaded all links from the selenium link database.

This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.

Default:: False
Environ:: DARC_REBOOT

darc.const.DEBUG: bool¶

If run the program in debugging mode.

Default:: False
Environ:: DARC_DEBUG

darc.const.VERBOSE: bool¶

If run the program in verbose mode. If DEBUG is True, then the verbose mode will be always enabled.

Default:: False
Environ:: DARC_VERBOSE

darc.const.FORCE: bool¶

If ignore robots.txt rules when crawling (c.f. crawler()).

Default:: False
Environ:: DARC_FORCE

darc.const.CHECK: bool¶

If check proxy and hostname before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

If CHECK_NG is True, then this environment variable will be always set as True.

Default:: False
Environ:: DARC_CHECK

darc.const.CHECK_NG: bool¶

If check content type through HEAD requests before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

Default:: False
Environ:: DARC_CHECK_CONTENT_TYPE

darc.const.ROOT: str¶: The root folder of the project.

darc.const.CWD = '.'¶: The current working direcory.

darc.const.DARC_CPU: int¶

Number of concurrent processes. If not provided, then the number of system CPUs will be used.

Default:: None
Environ:: DARC_CPU

darc.const.FLAG_MP: bool¶

If enable multiprocessing support.

Default:: True
Environ:: DARC_MULTIPROCESSING

darc.const.FLAG_TH: bool¶

If enable multithreading support.

Default:: False
Environ:: DARC_MULTITHREADING

Note

FLAG_MP and FLAG_TH can NOT be toggled at the same time.

darc.const.DARC_USER: str¶

Non-root user for proxies.

Default:: current login user (c.f. getpass.getuser())
Environ:: DARC_USER

Data Storage¶

See also

See darc.db for more information about database integration.

darc.const.REDIS: redis.Redis¶

URL to the Redis database.

Default:: redis://127.0.0.1
Environ:: REDIS_URL

darc.const.DB: peewee.Database¶

URL to the RDS storage.

Default:: sqlite://{PATH_DB}/darc.db
Environ:: :envvar`DB_URL`

darc.const.DB_WEB: peewee.Database¶

URL to the data submission storage.

Default:: sqlite://{PATH_DB}/darcweb.db
Environ:: :envvar`DB_URL`

darc.const.FLAG_DB: bool¶: Flag if uses RDS as the task queue backend. If REDIS_URL is provided, then False; else, True.

darc.const.PATH_DB: str¶

Path to data storage.

Default:: data
Environ:: PATH_DATA

See also

See darc.save for more information about source saving.

darc.const.PATH_MISC = '{PATH_DB}/misc/'¶

Path to miscellaneous data storage, i.e. misc folder under the root of data storage.

See also

darc.const.PATH_DB

darc.const.PATH_LN = '{PATH_DB}/link.csv'¶

Path to the link CSV file, link.csv.

See also

darc.const.PATH_ID = '{PATH_DB}/darc.pid'¶

Path to the process ID file, darc.pid.

See also

darc.const.PATH_DB
darc.const.getpid()

Web Crawlers¶

darc.const.DARC_WAIT: float | None¶

Time interval between each round when the requests and/or selenium database are empty.

Default:: 60
Environ:: DARC_WAIT

darc.const.TIME_CACHE: float¶

Time delta for caches in seconds.

The darc project supports caching for fetched files. TIME_CACHE will specify for how log the fetched files will be cached and NOT fetched again.

Note

If TIME_CACHE is None then caching will be marked as forever.

Default:: 60
Environ:: TIME_CACHE

darc.const.SE_WAIT: float¶

Time to wait for selenium to finish loading pages.

Note

Internally, selenium will wait for the browser to finish loading the pages before return (i.e. the web API event DOMContentLoaded). However, some extra scripts may take more time running after the event.

Default:: 60
Environ:: SE_WAIT

darc.const.SE_EMPTY = '<html><head></head><body></body></html>'¶

The empty page from selenium.

See also

darc.crawl.loader()

White / Black Lists¶

darc.const.LINK_WHITE_LIST: List[re.Pattern]¶

White list of hostnames should be crawled.

Default:: []
Environ:: LINK_WHITE_LIST

Note

Regular expressions are supported.

darc.const.LINK_BLACK_LIST: List[re.Pattern]¶

Black list of hostnames should be crawled.

Default:: []
Environ:: LINK_BLACK_LIST

Note

Regular expressions are supported.

darc.const.LINK_FALLBACK: bool¶

Fallback value for match_host().

Default:: False
Environ:: LINK_FALLBACK

darc.const.MIME_WHITE_LIST: List[re.Pattern]¶

White list of content types should be crawled.

Default:: []
Environ:: MIME_WHITE_LIST

Note

Regular expressions are supported.

darc.const.MIME_BLACK_LIST: List[re.Pattern]¶

Black list of content types should be crawled.

Default:: []
Environ:: MIME_BLACK_LIST

Note

Regular expressions are supported.

darc.const.MIME_FALLBACK: bool¶

Fallback value for match_mime().

Default:: False
Environ:: MIME_FALLBACK

darc.const.PROXY_WHITE_LIST: List[str]¶

White list of proxy types should be crawled.

Default:: []
Environ:: PROXY_WHITE_LIST

Note

The proxy types are case insensitive.

darc.const.PROXY_BLACK_LIST: List[str]¶

Black list of proxy types should be crawled.

Default:: []
Environ:: PROXY_BLACK_LIST

Note

The proxy types are case insensitive.

darc.const.PROXY_FALLBACK: bool¶

Fallback value for match_proxy().

Default:: False
Environ:: PROXY_FALLBACK

Module Constants¶

Auxiliary Function¶

General Configurations¶

Data Storage¶

Web Crawlers¶

White / Black Lists¶

darc

Navigation

Related Topics