Module Constants¶
Auxiliary Function¶
General Configurations¶
-
darc.const.REBOOT: bool¶ If exit the program after first round, i.e. crawled all links from the
requestslink database and loaded all links from theseleniumlink database.This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.
- Default
False- Environ
-
darc.const.DEBUG: bool¶ If run the program in debugging mode.
- Default
False- Environ
-
darc.const.VERBOSE: bool¶ If run the program in verbose mode. If
DEBUGisTrue, then the verbose mode will be always enabled.- Default
False- Environ
-
darc.const.FORCE: bool¶ If ignore
robots.txtrules when crawling (c.f.crawler()).- Default
False- Environ
-
darc.const.CHECK: bool¶ If check proxy and hostname before crawling (when calling
extract_links(),read_sitemap()andread_hosts()).If
CHECK_NGisTrue, then this environment variable will be always set asTrue.- Default
False- Environ
-
darc.const.CHECK_NG: bool¶ If check content type through
HEADrequests before crawling (when callingextract_links(),read_sitemap()andread_hosts()).- Default
False- Environ
-
darc.const.ROOT: str¶ The root folder of the project.
-
darc.const.CWD= '.'¶ The current working direcory.
-
darc.const.DARC_CPU: int¶ Number of concurrent processes. If not provided, then the number of system CPUs will be used.
- Default
None- Environ
-
darc.const.FLAG_MP: bool¶ If enable multiprocessing support.
- Default
True- Environ
-
darc.const.FLAG_TH: bool¶ If enable multithreading support.
- Default
False- Environ
-
darc.const.DARC_USER: str¶ Non-root user for proxies.
- Default
current login user (c.f.
getpass.getuser())- Environ
Data Storage¶
-
darc.const.PATH_DB: str¶ Path to data storage.
- Default
data- Environ
See also
See
darc.savefor more information about source saving.
-
darc.const.PATH_MISC= '{PATH_DB}/misc/'¶ Path to miscellaneous data storage, i.e.
miscfolder under the root of data storage.See also
-
darc.const.PATH_LN= '{PATH_DB}/link.csv'¶ Path to the link CSV file,
link.csv.See also
-
darc.const.PATH_QR= '{PATH_DB}/_queue_requests.txt'¶ Path to the
requestsdatabase,_queue_requests.txt.
-
darc.const.PATH_QS= '{PATH_DB}/_queue_selenium.txt'¶ Path to the
seleniumdatabase,_queue_selenium.txt.
-
darc.const.PATH_ID= '{PATH_DB}/darc.pid'¶ Path to the process ID file,
darc.pid.See also
Web Crawlers¶
-
darc.const.SAVE: bool¶ If save processed link back to database.
Note
If
SAVEisTrue, thenSAVE_REQUESTSandSAVE_SELENIUMwill be forced to beTrue.- Default
False- Environ
See also
See
darc.dbfor more information about link database.
-
darc.const.SAVE_REQUESTS: bool¶ If save
crawler()crawled link back torequestsdatabase.- Default
False- Environ
See also
See
darc.dbfor more information about link database.
-
darc.const.SAVE_SELENIUM: bool¶ If save
loader()crawled link back toseleniumdatabase.- Default
False- Environ
See also
See
darc.dbfor more information about link database.
-
darc.const.TIME_CACHE: float¶ Time delta for caches in seconds.
The
darcproject supports caching for fetched files.TIME_CACHEwill specify for how log the fetched files will be cached and NOT fetched again.Note
If
TIME_CACHEisNonethen caching will be marked as forever.- Default
60- Environ
-
darc.const.SE_WAIT: float¶ Time to wait for
seleniumto finish loading pages.Note
Internally,
seleniumwill wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded). However, some extra scripts may take more time running after the event.- Default
60- Environ
White / Black Lists¶
-
darc.const.LINK_WHITE_LIST: List[re.Pattern]¶ White list of hostnames should be crawled.
- Default
[]- Environ
Note
Regular expressions are supported.
-
darc.const.LINK_BLACK_LIST: List[re.Pattern]¶ Black list of hostnames should be crawled.
- Default
[]- Environ
Note
Regular expressions are supported.
-
darc.const.LINK_FALLBACK: bool¶ Fallback value for
match_host().- Default
False- Environ
-
darc.const.MIME_WHITE_LIST: List[re.Pattern]¶ White list of content types should be crawled.
- Default
[]- Environ
Note
Regular expressions are supported.
-
darc.const.MIME_BLACK_LIST: List[re.Pattern]¶ Black list of content types should be crawled.
- Default
[]- Environ
Note
Regular expressions are supported.
-
darc.const.MIME_FALLBACK: bool¶ Fallback value for
match_mime().- Default
False- Environ
-
darc.const.PROXY_WHITE_LIST: List[str]¶ White list of proxy types should be crawled.
- Default
[]- Environ
Note
The proxy types are case insensitive.
-
darc.const.PROXY_BLACK_LIST: List[str]¶ Black list of proxy types should be crawled.
- Default
[]- Environ
Note
The proxy types are case insensitive.
-
darc.const.PROXY_FALLBACK: bool¶ Fallback value for
match_proxy().- Default
False- Environ