Module Constants¶
Auxiliary Function¶
General Configurations¶
-
darc.const.REBOOT: bool¶ If exit the program after first round, i.e. crawled all links from the
requestslink database and loaded all links from theseleniumlink database.- Default
False- Environ
-
darc.const.DEBUG: bool¶ If run the program in debugging mode.
- Default
False- Environ
-
darc.const.VERBOSE: bool¶ If run the program in verbose mode. If
DEBUGisTrue, then the verbose mode will be always enabled.- Default
False- Environ
-
darc.const.FORCE: bool¶ If ignore
robots.txtrules when crawling (c.f.crawler()).- Default
False- Environ
-
darc.const.CHECK: bool¶ If check proxy and hostname before crawling (when calling
extract_links(),read_sitemap()andread_hosts()).If
CHECK_NGisTrue, then this environment variable will be always set asTrue.- Default
False- Environ
-
darc.const.CHECK_NG: bool¶ If check content type through
HEADrequests before crawling (when callingextract_links(),read_sitemap()andread_hosts()).- Default
False- Environ
-
darc.const.ROOT: str¶ The root folder of the project.
-
darc.const.CWD= '.'¶ The current working direcory.
-
darc.const.DARC_CPU: int¶ Number of concurrent processes. If not provided, then the number of system CPUs will be used.
- Default
None- Environ
-
darc.const.FLAG_MP: bool¶ If enable multiprocessing support.
- Default
True- Environ
-
darc.const.FLAG_TH: bool¶ If enable multithreading support.
- Default
False- Environ
-
darc.const.DARC_USER: str¶ Non-root user for proxies.
- Default
current login user (c.f.
getpass.getuser())- Environ
Data Storage¶
-
darc.const.PATH_DB: str¶ Path to data storage.
- Default
data- Environ
See also
See
darc.savefor more information about source saving.
-
darc.const.PATH_MISC= '{PATH_DB}/misc/'¶ Path to miscellaneous data storage, i.e.
miscfolder under the root of data storage.See also
-
darc.const.PATH_LN= '{PATH_DB}/link.csv'¶ Path to the link CSV file,
link.csv.See also
darc.save.save_link
-
darc.const.PATH_QR= '{PATH_DB}/_queue_requests.txt'¶ Path to the
requestsdatabase,_queue_requests.txt.See also
darc.db.load_requests()darc.db.save_requests()
-
darc.const.PATH_QS= '{PATH_DB}/_queue_selenium.txt'¶ Path to the
seleniumdatabase,_queue_selenium.txt.See also
darc.db.load_selenium()darc.db.save_selenium()
-
darc.const.PATH_ID= '{PATH_DB}/darc.pid'¶ Path to the process ID file,
darc.pid.See also
darc.const.getpid()
Web Crawlers¶
-
darc.const.TIME_CACHE: float¶ Time delta for caches in seconds.
The
darcproject supports caching for fetched files.TIME_CACHEwill specify for how log the fetched files will be cached and NOT fetched again.Note
If
TIME_CACHEisNonethen caching will be marked as forever.- Default
60- Environ
-
darc.const.SE_WAIT: float¶ Time to wait for
seleniumto finish loading pages.Note
Internally,
seleniumwill wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded). However, some extra scripts may take more time running after the event.- Default
60- Environ
White / Black Lists¶
-
darc.const.LINK_WHITE_LIST: List[str]¶ White list of hostnames should be crawled.
- Default
[]- Environ
Note
Regular expressions are supported.
-
darc.const.LINK_BLACK_LIST: List[str]¶ Black list of hostnames should be crawled.
- Default
[]- Environ
Note
Regular expressions are supported.
-
darc.const.MIME_WHITE_LIST: List[str]¶ White list of content types should be crawled.
- Default
[]- Environ
Note
Regular expressions are supported.
-
darc.const.MIME_BLACK_LIST: List[str]¶ Black list of content types should be crawled.
- Default
[]- Environ
Note
Regular expressions are supported.
-
darc.const.PROXY_WHITE_LIST: List[str]¶ White list of proxy types should be crawled.
- Default
[]- Environ
Note
Regular expressions are supported.
-
darc.const.PROXY_BLACK_LIST: List[str]¶ Black list of proxy types should be crawled.
- Default
[]- Environ
Note
Regular expressions are supported.