Module Constants¶
Auxiliary Function¶
General Configurations¶
-
darc.const.
REBOOT
: bool¶ If exit the program after first round, i.e. crawled all links from the
requests
link database and loaded all links from theselenium
link database.This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.
- Default
False
- Environ
-
darc.const.
VERBOSE
: bool¶ If run the program in verbose mode. If
DEBUG
isTrue
, then the verbose mode will be always enabled.- Default
False
- Environ
-
darc.const.
FORCE
: bool¶ If ignore
robots.txt
rules when crawling (c.f.crawler()
).- Default
False
- Environ
-
darc.const.
CHECK
: bool¶ If check proxy and hostname before crawling (when calling
extract_links()
,read_sitemap()
andread_hosts()
).If
CHECK_NG
isTrue
, then this environment variable will be always set asTrue
.- Default
False
- Environ
-
darc.const.
CHECK_NG
: bool¶ If check content type through
HEAD
requests before crawling (when callingextract_links()
,read_sitemap()
andread_hosts()
).- Default
False
- Environ
-
darc.const.
CWD
= '.'¶ The current working direcory.
-
darc.const.
DARC_CPU
: int¶ Number of concurrent processes. If not provided, then the number of system CPUs will be used.
- Default
None
- Environ
-
darc.const.
DARC_USER
: str¶ Non-root user for proxies.
- Default
current login user (c.f.
getpass.getuser()
)- Environ
Data Storage¶
-
darc.const.
REDIS
: redis.Redis¶ URL to the Redis database.
- Default
redis://127.0.0.1
- Environ
-
darc.const.
PATH_DB
: str¶ Path to data storage.
- Default
data
- Environ
See also
See
darc.save
for more information about source saving.
-
darc.const.
PATH_MISC
= '{PATH_DB}/misc/'¶ Path to miscellaneous data storage, i.e.
misc
folder under the root of data storage.See also
-
darc.const.
PATH_LN
= '{PATH_DB}/link.csv'¶ Path to the link CSV file,
link.csv
.See also
-
darc.const.
PATH_QR
= '{PATH_DB}/_queue_requests.txt'¶ Path to the
requests
database,_queue_requests.txt
.
-
darc.const.
PATH_QS
= '{PATH_DB}/_queue_selenium.txt'¶ Path to the
selenium
database,_queue_selenium.txt
.
-
darc.const.
PATH_ID
= '{PATH_DB}/darc.pid'¶ Path to the process ID file,
darc.pid
.See also
Web Crawlers¶
-
darc.const.
DARC_WAIT
: Optional[float]¶ Time interval between each round when the
requests
and/orselenium
database are empty.- Default
60
- Environ
-
darc.const.
SAVE
: bool¶ If save processed link back to database.
Note
If
SAVE
isTrue
, thenSAVE_REQUESTS
andSAVE_SELENIUM
will be forced to beTrue
.- Default
False
- Environ
See also
See
darc.db
for more information about link database.
-
darc.const.
SAVE_REQUESTS
: bool¶ If save
crawler()
crawled link back torequests
database.- Default
False
- Environ
See also
See
darc.db
for more information about link database.
-
darc.const.
SAVE_SELENIUM
: bool¶ If save
loader()
crawled link back toselenium
database.- Default
False
- Environ
See also
See
darc.db
for more information about link database.
-
darc.const.
TIME_CACHE
: float¶ Time delta for caches in seconds.
The
darc
project supports caching for fetched files.TIME_CACHE
will specify for how log the fetched files will be cached and NOT fetched again.Note
If
TIME_CACHE
isNone
then caching will be marked as forever.- Default
60
- Environ
-
darc.const.
SE_WAIT
: float¶ Time to wait for
selenium
to finish loading pages.Note
Internally,
selenium
will wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded
). However, some extra scripts may take more time running after the event.- Default
60
- Environ
White / Black Lists¶
-
darc.const.
LINK_WHITE_LIST
: List[re.Pattern]¶ White list of hostnames should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
LINK_BLACK_LIST
: List[re.Pattern]¶ Black list of hostnames should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
LINK_FALLBACK
: bool¶ Fallback value for
match_host()
.- Default
False
- Environ
-
darc.const.
MIME_WHITE_LIST
: List[re.Pattern]¶ White list of content types should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
MIME_BLACK_LIST
: List[re.Pattern]¶ Black list of content types should be crawled.
- Default
[]
- Environ
Note
Regular expressions are supported.
-
darc.const.
MIME_FALLBACK
: bool¶ Fallback value for
match_mime()
.- Default
False
- Environ
-
darc.const.
PROXY_WHITE_LIST
: List[str]¶ White list of proxy types should be crawled.
- Default
[]
- Environ
Note
The proxy types are case insensitive.
-
darc.const.
PROXY_BLACK_LIST
: List[str]¶ Black list of proxy types should be crawled.
- Default
[]
- Environ
Note
The proxy types are case insensitive.
-
darc.const.
PROXY_FALLBACK
: bool¶ Fallback value for
match_proxy()
.- Default
False
- Environ