Sites Customisation¶
As websites may have authentication requirements, etc., over
its content, the darc.sites
module provides sites
customisation hooks to both requests
and selenium
crawling processes.
Important
To create a sites customisation, define your class by inheriting
darc.sites.BaseSite
and register it to the darc
module through darc.sites.register()
.
To start with, you just need to define your sites customisation by
inheriting BaseSite
and overload corresponding
crawler()
and/or
loader()
methods.
To customise behaviours over requests
, you sites customisation
class should have a crawler()
method, e.g.
DefaultSite.crawler
.
The function takes the requests.Session
object with proxy settings and
a Link
object representing the link to be
crawled, then returns a requests.Response
object containing the final
data of the crawling process.
- darc.sites.crawler_hook(timestamp, session, link)[source]¶
Customisation as to
requests
sessions.- Parameters:
timestamp (
datetime
) – Timestamp of the worker node reference.session (requests.Session) – Session object with proxy settings.
link (
Link
) – Link object to be crawled.
- Returns:
The final response object with crawled data.
- Return type:
See also
darc.sites.SITE_MAP
To customise behaviours over selenium
, you sites customisation
class should have a loader()
method, e.g.
DefaultSite.loader
.
The function takes the WebDriver
object with proxy settings and a Link
object representing
the link to be loaded, then returns the WebDriver
object containing the final data of the loading process.
- darc.sites.loader_hook(timestamp, driver, link)[source]¶
Customisation as to
selenium
drivers.- Parameters:
- Returns:
The web driver object with loaded data.
- Return type:
selenium.webdriver.Chrome
See also
darc.sites.SITE_MAP
To tell the darc
project which sites customisation
module it should use for a certain hostname, you can register
such module to the SITEMAP
mapping dictionary
through register()
:
- darc.sites.SITEMAP: DefaultDict[str, Type[darc.sites._abc.BaseSite]]¶
from darc.sites.default import DefaultSite SITEMAP = collections.defaultdict(lambda: DefaultSite, { # 'www.sample.com': SampleSite, # local customised class })
The mapping dictionary for hostname to sites customisation classes.
The fallback value is
darc.sites.default.DefaultSite
.
- darc.sites._get_site(link)[source]¶
Load sites customisation if any.
If the sites customisation does not exist, it will fallback to the default hooks,
DefaultSite
.- Parameters:
link (
Link
) – Link object to fetch sites customisation class.- Return type:
- Returns:
The sites customisation class.
See also
See also
Please refer to Customisations for more examples and explanations.