Sites Customisation¶
As websites may have authentication requirements, etc., over
its content, the darc.sites
module provides sites
customisation hooks to both requests
and selenium
crawling processes.
Important
To create a sites customisation, define your class by inheriting
darc.sites.BaseSite
and register it to the darc
module through darc.sites.register()
.
To start with, you just need to define your sites customisation by
inheriting BaseSite
and overload corresponding
crawler()
and/or
loader()
methods.
To customise behaviours over requests
, you sites customisation
class should have a crawler()
method, e.g.
DefaultSite.crawler
.
The function takes the requests.Session
object with proxy settings and
a Link
object representing the link to be
crawled, then returns a requests.Response
object containing the final
data of the crawling process.
-
darc.sites.
crawler_hook
(link, session)[source]¶ Customisation as to
requests
sessions.- Parameters
link (darc.link.Link) – Link object to be crawled.
session (requests.Session) – Session object with proxy settings.
- Returns
The final response object with crawled data.
- Return type
See also
darc.sites.SITE_MAP
To customise behaviours over selenium
, you sites customisation
class should have a loader()
method, e.g.
DefaultSite.loader
.
The function takes the WebDriver
object with proxy settings and a Link
object representing
the link to be loaded, then returns the WebDriver
object containing the final data of the loading process.
-
darc.sites.
loader_hook
(link, driver)[source]¶ Customisation as to
selenium
drivers.- Parameters
link (darc.link.Link) – Link object to be loaded.
driver (selenium.webdriver.Chrome) – Web driver object with proxy settings.
- Returns
The web driver object with loaded data.
- Return type
selenium.webdriver.Chrome
See also
darc.sites.SITE_MAP
To tell the darc
project which sites customisation
module it should use for a certain hostname, you can register
such module to the SITEMAP
mapping dictionary
through register()
:
-
darc.sites.
register
(site, *hostname)[source]¶ Register new site map.
- Parameters
site (Type[darc.sites._abc.BaseSite]) – Sites customisation class inherited from
BaseSite
.*hostname (Tuple[str]) – Optional list of hostnames the sites customisation should be registered with. By default, we use
site.hostname
.
-
darc.sites.
SITEMAP
: DefaultDict[str, Type[darc.sites._abc.BaseSite]]¶ from darc.sites.default import DefaultSite SITEMAP = collections.defaultdict(lambda: DefaultSite, { # 'www.sample.com': SampleSite, # local customised class })
The mapping dictionary for hostname to sites customisation classes.
The fallback value is
darc.sites.default.DefaultSite
.
-
darc.sites.
_get_site
(link)[source]¶ Load sites customisation if any.
If the sites customisation does not exist, it will fallback to the default hooks,
DefaultSite
.- Parameters
link (darc.link.Link) – Link object to fetch sites customisation class.
- Returns
The sites customisation class.
- Return type
Type[darc.sites._abc.BaseSite]
See also
See also
Please refer to Customisations for more examples and explanations.