Source Parsing¶
The darc.parse
module provides auxiliary functions
to read robots.txt
, sitemaps and HTML documents. It
also contains utility functions to check if the proxy type,
hostname and content type if in any of the black and white
lists.
-
darc.parse.
_check
(temp_list)[source]¶ Check hostname and proxy type of links.
- Parameters
temp_list (List[darc.link.Link]) – List of links to be checked.
- Returns
List of links matches the requirements.
- Return type
List[darc.link.Link]
Note
If
CHECK_NG
isTrue
, the function will directly call_check_ng()
instead.
-
darc.parse.
_check_ng
(temp_list)[source]¶ Check content type of links through
HEAD
requests.- Parameters
temp_list (List[darc.link.Link]) – List of links to be checked.
- Returns
List of links matches the requirements.
- Return type
List[darc.link.Link]
-
darc.parse.
check_robots
(link)[source]¶ Check if
link
is allowed inrobots.txt
.- Parameters
link (darc.link.Link) – The link object to be checked.
- Returns
If
link
is allowed inrobots.txt
.- Return type
Note
The root path of a URL will always return
True
.
-
darc.parse.
extract_links
(link, html, check=False)[source]¶ Extract links from HTML document.
- Parameters
link (darc.link.Link) – Original link of the HTML document.
check (bool) – If perform checks on extracted links, default to
CHECK
.
- Returns
An iterator of extracted links.
- Return type
Iterator[darc.link.Link]
-
darc.parse.
get_content_type
(response)[source]¶ Get content type from
response
.- Parameters
response (
requests.Response
.) – Response object.- Returns
The content type from
response
.- Return type
Note
If the
Content-Type
header is not defined inresponse
, the function will utilisemagic
to detect its content type.
-
darc.parse.
get_sitemap
(link, text, host=None)[source]¶ Fetch link to other sitemaps from a sitemap.
- Parameters
link (darc.link.Link) – Original link to the sitemap.
text (str) – Content of the sitemap.
host (Optional[str]) – Hostname of the URL to the sitemap, the value may not be same as in
link
.
- Returns
List of link to sitemaps.
- Return type
List[darc.link.Link]
Note
As specified in the sitemap protocol, it may contain links to other sitemaps. *
-
darc.parse.
match_host
(host)[source]¶ Check if hostname in black list.
Note
If
host
isNone
, then it will always returnTrue
.
-
darc.parse.
match_proxy
(proxy)[source]¶ Check if proxy type in black list.
Note
If
proxy
isscript
, then it will always returnTrue
.
-
darc.parse.
read_robots
(link, text, host=None)[source]¶ Read
robots.txt
to fetch link to sitemaps.- Parameters
link (darc.link.Link) – Original link to
robots.txt
.text (str) – Content of
robots.txt
.host (Optional[str]) – Hostname of the URL to
robots.txt
, the value may not be same as inlink
.
- Returns
List of link to sitemaps.
- Return type
List[darc.link.Link]
Note
If the link to sitemap is not specified in
robots.txt
†, the fallback link/sitemap.xml
will be used.
-
darc.parse.
read_sitemap
(link, text, check=False)[source]¶ Read sitemap.
- Parameters
link (darc.link.Link) – Original link to the sitemap.
text (str) – Content of the sitemap.
check (bool) – If perform checks on extracted links, default to
CHECK
.
- Returns
List of links extracted.
- Return type
Iterator[darc.link.Link]