Source Parsing¶
The darc.parse
module provides auxiliary functions
to read robots.txt
, sitemaps and HTML documents. It
also contains utility functions to check if the proxy type,
hostname and content type if in any of the black and white
lists.
-
darc.parse.
_check
(temp_list)[source]¶ Check hostname and proxy type of links.
- Parameters
temp_list (List[darc.link.Link]) – List of links to be checked.
- Returns
List of links matches the requirements.
- Return type
List[darc.link.Link]
Note
If
CHECK_NG
isTrue
, the function will directly call_check_ng()
instead.
-
darc.parse.
_check_ng
(temp_list)[source]¶ Check content type of links through
HEAD
requests.- Parameters
temp_list (List[darc.link.Link]) – List of links to be checked.
- Returns
List of links matches the requirements.
- Return type
List[darc.link.Link]
-
darc.parse.
check_robots
(link)[source]¶ Check if
link
is allowed inrobots.txt
.- Parameters
link (darc.link.Link) – The link object to be checked.
- Returns
If
link
is allowed inrobots.txt
.- Return type
Note
The root path of a URL will always return
True
.
-
darc.parse.
extract_links
(link, html, check=False)[source]¶ Extract links from HTML document.
- Parameters
link (darc.link.Link) – Original link of the HTML document.
check (bool) – If perform checks on extracted links, default to
CHECK
.
- Returns
List of extracted links.
- Return type
List[darc.link.Link]
-
darc.parse.
get_content_type
(response)[source]¶ Get content type from
response
.- Parameters
response (
requests.Response
) – Response object.- Returns
The content type from
response
.- Return type
Note
If the
Content-Type
header is not defined inresponse
, the function will utilisemagic
to detect its content type.