Source Parsing¶
The darc.parse module provides auxiliary functions
to read robots.txt, sitemaps and HTML documents. It
also contains utility functions to check if the proxy type,
hostname and content type if in any of the black and white
lists.
-
darc.parse._check(temp_list)[source]¶ Check hostname and proxy type of links.
- Parameters
temp_list (List[darc.link.Link]) – List of links to be checked.
- Returns
List of links matches the requirements.
- Return type
List[darc.link.Link]
Note
If
CHECK_NGisTrue, the function will directly call_check_ng()instead.
-
darc.parse._check_ng(temp_list)[source]¶ Check content type of links through
HEADrequests.- Parameters
temp_list (List[darc.link.Link]) – List of links to be checked.
- Returns
List of links matches the requirements.
- Return type
List[darc.link.Link]
-
darc.parse.check_robots(link)[source]¶ Check if
linkis allowed inrobots.txt.- Parameters
link (darc.link.Link) – The link object to be checked.
- Returns
If
linkis allowed inrobots.txt.- Return type
Note
The root path of a URL will always return
True.
-
darc.parse.extract_links(link, html, check=False)[source]¶ Extract links from HTML document.
- Parameters
link (darc.link.Link) – Original link of the HTML document.
check (bool) – If perform checks on extracted links, default to
CHECK.
- Returns
List of extracted links.
- Return type
List[darc.link.Link]
-
darc.parse.get_content_type(response)[source]¶ Get content type from
response.- Parameters
response (
requests.Response.) – Response object.- Returns
The content type from
response.- Return type
Note
If the
Content-Typeheader is not defined inresponse, the function will utilisemagicto detect its content type.
-
darc.parse.get_sitemap(link, text, host=None)[source]¶ Fetch link to other sitemaps from a sitemap.
- Parameters
link (darc.link.Link) – Original link to the sitemap.
text (str) – Content of the sitemap.
host (Optional[str]) – Hostname of the URL to the sitemap, the value may not be same as in
link.
- Returns
List of link to sitemaps.
- Return type
List[darc.link.Link]
Note
As specified in the sitemap protocol, it may contain links to other sitemaps. *
-
darc.parse.match_host(host)[source]¶ Check if hostname in black list.
Note
If
hostisNone, then it will always returnTrue.
-
darc.parse.match_proxy(proxy)[source]¶ Check if proxy type in black list.
Note
If
proxyisscript, then it will always returnTrue.
-
darc.parse.read_robots(link, text, host=None)[source]¶ Read
robots.txtto fetch link to sitemaps.- Parameters
link (darc.link.Link) – Original link to
robots.txt.text (str) – Content of
robots.txt.host (Optional[str]) – Hostname of the URL to
robots.txt, the value may not be same as inlink.
- Returns
List of link to sitemaps.
- Return type
List[darc.link.Link]
Note
If the link to sitemap is not specified in
robots.txt†, the fallback link/sitemap.xmlwill be used.
-
darc.parse.read_sitemap(link, text, check=False)[source]¶ Read sitemap.
- Parameters
link (darc.link.Link) – Original link to the sitemap.
text (str) – Content of the sitemap.
check (bool) – If perform checks on extracted links, default to
CHECK.
- Returns
List of links extracted.
- Return type
Iterator[darc.link.Link]