No Proxy¶
The darc.proxy.null module contains the auxiliary functions
around managing and processing normal websites with no proxy.
-
darc.proxy.null.fetch_sitemap(link)[source]¶ Fetch sitemap.
The function will first fetch the
robots.txt, then fetch the sitemaps accordingly.- Parameters
link (darc.link.Link) – Link object to fetch for its sitemaps.
- Returns
Contents of
robots.txtand sitemaps.
See also
darc.parse.get_sitemap()
-
darc.proxy.null.get_sitemap(link, text, host=None)[source]¶ Fetch link to other sitemaps from a sitemap.
- Parameters
link (darc.link.Link) – Original link to the sitemap.
text (str) – Content of the sitemap.
host (Optional[str]) – Hostname of the URL to the sitemap, the value may not be same as in
link.
- Returns
List of link to sitemaps.
- Return type
List[darc.link.Link]
Note
As specified in the sitemap protocol, it may contain links to other sitemaps. *
-
darc.proxy.null.have_robots(link)[source]¶ Check if
robots.txtalready exists.- Parameters
link (darc.link.Link) – Link object to check if
robots.txtalready exists.- Returns
If
robots.txtexists, return the path torobots.txt, i.e.<root>/<proxy>/<scheme>/<hostname>/robots.txt.If not, return
None.
- Return type
Optional[str]
-
darc.proxy.null.have_sitemap(link)[source]¶ Check if sitemap already exists.
- Parameters
link (darc.link.Link) – Link object to check if sitemap already exists.
- Returns
If sitemap exists, return the path to the sitemap, i.e.
<root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml.If not, return
None.
- Return type
Optional[str]
-
darc.proxy.null.read_robots(link, text, host=None)[source]¶ Read
robots.txtto fetch link to sitemaps.- Parameters
link (darc.link.Link) – Original link to
robots.txt.text (str) – Content of
robots.txt.host (Optional[str]) – Hostname of the URL to
robots.txt, the value may not be same as inlink.
- Returns
List of link to sitemaps.
- Return type
List[darc.link.Link]
Note
If the link to sitemap is not specified in
robots.txt†, the fallback link/sitemap.xmlwill be used.
-
darc.proxy.null.read_sitemap(link, text, check=False)[source]¶ Read sitemap.
- Parameters
link (darc.link.Link) – Original link to the sitemap.
text (str) – Content of the sitemap.
check (bool) – If perform checks on extracted links, default to
CHECK.
- Returns
List of links extracted.
- Return type
List[darc.link.Link]
-
darc.proxy.null.save_invalid(link)[source]¶ Save link with invalid scheme.
The function will save link with invalid scheme to the file as defined in
PATH.- Parameters
link (darc.link.Link) – Link object representing the link with invalid scheme.
-
darc.proxy.null.save_robots(link, text)[source]¶ Save
robots.txt.- Parameters
link (darc.link.Link) – Link object of
robots.txt.text (str) – Content of
robots.txt.
- Returns
Saved path to
robots.txt, i.e.<root>/<proxy>/<scheme>/<hostname>/robots.txt.- Return type
See also
-
darc.proxy.null.save_sitemap(link, text)[source]¶ Save sitemap.
- Parameters
link (darc.link.Link) – Link object of sitemap.
text (str) – Content of sitemap.
- Returns
Saved path to sitemap, i.e.
<root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml.- Return type
See also
-
darc.proxy.null.PATH= '{PATH_MISC}/invalid.txt'¶ Path to the data storage of links with invalid scheme.
See also
-
darc.proxy.null.LOCK: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving links with invalid scheme
PATH.See also