No Proxy¶
The darc.proxy.null
module contains the auxiliary functions
around managing and processing normal websites with no proxy.
-
darc.proxy.null.
fetch_sitemap
(link, force=False)[source]¶ Fetch sitemap.
The function will first fetch the
robots.txt
, then fetch the sitemaps accordingly.- Parameters
link (darc.link.Link) – Link object to fetch for its sitemaps.
force (bool) – Force refetch its sitemaps.
- Returns
Contents of
robots.txt
and sitemaps.
See also
darc.parse.get_sitemap()
-
darc.proxy.null.
get_sitemap
(link, text, host=None)[source]¶ Fetch link to other sitemaps from a sitemap.
- Parameters
link (darc.link.Link) – Original link to the sitemap.
text (str) – Content of the sitemap.
host (Optional[str]) – Hostname of the URL to the sitemap, the value may not be same as in
link
.
- Returns
List of link to sitemaps.
- Return type
List[darc.link.Link]
Note
As specified in the sitemap protocol, it may contain links to other sitemaps. *
-
darc.proxy.null.
have_robots
(link)[source]¶ Check if
robots.txt
already exists.- Parameters
link (darc.link.Link) – Link object to check if
robots.txt
already exists.- Returns
If
robots.txt
exists, return the path torobots.txt
, i.e.<root>/<proxy>/<scheme>/<hostname>/robots.txt
.If not, return
None
.
- Return type
Optional[str]
-
darc.proxy.null.
have_sitemap
(link)[source]¶ Check if sitemap already exists.
- Parameters
link (darc.link.Link) – Link object to check if sitemap already exists.
- Returns
If sitemap exists, return the path to the sitemap, i.e.
<root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml
.If not, return
None
.
- Return type
Optional[str]
-
darc.proxy.null.
read_robots
(link, text, host=None)[source]¶ Read
robots.txt
to fetch link to sitemaps.- Parameters
link (darc.link.Link) – Original link to
robots.txt
.text (str) – Content of
robots.txt
.host (Optional[str]) – Hostname of the URL to
robots.txt
, the value may not be same as inlink
.
- Returns
List of link to sitemaps.
- Return type
List[darc.link.Link]
Note
If the link to sitemap is not specified in
robots.txt
†, the fallback link/sitemap.xml
will be used.
-
darc.proxy.null.
read_sitemap
(link, text, check=False)[source]¶ Read sitemap.
- Parameters
link (darc.link.Link) – Original link to the sitemap.
text (str) – Content of the sitemap.
check (bool) – If perform checks on extracted links, default to
CHECK
.
- Returns
List of links extracted.
- Return type
List[darc.link.Link]
-
darc.proxy.null.
save_invalid
(link)[source]¶ Save link with invalid scheme.
The function will save link with invalid scheme to the file as defined in
PATH
.- Parameters
link (darc.link.Link) – Link object representing the link with invalid scheme.
-
darc.proxy.null.
save_robots
(link, text)[source]¶ Save
robots.txt
.- Parameters
link (darc.link.Link) – Link object of
robots.txt
.text (str) – Content of
robots.txt
.
- Returns
Saved path to
robots.txt
, i.e.<root>/<proxy>/<scheme>/<hostname>/robots.txt
.- Return type
See also
-
darc.proxy.null.
save_sitemap
(link, text)[source]¶ Save sitemap.
- Parameters
link (darc.link.Link) – Link object of sitemap.
text (str) – Content of sitemap.
- Returns
Saved path to sitemap, i.e.
<root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml
.- Return type
See also
-
darc.proxy.null.
PATH
= '{PATH_MISC}/invalid.txt'¶ Path to the data storage of links with invalid scheme.
See also
-
darc.proxy.null.
LOCK
: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]¶ I/O lock for saving links with invalid scheme
PATH
.See also