No Proxy

The darc.proxy.null module contains the auxiliary functions around managing and processing normal websites with no proxy.

darc.proxy.null.fetch_sitemap(link, force=False)[source]

Fetch sitemap.

The function will first fetch the robots.txt, then fetch the sitemaps accordingly.

Parameters
  • link (Link) – Link object to fetch for its sitemaps.

  • force (bool) – Force refetch its sitemaps.

Return type

None

Returns

Contents of robots.txt and sitemaps.

darc.proxy.null.get_sitemap(link, text, host=None)[source]

Fetch link to other sitemaps from a sitemap.

Parameters
  • link (Link) – Original link to the sitemap.

  • text (str) – Content of the sitemap.

  • host (Optional[str]) – Hostname of the URL to the sitemap, the value may not be same as in link.

Return type

List[Link]

Returns

List of link to sitemaps.

Note

As specified in the sitemap protocol, it may contain links to other sitemaps. *

*

https://www.sitemaps.org/protocol.html#index

darc.proxy.null.have_robots(link)[source]

Check if robots.txt already exists.

Parameters

link (Link) – Link object to check if robots.txt already exists.

Return type

Optional[str]

Returns

  • If robots.txt exists, return the path to robots.txt, i.e. <root>/<proxy>/<scheme>/<hostname>/robots.txt.

  • If not, return None.

darc.proxy.null.have_sitemap(link)[source]

Check if sitemap already exists.

Parameters

link (Link) – Link object to check if sitemap already exists.

Return type

Optional[str]

Returns

  • If sitemap exists, return the path to the sitemap, i.e. <root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml.

  • If not, return None.

darc.proxy.null.read_robots(link, text, host=None)[source]

Read robots.txt to fetch link to sitemaps.

Parameters
  • link (Link) – Original link to robots.txt.

  • text (str) – Content of robots.txt.

  • host (Optional[str]) – Hostname of the URL to robots.txt, the value may not be same as in link.

Return type

List[Link]

Returns

List of link to sitemaps.

Note

If the link to sitemap is not specified in robots.txt , the fallback link /sitemap.xml will be used.

https://www.sitemaps.org/protocol.html#submit_robots

darc.proxy.null.read_sitemap(link, text, check=False)[source]

Read sitemap.

Parameters
  • link (Link) – Original link to the sitemap.

  • text (str) – Content of the sitemap.

  • check (bool) – If perform checks on extracted links, default to CHECK.

Return type

List[Link]

Returns

List of links extracted.

darc.proxy.null.save_invalid(link)[source]

Save link with invalid scheme.

The function will save link with invalid scheme to the file as defined in PATH.

Parameters

link (Link) – Link object representing the link with invalid scheme.

Return type

None

darc.proxy.null.save_robots(link, text)[source]

Save robots.txt.

Parameters
  • link (Link) – Link object of robots.txt.

  • text (str) – Content of robots.txt.

Return type

str

Returns

Saved path to robots.txt, i.e. <root>/<proxy>/<scheme>/<hostname>/robots.txt.

darc.proxy.null.save_sitemap(link, text)[source]

Save sitemap.

Parameters
  • link (Link) – Link object of sitemap.

  • text (str) – Content of sitemap.

Return type

str

Returns

Saved path to sitemap, i.e. <root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml.

darc.proxy.null.PATH = '{PATH_MISC}/invalid.txt'

Path to the data storage of links with invalid scheme.

darc.proxy.null.LOCK: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]

I/O lock for saving links with invalid scheme PATH.