Source Saving

The darc.save module contains the core utilities for managing fetched files and documents.

The data storage under the root path (PATH_DB) is typically as following:

data
├── _queue_requests.txt
├── _queue_requests.txt.tmp
├── _queue_selenium.txt
├── _queue_selenium.txt.tmp
├── api
│   └── <proxy>
│       └── <scheme>
│           └── <hostname>
│               ├── new_host
│               │   └── <hash>_<timestamp>.json
│               ├── requests
│               │   └── <hash>_<timestamp>.json
│               └── selenium
│                   └── <hash>_<timestamp>.json
├── link.csv
├── misc
│   ├── bitcoin.txt
│   ├── data
│   │   └── <hash>_<timestamp>.<ext>
│   ├── ed2k.txt
│   ├── invalid.txt
│   ├── irc.txt
│   ├── magnet.txt
│   └── mail.txt
└── <proxy>
    └── <scheme>
        └── <hostname>
            ├── <hash>_<timestamp>.dat
            ├── <hash>_<timestamp>.json
            ├── <hash>_<timestamp>_raw.html
            ├── <hash>_<timestamp>.html
            ├── <hash>_<timestamp>.png
            ├── robots.txt
            └── sitemap_<hash>.xml
darc.save.has_folder(link)[source]

Check if is a new host.

Parameters

link (darc.link.Link) – Link object to check if is a new host.

Returns

  • If link is a new host, return link.base.

  • If not, return None.

Return type

Optional[str]

darc.save.has_html(time, link)[source]

Check if we need to re-craw the link by selenium.

Parameters
  • link (darc.link.Link) – Link object to check if we need to re-craw the link by selenium.

  • time (NewType.<locals>.new_type) –

Returns

  • If no need, return the path to the document, i.e. <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html.

  • If needed, return None.

Return type

Optional[str]

darc.save.has_raw(time, link)[source]

Check if we need to re-craw the link by requests.

Parameters
  • link (darc.link.Link) – Link object to check if we need to re-craw the link by requests.

  • time (NewType.<locals>.new_type) –

Returns

  • If no need, return the path to the document, i.e. <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html, or <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat.

  • If needed, return None.

Return type

Optional[str]

darc.save.has_robots(link)[source]

Check if robots.txt already exists.

Parameters

link (darc.link.Link) – Link object to check if robots.txt already exists.

Returns

  • If robots.txt exists, return the path to robots.txt, i.e. <root>/<proxy>/<scheme>/<hostname>/robots.txt.

  • If not, return None.

Return type

Optional[str]

darc.save.has_sitemap(link)[source]

Check if sitemap already exists.

Parameters

link (darc.link.Link) – Link object to check if sitemap already exists.

Returns

  • If sitemap exists, return the path to the sitemap, i.e. <root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml.

  • If not, return None.

Return type

Optional[str]

darc.save.sanitise(link, time=None, raw=False, data=False, headers=False, screenshot=False)[source]

Sanitise link to path.

Parameters
  • link (darc.link.Link) – Link object to sanitise the path

  • time (datetime) – Timestamp for the path.

  • raw (bool) – If this is a raw HTML document from requests.

  • data (bool) – If this is a generic content type document.

  • headers (bool) – If this is response headers from requests.

  • screenshot (bool) – If this is the screenshot from selenium.

Returns

  • If raw is True, <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html.

  • If data is True, <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat.

  • If headers is True, <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.json.

  • If screenshot is True, <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.png.

  • If none above, <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html.

Return type

str

darc.save.save_file(time, link, content)[source]

Save file.

The function will also try to make symbolic links from the saved file standard path to the relative path as in the URL.

Parameters
  • time (datetime) – Timestamp of generic file.

  • link (darc.link.Link) – Link object of original URL.

  • content (bytes) – Content of generic file.

Returns

Saved path to generic content type file, <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat.

Return type

str

darc.save.save_headers(time, link, response, session)[source]

Save HTTP response headers.

Parameters
Returns

Saved path to response headers, i.e. <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.json.

Return type

str

The JSON data saved is as following:

{
    "[metadata]": {
        "url": "...",
        "proxy": "...",
        "host": "...",
        "base": "...",
        "name": "..."
    },
    "Timestamp": "...",
    "URL": "...",
    "Method": "GET",
    "Status-Code": "...",
    "Reason": "...",
    "Cookies": {
        "...": "..."
    },
    "Session": {
        "...": "..."
    },
    "Request": {
        "...": "..."
    },
    "Response": {
        "...": "..."
    },
    "History": [
        {"...": "..."}
    ]
}
darc.save.save_html(time, link, html, raw=False)[source]

Save response.

Parameters
  • time (datetime) – Timestamp of HTML document.

  • link (darc.link.Link) – Link object of original URL.

  • html (Union[str, bytes]) – Content of HTML document.

  • raw (bool) – If is fetched from requests.

Returns

Saved path to HTML document.

  • If raw is True, <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html.

  • If not, <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html.

Return type

str

Save link hash database link.csv.

The CSV file has following fields:

Parameters

link (darc.link.Link) – Link object to be saved.

darc.save.save_robots(link, text)[source]

Save robots.txt.

Parameters
  • link (darc.link.Link) – Link object of robots.txt.

  • text (str) – Content of robots.txt.

Returns

Saved path to robots.txt, i.e. <root>/<proxy>/<scheme>/<hostname>/robots.txt.

Return type

str

darc.save.save_sitemap(link, text)[source]

Save sitemap.

Parameters
  • link (darc.link.Link) – Link object of sitemap.

  • text (str) – Content of sitemap.

Returns

Saved path to sitemap, i.e. <root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml.

Return type

str

darc.save._SAVE_LOCK: multiprocessing.Lock

I/O lock for saving link hash database link.csv.