Source Saving

The darc.save module contains the core utilities for managing fetched files and documents.

The data storage under the root path (PATH_DB) is typically as following:

data
├── api
|   └── <date>
│       └── <proxy>
│           └── <scheme>
│               └── <hostname>
│                   ├── new_host
│                   │   └── <hash>_<timestamp>.json
│                   ├── requests
│                   │   └── <hash>_<timestamp>.json
│                   └── selenium
│                       └── <hash>_<timestamp>.json
├── link.csv
├── misc
│   ├── bitcoin.txt
│   ├── data
│   │   └── <hash>_<timestamp>.<ext>
│   ├── ed2k.txt
│   ├── invalid.txt
│   ├── irc.txt
│   ├── magnet.txt
│   └── mail.txt
└── <proxy>
    └── <scheme>
        └── <hostname>
            ├── <hash>_<timestamp>.json
            ├── robots.txt
            └── sitemap_<hash>.xml
darc.save.sanitise(link, time=None, raw=False, data=False, headers=False, screenshot=False)[source]

Sanitise link to path.

Parameters
  • link (Link) – Link object to sanitise the path

  • time (datetime) – Timestamp for the path.

  • raw (bool) – If this is a raw HTML document from requests.

  • data (bool) – If this is a generic content type document.

  • headers (bool) – If this is response headers from requests.

  • screenshot (bool) – If this is the screenshot from selenium.

Return type

str

Returns

  • If raw is True, <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html.

  • If data is True, <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat.

  • If headers is True, <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.json.

  • If screenshot is True, <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.png.

  • If none above, <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html.

darc.save.save_headers(time, link, response, session)[source]

Save HTTP response headers.

Parameters
  • time (datetime) – Timestamp of response.

  • link (Link) – Link object of response.

  • response (requests.Response) – Response object to be saved.

  • session (requests.Session) – Session object of response.

Return type

str

Returns

Saved path to response headers, i.e. <root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.json.

The JSON data saved is as following:

{
    "[metadata]": {
        "url": "...",
        "proxy": "...",
        "host": "...",
        "base": "...",
        "name": "..."
    },
    "Timestamp": "...",
    "URL": "...",
    "Method": "GET",
    "Status-Code": "...",
    "Reason": "...",
    "Cookies": {
        "...": "..."
    },
    "Session": {
        "...": "..."
    },
    "Request": {
        "...": "..."
    },
    "Response": {
        "...": "..."
    },
    "History": [
        {"...": "..."}
    ]
}

Save link hash database link.csv.

The CSV file has following fields:

Parameters

link (Link) – Link object to be saved.

Return type

None

Return type

None

Parameters

link (Link) –

darc.save._SAVE_LOCK: Union[multiprocessing.Lock, threading.Lock, contextlib.nullcontext]

I/O lock for saving link hash database link.csv.