Source Saving¶
The darc.save
module contains the core utilities
for managing fetched files and documents.
The data storage under the root path (PATH_DB
)
is typically as following:
data
├── api
| └── <date>
│ └── <proxy>
│ └── <scheme>
│ └── <hostname>
│ ├── new_host
│ │ └── <hash>_<timestamp>.json
│ ├── requests
│ │ └── <hash>_<timestamp>.json
│ └── selenium
│ └── <hash>_<timestamp>.json
├── link.csv
├── misc
│ ├── bitcoin.txt
│ ├── data
│ │ └── <hash>_<timestamp>.<ext>
│ ├── ed2k.txt
│ ├── invalid.txt
│ ├── irc.txt
│ ├── magnet.txt
│ └── mail.txt
└── <proxy>
└── <scheme>
└── <hostname>
├── <hash>_<timestamp>.json
├── robots.txt
└── sitemap_<hash>.xml
- darc.save.sanitise(link, time=None, raw=False, data=False, headers=False, screenshot=False)[source]¶
Sanitise link to path.
- Parameters:
- Return type:
- Returns:
If
raw
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html
.If
data
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat
.If
headers
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.json
.If
screenshot
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.png
.If none above,
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html
.
See also
- darc.save.save_headers(time, link, response, session)[source]¶
Save HTTP response headers.
- Parameters:
time (datetime) – Timestamp of response.
link (
Link
) – Link object of response.response (
requests.Response
) – Response object to be saved.session (
requests.Session
) – Session object of response.
- Return type:
- Returns:
Saved path to response headers, i.e.
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.json
.
The JSON data saved is as following:
{ "[metadata]": { "url": "...", "proxy": "...", "host": "...", "base": "...", "name": "..." }, "Timestamp": "...", "URL": "...", "Method": "GET", "Status-Code": "...", "Reason": "...", "Cookies": { "...": "..." }, "Session": { "...": "..." }, "Request": { "...": "..." }, "Response": { "...": "..." }, "History": [ {"...": "..."} ] }
See also
- darc.save.save_link(link)[source]¶
Save link hash database
link.csv
.The CSV file has following fields:
proxy type:
link.proxy
URL scheme:
link.url_parse.scheme
hostname:
link.base
link hash:
link.name
original URL:
link.url
See also
- darc.save._SAVE_LOCK: multiprocessing.Lock | threading.Lock | contextlib.nullcontext¶
I/O lock for saving link hash database
link.csv
.