Source Saving¶
The darc.save
module contains the core utilities
for managing fetched files and documents.
The data storage under the root path (PATH_DB
)
is typically as following:
data
├── _queue_requests.txt
├── _queue_requests.txt.tmp
├── _queue_selenium.txt
├── _queue_selenium.txt.tmp
├── api
│ └── <proxy>
│ └── <scheme>
│ └── <hostname>
│ ├── new_host
│ │ └── <hash>_<timestamp>.json
│ ├── requests
│ │ └── <hash>_<timestamp>.json
│ └── selenium
│ └── <hash>_<timestamp>.json
├── link.csv
├── misc
│ ├── bitcoin.txt
│ ├── data
│ │ └── <hash>_<timestamp>.<ext>
│ ├── ed2k.txt
│ ├── invalid.txt
│ ├── irc.txt
│ ├── magnet.txt
│ └── mail.txt
└── <proxy>
└── <scheme>
└── <hostname>
├── <hash>_<timestamp>.dat
├── <hash>_<timestamp>.json
├── <hash>_<timestamp>_raw.html
├── <hash>_<timestamp>.html
├── <hash>_<timestamp>.png
├── robots.txt
└── sitemap_<hash>.xml
-
darc.save.
has_folder
(link)¶ Check if is a new host.
- Parameters
link (darc.link.Link) – Link object to check if is a new host.
- Returns
If
link
is a new host, returnlink.base
.If not, return
None
.
- Return type
Optional[str]
-
darc.save.
has_html
(time, link)¶ Check if we need to re-craw the link by
selenium
.- Parameters
link (darc.link.Link) – Link object to check if we need to re-craw the link by
selenium
.time (NewType.<locals>.new_type) –
- Returns
If no need, return the path to the document, i.e.
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html
.If needed, return
None
.
- Return type
Optional[str]
See also
-
darc.save.
has_raw
(time, link)¶ Check if we need to re-craw the link by
requests
.- Parameters
link (darc.link.Link) – Link object to check if we need to re-craw the link by
requests
.time (NewType.<locals>.new_type) –
- Returns
If no need, return the path to the document, i.e.
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html
, or<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat
.If needed, return
None
.
- Return type
Optional[str]
See also
-
darc.save.
has_robots
(link)¶ Check if
robots.txt
already exists.- Parameters
link (darc.link.Link) – Link object to check if
robots.txt
already exists.- Returns
If
robots.txt
exists, return the path torobots.txt
, i.e.<root>/<proxy>/<scheme>/<hostname>/robots.txt
.If not, return
None
.
- Return type
Optional[str]
-
darc.save.
has_sitemap
(link)¶ Check if sitemap already exists.
- Parameters
link (darc.link.Link) – Link object to check if sitemap already exists.
- Returns
If sitemap exists, return the path to the sitemap, i.e.
<root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml
.If not, return
None
.
- Return type
Optional[str]
-
darc.save.
sanitise
(link, time=None, raw=False, data=False, headers=False, screenshot=False)¶ Sanitise link to path.
- Parameters
link (darc.link.Link) – Link object to sanitise the path
time (datetime) – Timestamp for the path.
raw (bool) – If this is a raw HTML document from
requests
.data (bool) – If this is a generic content type document.
headers (bool) – If this is response headers from
requests
.screenshot (bool) – If this is the screenshot from
selenium
.
- Returns
If
raw
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html
.If
data
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat
.If
headers
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.json
.If
screenshot
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.png
.If none above,
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html
.
- Return type
str
See also
-
darc.save.
save_file
(time, link, content)¶ Save file.
The function will also try to make symbolic links from the saved file standard path to the relative path as in the URL.
- Parameters
time (datetime) – Timestamp of generic file.
link (darc.link.Link) – Link object of original URL.
content (bytes) – Content of generic file.
- Returns
Saved path to generic content type file,
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat
.- Return type
str
See also
-
darc.save.
save_headers
(time, link, response, session)¶ Save HTTP response headers.
- Parameters
time (datetime) – Timestamp of response.
link (darc.link.Link) – Link object of response.
response (
requests.Response
) – Response object to be saved.session (
requests.Session
) – Session object of response.
- Returns
Saved path to response headers, i.e.
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.json
.- Return type
str
The JSON data saved is as following:
{ "[metadata]": { "url": "...", "proxy": "...", "host": "...", "base": "...", "name": "..." }, "Timestamp": "...", "URL": "...", "Method": "GET", "Status-Code": "...", "Reason": "...", "Cookies": { "...": "..." }, "Session": { "...": "..." }, "Request": { "...": "..." }, "Response": { "...": "..." } }
See also
-
darc.save.
save_html
(time, link, html, raw=False)¶ Save response.
- Parameters
time (datetime) – Timestamp of HTML document.
link (darc.link.Link) – Link object of original URL.
html (Union[str, bytes]) – Content of HTML document.
raw (bool) – If is fetched from
requests
.
- Returns
Saved path to HTML document.
If
raw
isTrue
,<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html
.If not,
<root>/<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html
.
- Return type
str
-
darc.save.
save_link
(link)¶ Save link hash database
link.csv
.The CSV file has following fields:
proxy type:
link.proxy
URL scheme:
link.url_parse.scheme
hostname:
link.base
link hash:
link.name
original URL:
link.url
- Parameters
link (darc.link.Link) – Link object to be saved.
See also
-
darc.save.
save_robots
(link, text)¶ Save
robots.txt
.- Parameters
link (darc.link.Link) – Link object of
robots.txt
.text (str) – Content of
robots.txt
.
- Returns
Saved path to
robots.txt
, i.e.<root>/<proxy>/<scheme>/<hostname>/robots.txt
.- Return type
str
See also
-
darc.save.
save_sitemap
(link, text)¶ Save sitemap.
- Parameters
link (darc.link.Link) – Link object of sitemap.
text (str) – Content of sitemap.
- Returns
Saved path to sitemap, i.e.
<root>/<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml
.- Return type
str
See also
-
darc.save.
_SAVE_LOCK
: multiprocessing.Lock¶ I/O lock for saving link hash database
link.csv
.See also