Data Submission¶
The darc
project integrates the capability of submitting
fetched data and information to a web server, to support real-time
cross-analysis and status display.
There are three submission events:
New Host Submission –
API_NEW_HOST
Submitted in
crawler()
function call, when the crawling URL is marked as a new host.Requests Submission –
API_REQUESTS
Submitted in
crawler()
function call, after the crawling process of the URL usingrequests
.Selenium Submission –
API_SELENIUM
Submitted in
loader()
function call, after the loading process of the URL usingselenium
.
-
darc.submit.
get_html
(link, time)[source]¶ Read HTML document.
- Parameters
link (darc.link.Link) – Link object to read document from
selenium
.time (str) –
- Returns
If document exists, return the data from document.
path
– relative path from document to root of data storagePATH_DB
,<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.html
data
– base64 encoded content of document
If not, return
None
.
- Return type
See also
-
darc.submit.
get_metadata
(link)[source]¶ Generate metadata field.
- Parameters
link (darc.link.Link) – Link object to generate metadata.
- Returns
The metadata from
link
.url
– original URL,link.url
proxy
– proxy type,link.proxy
host
– hostname,link.host
base
– base path,link.base
name
– link hash,link.name
- Return type
-
darc.submit.
get_raw
(link, time)[source]¶ Read raw document.
- Parameters
link (darc.link.Link) – Link object to read document from
requests
.time (str) –
- Returns
If document exists, return the data from document.
path
– relative path from document to root of data storagePATH_DB
,<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.html
or<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.dat
data
– base64 encoded content of document
If not, return
None
.
- Return type
-
darc.submit.
get_robots
(link)[source]¶ Read
robots.txt
.- Parameters
link (darc.link.Link) – Link object to read
robots.txt
.- Returns
If
robots.txt
exists, return the data fromrobots.txt
.path
– relative path fromrobots.txt
to root of data storagePATH_DB
,<proxy>/<scheme>/<hostname>/robots.txt
data
– base64 encoded content ofrobots.txt
If not, return
None
.
- Return type
-
darc.submit.
get_screenshot
(link, time)[source]¶ Read screenshot picture.
- Parameters
link (darc.link.Link) – Link object to read screenshot from
selenium
.time (str) –
- Returns
If screenshot exists, return the data from screenshot.
path
– relative path from screenshot to root of data storagePATH_DB
,<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.png
data
– base64 encoded content of screenshot
If not, return
None
.
- Return type
See also
-
darc.submit.
get_sitemap
(link)[source]¶ Read sitemaps.
- Parameters
link (darc.link.Link) – Link object to read sitemaps.
- Returns
If sitemaps exist, return list of the data from sitemaps.
path
– relative path from sitemap to root of data storagePATH_DB
,<proxy>/<scheme>/<hostname>/sitemap_<hash>.xml
data
– base64 encoded content of sitemap
If not, return
None
.
- Return type
-
darc.submit.
save_submit
(domain, data)[source]¶ Save failed submit data.
- Parameters
domain (
'new_host'
,'requests'
or'selenium'
) – Domain of the submit data.data (Dict[str, Any]) – Submit data.
-
darc.submit.
submit_new_host
(time, link)[source]¶ Submit new host.
When a new host is discovered, the
darc
crawler will submit the host information. Such includesrobots.txt
(if exists) andsitemap.xml
(if any).- Parameters
time (datetime.datetime) – Timestamp of submission.
link (darc.link.Link) – Link object of submission.
If
API_NEW_HOST
isNone
, the data for submission will directly be save throughsave_submit()
.The data submitted should have following format:
{ // metadata of URL "[metadata]": { // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment> "url": ..., // proxy type - null / tor / i2p / zeronet / freenet "proxy": ..., // hostname / netloc, c.f. ``urllib.parse.urlparse`` "host": ..., // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host> "base": ..., // sha256 of URL as name for saved files (timestamp is in ISO format) // JSON log as this one - <base>/<name>_<timestamp>.json // HTML from requests - <base>/<name>_<timestamp>_raw.html // HTML from selenium - <base>/<name>_<timestamp>.html // generic data files - <base>/<name>_<timestamp>.dat "name": ... }, // requested timestamp in ISO format as in name of saved file "Timestamp": ..., // original URL "URL": ..., // robots.txt from the host (if not exists, then ``null``) "Robots": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/robots.txt "path": ..., // content of the file (**base64** encoded) "data": ..., }, // sitemaps from the host (if none, then ``null``) "Sitemaps": [ { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/sitemap_<name>.txt "path": ..., // content of the file (**base64** encoded) "data": ..., }, ... ], // hosts.txt from the host (if proxy type is ``i2p``; if not exists, then ``null``) "Hosts": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/hosts.txt "path": ..., // content of the file (**base64** encoded) "data": ..., } }
-
darc.submit.
submit_requests
(time, link, response, session)[source]¶ Submit requests data.
When crawling, we’ll first fetch the URl using
requests
, to check its availability and to save its HTTP headers information. Such information will be submitted to the web UI.- Parameters
time (datetime.datetime) – Timestamp of submission.
link (darc.link.Link) – Link object of submission.
response (
requests.Response
) – Response object of submission.session (
requests.Session
) – Session object of submission.
If
API_REQUESTS
isNone
, the data for submission will directly be save throughsave_submit()
.The data submitted should have following format:
{ // metadata of URL "[metadata]": { // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment> "url": ..., // proxy type - null / tor / i2p / zeronet / freenet "proxy": ..., // hostname / netloc, c.f. ``urllib.parse.urlparse`` "host": ..., // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host> "base": ..., // sha256 of URL as name for saved files (timestamp is in ISO format) // JSON log as this one - <base>/<name>_<timestamp>.json // HTML from requests - <base>/<name>_<timestamp>_raw.html // HTML from selenium - <base>/<name>_<timestamp>.html // generic data files - <base>/<name>_<timestamp>.dat "name": ... }, // requested timestamp in ISO format as in name of saved file "Timestamp": ..., // original URL "URL": ..., // request method "Method": "GET", // response status code "Status-Code": ..., // response reason "Reason": ..., // response cookies (if any) "Cookies": { ... }, // session cookies (if any) "Session": { ... }, // request headers (if any) "Request": { ... }, // response headers (if any) "Response": { ... }, // requested file (if not exists, then ``null``) "Document": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/<name>_<timestamp>_raw.html // or if the document is of generic content type, i.e. not HTML // - <proxy>/<scheme>/<host>/<name>_<timestamp>.dat "path": ..., // content of the file (**base64** encoded) "data": ..., }, // redirection history (if any) "History": [ // same record data as the original response {"...": "..."} ] }
-
darc.submit.
submit_selenium
(time, link)[source]¶ Submit selenium data.
After crawling with
requests
, we’ll then render the URl usingselenium
with Google Chrome and its web driver, to provide a fully rendered web page. Such information will be submitted to the web UI.- Parameters
time (datetime.datetime) – Timestamp of submission.
link (darc.link.Link) – Link object of submission.
If
API_SELENIUM
isNone
, the data for submission will directly be save throughsave_submit()
.Note
This information is optional, only provided if the content type from
requests
is HTML, status code not between400
and600
, and HTML data not empty.The data submitted should have following format:
{ // metadata of URL "[metadata]": { // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment> "url": ..., // proxy type - null / tor / i2p / zeronet / freenet "proxy": ..., // hostname / netloc, c.f. ``urllib.parse.urlparse`` "host": ..., // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host> "base": ..., // sha256 of URL as name for saved files (timestamp is in ISO format) // JSON log as this one - <base>/<name>_<timestamp>.json // HTML from requests - <base>/<name>_<timestamp>_raw.html // HTML from selenium - <base>/<name>_<timestamp>.html // generic data files - <base>/<name>_<timestamp>.dat "name": ... }, // requested timestamp in ISO format as in name of saved file "Timestamp": ..., // original URL "URL": ..., // rendered HTML document (if not exists, then ``null``) "Document": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/<name>_<timestamp>.html "path": ..., // content of the file (**base64** encoded) "data": ..., }, // web page screenshot (if not exists, then ``null``) "Screenshot": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/<name>_<timestamp>.png "path": ..., // content of the file (**base64** encoded) "data": ..., } }
-
darc.submit.
PATH_API
= '{PATH_DB}/api/'¶ Path to the API submittsion records, i.e.
api
folder under the root of data storage.See also
-
darc.submit.
API_NEW_HOST
: str¶ API URL for
submit_new_host()
.- Default
None
- Environ
-
darc.submit.
API_REQUESTS
: str¶ API URL for
submit_requests()
.- Default
None
- Environ
-
darc.submit.
API_SELENIUM
: str¶ API URL for
submit_selenium()
.- Default
None
- Environ
Note
If API_NEW_HOST
, API_REQUESTS
and API_SELENIUM
is None
, the corresponding
submit function will save the JSON data in the path
specified by PATH_API
.