Data Submission¶
The darc project integrates the capability of submitting
fetched data and information to a web server, to support real-time
cross-analysis and status display.
There are three submission events:
New Host Submission –
API_NEW_HOSTSubmitted in
crawler()function call, when the crawling URL is marked as a new host.Requests Submission –
API_REQUESTSSubmitted in
crawler()function call, after the crawling process of the URL usingrequests.Selenium Submission –
API_SELENIUMSubmitted in
loader()function call, after the loading process of the URL usingselenium.
-
darc.submit.get_html(link, time)¶ Read HTML document.
- Parameters
link (darc.link.Link) – Link object to read document from
selenium.time (str) –
- Returns
If document exists, return the data from document.
path– relative path from document to root of data storagePATH_DB,<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.htmldata– base64 encoded content of document
If not, return
None.
- Return type
Optional[Dict[str, Union[str, ByteString]]]
See also
-
darc.submit.get_metadata(link)¶ Generate metadata field.
- Parameters
link (darc.link.Link) – Link object to generate metadata.
- Returns
The metadata from
link.url– original URL,link.urlproxy– proxy type,link.proxyhost– hostname,link.hostbase– base path,link.basename– link hash,link.name
- Return type
Dict[str, str]
-
darc.submit.get_raw(link, time)¶ Read raw document.
- Parameters
link (darc.link.Link) – Link object to read document from
requests.time (str) –
- Returns
If document exists, return the data from document.
path– relative path from document to root of data storagePATH_DB,<proxy>/<scheme>/<hostname>/<hash>_<timestamp>_raw.htmlor<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.datdata– base64 encoded content of document
If not, return
None.
- Return type
Optional[Dict[str, Union[str, ByteString]]]
-
darc.submit.get_robots(link)¶ Read
robots.txt.- Parameters
link (darc.link.Link) – Link object to read
robots.txt.- Returns
If
robots.txtexists, return the data fromrobots.txt.path– relative path fromrobots.txtto root of data storagePATH_DB,<proxy>/<scheme>/<hostname>/robots.txtdata– base64 encoded content ofrobots.txt
If not, return
None.
- Return type
Optional[Dict[str, Union[str, ByteString]]]
-
darc.submit.get_screenshot(link, time)¶ Read screenshot picture.
- Parameters
link (darc.link.Link) – Link object to read screenshot from
selenium.time (str) –
- Returns
If screenshot exists, return the data from screenshot.
path– relative path from screenshot to root of data storagePATH_DB,<proxy>/<scheme>/<hostname>/<hash>_<timestamp>.pngdata– base64 encoded content of screenshot
If not, return
None.
- Return type
Optional[Dict[str, Union[str, ByteString]]]
See also
-
darc.submit.get_sitemap(link)¶ Read sitemaps.
- Parameters
link (darc.link.Link) – Link object to read sitemaps.
- Returns
If sitemaps exist, return list of the data from sitemaps.
path– relative path from sitemap to root of data storagePATH_DB,<proxy>/<scheme>/<hostname>/sitemap_<hash>.xmldata– base64 encoded content of sitemap
If not, return
None.
- Return type
Optional[List[Dict[str, Union[str, ByteString]]]]
-
darc.submit.save_submit(domain, data)¶ Save failed submit data.
- Parameters
domain (
'new_host','requests'or'selenium') – Domain of the submit data.data (Dict[str, Any]) – Submit data.
-
darc.submit.submit(api, domain, data)¶ Submit data.
- Parameters
api (str) – API URL.
domain (
'new_host','requests'or'selenium') – Domain of the submit data.data (Dict[str, Any]) – Submit data.
-
darc.submit.submit_new_host(time, link)¶ Submit new host.
When a new host is discovered, the
darccrawler will submit the host information. Such includesrobots.txt(if exists) andsitemap.xml(if any).- Parameters
time (datetime.datetime) – Timestamp of submission.
link (darc.link.Link) – Link object of submission.
If
API_NEW_HOSTisNone, the data for submission will directly be save throughsave_submit().The data submitted should have following format:
{ // metadata of URL "[metadata]": { // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment> "url": ..., // proxy type - null / tor / i2p / zeronet / freenet "proxy": ..., // hostname / netloc, c.f. ``urllib.parse.urlparse`` "host": ..., // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host> "base": ..., // sha256 of URL as name for saved files (timestamp is in ISO format) // JSON log as this one - <base>/<name>_<timestamp>.json // HTML from requests - <base>/<name>_<timestamp>_raw.html // HTML from selenium - <base>/<name>_<timestamp>.html // generic data files - <base>/<name>_<timestamp>.dat "name": ... }, // requested timestamp in ISO format as in name of saved file "Timestamp": ..., // original URL "URL": ..., // robots.txt from the host (if not exists, then ``null``) "Robots": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/robots.txt "path": ..., // content of the file (**base64** encoded) "data": ..., }, // sitemaps from the host (if none, then ``null``) "Sitemaps": [ { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/sitemap_<name>.txt "path": ..., // content of the file (**base64** encoded) "data": ..., }, ... ], // hosts.txt from the host (if proxy type is ``i2p``; if not exists, then ``null``) "Hosts": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/hosts.txt "path": ..., // content of the file (**base64** encoded) "data": ..., } }
-
darc.submit.submit_requests(time, link, response, session)¶ Submit requests data.
When crawling, we’ll first fetch the URl using
requests, to check its availability and to save its HTTP headers information. Such information will be submitted to the web UI.- Parameters
time (datetime.datetime) – Timestamp of submission.
link (darc.link.Link) – Link object of submission.
response (
requests.Response) – Response object of submission.session (
requests.Session) – Session object of submission.
If
API_REQUESTSisNone, the data for submission will directly be save throughsave_submit().The data submitted should have following format:
{ // metadata of URL "[metadata]": { // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment> "url": ..., // proxy type - null / tor / i2p / zeronet / freenet "proxy": ..., // hostname / netloc, c.f. ``urllib.parse.urlparse`` "host": ..., // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host> "base": ..., // sha256 of URL as name for saved files (timestamp is in ISO format) // JSON log as this one - <base>/<name>_<timestamp>.json // HTML from requests - <base>/<name>_<timestamp>_raw.html // HTML from selenium - <base>/<name>_<timestamp>.html // generic data files - <base>/<name>_<timestamp>.dat "name": ... }, // requested timestamp in ISO format as in name of saved file "Timestamp": ..., // original URL "URL": ..., // request method "Method": "GET", // response status code "Status-Code": ..., // response reason "Reason": ..., // response cookies (if any) "Cookies": { ... }, // session cookies (if any) "Session": { ... }, // request headers (if any) "Request": { ... }, // response headers (if any) "Response": { ... }, // requested file (if not exists, then ``null``) "Document": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/<name>_<timestamp>_raw.html // or if the document is of generic content type, i.e. not HTML // - <proxy>/<scheme>/<host>/<name>_<timestamp>.dat "path": ..., // content of the file (**base64** encoded) "data": ..., } }
-
darc.submit.submit_selenium(time, link)¶ Submit selenium data.
After crawling with
requests, we’ll then render the URl usingseleniumwith Google Chrome and its web driver, to provide a fully rendered web page. Such information will be submitted to the web UI.- Parameters
time (datetime.datetime) – Timestamp of submission.
link (darc.link.Link) – Link object of submission.
If
API_SELENIUMisNone, the data for submission will directly be save throughsave_submit().Note
This information is optional, only provided if the content type from
requestsis HTML, status code not between400and600, and HTML data not empty.The data submitted should have following format:
{ // metadata of URL "[metadata]": { // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment> "url": ..., // proxy type - null / tor / i2p / zeronet / freenet "proxy": ..., // hostname / netloc, c.f. ``urllib.parse.urlparse`` "host": ..., // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host> "base": ..., // sha256 of URL as name for saved files (timestamp is in ISO format) // JSON log as this one - <base>/<name>_<timestamp>.json // HTML from requests - <base>/<name>_<timestamp>_raw.html // HTML from selenium - <base>/<name>_<timestamp>.html // generic data files - <base>/<name>_<timestamp>.dat "name": ... }, // requested timestamp in ISO format as in name of saved file "Timestamp": ..., // original URL "URL": ..., // rendered HTML document (if not exists, then ``null``) "Document": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/<name>_<timestamp>.html "path": ..., // content of the file (**base64** encoded) "data": ..., }, // web page screenshot (if not exists, then ``null``) "Screenshot": { // path of the file, relative path (to data root path ``PATH_DATA``) in container // - <proxy>/<scheme>/<host>/<name>_<timestamp>.png "path": ..., // content of the file (**base64** encoded) "data": ..., } }
-
darc.submit.PATH_API= '{PATH_DB}/api/'¶ Path to the API submittsion records, i.e.
apifolder under the root of data storage.See also
-
darc.submit.API_RETRY: int¶ Retry times for API submission when failure.
- Default
3- Environ
-
darc.submit.API_NEW_HOST: str¶ API URL for
submit_new_host().- Default
None- Environ
-
darc.submit.API_REQUESTS: str¶ API URL for
submit_requests().- Default
None- Environ
-
darc.submit.API_SELENIUM: str¶ API URL for
submit_selenium().- Default
None- Environ
Note
If API_NEW_HOST, API_REQUESTS
and API_SELENIUM is None, the corresponding
submit function will save the JSON data in the path
specified by PATH_API.
See also
The darc provides a demo on how to implement a darc-compliant
web backend for the data submission module. See the demo page
for more information.