Data Submission

The darc project integrates the capability of submitting fetched data and information to a web server, to support real-time cross-analysis and status display.

There are three submission events:

  1. New Host Submission – API_NEW_HOST

    Submitted in crawler() function call, when the crawling URL is marked as a new host.

  2. Requests Submission – API_REQUESTS

    Submitted in crawler() function call, after the crawling process of the URL using requests.

  3. Selenium Submission – API_SELENIUM

    Submitted in loader() function call, after the loading process of the URL using selenium.

See also

Please refer to data schema for more information about the submission data.

darc.submit.get_hosts(link)[source]

Read hosts.txt.

Parameters

link (Link) – Link object to read hosts.txt.

Return type

Optional[File]

Returns

  • If hosts.txt exists, return the data from hosts.txt.

    • path – relative path from hosts.txt to root of data storage PATH_DB, <proxy>/<scheme>/<hostname>/hosts.txt

    • database64 encoded content of hosts.txt

  • If not, return None.

darc.submit.get_robots(link)[source]

Read robots.txt.

Parameters

link (Link) – Link object to read robots.txt.

Return type

Optional[File]

Returns

  • If robots.txt exists, return the data from robots.txt.

    • path – relative path from robots.txt to root of data storage PATH_DB, <proxy>/<scheme>/<hostname>/robots.txt

    • database64 encoded content of robots.txt

  • If not, return None.

darc.submit.get_sitemaps(link)[source]

Read sitemaps.

Parameters

link (Link) – Link object to read sitemaps.

Return type

Optional[List[File]]

Returns

  • If sitemaps exist, return list of the data from sitemaps.

    • path – relative path from sitemap to root of data storage PATH_DB, <proxy>/<scheme>/<hostname>/sitemap_<hash>.xml

    • database64 encoded content of sitemap

  • If not, return None.

darc.submit.save_submit(domain, data)[source]

Save failed submit data.

Parameters
  • domain ('new_host', 'requests' or 'selenium') – Domain of the submit data.

  • data (Dict[str, Any]) – Submit data.

Return type

None

Notes

The saved files will be categorised by the actual runtime day for better maintenance.

Return type

None

Parameters
darc.submit.submit(api, domain, data)[source]

Submit data.

Parameters
  • api (str) – API URL.

  • domain ('new_host', 'requests' or 'selenium') – Domain of the submit data.

  • data (Dict[str, Any]) – Submit data.

Return type

None

Return type

None

Parameters
darc.submit.submit_new_host(time, link, partial=False, force=False)[source]

Submit new host.

When a new host is discovered, the darc crawler will submit the host information. Such includes robots.txt (if exists) and sitemap.xml (if any).

Parameters
  • time (datetime.datetime) – Timestamp of submission.

  • link (Link) – Link object of submission.

  • partial (bool) – If the data is not complete, i.e. failed when fetching robots.txt, hosts.txt and/or sitemaps.

  • force (bool) – If the data is force re-fetched, i.e. cache expired when checking with darc.db.have_hostname().

Return type

None

If API_NEW_HOST is None, the data for submission will directly be save through save_submit().

The data submitted should have following format:

{
    // partial flag - true / false
    "$PARTIAL$": ...,
    // force flag - true / false
    "$FORCE$": ...,
    // metadata of URL
    "[metadata]": {
        // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
        "url": ...,
        // proxy type - null / tor / i2p / zeronet / freenet
        "proxy": ...,
        // hostname / netloc, c.f. ``urllib.parse.urlparse``
        "host": ...,
        // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>
        "base": ...,
        // sha256 of URL as name for saved files (timestamp is in ISO format)
        //   JSON log as this one - <base>/<name>_<timestamp>.json
        //   HTML from requests - <base>/<name>_<timestamp>_raw.html
        //   HTML from selenium - <base>/<name>_<timestamp>.html
        //   generic data files - <base>/<name>_<timestamp>.dat
        "name": ...,
        // originate URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
        "backref": ...
    },
    // requested timestamp in ISO format as in name of saved file
    "Timestamp": ...,
    // original URL
    "URL": ...,
    // robots.txt from the host (if not exists, then ``null``)
    "Robots": {
        // path of the file, relative path (to data root path ``PATH_DATA``) in container
        //   - <proxy>/<scheme>/<host>/robots.txt
        "path": ...,
        // content of the file (**base64** encoded)
        "data": ...,
    },
    // sitemaps from the host (if none, then ``null``)
    "Sitemaps": [
        {
            // path of the file, relative path (to data root path ``PATH_DATA``) in container
            //   - <proxy>/<scheme>/<host>/sitemap_<name>.xml
            "path": ...,
            // content of the file (**base64** encoded)
            "data": ...,
        },
        ...
    ],
    // hosts.txt from the host (if proxy type is ``i2p``; if not exists, then ``null``)
    "Hosts": {
        // path of the file, relative path (to data root path ``PATH_DATA``) in container
        //   - <proxy>/<scheme>/<host>/hosts.txt
        "path": ...,
        // content of the file (**base64** encoded)
        "data": ...,
    }
}
Return type

None

Parameters
darc.submit.submit_requests(time, link, response, session, content, mime_type, html=True)[source]

Submit requests data.

When crawling, we’ll first fetch the URl using requests, to check its availability and to save its HTTP headers information. Such information will be submitted to the web UI.

Parameters
  • time (datetime.datetime) – Timestamp of submission.

  • link (Link) – Link object of submission.

  • response (requests.Response) – Response object of submission.

  • session (requests.Session) – Session object of submission.

  • content (bytes) – Raw content of from the response.

  • mime_type (str) – Content type.

  • html (bool) – If current document is HTML or other files.

Return type

None

If API_REQUESTS is None, the data for submission will directly be save through save_submit().

The data submitted should have following format:

{
    // metadata of URL
    "[metadata]": {
        // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
        "url": ...,
        // proxy type - null / tor / i2p / zeronet / freenet
        "proxy": ...,
        // hostname / netloc, c.f. ``urllib.parse.urlparse``
        "host": ...,
        // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>
        "base": ...,
        // sha256 of URL as name for saved files (timestamp is in ISO format)
        //   JSON log as this one - <base>/<name>_<timestamp>.json
        //   HTML from requests - <base>/<name>_<timestamp>_raw.html
        //   HTML from selenium - <base>/<name>_<timestamp>.html
        //   generic data files - <base>/<name>_<timestamp>.dat
        "name": ...,
        // originate URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
        "backref": ...
    },
    // requested timestamp in ISO format as in name of saved file
    "Timestamp": ...,
    // original URL
    "URL": ...,
    // request method
    "Method": "GET",
    // response status code
    "Status-Code": ...,
    // response reason
    "Reason": ...,
    // response cookies (if any)
    "Cookies": {
        ...
    },
    // session cookies (if any)
    "Session": {
        ...
    },
    // request headers (if any)
    "Request": {
        ...
    },
    // response headers (if any)
    "Response": {
        ...
    },
    // content type
    "Content-Type": ...,
    // requested file (if not exists, then ``null``)
    "Document": {
        // path of the file, relative path (to data root path ``PATH_DATA``) in container
        //   - <proxy>/<scheme>/<host>/<name>_<timestamp>_raw.html
        // or if the document is of generic content type, i.e. not HTML
        //   - <proxy>/<scheme>/<host>/<name>_<timestamp>.dat
        "path": ...,
        // content of the file (**base64** encoded)
        "data": ...,
    },
    // redirection history (if any)
    "History": [
        // same record data as the original response
        {"...": "..."}
    ]
}
Return type

None

Parameters
darc.submit.submit_selenium(time, link, html, screenshot)[source]

Submit selenium data.

After crawling with requests, we’ll then render the URl using selenium with Google Chrome and its web driver, to provide a fully rendered web page. Such information will be submitted to the web UI.

Parameters
  • time (datetime.datetime) – Timestamp of submission.

  • link (Link) – Link object of submission.

  • html (str) – HTML source of the web page.

  • screenshot (Optional[str]) – base64 encoded screenshot.

Return type

None

If API_SELENIUM is None, the data for submission will directly be save through save_submit().

Note

This information is optional, only provided if the content type from requests is HTML, status code not between 400 and 600, and HTML data not empty.

The data submitted should have following format:

{
    // metadata of URL
    "[metadata]": {
        // original URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
        "url": ...,
        // proxy type - null / tor / i2p / zeronet / freenet
        "proxy": ...,
        // hostname / netloc, c.f. ``urllib.parse.urlparse``
        "host": ...,
        // base folder, relative path (to data root path ``PATH_DATA``) in containter - <proxy>/<scheme>/<host>
        "base": ...,
        // sha256 of URL as name for saved files (timestamp is in ISO format)
        //   JSON log as this one - <base>/<name>_<timestamp>.json
        //   HTML from requests - <base>/<name>_<timestamp>_raw.html
        //   HTML from selenium - <base>/<name>_<timestamp>.html
        //   generic data files - <base>/<name>_<timestamp>.dat
        "name": ...,
        // originate URL - <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
        "backref": ...
    },
    // requested timestamp in ISO format as in name of saved file
    "Timestamp": ...,
    // original URL
    "URL": ...,
    // rendered HTML document (if not exists, then ``null``)
    "Document": {
        // path of the file, relative path (to data root path ``PATH_DATA``) in container
        //   - <proxy>/<scheme>/<host>/<name>_<timestamp>.html
        "path": ...,
        // content of the file (**base64** encoded)
        "data": ...,
    },
    // web page screenshot (if not exists, then ``null``)
    "Screenshot": {
        // path of the file, relative path (to data root path ``PATH_DATA``) in container
        //   - <proxy>/<scheme>/<host>/<name>_<timestamp>.png
        "path": ...,
        // content of the file (**base64** encoded)
        "data": ...,
    }
}

See also

Return type

None

Parameters
darc.submit.PATH_API = '{PATH_DB}/api/'

Path to the API submittsion records, i.e. api folder under the root of data storage.

darc.submit.SAVE_DB: bool

Save submitted data to database.

Default

True

Environ

SAVE_DB

darc.submit.API_RETRY: int

Retry times for API submission when failure.

Default

3

Environ

API_RETRY

darc.submit.API_NEW_HOST: str

API URL for submit_new_host().

Default

None

Environ

API_NEW_HOST

darc.submit.API_REQUESTS: str

API URL for submit_requests().

Default

None

Environ

API_REQUESTS

darc.submit.API_SELENIUM: str

API URL for submit_selenium().

Default

None

Environ

API_SELENIUM

Note

If API_NEW_HOST, API_REQUESTS and API_SELENIUM is None, the corresponding submit function will save the JSON data in the path specified by PATH_API.

See also

The darc provides a demo on how to implement a darc-compliant web backend for the data submission module. See the demo page for more information.