Botify API allows you to export your crawled URLs (and their metadata) as a CSV file. Crawled URLs can be filtered in order to export only a subset of URLs. Full list of requestable fields can be found in Analysis Datamodel.
Be noticed that CSV exports are by default limited to 100,000 URLs. Contact your account manager if you need to increase this limit.
Some operations like creating a CSV export need to be done asynchronously because they can take more time than common timeouts can accept. To do so, we use a polling mechanism:
To create a CSV, you need to:
- First call createUrlsExport which returns a job id.
- Then poll getUrlsExportStatus every X seconds (using that job id) till the export is done.
Note that the Job middleware of the Javascript SDK implements this logic making it much easier to use asynchronous operations.
analyses/{username}/{project_slug}/{analysis_slug}/urls/export
Array<BQLQuery>
Pagination<BQLResult>
Please refer to BQLQuery documentation for information about how to define fields to select and filters.
curl "https://api.botify.com/v1/analyses/${username}/${project_slug}/${analysis_slug}/urls/export" \
-H "Authorization: Token ${API_KEY}" \
-H "Content-type: application/json" \
--data-binary "${BQLQuery}"
analyses/{username}/{project_slug}/{analysis_slug}/urls/export/{url_export_id}
CsvExportStatus
curl "https://api.botify.com/v1/analyses/${username}/${project_slug}/${analysis_slug}/urls/export" \
-X GET \
-H "Authorization: Token ${API_KEY}" \
The following example of BQLQuery fetches url
and metadata.title.nb
fields and filters the dataset on new URLs that respond with a 2xx HTTP code.
{
"fields": [
"url",
"metadata.title.nb"
],
"filters": {
"and": [
{
"field": "http_code",
"predicate": "between",
"value": [200, 300]
},
{
"not": {
"field": "previous",
"predicate": "exists"
}
}
]
}
}
You first use createUrlsExport to start the CSV export. The response could be the following.
{
"job_id": 19381,
"job_url": "https://api.botify.com/v1/analyses/username/project_slug/analysis_slug/urls/export/19381",
"job_status": "CREATED",
"date_created": "2016-06-01T10:07:03.309416Z",
"query": ...,
"area": "current",
"nb_results": null,
"results": null
}
Then poll start polling getUrlsExportStatus till job_status
equals DONE
(or FAILED
). Once the export is finished, the response could be the following:
- nb_results
gives you the number of URLs exported.
- results.download_url
gives you the URL of the zip file containing your CSV export.
{
"job_id": 19381,
"job_url": "https://api.botify.com/v1/analyses/username/project_slug/analysis_slug/urls/export/19381",
"job_status": "DONE",
"date_created": "2016-06-01T10:07:03.309416Z",
"nb_results": 275,
"area": "current",
"query": ...,
"results": {
"download_url": "https://d121xa69ioyktv.cloudfront.net/csv_exports/10ebf8c8de4a8d4e47ca1da766704d7d.zip"
}
}