How to get a listing of WARC files using HTTP for Common Crawl News Dataset?

I can obtain listing for Common Crawl by:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-09/wet.paths.gz

How can I do this with Common Crawl News Dataset ?

I tried different options, but always getting errors:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS-2017-09/warc.paths.gz

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/09/warc.paths.gz

Solution

Since every few hours a new WARC file is added to the news dataset, a static file list does not make sense. Instead you can get a list of files using the AWS CLI - for any subset by year or month, e.g.

aws --no-sign-request s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2017/09/

See also the news data release announcement.

When using the AWS CLI lambda invoke, is there a way to pass environment variables?
error: failed to solve: failed commit on ref : unexpected status: 400 Bad Request
Is the fanout pattern using AWS SNS ans SQS reliable?
API Gateway: JSON 5+ MB Gives error "413, Request Too Long"
What are the deployment tools used in hybrid cloud
Delete/Update DynamoDB entries with AWS API
aws cli ec2 describe-instances table output
Deploying a microservice with Tensorflow at AWS Lambda
Issue when using Terraform to manage credentials that access RDS database
Predictive Auto Scaling for AWS ECS Services
vitejs build with jsx returning MIME error on aws amplify
AWS API Gateway : Execution failed due to configuration error: No match for output mapping and no default output mapping configured
Is there any way to Block request from Postman or other apps to call Restful API
EC2 user permissions for var/www directory
How can I recover deleted AWS Lambda code?
How to discard changes in AWS lambda function inline editor?
AWS equivalent for Azure Resourcegroup
Lambda Function to Delete Object after interrogation of file, triggered by S3 Create event not deleting file
AWS Java SDK v2.0: How to handle providing indexNames at runtime rather than using annotation
How can I get all thing names from a thing group in AWS IoT Core using a Lambda function?
AWS CloudFormation is stuck on DELETED_FAILED status
count value depends on resource attributes that cannot be determined until apply in module
GraphQL documentation strings (""") not showing up in AWS AppSync Documentation Explorer
How to check if Python app is running within AWS lambda function?
how to add raster data using mapbox gl js?
AWS Application Load Balancer transforms all headers to lower case
Aws cross-account backup copy and restoration failing due to insufficient privileges
CDK - S3 notification causing cyclic reference error
aws beanstalk 403 error while deploying
Docker commands hanging with no response