bigdatalarge-datalarge-fileslarge-data-volumes

How can I create a large file with random but sensible English words?


I want to test my wordcount software based on MapReduce framework with a very large file (over 1GB) but I don't know how can I generate it.

Are there any tools to create a large file with random but sensible english sentences? Thanks


Solution

  • I wrote this simple Python script that scrape on Project Gutenberg site and write the text (encoding: us-ascii, if you want to use others see http://www.gutenberg.org/files/) in a local file text. This script can be used in combination with https://github.com/c-w/gutenberg to do more accurate filtering (by language, by author etc.)

    from __future__ import print_function
    
    import requests
    import sys
    
    if (len(sys.argv)!=2):
            print("[---------- ERROR ----------] Usage: scraper <number_of_files>", file=sys.stderr)
            sys.exit(1)
    
    number_of_files=int(sys.argv[1])
    text_file=open("big_file.txt",'w+')
    
    for i in range(number_of_files):
        url='http://www.gutenberg.org/files/'+str(i)+'/'+str(i)+'.txt'
        resp=requests.get(url)
        if resp.status_code!=200:
            print("[X] resp.status_code =",resp.status_code,"for",url)
            continue
        print("[V] resp.status_code = 200 for",url)
        try:    
            content=resp.text
    
            #dummy cleaning of the text 
            splitted_content=content.split("*** START OF THIS PROJECT GUTENBERG EBOOK")
            splitted_content=splitted_content[1].split("*** END OF THIS PROJECT GUTENBERG EBOOK")
            print(splitted_content[0], file = text_file)
        except: 
            continue
    
    text_file.close()