Hey there I'm working on a textgenerator, that should generate millions of different texts. to make the content of each text realistic I used Zipf's Law It is working well, the word distribution is correct.
But the following next()
function is executing very slow and since I want to generate millions of articles it has to be changed. (The while loop is the slow part)
Can someone help me with this?
I implemented it like this:
public int next() {
int rank;
double frequency = 0;
double dice;
rank = rnd.nextInt(size);
frequency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
dice = rnd.nextDouble();
while (!(dice < frequency) || (rank == 0)) {
rank = rnd.nextInt(size);
frequency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
dice = rnd.nextDouble();
}
return rank;
}
EDIT: I obtainded the code from : http://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/
The implementation that you copied ... has some issues. One could probably say that it is plainly wrong, because it is using random values, and when in a computation like
rank = rnd.nextInt(size);
friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
the rank
value is 0
, then the frequency is Infinity
, and messes up some of the statistics.
I tried to correct these erros, but have not analyzed the implementation and have not compared it to the definition of the Zipf distribution function. So if somebody copies my code, he might find out that it still "...has some issues".
The implementation of the next
function is, strictly speaking, not "total correct", in the sense that it does not necessarily terminate. There's nothing preventing the loop from running forever. Depending on the parameters, it may be more or less likely that it will take a while until it terminates. And I think that's also one of the main reasons for your "performance" issue: For some values, the condition (dice < frequency)
is just very unlikely to happen....
Regardless of that, the goal that you want to achieve can be formulated more generically: You have a certain distribution of probabilities. And you want a "random" function that returns random values based on this distribution.
One simple and generic way to achieve this is to map the (cumulated) probability distribution to the target values with a NavigableMap
. This map can then be used to quickly look up the target value, given a random value between 0.0 and 1.0 that is supplied by a java.util.Random
instance.
There may be more efficient solutions for particular cases, but again: This is very generic and simple (and still, reasonably efficient).
I implemented this here for the Zipf distribution. Again, I did not verify everything in detail, and there are some +1
/-1
oddities (mentioned in the first paragraph), but it should show the idea: The FastZipfGenerator
fills the map containing the probability distribution, and in the next()
function, just performs a lookup:
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.NavigableMap;
import java.util.Random;
import java.util.TreeMap;
public class ZipfGeneratorTest
{
public static void main(String[] args) {
int size = 10;
double skew = 2.0;
ZipfGenerator z0 = new ZipfGenerator(size, skew);
FastZipfGenerator z1 = new FastZipfGenerator(size, skew);
long before = 0;
long after = 0;
int n = 5000000;
before = System.nanoTime();
Map<Integer, Integer> counts0 = computeCounts(z0, size, n);
after = System.nanoTime();
System.out.println(counts0+", duration "+(after-before)/1e6);
before = System.nanoTime();
Map<Integer, Integer> counts1 = computeCounts(z1, size, n);
after = System.nanoTime();
System.out.println(counts1+", duration "+(after-before)/1e6);
}
private static Map<Integer, Integer> computeCounts(
ZipfGenerator z, int size, int n)
{
Map<Integer, Integer> counts = new LinkedHashMap<Integer, Integer>();
for (int i=1; i<=size; i++)
{
counts.put(i, 0);
}
for (int i=1; i<=n; i++)
{
int k = z.next();
counts.put(k, counts.get(k)+1);
}
return counts;
}
private static Map<Integer, Integer> computeCounts(
FastZipfGenerator z, int size, int n)
{
Map<Integer, Integer> counts = new LinkedHashMap<Integer, Integer>();
for (int i=1; i<=size; i++)
{
counts.put(i, 0);
}
for (int i=1; i<=n; i++)
{
int k = z.next();
counts.put(k, counts.get(k)+1);
}
return counts;
}
}
// Based on http://diveintodata.org/tag/zipf/
class ZipfGenerator {
private Random rnd = new Random(0);
private int size;
private double skew;
private double bottom = 0;
public ZipfGenerator(int size, double skew) {
this.size = size;
this.skew = skew;
for(int i=1;i <=size; i++) {
this.bottom += (1/Math.pow(i, this.skew));
}
}
// the next() method returns an random rank id.
// The frequency of returned rank ids are follows Zipf distribution.
public int next() {
int rank;
double friquency = 0;
double dice;
rank = rnd.nextInt(size)+1;
friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
dice = rnd.nextDouble();
while(!(dice < friquency)) {
rank = rnd.nextInt(size)+1;
friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
dice = rnd.nextDouble();
}
return rank;
}
// This method returns a probability that the given rank occurs.
public double getProbability(int rank) {
return (1.0d / Math.pow(rank, this.skew)) / this.bottom;
}
}
class FastZipfGenerator
{
private Random random = new Random(0);
private NavigableMap<Double, Integer> map;
FastZipfGenerator(int size, double skew)
{
map = computeMap(size, skew);
}
private static NavigableMap<Double, Integer> computeMap(
int size, double skew)
{
NavigableMap<Double, Integer> map =
new TreeMap<Double, Integer>();
double div = 0;
for (int i = 1; i <= size; i++)
{
div += (1 / Math.pow(i, skew));
}
double sum = 0;
for(int i=1; i<=size; i++)
{
double p = (1.0d / Math.pow(i, skew)) / div;
sum += p;
map.put(sum, i-1);
}
return map;
}
public int next()
{
double value = random.nextDouble();
return map.ceilingEntry(value).getValue()+1;
}
}
It prints a random sample result (basically, a "histogram"), and some timing results. The timing results are something like
duration 6221.835052
duration 304.761282
showing that it will most likely be faster (even though this should not be considered as a "benchmark"...)