c++algorithmsortinglanguage-agnosticiterator

Get next number from range in lexicographical order (without container of all stringifications)


How have/would you design an function that on each call returns the next value in a nominated numeric range in lexicographical order of string representation...?

Example: range 8..203 --> 10, 100..109, 11, 110..119, 12, 120..129, 13, 130..139, ..., 19, 190..199, 20, 200..203, 30..79, 8, 80..89, 9, 90..99.

Constraints: indices 0..~INT_MAX, fixed space, O(range-length) performance, preferably "lazy" so if you stop iterating mid way you haven't wasted processing effort. Please don't post brute force "solutions" iterating numerically while generating strings that are then sorted.

Utility: if you're generating data that ultimately needs to be lexicographically presented or processed, a lexicographical series promises lazy generation as needed, reduces memory requirements and eliminates a sort.

Background: when answering this question today, my solution gave output in numeric order (i.e. 8, 9, 10, 11, 12), not lexicographical order (10, 11, 12, 8, 8) as illustrated in the question. I imagined it would be easy to write or find a solution, but my Google-foo let me down and it was trickier than I expected, so I figured I'd collect/contribute here....

(Tagged C++ as it's my main language and I'm personally particularly interested in C++ solutions, but anything's welcome)

Somebody voted to close this because I either didn't demonstrate a minimal understanding of the problem being solved (hmmmm!?! ;-P), or an attempted solution. My solution is posted as an answer as I'm happy for it to be commented on and regailed in the brutal winds of Stack Overflow wisdom.... O_o


Solution

  • This is a bit of a mess, so I'm curious to see how other people tackle it. There are so many edge cases explicitly handled in the increment operator!

    For range low to high:

     

    template <typename T>
    std::string str(const T& t)
    {
        std::ostringstream oss; oss << t; return oss.str();
    }
    
    template <typename T>
    class Lex_Counter
    {
      public:
        typedef T value_type;
    
        Lex_Counter(T from, T to, T first = -1)
          : from_(from), to_(to),
            min_size_(str(from).size()), max_size_(str(to).size()),
            n_(first != -1 ? first : get_first()),
            max_unit_(pow(10, max_size_ - 1)), min_unit_(pow(10, min_size_ - 1))
        { }
    
        operator T() { return n_; }
    
        T& operator++()
        {
            if (n_ == 0)
                return n_ = 1;
            if (n_ < max_unit_ && n_ * 10 <= to_)
                return n_ = n_ * 10; // e.g. 10 -> 100, 89 -> 890
            if (n_ % 10 < 9 && n_ + 1 <= to_)
                return ++n_;         // e.g. 108 -> 109
            if (min_size_ == max_size_
                ? n_ == to_
                : (n_ == max_unit_ - 1 && to_ < 10 * max_unit_ - 10 || // 99/989
                   n_ == to_ && to_ >= 10 * max_unit_ - 10))   // eg. 993
                return n_ = to_ + 1;
    
            // increment the right-most non-9 digit
            // note: all-9s case handled above (n_ == max_unit_ - 1 etc.)
            // e.g. 109 -> 11, 19 -> 2, 239999->24, 2999->3
    
            // comments below explain 230099 -> 230100
    
            // search from the right until we have exactly non-9 digit
            for (int k = 100; ; k *= 10)
                if (n_ % k != k - 1)
                {
                    int l = k / 10; // n_ 230099, k 1000, l 100
                    int r = ((n_ / l) + 1) * l; // 230100
                    if (r > to_ && r / 10 < from_)
                        return n_ = from_; // e.g. from_ 8, r 20...
                    while (r / 10 >= from_ && r % 10 == 0)
                        r /= 10; // e.g. 230100 -> 2301
                    return n_ = r <= from_ ? from_ : r;
                }
            assert(false);
        }
    
      private:
        T get_first() const
        {
            if (min_size_ == max_size_ ||
                from_ / min_unit_ < 2 && from_ % min_unit_ == 0)
                return from_;
    
            // can "fall" from e.g. 321 to 1000
            return min_unit_ * 10;
        }
        T pow(T n, int exp)
            { return exp == 0 ? 1 : exp == 1 ? n : 10 * pow(n, exp - 1); }
        T from_, to_;
        size_t min_size_, max_size_;
        T n_;
        T max_unit_, min_unit_;
    };
    

    Performance numbers

    I can count from 0 to 1 billion in under a second on a standard Intel machine / single threaded, MS compiler at -O2.

    The same machine / harness running my attempt at Shahbaz's solution - below - takes over 3.5 second to count to 100,000. Maybe the std::set isn't a good heap/heap-substitute, or there's a better way to use it? Any optimisation suggestions welcome.

    template <typename T>
    struct Shahbaz
    {
        std::set<std::string> s;
        Shahbaz(T from, T to)
          : to_(to)
        {
            s.insert(str(from));
            for (int n = 10; n < to_; n *= 10)
                if (n > from) s.insert(str(n));
            n_ = atoi(s.begin()->c_str());
        }
    
        operator T() const { return n_; }
    
        Shahbaz& operator++()
        {
            if (s.empty())
                n_ = to_ + 1;
            else
            {
                s.erase(s.begin());
                if (n_ + 1 <= to_)
                {
                    s.insert(str(n_ + 1));
                    n_ = atoi(s.begin()->c_str());
                }
            }
            return *this;
        }
    
      private:
        T n_, to_;
    };
    

    Perf code for reference...

    void perf()
    {
        DWORD start = GetTickCount();
        int to = 1000 *1000;
        // Lex_Counter<int> counter(0, to);
        Shahbaz<int> counter(0, to);
    
        while (counter <= to)
            ++counter;
        DWORD elapsed = GetTickCount() - start;
        std::cout << '~' << elapsed << "ms\n";
    }