c++matrixdouble-precision

Handling loss of precision in subtracting two doubles that are close to each other


I have a project to do where we are to solve the matrix equation AX=B for x, given that A is a tridiagonal matrix. I did this project in C++, got the program to produce the right Matrix X, but when trying to report the error back to the user, A*X-B, I get an erroneous error!! It is due to the fact that I am subtracing A*X and B, whose entries are arbitrarily close to each other. I had two ideas on how to handle this, element-by-element:

  1. According to this article, http://en.wikipedia.org/wiki/Loss_of_significance, there could be as many as -log2(1-y/x) bits lost in straight subtraction x-y. Let's scale both x and y by pow(2,bitsLost), subtract the two, and then scale them back down by dividing by pow(2,bitsLost)
  2. Stressed so much in the numeric methods course this is for: take the arithmetic conjugate! Instead of double difference = x-y; use double difference = (x*x-y*y)/(x+y);

OK, so why haven't you chose a method and moved on?

I tried all three methods (including straight subtraction) here: http://ideone.com/wfkEUp . I would like to know two things:

  1. Between the "scaling and descaling" method (for which I intentionally chose a power of two) and the arithmetic conjugate method, which one produces less error (in terms of subtracting the large numbers)?
  2. Which method is computationally more efficient? /*For this, I was going to say the scaling method was going to be more efficient with a linear complexity versus the seemed quadratic complexity of the conjugate method, but I don't know the complexity of log2()*/

Any and all help would be welcomed!!

P.S.: All three methods seem to return the same double in the sample code...


Let's see some of your code No problem; here is my Matrix.cpp code

#include "ExceptionType.h"
#include "Matrix.h"
#include "MatrixArithmeticException.h"
#include <iomanip>
#include <iostream>
#include <vector>

Matrix::Matrix()
{
    //default size for Matrix is 1 row and 1 column, whose entry is 0
    std::vector<long double> rowVector(1,0);
    this->matrixData.assign(1, rowVector);
}

Matrix::Matrix(const std::vector<std::vector<long double> >& data)
{
    this->matrixData = data;
    //validate matrixData
    validateData();
}

//getter functions
//Recall that matrixData is a vector of a vector, whose elements should be accessed like matrixData[row][column].
//Each rowVector should have the same size.
unsigned Matrix::getRowCount() const { return matrixData.size(); }

unsigned Matrix::getColumnCount() const { return matrixData[0].size(); }

//matrix validator should just append zeroes into row vectors that are of smaller dimension than they should be...
void Matrix::validateData()
{
    //fetch the size of the largest-dimension rowVector
    unsigned largestSize = 0;
    for (unsigned i = 0; i < getRowCount(); i++)
    {
        if (largestSize < matrixData[i].size())
            largestSize = matrixData[i].size();
    }
    //make sure that all rowVectors are of that dimension
    for (unsigned i = 0; i < getRowCount(); i++)
    {
        //if we find a rowVector where this isn't the case
        if (matrixData[i].size() < largestSize)
        {
            //add zeroes to it so that it becomes the case
            matrixData[i].insert(matrixData[i].end(), largestSize-matrixData[i].size(), 0);
        }
    }

}
//operators
//+ and - operators should check to see if the size of the first matrix is exactly the same size as that of the second matrix
Matrix Matrix::operator+(const Matrix& B)
{
    //if the sizes coincide
    if ((getRowCount() == B.getRowCount()) && (getColumnCount() == B.getColumnCount()))
    {
        //declare the matrixData
        std::vector<std::vector<long double> > summedData = B.matrixData;    //since we are in the scope of the Matrix, we can access private data members
        for (unsigned i = 0; i < getRowCount(); i++)
        {
            for (unsigned j = 0; j < getColumnCount(); j++)
            {
                summedData[i][j] += matrixData[i][j];   //add the elements together
            }
        }
        //return result Matrix
        return Matrix(summedData);
    }
    else
        throw MatrixArithmeticException(DIFFERENT_DIMENSIONS);
}

Matrix Matrix::operator-(const Matrix& B)
{
    //declare negativeB
    Matrix negativeB = B;
    //negate all entries
    for (unsigned i = 0; i < negativeB.getRowCount(); i++)
    {
        for (unsigned j = 0; j < negativeB.getColumnCount(); j++)
        {
            negativeB.matrixData[i][j] = 0-negativeB.matrixData[i][j];
        }
    }
    //simply add the negativeB
    try
    {
        return ((*this)+negativeB);
    }
    catch (MatrixArithmeticException& mistake)
    {
        //should exit or do something similar
        std::cout << mistake.what() << std::endl;
    }
}

Matrix Matrix::operator*(const Matrix& B)
{
    //the columnCount of the left operand must be equal to the rowCount of the right operand
    if (getColumnCount() == B.getRowCount())
    {
        //if it is, declare data with getRowCount() rows and B.getColumnCount() columns
        std::vector<long double> zeroVector(B.getColumnCount(), 0);
        std::vector<std::vector<long double> > data(getRowCount(), zeroVector);
        for (unsigned i = 0; i < getRowCount(); i++)
        {
            for (unsigned j = 0; j < B.getColumnCount(); j++)
            {
                long double sum = 0; //set sum to zero
                for (unsigned k = 0; k < getColumnCount(); k++)
                {
                    //add the product of matrixData[i][k] and B.matrixData[k][j] to sum
                    sum += (matrixData[i][k]*B.matrixData[k][j]);
                }
                data[i][j] = sum;   //assign the sum to data
            }
        }
        return Matrix(data);
    }
    else
    {
        throw MatrixArithmeticException(ROW_COLUMN_MISMATCH); //dimension mismatch
    }
}

std::ostream& operator<<(std::ostream& outputStream, const Matrix& theMatrix)
{
    //Here, you should use the << again, just like you would for ANYTHING ELSE.
    //first, print a newline
    outputStream << "\n";
    //setting precision (optional)
    outputStream.precision(11);
    for (unsigned i = 0; i < theMatrix.getRowCount(); i++)
    {
        //print '['
        outputStream << "[";
        //format stream(optional)
        for (unsigned j = 0; j < theMatrix.getColumnCount(); j++)
        {
            //print numbers
            outputStream << std::setw(17) << theMatrix.matrixData[i][j];
            //print ", "
            if (j < theMatrix.getColumnCount() - 1)
                outputStream << ", ";
        }
        //print ']'
        outputStream << "]\n";
    }
    return outputStream;
}

Solution

  • You computed two numbers x and y which are of a limited precision floating point type. This means that they are already rounded somehow, meaning loss of precision while computing the result. If you subtract those numbers afterwards, you compute the difference between those two already rounded numbers.

    The formula you wrote gives you the maximum error for computing the difference, but this error is with regard to the stored intermediate results x and y (again: rounded). No method other than x-y will give you a "better" result (in terms of the complete computation, not only the difference). To put it in a nutshell: the difference can't be more accurate using any formula other than x-y.

    I'd suggest taking a look at arbitrary precision arithmetic math libraries like GMP or Eigen. Use such libraries for computing your equation system. Don't use double for the matrix computations. This way you can make sure that the intermediate results x and y (or the matrices Ax and B) are as precise as you want them to be, for example 512 bits, which should definitely be enough for most cases.