javahbase

Difference between FilterList with RowFilter vs MultiGet on HBase


During our implementation for fetching multiple records from a HBase table, we came across a discussion regarding the best way to get records out.

The first implementation is something like:

      FilterList filterList = new FilterList(Operator.MUST_PASS_ONE);
      for (String rowKey : rowKeys) {
        filterList.addFilter(new RowFilter(CompareOp.EQUAL,new BinaryComparator(Bytes.toBytes(rowKey))));
      }

      Scan scan = new Scan();
      scan.setFilter(filterList);
      ResultScanner resultScanner = table.getScanner(scan);

and the second implementation is something like this:

      List<Get> listGet = rowKeys.stream()
          .map(entry -> {
            Get get = new Get(Bytes.toBytes(entry));
            return get;
          })
          .collect(Collectors.toList());
      Result[] results = table.get(listGet)

The only difference I see directly is that filterList would do a full table scan whereas multiget wouldn't do anything as such.

But what other benefits one has over the other? Also, when HBase finds out that all the filters in the filterList are RowFilters, would it perform some kind of optimization and perform multiget rather than doing a full table scan?


Solution

  • TLDR: It depends on the number of rows (both read and wanted), number of filters, and how closely the data you are searching are.

    But what other benefits one has over the other?

    Generally, it serves different purpose. If you want to read vast majority of the data and omit a few of them, use Scan with filter. If you rather want to take only couple of rows in a large table, use Multiget.

    When I was searching for an answer I found a discussion about Hbase multiget vs scan with RowFilter. These are the main points:

    If the number of Gets in the MultiGet is very small compared to the total number of rows, it's better to use MultiGet. However, if you are able to specify start and stop rows in the Scan operation, scan is faster (because you limit the amount of rows that will be read):

    new Scan().withStartRow(startRow).withStopRow(stopRow)
    

    Also, when HBase finds out that all the filters in the filterList are RowFilters, would it perform some kind of optimization and perform multiget rather than doing a full table scan?

    No I don't think it does any optimizations. I think, that too many filters will even slow down the scan, because it has to go through all the filters for every row. See FilterList documentation:

    FilterList.Operator.MUST_PASS_ONE evaluates non-lazily: all filters are always evaluated.