pythonnumpyperformanceruntime

Can I decrease the numeric precision of scikit-learn's predict_proba() method?


The question behind the question is that I am looking to decrease the runtime of and computational resources expended by my scikit-learn model when using it in production.

When I run predict_proba(), it generates an array of probabilities. Is there a way to have the method use less resources by not spitting out the result to 16 decimals of precision, e.g., but in 8 decimals? The comparison I'm thinking of is quantizing LLMs.

Rounding afterward does not accomplish this, since predict_proba(), where I'm facing a bottleneck, has already ran.

Providing an example, assume I've already queried rows from a table in my database, and I've called that dat.

I then load an XGB classifier (which was trained using scikit-learn) and prepare the data:

import pandas as pd
import xgboost as xgb

clf = xgb.XGBClassifier()
clf.load_model('model_xgb.json')
X = pd.get_dummies(X, dtype=int)

Then I run predict_proba() on that classifier

preds = clf.predict_proba(X)[:,1]

Solution

  • Goal #1 :

    "(...) trying to get predictions efficiently from millions of rows of tabular data"

    Details omitted for brevity, available upon request.

    Given the silicon-designs of contemporary CPU/SoC/GPU devices, do not expect any remarkable boost in "efficiency" from decimating numerical representation of values. During years of processing more than three orders of magnitude "larger" dataSets, with several orders of magnitude "deeper" feature-space [M,N] ~ [1E9+, 1E4+], numpy-based processing of such dataSets in both Scikit RF- and XGB-based predictors went way slower if operated in float32 than in (then) native float64. This is not the way ahead.

    Given the nature of mutual independece of samples alongside the M-dimension, given the postulated tools ( Python, x86-architecture hardware space ) and given the ex-post added remark on sizings (leaving aside a potential slight effect from remaining in default bool dtype in N-dimension), the only indeed remarkable speedup for production may come from organisation -- taking more machines ( ref. Python GIL, ref. memI/O bottlenecks known to starve no matter how many cores have the CPU in-silicon ), all equipped with the same pre-trained XGB-predictor, and split the work on "millions-of-rows" among this coalition-of-machines.

    Results :

    Expected speedups are driven by (revised) Amdahl Law (more details on this, with interactive tools).

    Additional positive side-effect on achievable speedup :

    Given a single typical machine with x86-architecture cannot move more than 2-, 3-, 4-memI/O operations at once (oversimplified here for brevity), based on DDR-hardware, it is a positive side-effect that n-times more machines used in the distributed-processing will enjoy n-many times more free-flowing memI/O channels, than if operated on the same PC-host. So the physically distributed work will not suffer from NOPs introduced by otherwise un-avoidable in-silicon memI/O blocking.

    Implementation constraints :

    Add-on overheads shall be kept low by pre-organising the actual representation of X data-storage and data-flows of parts-of-X as "feeding" the distributed XGB-predictors on performance-tuned machines, typically interconnected on the same colocated VLAN, best with avoiding all unnecessary SER/DES overheads (details matter, on campus HPC DataCenter Technical Support can guide your implementation to best harness available infrastructure).