Peracetic Acid (PAA) Disinfection Dosage Control


Project overview video

Define Problem:

As WRRFs look to upgrade old gaseous chlorine or liquid sodium hypochlorite-based disinfection processes, peracetic acid (PAA) disinfection has become an attractive retrofit option that produces fewer carcinogenic disinfection byproducts (DBPs) than chlorine-based disinfection. However, PAA is significantly more expensive than conventional sodium hypochlorite. Thus, precision dose control is key to minimizing cost and climate change impacts. Unfortunately, PAA is difficult to model at full-scale. For example, the Metro Water Recovery (MWR) Robert W. Hite Treatment Facility (RWHTF) has found both the monitoring and optimization of PAA difficult; analyzers are expensive to maintain and only capture some of the dynamic disinfection kinetics of PAA which are significantly impacted by water quality. The consequence is that excessive PAA dosing is common, especially in the summer months. Therefore, data-driven models are needed to better understand and quantify real-time disinfection performance and realize chemical reductions.

Get Data:

A total of 1,236 PAA and E. coli samples were taken throughout the length of the disinfection basin (8 locations) over 126 sampling events between August 2021 and June 2022, specifically for this study. Data that is normally collected by the facility for regulatory or process purposes was also utilized; 12 in-line sensors and analyzers (primarily flows and concentrations measured in the activated sludge system) and 10 laboratory analytes (primarily daily regulatory concentrations measured in the final effluent) that measured water quantity and quality up- and down-stream of the disinfection system. Halfway through the study, a new sensor was being piloted to approximate E. coli concentrations. Thus, a second study was done with half of the available laboratory data but including this new sensor.

Prepare Data:

PAA concentration measurements for each sampling event (one sample at each location taken within 1-2 minutes) were used to fit an exponential decay curve. Thus, preparation of PAA data required consideration for how it would affect the exponential curve fit. PAA measurements that were below detection limit were removed. The alternative approach, of replacing laboratory measurements below detection limits with either 0 or the detection limit, resulted in a skewed exponential model fit. Additionally, if the difference between two subsequent PAA measurements was an increase of 0.3 mg/L, (i.e., PAA concentration immediately downstream of an upstream concentration measurement increases by more than 0.3 mg/L), the upstream sample is removed. The upstream sample was selected to be removed after multiple instances suggested that non-ideal flow conditions (e.g., short circuiting) in the initial sampling locations lead to non-representative PAA concentrations. Exponential decay curves were fit to the final subset of PAA concentrations (901 observations) for each sampling event. From this curve and a calculated hydraulic retention time based on flow, the integrated CT (iCT) was calculated. iCT representative of the time and active concentration that pathogens are in contact with PAA, and has a known mechanistic relationship with E. coli log removal.

E. coli data was transformed to log removal based on the E. coli concentration immediately prior to PAA dosing taken with each sampling event. To perform the log removal calculation, E. coli concentrations in the disinfection basin had to be above the detection limit. This filter resulted in a total subset of 1,223 E. coli observations.

Instrumentation data was averaged to 1 hour intervals closest to the PAA and E. coli collection times. Additional laboratory data was interpolated to the latest PAA and E. coli collection time using the last observation carried forward method, in order to prevent future information that would not be available during a true forecast from being used for model training. 

Kernel density estimation (KDE) was used to identify outliers for each variable (feature). KDE fits a probability density curve to the dataset; the flexibility of the curve is determined by its bandwidth. Bandwidth estimators were initially used, but demonstrated overfitting behavior. Instead, a bandwidth was selected for each feature that is 1/10th the range of values (similar to default histogram behavior). This, combined with an alpha value of 0.95, resulted in a two-sided confidence interval of ‘normal’ values. Values above or below the confidence interval were considered outliers. This was validated against several known outliers in features such as TSS and OP. 

Training Models:

Linear regression, adaptive lasso with diurnal model features, and XGBoost were used to predict iCT, E. coli concentrations, and E. coli removal. Linear regression represented a simplified baseline condition. Adaptive lasso is a linear regression method with build-in feature selection, and in this case sine and cosine terms were added as features to represent common diurnal trends in water and wastewater treatment. This was primarily to test potential time-dependent behavior that was observed in some water quality sensors. The machine learning (ML) approach that was explored was XGBoost, a tree-based approach in which each subsequent regression decision tree fit to the total error (residuals) of the previous trees. As a tree-based approach, XGBoost has been shown to be effective with (1) laboratory data that may include some measurement variability or noisy sensor data and (2) significantly different model output (response) behavior depending on the range of values. 


Model training included both random and time-based 10-fold cross-validation of 80% of the data for model training and hyper-parameter tuning, with the remaining 20% for testing. The time-based training/validation set was initially hypothesized to perform worse than random selection due to data leakage. However, the poor performance of models using the time-based split (Figure 1) ended up being dominated by the small sample size (and lack of process variability) of the initial training windows. Therefore both random cross-validation and cross-validation based on a cumulative density function (CDF) for the response variable (e.g., prediction E. coli) were explored.

Figure 1. Training and testing split by time for pre-disinfection E. coli

Model Results:

Many different model types were considered (Figure 2) and the lowest testing error that also achieved a convergence with the training error was XGBoost with the entire profile period trained using CDF cross-validation.

Figure 2. RMSE training and testing errors for various model types, data (i.e., subset to Coliminder instrument, or all data excluding Coliminder), and training split (i.e., random vs CDF).

Deploy Model:

When the predisinfection E. coli model was deployed in 2023, there was a divergence in model performance. Prediction of influent E. coli concentrations tended to do no better on the test data than the status quo method of approximation (rolling 3-day average of previous E. coli measurements).

Figure 3. Prediction of the existing modeling approach (‘persistence’) and the proposed machine learning model (‘XGBoost’) vs the actual values of influent E. coli.

This was likely the result of a dataset that did not contain sufficient variability in E. coli as well as some of the input features as a function of water quality and overfitting of the XGBoost method to the small range of variability.

Figure 4. Histogram of data collected during sampling campaign (2021-2022) and all available historical data for influent E. coli.

The prediction of iCT is impossible to validate without a second large sampling campaign. However when XGBoost predictions of iCT are compared to the existing method of calculating CT (i.e., assuming a static k), it appears that the ML prediction better follows the expected relationship between CT and log removal.

Figure 5. Current iCT approximation at MWR (‘persistence’) compared to the XGBoost prediction and the mechanistic model fit using all historical data.

Code overview video. Link to access code.