Total Solids Prediction: Acoustic Sensing and ML for Holistic Biosolids Optimization

Define Problem:

The D.C. Water Blue Plains Advanced Wastewater Treatment Plant contains a side stream process that takes solids from the mainstream processes and converts them into biogas and a biosolids product to be used as a fertilizer. This process requires the accurate setting of dilution water at multiple locations including pulper feedline (pre-dewatering management or before THP feed targeted 16.5% TS), digester feedline (10.5% TS), and final dewatering feedline (3.5% TS). The final dewatering process needs additional polymer dosing to maintain the correct total solids concentration when entering the belt filter press. Extra attention needs to be paid to this process to prevent polymer overdosing resulting in overuse of polymer and polymer underdosing resulting in machine failures that require significant operator cleaning time.

Figure 1. D.C. Water Blue Plains AWTP diagram of side stream process to create a Class A biosolids product. This work focuses on the acoustic sensors at the pulper feedline, digester feedline, and final dewatering location.

The measurement of total solids is the main piece of information used to identify dilution water and polymer dosage and is typically conducted using time-intensive laboratory processes. Using a non-contact acoustic soft sensor with machine learning, we are able to detect the total solids concentration within a pipe without requiring operator sensor maintenance. Other total solids sensors are placed on the inside of the pipes and experience fouling very quickly. Machine learning is necessary to achieve this total solids estimation because the acoustic non-contact sensors do not provide a direct output of total solids that may be calculated but are known to be correlated with total solids.

Go/No-Go Decision: Machine learning may not be a viable solution to your problem depending on accuracy of data, current process modeling accuracy, and existing data infrastructure.

Project Overview Video. Link to access code.

Get Data:

Acoustic measurements, flow measurements, and total solids laboratory samples were taken from the digester in the treatment plant. The acoustic measurements are recorded using a passive acoustic sensor (type 4534-B-001 of the Brüel & Kjær company) glued using industrial adhesive directly to a pipe in the full scale system. The flow measurements are already a plant datastream recorded with a sensor. Total solids samples were taken weekly under normal operating conditions at ~9am from September 2021 to May 2023. Samples were then taken at the digester under operator induced experimental conditions over the summer of 2023 where the range of total solids values was pushed to the edge of safe conditions over a period of ~6 hours.

Figure 2. 4534-B-001 Brüel & Kjær Company acoustic sensor glued to a pipe in the Blue Plains facility

After collecting all of the total solids measurements to be used to train and test the future model, the histogram of the data must be plotted to ensure that there is a sufficient range and distribution of samples. Under normal operating conditions, the range of total solids measurements was 7-12% total solids. This range excluded any extreme conditions occurring at the digester.

Machine learning algorithms are known to perform poorly outside of the range of their training data meaning the final model would not be able to accurately interpret observed extreme conditions during deployment if trained exclusively on normal conditions. The digester is also a controlled process, meaning normal operating conditions exist on a very small range and the standard deviation of the distribution of samples is close to the goal error of the model and the error in the laboratory measures. This means a model is likely to just predict the mean of the data, not an underlying pattern. Since the goal of many ML models is the predictions of aberrations in underlying conditions, the training data should include these conditions as well.

Getting Enough or “Right” Data:

An experiment was conducted to combat this issue by increasing the range of total solids measurements observed and ensuring there was an even distribution of normal operating condition measurements and extreme measurements. Essentially, we ran the process across a wide range of conditions to ensure that the training data would reflect the types of conditions we would also like to predict.

This may not be feasible for many other applications, since taking other processes outside of their control regime may lead to unintended consequences. If you intend to use ML for your project, make sure the data are “interesting.” In other words, if the parameter is tightly controlled, you may not have enough data to predict anything outside of that range.

The final dataset of total solids and flow measurements are plotted in Python before proceeding with data processing to ensure the new dataset from the operator experiments meets the criteria of a greater standard deviation than goal model error and error in laboratory measurements.

Train Model:

#Plotting the ranges of total solids and flow observed in the dataset
f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15,5))

ax1.hist(y);
ax1.set_xlabel("TS(%)");
ax1.set_ylabel("Frequency");

ax2.hist(Flow);
ax2.set_xlabel("Flow");
ax2.set_ylabel("Frequency");

ax3.hist(Flow_Smooth);
ax3.set_xlabel("Smoothed Flow");
ax3.set_ylabel("Frequency");

Figure 3. Histograms of data measured during the experiment’s total solids, flow, and smoothed flow (left to right)

Go/No-Go: Only proceed with your machine learning algorithm if the accuracy desired in the final predictions is more granular than what is already inherent in the variability of the data. Also, ensure there is data infrastructure in place that prevents mislabelling or data loss and keeps the rawest form of data reasonable.

Prepare Data:

The acoustic data is recorded in its raw form as sound waves over a range of frequencies from 0Hz t0 13,100Hz for 0.45s. To reduce the number of features, the acoustic samples are projected into the frequency domain by taking the fast fourier transform (FFT) resulting in a dataset containing the amplitude for 5946 frequencies.

#Defining a function that takes the FFT of the acoustic readings and scales it

def full_fft(raw_data, N):
    y = scipy.fftpack.fft(raw_data)
    y = (2.0/N)*np.abs(y[:N//2])
    return y

#Taking the FFT of each measurement excluding the end of the signal 
#containing artifacts from the sensor
i = 0
new_fft = []

while(i < len(df.index)):
    line = df.iloc[i][1:]
    time = df.iloc[i][0]

    fft = full_fft(line[0:11892], len(line[0:11892]))
    fft = list(fft)
    fft.insert(0, time)
    new_fft.append(fft)

    i = i + 1

df_fft = pd.DataFrame(new_fft)
step = (26200.0/2.0)/5946.0
cols = ['DateTime'] + [(i+1)*step for i in range(5946)]
df_fft.columns = cols
df_fft.head()

Three data streams are included in the collection of the data, and thus these streams must be merged into one dataset. The acoustic data are recorded in triplicate every 5 minutes during the experimental measurements, the flow is always recorded every minute, and total solids values are measured every 10 minutes, 2.5 minutes after the most recent acoustic measurement during the experimental period.

To merge the total solids values with the acoustic measurements, the sample measurement time for the acoustic measurement is compared with the set of total solids measurement times, and if there is a total solids measurement within 5 minutes, the total solids measurement is linked to that acoustic measurement. Based on the timing of this experiment, there are two acoustic measurements linked to each total solids measurement because the acoustic measurements are equidistant to the total solids measurement time.

    df_ts.sort_values("DateTime", inplace=True)
    df.sort_values("DateTime",inplace=True)
    df_merged = pd.merge_asof(df, df_ts, on="DateTime", direction="nearest",
                              tolerance = pd.Timedelta(5, "minute"))
    df_merged = df_merged[df_merged['Location'].notna()]
    df_merged.reset_index(inplace = True)
    

The triplicates of the acoustic measurements will be averaged into one sample per measurement time. This averaging of triplicates occurs after the merging of the acoustic measurements and total solids measurements to reduce computation time. The triplicates may also be used for some outlier removal, but in this limited dataset, there were no dramatic outliers therefore none were removed.

df = df.sort_values(by = 'DateTime')

    avg_df = []
    i = 0
    while(i < len(df.index)):
        df_mean = [df.iloc[i, 1], df.iloc[i, -2]]
        df_3 = df[i:i+3][:]
        df_3.drop(['index','TS', 'DateTime', 'Location'], inplace = True, axis = 1)
        df_mean_1 = df_3.mean()
        df_mean_1 = list(df_mean_1)
        df_mean = df_mean + df_mean_1
        avg_df.append(df_mean)
        i = i + 3

    avg_df = pd.DataFrame(avg_df)
    

df['DateTime'] = pd.to_datetime(df['DateTime'])

    df_flow = df_flow.drop_duplicates(keep = 'first', ignore_index = True)
    df_flow['DateTime'] = df_flow['DateTime'].astype(str)
    df_flow = df_flow[df_flow['DateTime'] != 'nan']
    df_flow['DateTime'] = [x.strip().replace('/', '-') for x in df_flow["DateTime"]]

    df_flow['DateTime'] = pd.to_datetime(df_flow['DateTime'],dayfirst=False);
    df_flow.sort_values("DateTime",inplace=True)

    df_filtered = pd.merge_asof(df,df_flow, left_on="DateTime",right_on="DateTime",
    direction="nearest", tolerance = pd.Timedelta(30, "second"))


    df_filtered.insert(1,'Flow',df_filtered.pop('THP DIG 3 FD CTL 1 FLOW'))
    df_filtered.insert(2,'Enable',df_filtered.pop('DIG 3 CSLDS PMP 1  RUNST'))
    df_filtered.insert(3, 'Flow Smooth', df_filtered.pop('Flow Smooth 1'))

#Smoothing the acoustic data with a moving average with a window of 21 steps
width_half = 10

width = 2*width_half + 1

X_smooth = []

for raw in X:
    smooth = []
    smooth.append(np.mean(raw[0:(width)]))
    i = width_half + 1
    while(i < (len(raw)-width_half)):
        smooth.append(smooth[-1] + (raw[i+width_half]/width) - (raw[i-width_half-1]/width))
        i = i + 1

    X_smooth.append(smooth)

X_smooth = np.array(X_smooth)

After merging the entire dataset into this final form containing acoustic fast fourier transform profiles, total solids laboratory measurements, raw flow data, and smoothed flow data, the acoustic data is smoothed to remove noise. This smoothing process is conducted last to limit the number of samples that must be smoothed, therefore reducing computational time. The smoothing is conducted using a moving average on the fast fourier transform with a width of 46Hz to best remove noise without removing local minimums and maximums of interest.

An average model is created as a comparison point for the final root mean squared error to be able to evaluate if the advanced models perform better than guessing the mean of the training dataset. A root mean squared error for the average model that is similar to the goal final root mean squared error is a good indicator that the dataset used has too little variability to be an effective model.

#The average model of the data is plot
#The average model predicts the mean of the training y values for every training point
calc_mean=np.mean(y_train)
n=len(y_test)
y_pred_list_mean=[calc_mean]*n
y_pred_mean=np.array(y_pred_list_mean)
y_pred_list_train = [calc_mean]*len(y_train)
y_pred_mean_train = np.array(y_pred_list_train)

rmse_mean=np.sqrt(metrics.mean_squared_error(y_test, y_pred_mean))

print('Root Mean Squared Error (Average Model):', rmse_mean)

plt.figure(figsize=(5,5))
plt.rcParams.update({'font.size': 15});
plt.scatter(y_test, y_pred_mean,label="test")
plt.scatter(y_train, y_pred_mean_train,label="train")
plt.legend()
plt.xlim([0,20]);
plt.ylim([0,20]);
plt.xlabel('Measured')
plt.ylabel('Predicted')
plt.axline([0, 0], [1, 1],color = 'black', linewidth = 0.5)
plt.title('Average Model');

Deploy Model:

The reduced dataset as processed above is then merged with the flow data that is collected every 1 minute at the digester. Each acoustic/ total solids sample is assigned the closest flow measurement in time within a range of 30 seconds. A moving average has been conducted on the raw flow dataset and merged with the acoustic/ total solids dataset as well to remove some noise in the flow measurement.

Figure 4. Fast fourier transform of acoustic signals measured at the digester shown before (left) and after (right) moving average smoothing

Go/No-Go: Only proceed with the modeling of the dataset if datasets have been carefully merged to ensure minimal changes in value are expected within the tolerance time range you set and if noise has been removed carefully so as to preserve the underlying signal’s behavior.

Further feature reduction is required as a part of this model because the number of features, 5946, is much greater than the number of samples. The method of feature reduction that was determined to be the most effective for this application is principal component analysis. This method along with other feature reduction techniques is used to remove the portions of a measurement that are either irrelevant or redundant. In doing so, you are left with only the most essential pieces of information and as little noise as possible. This results in a model that predicts the testing data points with a similar level of accuracy to the training data points meaning the model is effectively describing the data.

#The machine learning dataset is chosen as the smoothed acoustic dataset
#X is split into training(80% of original) and testing(20% of original) datasets
#Principal component analysis is fit and used to transform the dataset into 20 features
X = X_smooth

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1104)

pca = PCA(n_components=20)

X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

Establishing a Baseline:

Figure 5. The predicted versus measured testing and training data when trained on a model always predicting the average of the training data.

Before moving on to advanced machine learning algorithms, it is important to exhaust the simplest models' performance. With principal component analysis conducted, linear regression performs well and visually is not overfitting the data. This is a reasonable model and the benefit of linear regression is it is able to be extrapolated outside of the training dataset range. However, cross-validation remains to be conducted to compare it to advanced machine learning algorithms.

#Linear Regression is plot
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

print('Root Mean Squared Error (Linear Regression):', 
np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

plt.figure(figsize=(5,5))
plt.rcParams.update({'font.size': 15});
plt.scatter(y_test, y_pred,label="test")
plt.scatter(y_train, regressor.predict(X_train),label="train")
plt.legend()
plt.xlim([0,20]);
plt.ylim([0,20]);
plt.xlabel('Measured')
plt.ylabel('Predicted')
plt.axline([0, 0], [1, 1],color = 'black', linewidth = 0.5)
plt.title('Linear Regression');

Figure 6. The predicted versus measured testing and training data when trained on a linear regression model

Increasing Model Capacity:

Multiple machine learning algorithms were compared to identify the best performing algorithm, XGBoost, including partial least squares, gradient boosted regression, light gradient boosted regression, and random forest. By comparing many algorithms, widths of moving average smoothing, and number of principal components with cross-validation, XGBoost was identified to have the lowest root mean squared error with a small standard deviation when combined with a moving average smoothing algorithm and using 20 principal components.

#XGBoost is plot
regr = xgb.XGBRegressor(colsample_bytree=0.2, gamma=0.0, learning_rate=0.01, max_depth=4, 
min_child_weight=1.5, n_estimators=7200,reg_alpha=0.9, reg_lambda=0.6, subsample=0.2, seed=42)
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print("Root Mean Squared Error (Xgboost): ", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

plt.figure(figsize=(5,5))
plt.rcParams.update({'font.size': 15});
plt.scatter(y_test, y_pred,label="test")
plt.scatter(y_train, regr.predict(X_train),label="train")
plt.legend()
plt.xlim([0,20]);
plt.ylim([0,20]);
plt.xlabel('Measured', fontsize=15)
plt.ylabel('Predicted', fontsize=15)
plt.axline([0, 0], [1, 1],color = 'black', linewidth = 0.5)
plt.xlim([0,15])
plt.ylim([0,15])

plt.title("xgboost",fontsize=15);

Figure 7. The predicted versus measured testing and training data when trained on an extreme gradient boosted regression model

Cross-validation is an essential test of model performance as it limits the effect of the random choice of training-testing splitting of the data. 50 iterations of cross-validation were conducted here where in each iteration the data was resplit into training and testing sets and the models were retrained. The resulting bar graph shows that the XGBoost model reduces the overall root mean squared error by about 1% total solids and has a much smaller standard deviation of root mean squared error than linear regression does. These results show that XGBoost is a better choice than linear regression despite the increase in complexity of the model and the inability to extrapolate effectively.

#Cross Validation

rmse_list_avg=[]
rmse_list_LR=[]
rmse_list_XG=[]
for i in range (50):
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=i)
    X_train = pca.fit_transform(X_train)
    X_test = pca.transform(X_test)
    y_test_list=y_test.tolist()
    y_train_list=y_train.tolist()
    calc_mean=np.mean(y_train_list)
    n=len(y_test)
    y_pred_list_mean=[calc_mean]*n
    y_pred_mean=np.array(y_pred_list_mean)

    #y_pred
    rmse_mean=np.sqrt(metrics.mean_squared_error(y_test, y_pred_mean))
    rmse_list_avg.append(rmse_mean)

    #LR
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    rmse_list_LR.append(np.sqrt(metrics.mean_squared_error(y_test, regressor.predict(X_test))))

    #XG
    regr = xgb.XGBRegressor(colsample_bytree=0.2, gamma=0.0, learning_rate=0.01, max_depth=4, min_child_weight=1.5, n_estimators=7200,
                            reg_alpha=0.9, reg_lambda=0.6, subsample=0.2, seed=42)#, silent=1)
    regr.fit(X_train, y_train)
    rmse_list_XG.append(np.sqrt(metrics.mean_squared_error(y_test, regr.predict(X_test))))

x= ['avg','LR','XG']
y_=[np.mean(rmse_list_avg),np.mean(rmse_list_LR), np.mean(rmse_list_XG)]
y_error = [np.std(rmse_list_avg), np.std(rmse_list_LR), np.std(rmse_list_XG)]

Figure 8. A bar graph of the mean root mean squared error and standard deviation of the average, linear regression, and extreme gradient boosted regression models after cross validation

Finally, a last check of the XGBoost model’s performance is a graph of all of the testing data sets over the 50 iterations of cross validation. By only graphing the testing data points that were predicted using the 50 differently split and trained models, we are visually able to verify that the model is predicting the extremes of the data as well as the center of the dataset. This is in stark contrast to the average model as plotted above. Since this is the case, we are able to be confident in the root mean squared error of the final model as an accurate description of the error in the model.

#The training predictions for every iteration of cross-validation of XGBoost are plotted
#Compare to the 1:1 line to determine if XGBoost is predicting a pattern 
#or if it is predicting the mean
y_test_vec = []
y_pred_vec = []
y_train_vec = []
y_train_pred_vec = []
for i in range (50):
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=i)
    X_train = pca.fit_transform(X_train)
    X_test = pca.transform(X_test)
    regr = xgb.XGBRegressor(colsample_bytree=0.2, gamma=0.0, learning_rate=0.01, max_depth=4, 
    min_child_weight=1.5, n_estimators=7200, reg_alpha=0.9, reg_lambda=0.6, subsample=0.2, 
    seed=42)
    regr.fit(X_train, y_train)
    y_pred = regr.predict(X_test)
    y_test_vec.append(y_test)
    y_pred_vec.append(y_pred)
    y_train_vec.append(y_train)
    y_train_pred_vec.append(regr.predict(X_train))

plt.figure(figsize=(10,10))
plt.scatter(y_test_vec, y_pred_vec,label="test")
plt.legend()
plt.xlabel('Measured', fontsize=15)
plt.ylabel('Predicted', fontsize=15)
plt.axline([0, 0], [1, 1],color = 'black', linewidth = 0.5)
plt.xlim([0,20])
plt.ylim([0,20])

plt.title("XGboost Cross-Validation Testing Points",fontsize=15)

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.head(20)

Figure 9. The predicted value compared to the measured value of each testing point from each iteration of cross validation.

Go/No-Go: Proceed with deployment of your model if the root mean squared error of the average model is greater than the goal final root mean squared error, the simple models have been sufficiently exhausted before moving on to complex models, and the final model predicts the extremes of the data with a similar amount of error as the center of the data.

The machine learning model must be deployed safely and securely, meaning maintaining secure Wi-Fi networks and communicating with operators to have a clear understanding of the use cases of these tools. To achieve this goal, an internal Wi-Fi network was constructed that only the acoustic sensors are placed on and these sensors report to one computer housed within the plant. This computer then processes the data using the techniques contained in the data preparation section and runs it through the model chosen in the model training section. The computer is able to post data to the operator dashboard without having a return pathway meaning there is no security risk imposed. The data is also posted to a PowerBI dashboard for process engineers to monitor.

After deployment, the model may be retrained if samples continue to be taken and the model begins to perform at a lower quality level. The total solids behavior often changes on the multi-month scale, so model retraining may be discerned by humans and done infrequently. This retraining also means that operators can have confidence in the model while in the model’s operating range because each model has been analyzed by an engineer before redeployment.

Go/No-Go: Complete deployment of your model if there is no security risk to the plant and there is sufficient staff to monitor model performance over time and retrain the model when necessary.