TY - JOUR
T1 - Development of a data-driven ensemble regressor and its applicability for identifying contextual and collective outliers in groundwater level time-series data
AU - Kim, Yuhan
AU - Jeong, Jiho
AU - Park, Heejeong
AU - Kwon, Mijin
AU - Cho, Chunhyung
AU - Jeong, Jina
N1 - Publisher Copyright:
© 2022 The Author(s)
PY - 2022/9
Y1 - 2022/9
N2 - In this study, a method to estimate the normal range of groundwater level time-series data was developed to identify outliers in terms of the global, contextual, and collective sense. To evaluate the normal range of groundwater level time-series data, the statistical characteristics of the data and the patterns of the precipitation time-series data were incorporated into the LSTM (Long Short-Term Memory)-based ensemble regressor (i.e., the LER model). Based on the LER model, multiple possible trends of the groundwater level were generated, and the general rules of outlier identification methods (i.e., σ and Tukey's fences (TF) rules) were applied to the LER ensemble estimation result to finally define the range of the normal data. For outlier identification performance validation, the actual groundwater level acquired from three groundwater monitoring stations in South Korea (i.e., the Pohang–Gibuk (PG), Namwon–Dotong (ND), and Jeju–Sangyae (JS) monitoring wells) and the corresponding precipitation data acquired from the nearest weather stations were applied to the study. As the reference method for comparative performance validation, simple applications of the σ and TF rules were used. For the monitoring data, the developed LER-based outlier identification method evaluates the range of the data that might be explained by the modelled influences of the interest (i.e., normal data range). The developed method showed an outlier identification performance of >70% in general while the performance of the σ and TF rules was mostly <50%. In particular, as the method effectively estimated the seasonal trend and the variability of the groundwater level with consideration of the precipitation patterns and statistics on the groundwater level variation, it is superior for identifying the contextual or collective outliers compared to the simple σ and TF rules. Through in-depth analysis, it can be concluded that the developed LER-based outlier identification method is effective for discriminating the abnormal data by considering the intrinsic statistical characteristics of the original data trend and the exogenous factors. In the aspect of the practical applicability, as the result can be automatically acquired based on real-time monitoring data, the developed method is expected to apply for more efficient maintenance of the monitoring devices by embedding the model as the management software into the monitoring network system.
AB - In this study, a method to estimate the normal range of groundwater level time-series data was developed to identify outliers in terms of the global, contextual, and collective sense. To evaluate the normal range of groundwater level time-series data, the statistical characteristics of the data and the patterns of the precipitation time-series data were incorporated into the LSTM (Long Short-Term Memory)-based ensemble regressor (i.e., the LER model). Based on the LER model, multiple possible trends of the groundwater level were generated, and the general rules of outlier identification methods (i.e., σ and Tukey's fences (TF) rules) were applied to the LER ensemble estimation result to finally define the range of the normal data. For outlier identification performance validation, the actual groundwater level acquired from three groundwater monitoring stations in South Korea (i.e., the Pohang–Gibuk (PG), Namwon–Dotong (ND), and Jeju–Sangyae (JS) monitoring wells) and the corresponding precipitation data acquired from the nearest weather stations were applied to the study. As the reference method for comparative performance validation, simple applications of the σ and TF rules were used. For the monitoring data, the developed LER-based outlier identification method evaluates the range of the data that might be explained by the modelled influences of the interest (i.e., normal data range). The developed method showed an outlier identification performance of >70% in general while the performance of the σ and TF rules was mostly <50%. In particular, as the method effectively estimated the seasonal trend and the variability of the groundwater level with consideration of the precipitation patterns and statistics on the groundwater level variation, it is superior for identifying the contextual or collective outliers compared to the simple σ and TF rules. Through in-depth analysis, it can be concluded that the developed LER-based outlier identification method is effective for discriminating the abnormal data by considering the intrinsic statistical characteristics of the original data trend and the exogenous factors. In the aspect of the practical applicability, as the result can be automatically acquired based on real-time monitoring data, the developed method is expected to apply for more efficient maintenance of the monitoring devices by embedding the model as the management software into the monitoring network system.
KW - Contextual and collective outlier identification
KW - Ensemble estimation
KW - Groundwater level fluctuation
KW - Long short-term memory
KW - Normal data range
UR - http://www.scopus.com/inward/record.url?scp=85133864776&partnerID=8YFLogxK
U2 - 10.1016/j.jhydrol.2022.128127
DO - 10.1016/j.jhydrol.2022.128127
M3 - Article
AN - SCOPUS:85133864776
SN - 0022-1694
VL - 612
JO - Journal of Hydrology
JF - Journal of Hydrology
M1 - 128127
ER -