Challenge 3 Concept Thoughts
Analyzing how environmental factors including land use, land cover, weather, atmosphere, and geology influence water quality, through geospatial data filtration and machine learning modelling.
Per the hackathon extension notice, I am submitting Challenge 3 concept ideas in lieu of a finished product.
It is commonly understood that there are spatial and temporal gaps in the existing Chesapeake Bay monitoring data. The first step would be to identify an area of the bay watershed, ideally the size of a few HUC12 subwatersheds or larger that includes fairly diverse land uses and contains recent data coverage located above and below those land uses considered most impactful (e.g. agriculture, urban/developed, mining areas). Identification of this area would be accelerated by using a fusion of R and GIS to 1) filter out any data that doesn't meet basic spatial and temporal thresholds (e.g. water quality monitoring points with <10 visits per year for the last five years), 2) filter out any monitoring stations that do not contain the basic subset of parameters (e.g. pH, water temperature, benthic, nitrogen, phosphorous, etc), and 3) visually accentuate the remaining monitoring points that received concentrated attention per year (e.g. 50+ visits per year for most recent 5 years). The more visits per year the better, as trends associated with weather events could be studied closely and the largest amount of observations is desirable for modelling purposes. Also, if possible, it may be useful to select a headwater area to avoid substantial contributions from upstream influences.
After identification of the watershed region, additional geospatial data should be gathered including: Land use and land cover from the U.S. Geologic Survey National Land Cover Database; Past weather data (e.g. precipitation, temperature) from the National Oceanic and Atmospheric Administration; and Geology and/or soils from the U.S. Geologic Survey and/or Natural Resource Conservation Service. These data sets would be joined to the existing monitoring data, likely through GIS overlay and extraction methods, by using the monitoring station geospatial locations.
With all data assembled in a single data file, the remaining data analysis steps would be completed in R. First, creating a few additional variables may be useful. For instance, knowing what land use or the length of stream lying immediately upstream from an existing monitoring station. Also, as with all data files, some tidying will be required to fit the intended use. In this case, if the multiple similar-named parameters (e.g. pH, pH.6, pH.9, etc) can be labelled the same, that would streamline the data set. Next, exploratory data analysis should be applied to evaluate basic univariate patterns, bivariate relationships, potential missing data, and any data oddities. Further tidying should be applied as appropriate.
After any restructuring and tidying, then machine learning algorithms could be applied to evaluate which variables have relationships with the response variable. This concept plan assumes benthic rating as the response variable. Initially, a multiple linear regression could be applied, although since the response is categorical, a classification model is likely most appropriate. A suite of classifiers including multinomial logistic regression, SVM, LDA, QDA, and Random Forest could quickly be trialed through R's caret package. Further refinement could occur after variable selection analysis using Random Forest, PCA or other method, and evaluation of the changes or improvements in classification error.
Thank you for the opportunity to submit these thoughts as a rough concept. At minimum, it was exciting to visualize a possible approach and daydream about how fun the execution would be.