Master's Projects (Mathematics and Statistics)
Recent Submissions

A Bayesian mixed multistate openrobust design markrecapture model to estimate heterogeneity in transition rates in an imperfectly detected systemMultistate markrecapture models have long been used to assess ecological and demographic parameters such as survival, phenology, and breeding rates by estimating transition rates among a series of latent or observable states. Here, we introduce a Bayesian mixed multistate open robust design mark recapture model (MSORD), with random intercepts and slopes to explore individual heterogeneity in transition rates and individual responses to covariates. We fit this model to simulated data sets to test whether the model could accurately and precisely estimate five parameters, set to known values a priori, under varying sampling schemes. To assess the behavior of the model integrated across replicate fits, we employed a twostage hierarchical model fitting algorithm for each of the simulations. The majority of model fits showed no sign of inadequate convergence according to our metrics, with 81.25% of replicate posteriors for parameters of interest having general agreement among chains (r < 1.1). Estimates of posterior distributions for mean transition rates and standard deviation in random intercepts were generally welldefined. However, we found that models estimated the standard deviation in random slopes and the correlation among random effects relatively poorly, especially in simulations with low power to detect individuals (e.g. low detection rates, study duration, or secondary samples). We also apply this model to a dataset of 200 female grey seals breeding on Sable Island from 19852018 to estimate individual heterogeneity in reproductive rate and response to nearexponential population growth. The Bayesian MSORD estimated substantial variation among individuals in both mean transition rates and responses to population size. The correlation among effects trended positively, indicating that females with high reproductive performance (more positive intercept) were also more likely to respond better to population growth (more positive slope) and vice versa. Though our simulation results lend confidence to analyses using this method on well developed datasets on highly observable systems, we caution the use of this framework in sparse data situations.

Analysis of GNAC Volleyball using the BradleyTerry ModelRanking is the process by which a set of objects is assigned a linear ordering based on some property that they possess. Not surprisingly, there are many different methods of ranking used in a wide array of diverse applications; ranking plays a vital role in sports analysis, preference testing, search engine optimization, psychological research, and many other areas. One of the more popular ranking models is BradleyTerry, which is a type of aggregation ranking that has been used mostly within the realm of sports. BradleyTerry uses the outcome of individual matchups (pairedcomparisons) to create rankings using maximumlikelihood estimation. This project aims to briefly examine the motivation for modeling sporting events, review the history of ranking and aggregationranking, communicate the mathematical theory behind the BradleyTerry model, and apply the model to a novel volleyball dataset.

Simulating distance sampling to estimate nest abundance on the YukonKuskokwim Delta, AlaskaThe U.S. Fish and Wildlife Service currently conducts annual surveys to estimate bird nest abundance on the YukonKuskokwim Delta, Alaska. The current method involves intensive searching on large plots with the goal of finding every nest on the plot. Distance sampling is a wellestablished transectbased method to estimate density or abundance that accounts for imperfect detection of objects. It relies on estimating the probability of detecting an object given its distance from the transect line, or the detection function. Simulations were done using R to explore whether distance sampling methods on the YukonKuskokwim Delta could produce reliable estimates of nest abundance. Simulations were executed both with geographic strata based on estimated Spectacled Eider (Somateria fischeri) nest densities and without stratification. Simulations with stratification where more effort was allotted to high density areas tended to be more precise, but lacked the property of pooling robustness and assumed stratum boundaries would not change over time. Simulations without stratification yielded estimates with relatively low bias and variances comparable to current estimation methods. Distance sampling appears to be a viable option for estimating the abundance of nests on the YukonKuskokwim Delta.

Multiple imputation of missing multivariate atmospheric chemistry time series data from Denali National ParkThis paper explores a technique where we impute missing values for an incomplete dataset via multiple imputation. Incomplete data is one of the most common issues in data analysis and often occurs when measuring chemical and environmental data. The dataset that we used in the model consists of 26 atmospheric particulates or elements that were measured semiweekly in Denali National Park from 1988 to 2015. The collection days were alternating between three and four days apart from 3/2/88  9/30/00 and being consistently collected every three days apart from 10/3/00  12/29/15. For this reason, the data were initially partitioned into two in case the separation between collection days would have an impact. With further analysis, we concluded that the misalignments between the two datasets had very little or no impact on our analysis and therefore combined the two. After running five Markov chains of 1000 iterations we concluded that the model stayed consistent between the five chains. We found out that in order to get a better understanding of how well the imputed values did, more exploratory analysis on the imputed datasets would be required.

Multistate OrnsteinUhlenbeck space use model reveals sexspecific partitioning of the energy landscape in a soaring birdUnderstanding animals’ home range dynamics is a frequent motivating question in movement ecology. Descriptive techniques are often applied, but these methods lack predictive ability and cannot capture effects of dynamic environmental patterns, such as weather and features of the energy landscape. Here, we develop a practical approach for statistical inference into the behavioral mechanisms underlying how habitat and the energy landscape shape animal home ranges. We validated this approach by conducting a simulation study, and applied it to a sample of 12 golden eagles Aquila chrysaetos tracked with satellite telemetry. We demonstrate that readily available software can be used to fit a multistate OrnsteinUhlenbeck space use model to make hierarchical inference of habitat selection parameters and home range dynamics. Additionally, the underlying mathematical properties of the model allow straightforward computation of predicted space use distributions, permitting estimation of home range size and visualization of space use patterns under varying conditions. The application to golden eagles revealed effects of habitat variables that align with eagle biology. Further, we found that males and females partition their home ranges dynamically based on uplift. Specifically, changes in wind and the angle of the sun seemed to be drivers of differential space use between sexes, in particular during late breeding season when both are foraging across large parts of their home range to support nestling growth.

Estimating confidence intervals on accuracy in classification in machine learningThis paper explores various techniques to estimate a confidence interval on accuracy for machine learning algorithms. Confidence intervals on accuracy may be used to rank machine learning algorithms. We investigate bootstrapping, leave one out cross validation, and conformal prediction. These techniques are applied to the following machine learning algorithms: support vector machines, bagging AdaBoost, and random forests. Confidence intervals are produced on a total of nine datasets, three real and six simulated. We found in general not any technique was particular successful at always capturing the accuracy. However leave one out cross validation had the most consistency amongst all techniques for all datasets.

A geostatistical model based on Brownian motion to Krige regions in R2 with irregular boundaries and holesKriging is a geostatistical interpolation method that produces predictions and prediction intervals. Classical kriging models use Euclidean (straight line) distance when modeling spatial autocorrelation. However, for estuaries, inlets, and bays, shortestinwater distance may capture the system’s proximity dependencies better than Euclidean distance when boundary constraints are present. Shortestinwater distance has been used to krige such regions (Little et al., 1997; Rathbun, 1998); however, the variancecovariance matrices used in these models have not been shown to be mathematically valid. In this project, a new kriging model is developed for irregularly shaped regions in R 2 . This model incorporates the notion of flow connected distance into a valid variancecovariance matrix through the use of a random walk on a lattice, process convolutions, and the nonstationary kriging equations. The model developed in this paper is compared to existing methods of spatial prediction over irregularly shaped regions using water quality data from Puget Sound.

Testing multispecies coalescent simulators with summary statisticsThe Multispecies coalescent model (MSC) is increasingly used in phylogenetics to describe the formation of gene trees (depicting the direct ancestral relationships of sampled lineages) within species trees (depicting the branching of species from their common ancestor). A number of MSC simulators have been implemented, and these are often used to test inference methods built on the model. However, it is not clear from the literature that these simulators are always adequately tested. In this project, we formulated tools for testing these simulators and use them to show that of four wellknown coalescent simulators, Mesquite, HybridLambda, SimPhy, and Phybase, only SimPhy performs correctly according to these tests.

The treatment of missing data on placement tools for predicting success in college algebra at the University of AlaskaThis project investigated the statistical significance of baccalaureate student placement tools such as tests scores and completion of a developmental course on predicting success in a college level algebra course at the University of Alaska (UA). Students included in the study had attempted Math 107 at UA for the first time between fiscal years 2007 and 2012. The student placement information had a high percentage of missing data. A simulation study was conducted to choose the best missing data method between complete case deletion, and multiple imputation for the student data. After the missing data methods were applied, a logistic regression with fitted with explanatory variables consisting of tests scores, developmental course grade, age (category) of scores and grade, and interactions. The relevant tests were SAT math, ACT math, AccuPlacer college level math, and the relevant developmental course was Devm /Math 105. The response variable was success in passing Math 107 with grade of C or above on the first attempt. The simulation study showed that under a high percentage of missing data and correlation, multiple imputation implemented by the R package Multivariate Imputation by Chained Equations (MICE) produced the least biased estimators and better confidence interval coverage compared to complete cases deletion when data are missing at random (MAR) and missing not at random (MNAR). Results from multiple imputation method on the student data showed that Devm /Math 105 grade was a significant predictor of passing Math 107. The age of Devm /Math 105, age of tests, and test scores were not significant predictors of student success in Math 107. Future studies may consider modeling with ALEKS scores, and high school math course information.

Analyzing tree distribution and abundance in YukonCharley Rivers National Preserve: developing geostatistical Bayesian models with count dataSpecies distribution models (SDMs) describe the relationship between where a species occurs and underlying environmental conditions. For this project, I created SDMs for the five tree species that occur in YukonCharley Rivers National Preserve (YUCH) in order to gain insight into which environmental covariates are important for each species, and what effect each environmental condition has on that species' expected occurrence or abundance. I discuss some of the issues involved in creating SDMs, including whether or not to incorporate spatially explicit error terms, and if so, how to do so with generalized linear models (GLMs, which have discrete responses). I ran a total of 10 distinct geostatistical SDMs using Markov Chain Monte Carlo (Bayesian methods), and discuss the results here. I also compare these results from YUCH with results from a similar analysis conducted in Denali National Park and Preserve (DNPP).

Toward an optimal solver for the obstacle problemAn optimal algorithm for solving a problem with m degrees of freedom is one that computes a solution in O (m) time. In this paper, we discuss a class of optimal algorithms for the numerical solution of PDEs called multigrid methods. We go on to examine numerical solvers for the obstacle problem, a constrained PDE, with the goal of demonstrating optimality. We discuss two known algorithms, the socalled reduced space method (RSP) [BM03] and the multigridbased projected fullapproximation scheme (PFAS) [BC83]. We compare the performance of PFAS and RSP on a few example problems, finding numerical evidence of optimality or nearoptimality for PFAS.

Reliability analysis of reconstructing phylogenies under long branch attraction conditionsIn this simulation study we examined the reliability of three phylogenetic reconstruction techniques in a long branch attraction (LBA) situation: Maximum Parsimony (M P), Neighbor Joining (NJ), and Maximum Likelihood. Data were simulated under five DNA substitution modelsJC, K2P, F81, HKY, and G T Rfrom four different taxa. Two branch length parameters of four taxon trees ranging from 0.05 to 0.75 with an increment of 0.02 were used to simulate DNA data under each model. For each model we simulated DNA sequences with 100, 250, 500 and 1000 sites with 100 replicates. When we have enough data the maximum likelihood technique is the most reliable of the three methods examined in this study for reconstructing phylogenies under LBA conditions. We also find that MP is the most sensitive to LBA conditions and that Neighbor Joining performs well under LBA conditions compared to MP.

An investigation into the effectiveness of simulationextrapolation for correcting measurement errorinduced bias in multilevel modelsThis paper is an investigation into correcting the bias introduced by measurement errors into multilevel models. The proposed method for this correction is simulationextrapolation (SIMEX). The paper begins with a detailed discussion of measurement error and its effects on parameter estimation. We then describe the simulationextrapolation method and how it corrects for the bias introduced by the measurement error. Multilevel models and their corresponding parameters are also defined before performing a simulation. The simulation involves estimating the multilevel model parameters using our true explanatory variables, the observed measurement error variables, and two different SIMEX techniques. The estimates obtained from our true explanatory values were used as a baseline for comparing the effectiveness of the SIMEX method for correcting bias. From these results, we were able to determine that the SIMEX was very effective in correcting the bias in estimates of the fixed effects parameters and often provided estimates that were not significantly different than those from the estimates derived using the true explanatory variables. The simulation also suggested that the SIMEX approach was effective in correcting bias for the random slope variance estimates, but not for the random intercept variance estimates. Using the simulation results as a guideline, we then applied the SIMEX approach to an orthodontics dataset to illustrate the application of SIMEX to real data.

Effect of filling methods on the forecasting of time series with missing valuesThe Gulf of Alaska Mooring (GAK1) monitoring data set is an irregular time series of temperature and salinity at various depths in the Gulf of Alaska. One approach to analyzing data from an irregular time series is to regularize the series by imputing or filling in missing values. In this project we investigated and compared four methods (denoted as APPROX, SPLINE, LOCF and OMIT) of doing this. Simulation was used to evaluate the performance of each filling method on parameter estimation and forecasting precision for an Autoregressive Integrated Moving Average (ARIMA) model. Simulations showed differences among the four methods in terms of forecast precision and parameter estimate bias. These differences depended on the true values of model parameters as well as on the percentage of data missing. Among the four methods used in this project, the method OMIT performed the best and SPLINE performed the worst. We also illustrate the application of the four methods to forecasting the Gulf of Alaska Mooring (GAK1) monitoring time series, and discuss the results in this project.

Extending the LatticeBased Smoother using a generalized additive modelThe Lattice Based Smoother was introduced by McIntyre and Barry (2017) to estimate a surface defined over an irregularlyshaped region. In this paper we consider extending their method to allow for additional covariates and noncontinuous responses. We describe our extension which utilizes the framework of generalized additive models. A simulation study shows that our method is comparable to the Soap film smoother of Wood et al. (2008), under a number of different conditions. Finally we illustrate the method's practical use by applying it to a real data set.

Vertex arboricity of trianglefree graphsThe vertex arboricity of a graph is the minimum number of colors needed to color the vertices so that the subgraph induced by each color class is a forest. In other words, the vertex arboricity of a graph is the fewest number of colors required in order to color a graph such that every cycle has at least two colors. Although not standard, we will refer to vertex arboricity simply as arboricity. In this paper, we discuss properties of chromatic number and kdefective chromatic number and how those properties relate to the arboricity of trianglefree graphs. In particular, we find bounds on the minimum order of a graph having arboricity three. Equivalently, we consider the largest possible vertex arboricity of trianglefree graphs of fixed order.

Bayesian predictive process models for historical precipitation data of Alaska and southwestern CanadaIn this paper we apply hierarchical Bayesian predictive process models to historical precipitation data using the spBayes R package. Classical and hierarchical Bayesian techniques for spatial analysis and modeling require large matrix inversions and decompositions, which can take prohibitive amounts of time to run (n observations take time on the order of n3). Bayesian predictive process models have the same spatial framework as hierarchical Bayesian models but fit a subset of points (called knots) to the sample which allows for large scale dimension reduction and results in much smaller matrix inversions and faster computing times. These computationally less expensive models allow average desktop computers to analyze spatially related datasets in excess of 20,000 observations in an acceptable amount of time.

Assessing year to year variability of inertial oscillation in the Chukchi Sea using the wavelet transformThree years of ocean drifter data from the Chukchi Sea were examined using the wavelet transform to investigate inertial oscillation. There was an increasing trend in number, duration, and hence total proportion of time spent in inertial oscillation events. Additionally, the Chukchi Sea seems to facilitate inertial oscillation that is easier to discern using northsouth velocity records rather than eastwest velocity records. The data used in this analysis was transformed using wavelets, which are generally used as a qualitative statistical method. Because of this, in addition to measurement error and random ocean noise, there is an additional source of variability and correlation which makes concrete statistical results challenging to obtain. However, wavelets were an effective tool for isolating the specific period of inertial oscillation and examining how it changed over time.

Statistical analysis of species tree inferenceIt is known that the STAR and USTAR algorithms are statistically consistent techniques used to infer species tree topologies from a large set of gene trees. However, if the set of gene trees is small, the accuracy of STAR and USTAR in determining species tree topologies is unknown. Furthermore, it is unknown how introducing roots on the gene trees affects the performance of STAR and USTAR. Therefore, we show that when given a set of gene trees of sizes 1, 3, 6 or 10, the STAR and USTAR algorithms with Neighbor Joining perform relatively well for two different cases: one where the gene trees are rooted at the outgroup and the STAR inferred species tree is also rooted at the outgroup, and the other where the gene trees are not rooted at the outgroup, but the USTAR inferred species tree is rooted at the outgroup. It is known that the STAR and USTAR algorithms are statistically consistent techniques used to infer species tree topologies from a large set of gene trees. However, if the set of gene trees is small, the accuracy of STAR and USTAR in determining species tree topologies is unknown. Furthermore, it is unknown how introducing roots on the gene trees affects the performance of STAR and USTAR. Therefore, we show that when given a set of gene trees of sizes 1, 3, 6 or 10, the STAR and USTAR algorithms with Neighbor Joining perform relatively well for two different cases: one where the gene trees are rooted at the outgroup and the STAR inferred species tree is also rooted at the outgroup, and the other where the gene trees are not rooted at the outgroup, but the USTAR inferred species tree is rooted at the outgroup.

Gaussian process convolutions for Bayesian spatial classificationWe compare three models for their ability to perform binary spatial classification. A geospatial data set consisting of observations that are either permafrost or not is used for this comparison. All three use an underlying Gaussian process. The first model considers this process to represent the logodds of a positive classification (i.e. as permafrost). The second model uses a cutoff. Any locations where the process is positive are classified positively, while those that are negative are classified negatively. A probability of misclassification then gives the likelihood. The third model depends on two separate processes. The first represents a positive classification, while the second a negative classification. Of these two, the process with greater value at a location provides the classification. A probability of misclassification is also used to formulate the likelihood for this model. In all three cases, realizations of the underlying Gaussian processes were generated using a process convolution. A grid of knots (whose values were sampled using Markov Chain Monte Carlo) were convolved using an anisotropic Gaussian kernel. All three models provided adequate classifications, but the single and twoprocess models showed much tighter bounds on the border between the two states.