NSF REU Site - Research Project

Project Description

Keystroke dynamics is a novel way to authenticate a user based on how they type on a computer. Our group are working to determine overall system performance via Monte Carlo simulations. When implementing Monte Carlo cross validation on a very large dataset it can take a long time for code to run (i. e. days). An outline of the process is to randomly sample user’s keystrokes, extract features, fuse the features, and use genuine and imposter data to produce false accept rates and false reject rates. This entire process is then repeated a large number of times as outlined by Monte Carlo cross validation.

For robust cross-validation we prefer to use a large number of iterations in Monte Carlo simulations. The idea behind Monte Carlo cross validation is to select random subsets of your data for training and testing and determine the performance for that subset This process is repeated a very large number of times and the performance averaged together to obtain the average performance. This average performance is very representative of the dataset as many various subsets were used. Conceptually it is not difficult to understand that if only 1 sunset of the data was used the results may be very good or very bad depending on the data that was used. By selecting a large of number of random subsets the variations in performance from the data is minimized. As the number of subsets used approaches infinity the variance of the performance approaches 0 and the true performance is obtained. In practice an infinite number of subsets is not needed, the variance can be significantly reduced with even 100 iterations.

Currently the code (in Python) we are running can take days to compute Monte Carlo cross-validation of 50 iterations (other parameters are varied as well). An undergraduate student with knowledge of parallelization and computation speed could go through this code and work to significantly increase speed. Speed could be improved in two main ways. The first way would be to parallelize the code as much as possible. This way the code can run on GPU’s and receive a large boost in performance. There are currently many for loops in the code and if they could be instead parallelized computing time could be reduced. The second way to reduce time would be pre-sampling and pre-processing for the subsets of data. The code could then be run on already partitioned and cleaned data so there would be no need to sample and process data inside of the for loops.

Qualifications

1. Solid coding skills in Python or related scripting languages;

2. Solid background in statistics and probability;

3. Prior experience with machine learning and deep learning preferred.