# NeuroDecodeR object specification

The NeuroDecodeR (NDR) package is designed around five abstract object types which are:

1. Datasources (DS): Generate training and test splits of the data.

2. Feature preprocessors (FP): Learn parameters on the training set and apply transformations to the training and test sets.

3. Classifiers (CL): Learn the relationship between experimental conditions (i.e., “labels”) and neural data on a training set, and then predict experimental conditions neural data in a test set.

4. Result metrics (RM): Aggregate results across validation splits and over resampled runs and compute and plot final decoding accuracy metrics.

5. Cross-validators (CV): Take the DS, FP, CL and RM objects and run a cross-validation decoding procedure.

By having a standard set of object types, one can easily use different instances of these five object types to do different types of analyses.

For most common analyses, one can use instances of these different object types that come with the NDR. However, in some cases, one might want to extend the functionality of the NDR to gain additional insights. For example, one might want to try a different classifier to gain a better understanding of how populations of neurons code information (e.g., see Meyers, Borzello, Freiwald and Tsao, J Neurosci 2015).

The following document describes the methods and data formats that need to be implemented to create valid DS, FP, CL, RM, and CV object types. By creating new classes of objects that conform to these interfaces, one can easily extend the NDR to try new analyses.

# Datasources (DS)

Datasources are used to generate training and tests splits of data.

### Implementing an DS: methods and data formats

All datasources must implement a get_data() method that returns a data frame that has the following variables in it:

1. train_labels: The label levels that occur on each trial in the training data set

2. test_labels: The label levels that occur on each trial in the test data set

3. time_bin: The time in the experiment where the test data comes from

4. site_XXX: A collection of variables that each has data from one site (e.g., neuron, EEG channel etc.)

5. CV_XXX: A list for each CV split whether a given row is in that train or test set

Like all NDR objects, DS objects must also implement a get_properties() method which returns a data frame with one row that lists all the properties that have been set to allow for reproducible research.

### Example of internals of DS objects using the ds_basic object

Here is an example the data returned by the ds_basic() datasource

data_file_name <- system.file("extdata/ZD_150bins_50sampled.Rda", package="NeuroDecodeR")
ds <- ds_basic(data_file_name, 'stimulus_ID', 18)
## Automatically selecting sites_IDs_to_use. Since num_cv_splits = 18 and num_label_repeats_per_cv_split = 1, all sites that have 18 repetitions have been selected. This yields 132 sites that will be used for decoding (out of 132 total).
all_cv_data <- get_data(ds)

names(all_cv_data)
##   [1] "train_labels" "test_labels"  "time_bin"     "site_0001"    "site_0002"
##   [6] "site_0003"    "site_0004"    "site_0005"    "site_0006"    "site_0007"
##  [11] "site_0008"    "site_0009"    "site_0010"    "site_0011"    "site_0012"
##  [16] "site_0013"    "site_0014"    "site_0015"    "site_0016"    "site_0017"
##  [21] "site_0018"    "site_0019"    "site_0020"    "site_0021"    "site_0022"
##  [26] "site_0023"    "site_0024"    "site_0025"    "site_0026"    "site_0027"
##  [31] "site_0028"    "site_0029"    "site_0030"    "site_0031"    "site_0032"
##  [36] "site_0033"    "site_0034"    "site_0035"    "site_0036"    "site_0037"
##  [41] "site_0038"    "site_0039"    "site_0040"    "site_0041"    "site_0042"
##  [46] "site_0043"    "site_0044"    "site_0045"    "site_0046"    "site_0047"
##  [51] "site_0048"    "site_0049"    "site_0050"    "site_0051"    "site_0052"
##  [56] "site_0053"    "site_0054"    "site_0055"    "site_0056"    "site_0057"
##  [61] "site_0058"    "site_0059"    "site_0060"    "site_0061"    "site_0062"
##  [66] "site_0063"    "site_0064"    "site_0065"    "site_0066"    "site_0067"
##  [71] "site_0068"    "site_0069"    "site_0070"    "site_0071"    "site_0072"
##  [76] "site_0073"    "site_0074"    "site_0075"    "site_0076"    "site_0077"
##  [81] "site_0078"    "site_0079"    "site_0080"    "site_0081"    "site_0082"
##  [86] "site_0083"    "site_0084"    "site_0085"    "site_0086"    "site_0087"
##  [91] "site_0088"    "site_0089"    "site_0090"    "site_0091"    "site_0092"
##  [96] "site_0093"    "site_0094"    "site_0095"    "site_0096"    "site_0097"
## [101] "site_0098"    "site_0099"    "site_0100"    "site_0101"    "site_0102"
## [106] "site_0103"    "site_0104"    "site_0105"    "site_0106"    "site_0107"
## [111] "site_0108"    "site_0109"    "site_0110"    "site_0111"    "site_0112"
## [116] "site_0113"    "site_0114"    "site_0115"    "site_0116"    "site_0117"
## [121] "site_0118"    "site_0119"    "site_0120"    "site_0121"    "site_0122"
## [126] "site_0123"    "site_0124"    "site_0125"    "site_0126"    "site_0127"
## [131] "site_0128"    "site_0129"    "site_0130"    "site_0131"    "site_0132"
## [136] "CV_1"         "CV_2"         "CV_3"         "CV_4"         "CV_5"
## [141] "CV_6"         "CV_7"         "CV_8"         "CV_9"         "CV_10"
## [146] "CV_11"        "CV_12"        "CV_13"        "CV_14"        "CV_15"
## [151] "CV_16"        "CV_17"        "CV_18"

# Feature preprocessors (FP)

Feature preprocessors learn a set of parameters from the training data and modify both the training and the test data based on these parameters, prior to the data being sent to the classifier. The features preprocessor objects must only use the training data to learn the preprocessing parameters in order to prevent contamination between the training and test data which could bias the results.

### Implementing an FP: required methods and data formats

All feature preprocessors must implement preprocess_data(). This method takes two data frames called training_set and test_set have the following variables:

#### training_set

1. training_labels: The labels used to train the classifier.
2. site_X: a group of variables that has data from multiple sites.

#### test_set

1. test_labels: The labels used to test the classifier
2. site_X: a group of variables that has data from multiple sites
3. time_bin: character strings listing which times different rows correspond to

The preprocess_data() returns a list with the two data frames training_set and test_set but the data in these data frames has been preprocessed based on parameters learned from the training_set.

Like all NDR objects, FP objects must also implement a get_properties() method which returns a data frame with one row that lists all the properties that have been set to allow for reproducible research.

### Example of internals of FP objects using the fp_zscore

If you want to implement a new FP object yourself, below is an example of how the FP object gets and returns data.

# create a ds_basic to get the data
data_file_name <- system.file("extdata/ZD_150bins_50sampled.Rda", package="NeuroDecodeR")
ds <- ds_basic(data_file_name, 'stimulus_ID', 18)
## Automatically selecting sites_IDs_to_use. Since num_cv_splits = 18 and num_label_repeats_per_cv_split = 1, all sites that have 18 repetitions have been selected. This yields 132 sites that will be used for decoding (out of 132 total).
cv_data <- get_data(ds)

# an example of spliting the data into a training and test set,
# this is done in the cross-validator
training_set <- dplyr::filter(cv_data,
time_bin == "time.100_250",
CV_1 == "train") %>%       # get data from the first CV split
dplyr::select(starts_with("site"), train_labels)

test_set <- dplyr::filter(cv_data, CV_1 == "test") %>%   # get data from the first CV split
dplyr::select(starts_with("site"), test_labels, time_bin)

# use the fp object to normalize the data
fp <- fp_zscore()
processed_data <- preprocess_data(fp, training_set, test_set)

# prior to z-score normalizing the mean (e.g. for site 1) is not 0
mean(training_set$site_0001) ## [1] 0.003137255 # after normalizing the data the mean is pretty much 0 mean(processed_data$training_set$site_0001) ## [1] -5.768502e-17 # Classifiers (CL) Classifiers take a set of training data and training labels, and learn a model of the relationship between the training data and the labels from the different classes. Once this model has been learned (i.e., once the classifier has been trained), the classifier is then used to make predictions about what labels were present in a new set of test data. ### Implementing a CL: required methods and data formats Objects that are classifiers must implement the get_predictions() method. This method takes two data frames called training_set and test_set have the following variables: #### training_set 1. training_labels: The labels used to train the classifier. 2. site_X: a group of variables that has data from multiple sites. #### test_set 1. test_labels The labels used to test the classifier. 2. site_X: a group of variables that has data from multiple sites. 3. time_bin: character strings listing which times different rows correspond to. The get_predictions() returns a data frame that has the following variables: 1. test_time: a character vector indicating the times which the rows come from 2. actual_labels: the labels that were actually present on each trial 3. predicted_labels: the labels that the classifier predicted 4. decision_vals.X (optional): a group of variables that has values that indicate how strongly the classifier rates a test point as coming from a particular class Like all NDR objects, CL objects must also implement a get_properties() method which returns a data frame with one row that lists all the properties that have been set to allow for reproducible research. ### Example of internals of CL object using the cl_max_correlation If you want to implement a new CL object yourself, below is an example of how the CL object gets and returns data. # create a ds_basic to get the data data_file_name <- system.file("extdata/ZD_150bins_50sampled.Rda", package="NeuroDecodeR") ds <- ds_basic(data_file_name, 'stimulus_ID', 18) ## Automatically selecting sites_IDs_to_use. Since num_cv_splits = 18 and num_label_repeats_per_cv_split = 1, all sites that have 18 repetitions have been selected. This yields 132 sites that will be used for decoding (out of 132 total). cv_data <- get_data(ds) # an example of spliting the data into a training and test set, # this is done in the cross-validator training_set <- dplyr::filter(cv_data, time_bin == "time.100_250", CV_1 == "train") %>% # get data from the first CV split dplyr::select(starts_with("site"), train_labels) test_set <- dplyr::filter(cv_data, CV_1 == "test") %>% # get data from the first CV split dplyr::select(starts_with("site"), test_labels, time_bin) # use the cl object to make predictions cl <- cl_max_correlation() predictions <- get_predictions(cl, training_set, test_set) names(predictions) ## [1] "test_time" "actual_labels" "predicted_labels" ## [4] "decision_vals.car" "decision_vals.couch" "decision_vals.face" ## [7] "decision_vals.flower" "decision_vals.guitar" "decision_vals.hand" ## [10] "decision_vals.kiwi" # see how accurate the predictions are (chance is 1/7) predictions_at_100ms <- dplyr::filter(predictions, test_time == "time.100_250") mean(predictions_at_100ms$actual_labels == predictions_at_100ms\$predicted_labels)
## [1] 0.8571429

# Result metrics (RM)

Result metrics take the predictions made by a classifier and aggregate the results so that they can be interpreted.

### Implementing an RM: required methods and data formats

To create a result metric two methods must be implemented aggregate_CV_split_results() which aggregates the results after a set of cross-validation sweeps have been completed and aggregate_resample_run_results() which aggregates the final results across all the resample runs.

#### aggregate_CV_split_results() method

The aggregate_CV_split_results() method takes a data frame that is a concatenation of the prediction data frames made by the classifier (CL) objects across all times and cross-validation splits in one resample run. Thus the input data frame to the aggregate_CV_split_results() method has similar variables as in the output of the CL get_predictions() method, namely:

1. CV: a number indicating which cross-validation split the current row comes from

2. train_time: the train time that the current row comes from.

3. test_time: the test time that the current row comes from.

4. actual_labels: the labels that were actually present on each trial.

5. predicted_labels: the labels that the classifier predicted.

6. decision_vals.X (optional): a group of variables that has values that indicate how strongly the classifier rates a test point as coming from a particular class

The output of the aggregate_CV_split_results is a RM object of the same type that contains inherits from a data frame so that the results can be can be aggregated together (e.g., via rbind) across resample runs. The variables in the data frame can be anything that is useful to capture about the classification performance.

#### aggregate_resample_run_results() method

The aggregate_resample_run_results() method takes result metric data frames that have been aggregated together (e.g., via rbind) across resample runs. Thus this input data frame as the same variables as the data frame returned by the aggregate_CV_split_results() along with one additional variable indicating which resample run each row comes from.

The output of this method should be a RM object of the same type that is a data frame which most likely is of a smaller size.

Like all NDR objects, RM objects must also implement a get_properties() method which returns a data frame with one row that lists all the properties that have been set to allow for reproducible research.

RM objects can also have plot methods to allow the different aggregated results to be plotted

### Example of result metrics

Examples of using result metrics can be seen by going through the Introduction tutorial

# Cross-validators (CV)

Cross-validators take a classifier (CL), a datasource (DS) feature preprocessors (FP) objects, and result metric (RM) objects and they run a cross-validation decoding scheme by training and testing the classifier with data generated from the datasource object (and possibly fed through the feature pre-processing first).

### Implementing a CV: required methods and data formats

All cross-validators must implement run_decoding() method. This method does not take any additional arguments (apart from the cross-validator itself).

The cross-validator returns a list DECODING_RESULTS which contains different RM objects that can be used to assess how accurately the classifier made predictions at different points in time.

Like all NDR objects, CV objects must also implement a get_properties() method which returns a data frame with one row that lists all the properties that have been set and also pulls all properties from the other NDR objects (e.g., from the DS, FP, CL and RM objects) to allow for reproducible research.

### Example of cross-validators

Examples of using the cv_standard can be seen by going through the Introduction tutorial