Pre-processing datasets
pre_process.Rd
The pre_process()
function aids in processing data inputs and automatically establishes a standardized format for future use.
It allows for two types of data input: a list of datasets from different sources or a long dataset containing a specified last column type
.
Usage
pre_process(
data,
typenameList = NULL,
replaceNA = TRUE,
scale = TRUE,
autoColName = "Sec_"
)
Arguments
- data
A data.frame to describe each feature in one row. The data should contain variables
ID
value on time_1
, ...,value on time_k
, andtype
for extracting patterns across the time. Note that the initial and last column must be exactlyID
andtype
. If multiple data.frame with the above format needs to be analyzed, you could also put a list of data.frame into this argument. In this case, variabletype
is not required and will be generated by the next argumenttypenameList
.- typenameList
A vector of strings. This parameter is used to clarify the source or names for each data.frame, and is only applicable when the input of
data
is a list of data.frame. By default, it will be set as "Dataset_1", "Dataset_2", ..., etc.- scale
Logical; if
scale
is TRUE (default), standardize the data.frame by row withbase::scale
. This converts each original value into a z-score. See alsoscale_by_row__()
.- autoColName
A string; if
autoColName
is not-NULL (default), it will automatically set uniform column names for all the data.frames. This parameter is only applicable when the input ofdata
is a list of data.frame.- replaceNa
Logical; if
replaceNa
is TRUE (default), replace NA with 0.
Value
The function returns a long data.frame with columns ID
, value on time_1
, ..., value on time_k
, and type
.
Details
We consider two distinct scenarios for this application:
In one scenario, individuals collect several datasets from various aspects and instruments for the same objects. For example, they might be separately detecting lipids, metabolites, and peptides from a specific soil sample.
In the other scenario, all the data is of uniform quality, but it can be categorized into larger groups that exhibit significant differences. In both of these cases, the pre_process() function serves as a valuable and versatile tool. Yet, this function is optional when generating the dashboard. Users can perform their own processing as long as the format matches the required output. However, they should be mindful that the number of samples (timepoints) must be greater than 5 to avoid potential errors in the subsequent prediction section.
Examples
data(test_data)
head(test_data, 10)
#> ID T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 type
#> 1 1 1 0 0 1 1 0 0 1 6 6 type_A
#> 2 2 6 0 0 0 0 3 1 0 2 1 type_A
#> 3 3 1 0 0 0 2 0 0 2 2 1 type_A
#> 4 4 4 5 3 3 7 2 1 1 0 0 type_A
#> 5 5 4 3 NA 2 5 5 0 0 0 0 type_A
#> 6 6 4 1 0 1 3 1 3 5 11 14 type_A
#> 7 7 1 0 0 0 1 3 3 1 1 1 type_A
#> 8 8 4 2 1 1 1 1 0 0 0 0 type_A
#> 9 9 1 1 1 19 22 1 2 1 1 2 type_A
#> 10 10 1 1 3 5 8 2 2 2 5 2 type_A
a <- pre_process(test_data)
head(a, 10)
#> ID T1 T2 T3 T4 T5 T6
#> 1 1 -0.25354628 -0.6761234 -0.67612340 -0.25354628 -0.25354628 -0.6761234
#> 2 2 2.41458180 -0.6678631 -0.66786305 -0.66786305 -0.66786305 0.8733594
#> 3 3 0.21764288 -0.8705715 -0.87057150 -0.87057150 1.30585725 -0.8705715
#> 4 4 0.61658123 1.0569964 0.17616607 0.17616607 1.93782672 -0.2642491
#> 5 5 0.96186009 0.5038315 -0.87025436 0.04580286 1.41988870 1.4198887
#> 6 6 -0.06459959 -0.7105955 -0.92592741 -0.71059546 -0.27993154 -0.7105955
#> 7 7 -0.09086738 -0.9995412 -0.99954118 -0.99954118 -0.09086738 1.7264802
#> 8 8 2.40535118 0.8017837 0.00000000 0.00000000 0.00000000 0.0000000
#> 9 9 -0.50260633 -0.5026063 -0.50260633 1.70395805 2.07171878 -0.5026063
#> 10 10 -0.94019379 -0.9401938 -0.04477113 0.85065153 2.19378551 -0.4924825
#> T7 T8 T9 T10 type
#> 1 -0.6761234 -0.25354628 1.85933936 1.85933936 type_A
#> 2 -0.1541222 -0.66786305 0.35961857 -0.15412224 type_A
#> 3 -0.8705715 1.30585725 1.30585725 0.21764288 type_A
#> 4 -0.7046643 -0.70466426 -1.14507943 -1.14507943 type_A
#> 5 -0.8702544 -0.87025436 -0.87025436 -0.87025436 type_A
#> 6 -0.2799315 0.15073237 1.44272411 2.08871998 type_A
#> 7 1.7264802 -0.09086738 -0.09086738 -0.09086738 type_A
#> 8 -0.8017837 -0.80178373 -0.80178373 -0.80178373 type_A
#> 9 -0.3800194 -0.50260633 -0.50260633 -0.38001942 type_A
#> 10 -0.4924825 -0.49248246 0.85065153 -0.49248246 type_A