Reshaping, Transforming and Cleaning Datasets
It is frequently necessary to perform operations on the structure or contents of datasets prior to creating a predictive model.
For example, it may be necessary to reshape some of the datasets prior to merging them in a single table, or transform attributes to more manageable data types, or clean up the dataset by removing outliers or attributes that could cause confusion in the final model.
Task | Description | Corresponding page |
---|---|---|
Reshaping Tasks | ||
Reshape To Long | Transforms key attributes in a dataset into new columns. This operation is necessary when a table contains more than one key. | |
Reshape To Wide | Transforms key attributes in a dataset into new rows. This operation is necessary when a table contains more than one key. | |
Transpose | Converts rows into columns and vice versa. | |
Transforming Tasks | ||
Discretize | Transforms continuous attributes into a finite set of intervals | |
Moving Window | Defines temporal windows of data of a specific size and shape. | |
Cleaning Tasks | ||
Fill/Clean | Removes attributes which could create confusion in the resulting predictive model. |
Outliers
It is also important to correctly identify and manage outliers, which are anomalous data samples. which can have a negative impact on predictive models if not handled correctly.
This is not limited to a single task. For details see Identifying and Managing Outliers