Machine learning : How to represent data?
Best practices for Feature engineering
Machine Learning : Representation
In order to train a model, you must choose the set of features that best represent the data.
Feature engineering means transforming raw data into a feature vector. A lot of time is put into feature programming for machine learning.
Properties of a good feature :
Feature values should appear with non-zero value more than a small handful of times in the dataset.
Features should have a clear, obvious meaning.
Features shouldn't take on "magic" values.
The definition of a feature shouldn't change over time.
Distribution should not have extreme outliers.
Good Habits :
Visualize: Plot histograms, rank most to least common.
Debug: Duplicate examples? Missing values? Outliers? Data agrees with dashboards? Training and Validation data similar?
Monitor: Feature quantiles, number of examples over time?