Machine learning : How to represent data?

Best practices for Feature engineering

Machine Learning : Representation

In order to train a model, you must choose the set of features that best represent the data.

Feature engineering means transforming raw data into a feature vector. A lot of time is put into feature programming for machine learning.

Properties of a good feature :

Feature values should appear with non-zero value more than a small handful of times in the dataset.

Features should have a clear, obvious meaning.

Features shouldn't take on "magic" values.

The definition of a feature shouldn't change over time.

Distribution should not have extreme outliers.

Good Habits :

Visualize: Plot histograms, rank most to least common.

Debug: Duplicate examples? Missing values? Outliers? Data agrees with dashboards? Training and Validation data similar?

Monitor: Feature quantiles, number of examples over time?