Scalable Automatic Feature Engineering

Generate automatically new features based on older ones for further modelling, using SAFE algoritm proposed in a paper by Shi, Zhang, Li, Yang and Zhou. This is a direct implementation of the pseudo-algoritm proposed in the paper, with its conventions, denotements and flaws.

SAFE(
  X_train,
  y_train,
  X_valid,
  y_valid,
  operators = list(NULL, list(`+`, `-`, `*`)),
  n_iter = 10,
  nrounds = 5,
  alpha = 0.1,
  gamma = 10,
  bins = 30,
  theta = 0.8,
  beta = Inf
)

Arguments

X_train	Matrix - data used to train model. Must be numerical.
y_train	Factor - labels for training data. Must be binary.
X_valid	Matrix - data used to test model. Must be numerical.
y_valid	Factor - labels for testing data. Must be binary.
operators	A `list` of lists of functions. Ith list of funcitons contains functions accepting `i` vectors of equal length and returning 1 vector of the same length.
n_iter	Integer; Amount of iterations for the alghoritm to perform.
nrounds	Integer for `xgb.train`.
alpha	Threshold for `link{IV}`. Features with IV < alpha will be dropped.
gamma	Integer; Amount of most important feature combinations to be selected in each iteration.
bins	Integer; amount of bins to create to discretize features.
theta	Threshold for Pearson's correlation. Features with correlation above theta will be dropped.
beta	Integer; Maximum amount of features to be selected at the end of each loop. Set to `Inf` to select all features.

Value

A list with 2 elements: X_train and X_test. Both contain transformed train and test sets, ready for further modelling. Unfortunately, this is in contrary to algoritm mentioned in the paper (which returns a function) - at least for now.