用tf.estimator構建輸入函數

本教程將向您介紹Tensorflow的數據預處理（如何在tf.estimator中創建輸入函數）。您將會看到如何構建一個input_fn預處理並將數據輸入到模型中。然後，你會實現一個input_fn用於將訓練、評估、預測數據輸入神經網絡回歸器以預測房屋中值。

用input_fn自定義輸入管道

input_fn用於將特征和目標數據傳遞給Estimator的方法：train，evaluate，和predict。用戶可以在input_fn裏麵做特征工程或者特征預處理。這裏是一個例子tf.estimator快速入門教程：

import numpy as np

training_set = tf.contrib.learn.datasets.base.load_csv_with_header(
    filename=IRIS_TRAINING, target_dtype=np.int, features_dtype=np.float32)

train_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"x": np.array(training_set.data)},
    y=np.array(training_set.target),
    num_epochs=None,
    shuffle=True)

classifier.train(input_fn=train_input_fn, steps=2000)

剖析一個input_fn

以下代碼說明了輸入函數的基本框架：

def my_input_fn():

    # Preprocess your data here...

    # ...then return 1) a mapping of feature columns to Tensors with
    # the corresponding feature data, and 2) a Tensor containing labels
    return feature_cols, labels

輸入函數的主體包含用於預處理輸入數據的特定邏輯，例如清理不好的示例或特征縮放。

輸入函數必須返回以下兩個包含要輸入到模型中的最終特征和標簽數據的值(如上麵的代碼框架所示)：

feature_cols: 包含將特征列名稱映射到的鍵/值對的字典Tensor(或SparseTensors)包含相應的特征數據。
labels: 一個Tensor包含您的標簽(目標)值：您的模型旨在預測的值。

將特征數據轉換為張量

如果你的特征/標簽數據是一個Python數組或存儲在 Pandas DataFrame或numpy的數組，你可以使用下麵的方法來構造input_fn：

import numpy as np
# numpy input_fn.
my_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"x": np.array(x_data)},
    y=np.array(y_data),
    ...)

import pandas as pd
# pandas input_fn.
my_input_fn = tf.estimator.inputs.pandas_input_fn(
    x=pd.DataFrame({"x": x_data}),
    y=pd.Series(y_data),
    ...)

對於稀疏的類別數據(數據的大部分值是0)，你會改為填充一個SparseTensor，它使用三個參數實例化：

dense_shape: 張量的形狀。獲取一個列表，指出每個維度中元素的數量。例如，dense_shape=[3,6]指定二維3×6張量，dense_shape=[2,3,4]指定一個三維 2x3x4張量，和dense_shape=[9]指定具有9個元素的一維張量。
indices: 張量中包含非零值的元素的索引。獲取元素列表，其中每個元素本身是一個包含非零元素索引的列表。 (元素是zero-indexed，即[0,0]是two-dimensional張量中第一行第一列元素的索引值。)例如，indices=[[1,3], [2,4]]指定索引為[1,3]和[2,4]的元素具有非零值。
values: 一個一維張量值。元素i在values對應於元素i在indices並指定其值。例如，給出indices=[[1,3], [2,4]]，參數values=[18, 3.6]指定張量的元素[1,3]的值為18，張量的元素[2,4]的值為3.6。

以下代碼定義了2DSparseTensor3行5列。索引為[0,1]的元素的值為6，索引為[2,4]的元素的值為0.5(所有其他值為0)：

sparse_tensor = tf.SparseTensor(indices=[[0,1], [2,4]],
                                values=[6, 0.5],
                                dense_shape=[3, 5])

這對應於下麵的稠密張量：

[[0, 6, 0, 0, 0]
 [0, 0, 0, 0, 0]
 [0, 0, 0, 0, 0.5]]

更多關於SparseTensor的資料，見tf.SparseTensor。

將input_fn數據傳遞給您的模型

要將數據提供給您的模型進行訓練，隻需將您創建的輸入函數傳遞給您的模型train作為input_fn參數的值，例如：

classifier.train(input_fn=my_input_fn, steps=2000)

請注意input_fn參數必須接收函數對象(即，input_fn=my_input_fn)，而不是函數調用的返回值(input_fn=my_input_fn())。這意味著，在你的train調用中，如果你嘗試傳遞參數給input_fn，如下麵的代碼，它會導致一個TypeError：

classifier.train(input_fn=my_input_fn(training_set), steps=2000)

然而，如果你想能夠參數化你的輸入函數，還有其他的方法。你可以使用一個不帶參數的包裝函數input_fn並用它來調用你所需要的參數的輸入函數。例如：

def my_input_fn(data_set):
  ...

def my_input_fn_training_set():
  return my_input_fn(training_set)

classifier.train(input_fn=my_input_fn_training_set, steps=2000)

或者，你可以使用Python的functools.partial函數來構建一個固定了所有參數值的新函數對象：

classifier.train(
    input_fn=functools.partial(my_input_fn, data_set=training_set),
    steps=2000)

第三個選擇是包裝你的input_fn，在.lambda中調用，並傳遞給input_fn參數：

classifier.train(input_fn=lambda: my_input_fn(training_set), steps=2000)

如上所示設計輸入管道的一大優勢是：接受數據集的參數，您可以傳入相同的input_fn到evaluate和predict操作，隻需更改數據集參數，例如：

classifier.evaluate(input_fn=lambda: my_input_fn(test_set), steps=2000)

這種方法增強了代碼的可維護性：不需要為每種類型的操作定義多個input_fn(例如。input_fn_train，input_fn_test，input_fn_predict)。

最後，你可以使用tf.estimator.inputs中的方法從numpy或pandas數據集構造input_fn。額外的好處是你可以使用更多的參數，比如num_epochs和shuffle，控製input_fn如何遍曆數據：

import pandas as pd

def get_input_fn_from_pandas(data_set, num_epochs=None, shuffle=True):
  return tf.estimator.inputs.pandas_input_fn(
      x=pdDataFrame(...),
      y=pd.Series(...),
      num_epochs=num_epochs,
      shuffle=shuffle)

import numpy as np

def get_input_fn_from_numpy(data_set, num_epochs=None, shuffle=True):
  return tf.estimator.inputs.numpy_input_fn(
      x={...},
      y=np.array(...),
      num_epochs=num_epochs,
      shuffle=shuffle)

波士頓房屋價值的神經網絡模型

在本教程的其餘部分，您將編寫一個輸入函數，用於預處理從波士頓房屋數據中提取的一個子集UCI住房數據集並用它來向神經網絡回歸器提供數據來預測房屋中值。

這個波士頓CSV數據集，將用來訓練你的神經網絡，包含以下內容特征數據：

特征	描述
CRIM	人均犯罪率
ZN	住宅用地的麵積為25,000平方英尺以上
INDUS	是non-retail事業的土地的分數
NOX	一氧化氮濃度每千萬分之一
RM	平均每間住宅的房間
AGE	1940年以前建造的自住住宅的部分
DIS	與Boston區域就業中心的距離
TAX	每萬美元的房產稅稅率
PTRATIO	Student-teacher比率

而你的模型預測的標簽是MEDV，自住房房價的中間值，單位是千美元。

Setup

下載以下數據集：boston_train.csv，boston_test.csv，和boston_predict.csv。

以下部分逐步介紹如何創建輸入函數，將這些數據集提供給神經網絡回歸器，訓練和評估模型，並進行房屋價值預測。完整的代碼在這裏。

導入住房數據

開始，設置您的導入(包括pandas和tensorflow)並將日誌級別設置為INFO：

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import itertools

import pandas as pd
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)

定義數據集的列名COLUMNS。為了區分標簽和特征，還要定義FEATURES和LABEL。然後讀入三個CSV(tf.train，tf.test，和predict),使用 Pandas
DataFrame格式：

COLUMNS = ["crim", "zn", "indus", "nox", "rm", "age",
           "dis", "tax", "ptratio", "medv"]
FEATURES = ["crim", "zn", "indus", "nox", "rm",
            "age", "dis", "tax", "ptratio"]
LABEL = "medv"

training_set = pd.read_csv("boston_train.csv", skipinitialspace=True,
                           skiprows=1, names=COLUMNS)
test_set = pd.read_csv("boston_test.csv", skipinitialspace=True,
                       skiprows=1, names=COLUMNS)
prediction_set = pd.read_csv("boston_predict.csv", skipinitialspace=True,
                             skiprows=1, names=COLUMNS)

定義FeatureColumns並創建回歸模型

接下來，創建一個列表FeatureColumns輸入數據，正式指定用於訓練的一組特征。由於住房數據集中的所有特征都是連續值，因此您可以創建自己的FeatureColumn，使用的tf.contrib.layers.real_valued_column()函數：

feature_cols = [tf.feature_column.numeric_column(k) for k in FEATURES]

注：有關特征列的更深入概述，請參閱這個介紹，對於類別數據如何定義FeatureColumns，請參閱線性模型教程。

現在，實例化一個DNNRegressor為神經網絡回歸模型。你需要在這裏提供兩個參數：hidden_units超參數指定每個隱藏層中的節點數量(這裏是兩個隱藏層，每個隱藏層各有10個節點)feature_columns，包含的列表FeatureColumns：

regressor = tf.estimator.DNNRegressor(feature_columns=feature_cols,
                                      hidden_units=[10, 10],
                                      model_dir="/tmp/boston_model")

建立input_fn

將輸入數據傳入regressor，寫一個接受 Pandas Dataframe並返回一個input_fn的工廠方法：

def get_input_fn(data_set, num_epochs=None, shuffle=True):
  return tf.estimator.inputs.pandas_input_fn(
      x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),
      y = pd.Series(data_set[LABEL].values),
      num_epochs=num_epochs,
      shuffle=shuffle)

請注意輸入數據被傳入input_fn的data_set參數，這意味著該函數可以處理任何的DataFrame：training_set，test_set，和prediction_set。

另外兩個參數： num_epochs：控製迭代數據的數量。要進行訓練，請設置為None，這樣
input_fn保持返回數據，直到達到所需的訓練步數。為了評估和預測，設置為1，這樣input_fn將遍曆數據一次，然後拋出OutOfRangeError。那個錯誤會發信號給Estimator停止評估或預測。 shuffle：是否洗牌數據。為了評估和預測，將其設置為False，所以input_fn依次遍曆數據。對於訓練，請將此設置為True。

訓練回歸模型

訓練神經網絡回歸模型，運行train，將training_set傳給了input_fn參數，如下：

regressor.train(input_fn=get_input_fn(training_set), steps=5000)

你應該看到類似於以下的日誌輸出，每100步報告一次訓練損失：

INFO:tensorflow:Step 1: loss = 483.179
INFO:tensorflow:Step 101: loss = 81.2072
INFO:tensorflow:Step 201: loss = 72.4354
...
INFO:tensorflow:Step 1801: loss = 33.4454
INFO:tensorflow:Step 1901: loss = 32.3397
INFO:tensorflow:Step 2001: loss = 32.0053
INFO:tensorflow:Step 4801: loss = 27.2791
INFO:tensorflow:Step 4901: loss = 27.2251
INFO:tensorflow:Saving checkpoints for 5000 into /tmp/boston_model/model.ckpt.
INFO:tensorflow:Loss for final step: 27.1674.

評估模型

接下來，看看訓練的模型如何在測試數據集運行。執行evaluate，將test_set傳給input_fn：

ev = regressor.evaluate(
    input_fn=get_input_fn(test_set, num_epochs=1, shuffle=False))

從ev中取得損失結果並打印輸出：

loss_score = ev["loss"]
print("Loss: {0:f}".format(loss_score))

您應該看到類似於以下的結果：

INFO:tensorflow:Eval steps [0,1) for training step 5000.
INFO:tensorflow:Saving evaluation summary for 5000 step: loss = 11.9221
Loss: 11.922098

做預測

最後，您可以使用該模型來預測房屋價格的中位數值prediction_set，其中包含的特征數據但沒有標簽的六個樣本：

y = regressor.predict(
    input_fn=get_input_fn(prediction_set, num_epochs=1, shuffle=False))
# .predict() returns an iterator of dicts; convert to a list and print
# predictions
predictions = list(p["predictions"] for p in itertools.islice(y, 6))
print("Predictions: {}".format(str(predictions)))

您的結果應該包含六個house-value的預測值，例如：

Predictions: [ 33.30348587  17.04452896  22.56370163  34.74345398  14.55953979
  19.58005714]

其他資源

本教程著重於創建一個input_fn神經網絡回歸器。要了解更多關於使用input_fn對於其他類型的型號，請查看以下資源：

Large-scale Linear Models with TensorFlow: This
introduction to linear models in TensorFlow provides a high-level overview
of feature columns and techniques for transforming input data.
TensorFlow Linear Model Tutorial: This tutorial covers
creating FeatureColumns and an input_fn for a linear classification
model that predicts income range based on census data.
TensorFlow Wide & Deep Learning Tutorial: Building on
the Linear Model Tutorial, this tutorial covers
FeatureColumn and input_fn creation for a “wide and deep” model that
combines a linear model and a neural network using
DNNLinearCombinedClassifier.

參考資料

Building Input Functions with tf.estimator