pythonML: ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow ) (CH2 End-to-End Machine Learning Project)

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow )

Ch 2

End-to-End Machine Learning Project

Here are the main steps you will

go through:

1. Look at the big picture.

2. Get the data.

3. Discover and visualize the data to gain insights.

4. Prepare the data for Machine Learning algorithms.

5. Select a model and train it.

6. Fine-tune your model.

7. Present your solution.

8. Launch, monitor, and maintain your system.

Performance Measure

Root Mean Square Error (RMSE)

Mean Absolute Error (MAE)

Step 1

Treatment of Data

a)using .info(), .head(), .describe()

to take a look with the data

b) with column of catagories

.value_counts()

c) having a look with histagrph

import matplotlib.pyplot as plt

df.hist(bins=50, figsize=(20,15))

plt.show()

Step 2

Graph

correalation

code

corr_matrix = housing.corr()

Now let’s look at how much each attribute correlates with the median house value:

>>>

corr_matrix["median_house_value"].sort_values(ascending=False)

median_house_value 1.000000

median_income 0.687170

total_rooms 0.135231

housing_median_age 0.114220

households 0.064702

total_bedrooms 0.047865

population -0.026699

longitude -0.047279

latitude -0.142826

Name: median_house_value, dtype: float64

scatterplot

code

from pandas.tools.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",

"housing_median_age"]

scatter_matrix(housing[attributes], figsize=(12, 8))

Handling missing data

code

You can accomplish these easily using DataFrame’s dropna(), drop(), and fillna()

methods:

housing.dropna(subset=["total_bedrooms"]) # option 1

housing.drop("total_bedrooms", axis=1) # option 2

median = housing["total_bedrooms"].median()

housing["total_bedrooms"].fillna(median) # option 3

Handling Text and Categorical attributes

code

Scikit-Learn provides a transformer for this task called LabelEncoder:

>>> from sklearn.preprocessing import LabelEncoder

>>> encoder = LabelEncoder()

>>> housing_cat = housing["ocean_proximity"]

>>> housing_cat_encoded = encoder.fit_transform(housing_cat)

>>> housing_cat_encoded

array([1, 1, 4, ..., 1, 0, 3])

Feature Scaling

a) min-max scaling

from 0 to 1,,x-min/max-min

code >>MinMaxScaler,feature_range

b) standardization

code>>StandardScaler

Transformation Pipelines

code

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([

('imputer', Imputer(strategy="median")),

('attribs_adder', CombinedAttributesAdder()),

('std_scaler', StandardScaler()),

])

housing_num_tr = num_pipeline.fit_transform(housing_num)

The Pipeline constructor takes a list of name/estimator pairs defining a sequence of

steps. All but the last estimator must be transformers (i.e., they must have a

fit_transform() method). The names can be anything you like.

When you call the pipeline’s fit() method, it calls fit_transform() sequentially on

all transformers, passing the output of each call as the parameter to the next call, until

it reaches the final estimator, for which it just calls the fit() method.

The pipeline exposes the same methods as the final estimator. In this example, the last

estimator is a StandardScaler, which is a transformer, so the pipeline has a trans

form() method that applies all the transforms to the data in sequence (it also has a

fit_transform method that we could have used instead of calling fit() and then

transform()).

Better Evaluation Using Cross-Validation

One way to evaluate the Decision Tree model would be to use the train_test_split

function to split the training set into a smaller training set and a validation set, then

train your models against the smaller training set and evaluate them against the validation

set. It’s a bit of work, but nothing too difficult and it would work fairly well.

A great alternative is to use Scikit-Learn’s cross-validation feature. The following code

performs K-fold cross-validation: it randomly splits the training set into 10 distinct

subsets called folds, then it trains and evaluates the Decision Tree model 10 times,

picking a different fold for evaluation every time and training on the other 9 folds.

The result is an array containing the 10 evaluation scores:

code

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,

scoring="neg_mean_squared_error", cv=10)

rmse_scores = np.sqrt(-scores)

<imp> Fine-Tune model

Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you need to

do is tell it which hyperparameters you want it to experiment with, and what values to

try out, and it will evaluate all the possible combinations of hyperparameter values,

using cross-validation. For example, the following code searches for the best combination

of hyperparameter values for the RandomForestRegressor:

code

from sklearn.model_selection import GridSearchCV

param_grid = [

{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},

{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},

]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,

scoring='neg_mean_squared_error')

grid_search.fit(housing_prepared, housing_labels)

This param_grid tells Scikit-Learn to first evaluate all 3 × 4 = 12 combinations of

n_estimators and max_features hyperparameter values specified in the first dict

(don’t worry about what these hyperparameters mean for now; they will be explained

in Chapter 7), then try all 2 × 3 = 6 combinations of hyperparameter values in the

second dict, but this time with the bootstrap hyperparameter set to False instead of

True (which is the default value for this hyperparameter).

All in all, the grid search will explore 12 + 6 = 18 combinations of RandomForestRe

gressor hyperparameter values, and it will train each model five times (since we are

using five-fold cross validation). In other words, all in all, there will be 18 × 5 = 90

rounds of training! It may take quite a long time, but when it is done you can get the

best combination of parameters like this:

>>> grid_search.best_params_

{'max_features': 6, 'n_estimators': 30}

You can also get the best estimator directly:

>>> grid_search.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,

max_features=6, max_leaf_nodes=None, min_samples_leaf=1,

min_samples_split=2, min_weight_fraction_leaf=0.0,

n_estimators=30, n_jobs=1, oob_score=False, random_state=None,

verbose=0, warm_start=False)

And of course the evaluation scores are also available:

>>> cvres = grid_search.cv_results_

... for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):

... print(np.sqrt(-mean_score), params)

...

64912.0351358 {'max_features': 2, 'n_estimators': 3}

55535.2786524 {'max_features': 2, 'n_estimators': 10}

52940.2696165 {'max_features': 2, 'n_estimators': 30}

60384.0908354 {'max_features': 4, 'n_estimators': 3}

52709.9199934 {'max_features': 4, 'n_estimators': 10}

50503.5985321 {'max_features': 4, 'n_estimators': 30}

59058.1153485 {'max_features': 6, 'n_estimators': 3}

52172.0292957 {'max_features': 6, 'n_estimators': 10}

49958.9555932 {'max_features': 6, 'n_estimators': 30}

59122.260006 {'max_features': 8, 'n_estimators': 3}

52441.5896087 {'max_features': 8, 'n_estimators': 10}

50041.4899416 {'max_features': 8, 'n_estimators': 30}

62371.1221202 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}

54572.2557534 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}

59634.0533132 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}

52456.0883904 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}

58825.665239 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}

52012.9945396 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

Analyze the Best Models and Their Errors

You will often gain good insights on the problem by inspecting the best models. For

example, the RandomForestRegressor can indicate the relative importance of each

attribute for making accurate predictions:

>>> feature_importances = grid_search.best_estimator_.feature_importances_

>>> feature_importances

array([ 7.14156423e-02, 6.76139189e-02, 4.44260894e-02,

1.66308583e-02, 1.66076861e-02, 1.82402545e-02,

1.63458761e-02, 3.26497987e-01, 6.04365775e-02,

1.13055290e-01, 7.79324766e-02, 1.12166442e-02,

1.53344918e-01, 8.41308969e-05, 2.68483884e-03,

3.46681181e-03])

Let’s display these importance scores next to their corresponding attribute names:

>>> extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]

>>> cat_one_hot_attribs = list(encoder.classes_)

>>> attributes = num_attribs + extra_attribs + cat_one_hot_attribs

>>> sorted(zip(feature_importances, attributes), reverse=True)

[(0.32649798665134971, 'median_income'),

(0.15334491760305854, 'INLAND'),

(0.11305529021187399, 'pop_per_hhold'),

(0.07793247662544775, 'bedrooms_per_room'),

(0.071415642259275158, 'longitude'),

(0.067613918945568688, 'latitude'),

(0.060436577499703222, 'rooms_per_hhold'),

(0.04442608939578685, 'housing_median_age'),

(0.018240254462909437, 'population'),

(0.01663085833886218, 'total_rooms'),

(0.016607686091288865, 'total_bedrooms'),

(0.016345876147580776, 'households'),

(0.011216644219017424, '<1H OCEAN'),

(0.0034668118081117387, 'NEAR OCEAN'),

(0.0026848388432755429, 'NEAR BAY'),

(8.4130896890070617e-05, 'ISLAND')]

With this information, you may want to try dropping some of the less useful features

(e.g., apparently only one ocean_proximity category is really useful, so you could try

dropping the others).

You should also look at the specific errors that your system makes, then try to understand

why it makes them and what could fix the problem (adding extra features or, on

the contrary, getting rid of uninformative ones, cleaning up outliers, etc.).

pythonML

2017年9月12日星期二

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow ) (CH2 End-to-End Machine Learning Project)

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow )

Handling missing data

Better Evaluation Using Cross-Validation

<imp> Fine-Tune model

Analyze the Best Models and Their Errors

沒有留言:

張貼留言

2017年9月12日 星期二

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow ) (CH2 End-to-End Machine Learning Project)

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow )

Handling missing data

Better Evaluation Using Cross-Validation

<imp> Fine-Tune model

Analyze the Best Models and Their Errors

沒有留言:

張貼留言

2017年9月12日星期二