2017年9月14日星期四

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow ) (CH4 Training Models)

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow ) (CH4 Training Models)

link: https://github.com/ageron/handson-ml/blob/master/04_training_linear_models.ipynb

Models:

1) Linear Regression,
2) Polynomial Regression
3) Logistic Regression
4) Softmax Regression

1) Linear Regression

code
>>> from sklearn.linear_model import LinearRegression
>>> lin_reg = LinearRegression()
>>> lin_reg.fit(X, y)
>>> lin_reg.intercept_, lin_reg.coef_
(array([ 4.21509616]), array([[ 2.77011339]]))
>>> lin_reg.predict(X_new)
array([[ 4.21509616],
[ 9.75532293]])

Gradient Descent
The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

2017年9月13日星期三

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow ) (CH3 Classification)

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow ) (CH3 Classification)

link: https://github.com/ageron/handson-ml/blob/master/03_classification.ipynb

Confusion matrix

Now you are ready to get the confusion matrix using the confusion_matrix() function.
Just pass it the target classes (y_train_5) and the predicted classes
(y_train_pred):

code:
>>> from sklearn.metrics import confusion_matrix
>>> confusion_matrix(y_train_5, y_train_pred)
array([[53272, 1307],
[ 1077, 4344]])

Precision and Recall
Scikit-Learn provides several functions to compute classifier metrics, including precision
and recall:

code
>>> from sklearn.metrics import precision_score, recall_score
>>> precision_score(y_train_5, y_pred) # == 4344 / (4344 + 1307)
0.76871350203503808
>>> recall_score(y_train_5, y_train_pred) # == 4344 / (4344 + 1077)
0.79136690647482011

F-score
Harmonic mean of precision and recall

Whereas the regular mean
treats all values equally, the harmonic mean gives much more weight to low values.
As a result, the classifier will only get a high F1 score if both recall and precision are
high.

code

>>> from sklearn.metrics import f1_score
>>> f1_score(y_train_5, y_pred)
0.78468208092485547

Precision/Recall Tradeoff

The ROC curve

code

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

Then you can plot the FPR against the TPR using Matplotlib. This code produces the
plot in Figure 3-6:

def plot_roc_curve(fpr, tpr, label=None):
plt.plot(fpr, tpr, linewidth=2, label=label)
plt.plot([0, 1], [0, 1], 'k--')
plt.axis([0, 1, 0, 1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plot_roc_curve(fpr, tpr)
plt.show()

Multi-class Classification

Method: Random Forest or naive Bayes classifiers

OVA and OVO

For example, one way to create a system that can classify the digit images into 10
classes (from 0 to 9) is to train 10 binary classifiers, one for each digit (a 0-detector, a
1-detector, a 2-detector, and so on). Then when you want to classify an image, you get
the decision score from each classifier for that image and you select the class whose
classifier outputs the highest score. This is called the one-versus-all (OvA) strategy
(also called one-versus-the-rest).

Another strategy is to train a binary classifier for every pair of digits: one to distinguish
0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on.
This is called the one-versus-one (OvO) strategy. If there are N classes, you need to
train N × (N – 1) / 2 classifiers. For the MNIST problem, this means training 45
binary classifiers! When you want to classify an image, you have to run the image
through all 45 classifiers and see which class wins the most duels. The main advantage
of OvO is that each classifier only needs to be trained on the part of the training
set for the two classes that it must distinguish.

Some algorithms (such as Support Vector Machine classifiers) scale poorly with the
size of the training set, so for these algorithms OvO is preferred since it is faster to
train many classifiers on small training sets than training few classifiers on large
training sets. For most binary classification algorithms, however, OvA is preferred.

2017年9月12日星期二

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow ) (CH2 End-to-End Machine Learning Project)

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow )

Ch 2

End-to-End Machine Learning Project

Here are the main steps you will

go through:

1. Look at the big picture.

2. Get the data.

3. Discover and visualize the data to gain insights.

4. Prepare the data for Machine Learning algorithms.

5. Select a model and train it.

6. Fine-tune your model.

7. Present your solution.

8. Launch, monitor, and maintain your system.

Performance Measure

Root Mean Square Error (RMSE)

Mean Absolute Error (MAE)

Step 1

Treatment of Data

a)using .info(), .head(), .describe()

to take a look with the data

b) with column of catagories

.value_counts()

c) having a look with histagrph

import matplotlib.pyplot as plt

df.hist(bins=50, figsize=(20,15))

plt.show()

Step 2

Graph

correalation

code

corr_matrix = housing.corr()

Now let’s look at how much each attribute correlates with the median house value:

>>>

corr_matrix["median_house_value"].sort_values(ascending=False)

median_house_value 1.000000

median_income 0.687170

total_rooms 0.135231

housing_median_age 0.114220

households 0.064702

total_bedrooms 0.047865

population -0.026699

longitude -0.047279

latitude -0.142826

Name: median_house_value, dtype: float64

scatterplot

code

from pandas.tools.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",

"housing_median_age"]

scatter_matrix(housing[attributes], figsize=(12, 8))

Handling missing data

code

You can accomplish these easily using DataFrame’s dropna(), drop(), and fillna()

methods:

housing.dropna(subset=["total_bedrooms"]) # option 1

housing.drop("total_bedrooms", axis=1) # option 2

median = housing["total_bedrooms"].median()

housing["total_bedrooms"].fillna(median) # option 3

Handling Text and Categorical attributes

code

Scikit-Learn provides a transformer for this task called LabelEncoder:

>>> from sklearn.preprocessing import LabelEncoder

>>> encoder = LabelEncoder()

>>> housing_cat = housing["ocean_proximity"]

>>> housing_cat_encoded = encoder.fit_transform(housing_cat)

>>> housing_cat_encoded

array([1, 1, 4, ..., 1, 0, 3])

Feature Scaling

a) min-max scaling

from 0 to 1,,x-min/max-min

code >>MinMaxScaler,feature_range

b) standardization

code>>StandardScaler

Transformation Pipelines

code

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([

('imputer', Imputer(strategy="median")),

('attribs_adder', CombinedAttributesAdder()),

('std_scaler', StandardScaler()),

])

housing_num_tr = num_pipeline.fit_transform(housing_num)

The Pipeline constructor takes a list of name/estimator pairs defining a sequence of

steps. All but the last estimator must be transformers (i.e., they must have a

fit_transform() method). The names can be anything you like.

When you call the pipeline’s fit() method, it calls fit_transform() sequentially on

all transformers, passing the output of each call as the parameter to the next call, until

it reaches the final estimator, for which it just calls the fit() method.

The pipeline exposes the same methods as the final estimator. In this example, the last

estimator is a StandardScaler, which is a transformer, so the pipeline has a trans

form() method that applies all the transforms to the data in sequence (it also has a

fit_transform method that we could have used instead of calling fit() and then

transform()).

Better Evaluation Using Cross-Validation

One way to evaluate the Decision Tree model would be to use the train_test_split

function to split the training set into a smaller training set and a validation set, then

train your models against the smaller training set and evaluate them against the validation

set. It’s a bit of work, but nothing too difficult and it would work fairly well.

A great alternative is to use Scikit-Learn’s cross-validation feature. The following code

performs K-fold cross-validation: it randomly splits the training set into 10 distinct

subsets called folds, then it trains and evaluates the Decision Tree model 10 times,

picking a different fold for evaluation every time and training on the other 9 folds.

The result is an array containing the 10 evaluation scores:

code

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,

scoring="neg_mean_squared_error", cv=10)

rmse_scores = np.sqrt(-scores)

<imp> Fine-Tune model

Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you need to

do is tell it which hyperparameters you want it to experiment with, and what values to

try out, and it will evaluate all the possible combinations of hyperparameter values,

using cross-validation. For example, the following code searches for the best combination

of hyperparameter values for the RandomForestRegressor:

code

from sklearn.model_selection import GridSearchCV

param_grid = [

{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},

{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},

]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,

scoring='neg_mean_squared_error')

grid_search.fit(housing_prepared, housing_labels)

This param_grid tells Scikit-Learn to first evaluate all 3 × 4 = 12 combinations of

n_estimators and max_features hyperparameter values specified in the first dict

(don’t worry about what these hyperparameters mean for now; they will be explained

in Chapter 7), then try all 2 × 3 = 6 combinations of hyperparameter values in the

second dict, but this time with the bootstrap hyperparameter set to False instead of

True (which is the default value for this hyperparameter).

All in all, the grid search will explore 12 + 6 = 18 combinations of RandomForestRe

gressor hyperparameter values, and it will train each model five times (since we are

using five-fold cross validation). In other words, all in all, there will be 18 × 5 = 90

rounds of training! It may take quite a long time, but when it is done you can get the

best combination of parameters like this:

>>> grid_search.best_params_

{'max_features': 6, 'n_estimators': 30}

You can also get the best estimator directly:

>>> grid_search.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,

max_features=6, max_leaf_nodes=None, min_samples_leaf=1,

min_samples_split=2, min_weight_fraction_leaf=0.0,

n_estimators=30, n_jobs=1, oob_score=False, random_state=None,

verbose=0, warm_start=False)

And of course the evaluation scores are also available:

>>> cvres = grid_search.cv_results_

... for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):

... print(np.sqrt(-mean_score), params)

...

64912.0351358 {'max_features': 2, 'n_estimators': 3}

55535.2786524 {'max_features': 2, 'n_estimators': 10}

52940.2696165 {'max_features': 2, 'n_estimators': 30}

60384.0908354 {'max_features': 4, 'n_estimators': 3}

52709.9199934 {'max_features': 4, 'n_estimators': 10}

50503.5985321 {'max_features': 4, 'n_estimators': 30}

59058.1153485 {'max_features': 6, 'n_estimators': 3}

52172.0292957 {'max_features': 6, 'n_estimators': 10}

49958.9555932 {'max_features': 6, 'n_estimators': 30}

59122.260006 {'max_features': 8, 'n_estimators': 3}

52441.5896087 {'max_features': 8, 'n_estimators': 10}

50041.4899416 {'max_features': 8, 'n_estimators': 30}

62371.1221202 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}

54572.2557534 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}

59634.0533132 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}

52456.0883904 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}

58825.665239 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}

52012.9945396 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

Analyze the Best Models and Their Errors

You will often gain good insights on the problem by inspecting the best models. For

example, the RandomForestRegressor can indicate the relative importance of each

attribute for making accurate predictions:

>>> feature_importances = grid_search.best_estimator_.feature_importances_

>>> feature_importances

array([ 7.14156423e-02, 6.76139189e-02, 4.44260894e-02,

1.66308583e-02, 1.66076861e-02, 1.82402545e-02,

1.63458761e-02, 3.26497987e-01, 6.04365775e-02,

1.13055290e-01, 7.79324766e-02, 1.12166442e-02,

1.53344918e-01, 8.41308969e-05, 2.68483884e-03,

3.46681181e-03])

Let’s display these importance scores next to their corresponding attribute names:

>>> extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]

>>> cat_one_hot_attribs = list(encoder.classes_)

>>> attributes = num_attribs + extra_attribs + cat_one_hot_attribs

>>> sorted(zip(feature_importances, attributes), reverse=True)

[(0.32649798665134971, 'median_income'),

(0.15334491760305854, 'INLAND'),

(0.11305529021187399, 'pop_per_hhold'),

(0.07793247662544775, 'bedrooms_per_room'),

(0.071415642259275158, 'longitude'),

(0.067613918945568688, 'latitude'),

(0.060436577499703222, 'rooms_per_hhold'),

(0.04442608939578685, 'housing_median_age'),

(0.018240254462909437, 'population'),

(0.01663085833886218, 'total_rooms'),

(0.016607686091288865, 'total_bedrooms'),

(0.016345876147580776, 'households'),

(0.011216644219017424, '<1H OCEAN'),

(0.0034668118081117387, 'NEAR OCEAN'),

(0.0026848388432755429, 'NEAR BAY'),

(8.4130896890070617e-05, 'ISLAND')]

With this information, you may want to try dropping some of the less useful features

(e.g., apparently only one ocean_proximity category is really useful, so you could try

dropping the others).

You should also look at the specific errors that your system makes, then try to understand

why it makes them and what could fix the problem (adding extra features or, on

the contrary, getting rid of uninformative ones, cleaning up outliers, etc.).

Decision Trees and Random Forests in Python (Udmey)

Decision Trees and Random Forests in Python

2017年9月11日星期一

K Nearest Neighbors with Python (Udemy)

Python-Data-Science-and-Machine-Learning-Bootcamp\Machine Learning Sections\

K Nearest Neighbors with Python

Brief:

parameters
1) K
2) Distance Metric

cons
1) need high volume of data
2) not good with hihg dimensional data
3) Categorcal features don't work well

Steps:

Standardize the Variables

Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale.

Code:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS',axis=1))
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])
df_feat.head()

Train Test Split

code

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(scaled_features,df['TARGET CLASS'],

test_size=0.30)

Using KNN

Remember that we are trying to come up with a model to predict whether someone will TARGET CLASS or not. We'll start with k=1.

code

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)

pred = knn.predict(X_test)

Predictions and Evaluations

code

from sklearn.metrics import classification_report,confusion_matrix

print(confusion_matrix(y_test,pred))

print(classification_report(y_test,pred))

Choosing a K Value

Let's go ahead and use the elbow method to pick a good K Value:

code

<imp>

error_rate = []

# Will take some time

for i in range(1,40):

knn = KNeighborsClassifier(n_neighbors=i)

knn.fit(X_train,y_train)

pred_i = knn.predict(X_test)

error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize=(10,6))

plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',

markerfacecolor='red', markersize=10)

plt.title('Error Rate vs. K Value')

plt.xlabel('K')

plt.ylabel('Error Rate')

code
# FIRST A QUICK COMPARISON TO OUR ORIGINAL K=1
knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=1')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))

# NOW WITH K=23
knn = KNeighborsClassifier(n_neighbors=23)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=23')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))

outcome

WITH K=23


[[132  11]
 [  5 152]]


             precision    recall  f1-score   support

          0       0.96      0.92      0.94       143
          1       0.93      0.97      0.95       157

avg / total       0.95      0.95      0.95       300

2017年9月7日星期四

Ml udemy course:Python-Data-Science-and-Machine-Learning-Bootcamp\Machine Learning Sections\Logistic-Regression

Python-Data-Science-and-Machine-Learning-Bootcamp\Machine Learning Sections\Logistic-Regression Project

Logistic Regression for classification
steps
>import data, handling the format,like missing data,
>plot some correlation graphs

>import logistic regression tool
>split to training and testing sets
>run logistic model and predict the values
>perform prediction metrics and check the model

codes
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

ad_data = pd.read_csv('advertising.csv')

ad_data.head()

sns.set_style('whitegrid')
ad_data['Age'].hist(bins=30)

plt.xlabel('Age')

sns.jointplot(x='Age',y='Area Income',data=ad_data)

sns.jointplot(x='Age',y='Daily Time Spent on Site',data=ad_data,color='red',kind='kde');

sns.jointplot(x='Daily Time Spent on Site',y='Daily Internet Usage',data=ad_data,color='green')

sns.pairplot(ad_data,hue='Clicked on Ad',palette='bwr')

Split the data into training set and testing set using train_test_split

from sklearn.model_selection import train_test_split

X = ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']]

y = ad_data['Clicked on Ad']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()

logmodel.fit(X_train,y_train)

out:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

predictions = logmodel.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

訂閱：文章 (Atom)

2017年9月14日 星期四

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow ) (CH4 Training Models)

2017年9月13日 星期三

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow ) (CH3 Classification)

2017年9月12日 星期二

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow ) (CH2 End-to-End Machine Learning Project)

ML book (Hands-On Machine Learning with Scikit-Learn and TensorFlow )

Handling missing data

Better Evaluation Using Cross-Validation

<imp> Fine-Tune model

Analyze the Best Models and Their Errors

Decision Trees and Random Forests in Python (Udmey)

Decision Trees and Random Forests in Python

2017年9月11日 星期一

K Nearest Neighbors with Python (Udemy)

K Nearest Neighbors with Python (Udemy)

K Nearest Neighbors with Python

Standardize the Variables

Train Test Split

Using KNN

Predictions and Evaluations

Choosing a K Value

2017年9月7日 星期四

Ml udemy course:Python-Data-Science-and-Machine-Learning-Bootcamp\Machine Learning Sections\Logistic-Regression

2017年9月14日星期四

2017年9月13日星期三

2017年9月12日星期二

2017年9月11日星期一

2017年9月7日星期四