Python培训Week4

前言

这节课是公司“学徒计划”训练营Python课程班的第4节课,前3节课主要讲了web框架(Django和原生方式)、selenium自动化测试工具的使用、网络爬虫(Scrapy和原生方式)。

这节课开始,进入课程的主要内容——机器学习部分。

这节课主要是4类python库的使用:

  1. Numpy 数据处理
  2. Pandas 读取文件
  3. Matplotlib 图形化展示
  4. Scipy高级科学计算库和scikit-learn机器学习库

Numpy

1
2
3
4
5
6
7
8
9
10
11
import numpy as np  

array_one = np.array( [1,2,3,4,5,6] )
print (array_one[0])

array_two = np.array( [[1,2,3], [4,5,6],[7,8,9] ])
print (array_two)
print(array_two.shape)
print(array_two[0][0]) # 取一个元素
print(array_two[1, :]) # 取一行, 等同print(array_two[1])
print(array_two[:, 2]) # 取一列

运行结果

1
2
3
4
5
6
7
8
1
[[1 2 3]
[4 5 6]
[7 8 9]]
(3, 3)
1
[4 5 6]
[3 6 9]

参考array_two:一个数据(矩阵)是一个dataset,其中每行数据是一个sample(类似数据库,一行一个数据),每列是一个feature。

reshape和resize

reshape:为了降低运算复杂度,reshape为低纬度的数据。不改变原始矩阵的形状。
如,3*3的矩阵变为1*9的行。

1
2
3
4
array_two = np.array( [[1,2,3], [4,5,6],[7,8,9] ])  

print(array_two.reshape(1,-1)) #更常用,等同print(array_two.reshape(1,9))
print (array_two)
1
2
3
4
[[1 2 3 4 5 6 7 8 9]]
[[1 2 3]
[4 5 6]
[7 8 9]]
为one sample, multiple features。常用。
1
2
3
4
5
6
7
8
9
类似的,```array_two.reshape(-1,1)``` 为one feature, multiple samples。不常见,很少有数据只有1个特征值。  


**resize**:类似reshape, 但改变原始矩阵的形状。
```python
array_two = np.array( [[1,2,3], [4,5,6],[7,8,9] ])

array_two.resize(1,9) # 无返回值
print(array_two)
1
[[1 2 3 4 5 6 7 8 9]]

Pandas

使用数据集https://archive.ics.uci.edu/dataset/53/iris

1
2
3
4
5
6
7
import pandas as pd  

dataset = pd.read_excel("data.xlsx", sheet_name='employee')
print(dataset)

dataset = pd.read_csv("iris.data.csv")
print(dataset)

输出如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
   id code   name
0 1 E01 Jack
1 2 E02 Jane
2 3 E03 Luke
3 4 E04 Isha
4 5 E05 Jerry
5.1 3.5 1.4 0.2 Iris-setosa
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
.. ... ... ... ... ...
144 6.7 3.0 5.2 2.3 Iris-virginica
145 6.3 2.5 5.0 1.9 Iris-virginica
146 6.5 3.0 5.2 2.0 Iris-virginica
147 6.2 3.4 5.4 2.3 Iris-virginica
148 5.9 3.0 5.1 1.8 Iris-virginica

[149 rows x 5 columns]

Matplotlib

可以用图形化方式展示数据,图表样式比excel更多,更专业。
更多图表样式在官网https://matplotlib.org/stable/gallery/index,下载后替换数据即可。

1
2
3
4
5
6
7
8
9
10
11
import  matplotlib.pyplot as plt  

plt.plot([2,4,6,8], [1,2,3,4])
plt.xlabel('x')
plt.ylabel('y')
plt.show()

value = [0.1, 0.3, 0.5, 0.1]
label = ['A', 'B', 'C', 'D']
plt.pie(x=value, labels=label, autopct='%.2f%%')
plt.show()

运行结果:
image.png

image.png

Scipy高级科学计算库和scikit-learn机器学习库

scikit-learn安装:

(1) pip uninstall numpy

pip install numpy==1.19.0 -i http://mirrors.aliyun.com/pypi/simple/ –trusted-host mirrors.aliyun.com

(2) 将Installer.zip拷贝至如下目录并解压缩 C:\Program Files (x86)\Microsoft Visual Studio

(3)pip install scipy==1.5.1 -i http://mirrors.aliyun.com/pypi/simple/ –trusted-host mirrors.aliyun.com

(4)pip install scikit-learn==0.23.1 -i http://mirrors.aliyun.com/pypi/simple/ –trusted-host mirrors.aliyun.com

大写X表示数据,小写y表示label/target。后面加后缀表示训练或测试数据。

预测模型

image.png
从数据中学习,然后给测试数据,根据之前训练的数据,预测/匹配最类似的一个label作为结果。
下面是学习手写数字,然后根据给定的test_data进行匹配,预测是哪个数据;这里采用KNN算法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
import numpy as np


class DigitIdentify(object):

def digit_identify(self):
X, y = datasets.load_digits(return_X_y=True)
knn = KNeighborsClassifier()
knn.fit(X, y)
test_data = [0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0,0, 0, 0, 0,
0, 0, 0, 10, 0, 15, 4, 0,
0, 0, 3, 16, 12, 14, 2, 0,
0, 0, 4, 16, 16, 2, 0, 0,
0, 3, 16, 8, 10, 13, 2, 0,
0, 1, 15, 1, 3, 16, 8, 0,
0, 0, 11, 16, 15, 11, 1, 0]
y_pred = knn.predict(np.array(test_data).reshape(1, -1))
print(y_pred)


if __name__ == '__main__':
di = DigitIdentify()
di.digit_identify()

评估模型

image.png

数据集中数据分为2部分,大量数据用于训练,少量数据用于测试。最后比较测试数据的正确结果和实际结果,作为accuracy score

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sklearn import datasets  
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


class DigitIdentify(object):

def digit_identify(self):
X, y = datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

y_pred = knn.predict(np.array(X_test))
print(accuracy_score(y_test, y_pred)) #测试结果集的正确结果和预测(实际


if __name__ == '__main__':
di = DigitIdentify()
di.digit_identify()

应用实例

对糖尿病数据’pima-indians-diabetes.csv’应用上述模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd  
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np


class DiabetesPredict(object):

def diabetes_predict(self):
data_set = pd.read_csv('pima-indians-diabetes.csv')
feature_columns = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age']
X = data_set[feature_columns]
y = data_set.label
dtc = DecisionTreeClassifier()
dtc.fit(X.values, y.values) # 加上.values
test_data = [1, 97, 66, 15, 140, 23.2, 0.487, 32]
y_pred = dtc.predict(np.array(test_data).reshape(1, -1))
print(y_pred)

def diabetes_identify(self):
data_set = pd.read_csv('pima-indians-diabetes.csv')
feature_columns = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age']
X = data_set[feature_columns]
y = data_set.label
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25)

dtc = DecisionTreeClassifier()
# dtc = DecisionTreeClassifier(criterion="entropy",max_depth=2)
dtc.fit(X_train.values, y_train.values) # x.values will have only values without headers

y_pred = dtc.predict(np.array(X_test))
print(accuracy_score(y_test,y_pred))


if __name__ == "__main__":
dp = DiabetesPredict()
dp.diabetes_predict()
dp.diabetes_identify()

输出结果

1
2
[0]
0.56

表示预测模型评估结果为:没有患糖尿病;评估模型得到的准确率为0.56(每次随机选取数据,所以结果也不同)。
为了得到更高的准确率,可以对算法参数进行调整:

1
dtc = DecisionTreeClassifier(criterion="entropy",max_depth=2)  

决策树

调参效率太低,为了了解决策树,可以用图形化工具输出每一步的图像。

  1. 安装pydotplus
    pip install pydotplus
  2. 下载graphviz并设置环境变量,这里安装2.38版本
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    import pandas as pd  
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier, export_graphviz
    import numpy as np

    import pydotplus
    from six import StringIO


    class DiabetesPredict(object):

    def diabetes_predict(self):
    data_set = pd.read_csv('pima-indians-diabetes.csv')
    feature_columns = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age']
    X = data_set[feature_columns]
    y = data_set.label
    dtc = DecisionTreeClassifier()
    dtc.fit(X.values, y.values) # 加上.values
    test_data = [1, 97, 66, 15, 140, 23.2, 0.487, 32]
    y_pred = dtc.predict(np.array(test_data).reshape(1, -1))
    print(y_pred)

    def diabetes_identify(self):
    data_set = pd.read_csv('pima-indians-diabetes.csv')
    feature_columns = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age']
    X = data_set[feature_columns]
    y = data_set.label
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25)

    dtc = DecisionTreeClassifier(criterion="entropy",max_depth=2)
    dtc.fit(X_train.values, y_train.values) # x.values will have only values without headers

    y_pred = dtc.predict(np.array(X_test))
    print(accuracy_score(y_test,y_pred))

    # Generate graph
    dot_data = StringIO()
    export_graphviz(dtc,out_file=dot_data,
    feature_names=feature_columns,
    class_names=['0','1'],
    filled=True,
    rounded=True,
    special_characters=True)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
    graph.write_pdf('decision_tree.pdf')



    if __name__ == "__main__":
    dp = DiabetesPredict()
    dp.diabetes_predict()
    dp.diabetes_identify()

运行后本地decision_tree.pdf文件:
image.png

修改为:

1
dtc = DecisionTreeClassifier(criterion="entropy")  

运行后本地decision_tree.pdf文件:
image.png