12月21日百度飞桨AI快车道-ElasticCTR【北京】专场活动

这次大会和深度学习有关,百度的ai确实厉害。现场用到了k8s集群,docker,hdfs等相关技术,所以,就去了。

一、荣幸通过了审核

二、现场状况

三、现场还有小礼品

充电宝

四、办公室名字很有个性

五、现场PPT

https://www.lanzous.com/b03ybasob
密码:duzy

5.1 预习资料

5.3 ElasticCTR飞桨推荐系统解决方案

Elastic开源软件栈

ElasticCTR架构

1577194275229

六、实验环境

6.1 ElasticCRT架构

6.1 服务器

现场会给每个人一个服务器账号,用xshell连上就可以开始,每10人共享一个k8s集群;每个k8s集群有10个namespace,通过namespace隔离每个人的用户环境;每个集群是20个节点,配置还是很高的。然后有一个EXTERNAL-IP,每个人会使用一个端口出去。

如果是自己搭建环境的话需要满足以下:

[user8@instance-w0oy7npe-3 elastic-ctr-cli]$ kubectl get nodes -o wide
NAME            STATUS   ROLES    AGE   VERSION    INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION               CONTAINER-RUNTIME
192.168.50.13   Ready    <none>   45h   v1.13.10   192.168.50.13   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.14   Ready    <none>   45h   v1.13.10   192.168.50.14   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.17   Ready    <none>   45h   v1.13.10   192.168.50.17   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.2    Ready    <none>   45h   v1.13.10   192.168.50.2    <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.49   Ready    <none>   45h   v1.13.10   192.168.50.49   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.50   Ready    <none>   45h   v1.13.10   192.168.50.50   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.51   Ready    <none>   45h   v1.13.10   192.168.50.51   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.52   Ready    <none>   45h   v1.13.10   192.168.50.52   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.53   Ready    <none>   45h   v1.13.10   192.168.50.53   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.54   Ready    <none>   45h   v1.13.10   192.168.50.54   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.55   Ready    <none>   45h   v1.13.10   192.168.50.55   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.56   Ready    <none>   45h   v1.13.10   192.168.50.56   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.57   Ready    <none>   45h   v1.13.10   192.168.50.57   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.58   Ready    <none>   45h   v1.13.10   192.168.50.58   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.59   Ready    <none>   45h   v1.13.10   192.168.50.59   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.60   Ready    <none>   45h   v1.13.10   192.168.50.60   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.63   Ready    <none>   45h   v1.13.10   192.168.50.63   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.65   Ready    <none>   45h   v1.13.10   192.168.50.65   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.66   Ready    <none>   45h   v1.13.10   192.168.50.66   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2
192.168.50.72   Ready    <none>   45h   v1.13.10   192.168.50.72   <none>        CentOS Linux 7 (Core)   3.10.0-1062.4.1.el7.x86_64   docker://18.9.2

6.2 操作步骤

整个操作还是比较简单的,但是理解起来还是比较困难的!

1、登陆BCC节点,

2、确认环境是否可用

  • kubectl get nodes

3、代码克隆,现场的话就不用克隆了,现场环境代码与github稍有区别:

  • git clone https://github.com/PaddlePaddle/ElasticCTR.git

4、参数配置

  • sh elastic-control.sh -r

    [user8@instance-w0oy7npe-3 elastic-ctr-cli]$ sh elastic-control.sh -r
    CPU=4 MEM=4 CUBE=2 TRAINER=2 PSERVER=2  CUBE=2 DATA_PATH=/app SLOT_CONF=./slot.conf VERBOSE=0  HDFS_ADDRESS=hdfs://192.168.50.158:9000 HDFS_UGI=root,i START_DATE_HR=20191221/00 END_DATE_HR=20191221/09  SPARSE_DIM=1000001 DATASET_PATH=/cluster3/train_data 
    cube.yaml written to ./cube.yaml
    transfer.yaml written to ./transfer.yaml
    File server yaml written to fileserver.yaml
    Main yaml written to fleet-ctr.yaml
    start file-server pod
    No resources found.
    deployment.apps/file-server created
    service/file-server created
    searching file-server external IP, wait a moment.
    ...
    curl --upload-file ./slot.conf 180.76.185.64:9000
    File ./slot.conf uploaded to 180.76.185.64:9000/slot.conf
    

5、执行训练

  • [user8@instance-w0oy7npe-3 elastic-ctr-cli]$ sh elastic-control.sh -a
    Waiting for pod...
    pod/cube-0 created
    service/cube-0 created
    pod/cube-1 created
    service/cube-1 created
    pod/cube-transfer created
    deployment.apps/paddleserving created
    service/paddleserving created
    No resources found.
    job.batch.volcano.sh/fleet-ctr-demo created
    waiting for mlflow...
    mlflow ready!
    

6、监控状态

  • [user8@instance-w0oy7npe-3 elastic-ctr-cli]$ sh elastic-control.sh -l
    Trainer 0 Log:
    2019-12-21 08:09:39,509 - __main__ - INFO - Slot:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26']
    2019-12-21 08:09:46,659 - __main__ - INFO - startup program done.
    2019-12-21 08:09:51,757 - __main__ - INFO - Training: 20191221/00
    2019-12-21 08:10:56,612 - __main__ - INFO - 0.6585546060804714
    2019-12-21 08:11:58,872 - __main__ - INFO - 0.6765955112507775
    2019-12-21 08:13:00,832 - __main__ - INFO - 0.6852686167342176
    2019-12-21 08:14:02,717 - __main__ - INFO - 0.6908312775141056
    2019-12-21 08:14:52,288 - __main__ - INFO - save inference program: ./saved_models/20191221/00_model/
    2019-12-21 08:15:06,033 - __main__ - INFO - push raw model to HDFS: 20191221/00
    2019-12-21 08:15:23,697 - __main__ - INFO - push converted model to HDFS: //cluster3/train_data/user8/output/20191221/00
    2019-12-21 08:15:23,731 - __main__ - INFO - 20191221/00 Training Done.
    2019-12-21 08:15:28,601 - __main__ - INFO - Training: 20191221/01
    2019-12-21 08:15:52,317 - __main__ - INFO - 0.6961387571218486
    2019-12-21 08:16:54,322 - __main__ - INFO - 0.6999073056549932
    2019-12-21 08:17:56,776 - __main__ - INFO - 0.7031652044967064
    
    File Server Log:
    2019-12-21 08:15:32,330 - __main__ - INFO - new model downloaded. b'{"id": "1576916114", "key": "1576916114", "input": "/output/ctr_cube/20191221/base"}\n'
    
    Cube Transfer Log:
    [all reload ok]inst:[{test_dict base 0 0 0 0 0  /cube 172.16.162.85 8027 172.16.162.85 8001 70  1576916162 1576916162 1576916169 1576916179 1576916180 0 0 1576916161} {test_dict base 0 0 0 0 1  /cube 172.16.69.62 8027 172.16.69.62 8001 70  1576916162 1576916162 1576916169 1576916179 1576916180 0 0 1576916161}]
    
    Padddle Serving Log:
    

7、预测服务

  • [user8@instance-w0oy7npe-3 elastic-ctr-cli]$ sh elastic-control.sh -c
    ...
    Try ELASTIC CTR:
    1. cd client
    2. (python) python bin/elastic_ctr.py 180.76.157.253 8010 conf/slot.conf data/ctr_prediction/data.txt
    3. (C++ native) bin/elastic_ctr_demo --test_file data/ctr_prediction/data.txt
    
  • [user8@instance-w0oy7npe-3 elastic-ctr-cli]$ cd client/
    [user8@instance-w0oy7npe-3 client]$ python bin/elastic_ctr.py 180.76.157.253 8010 conf/slot.conf data/ctr_prediction/data.txt
    auc =  0.692497772238
    

8、web端访问

七、百度的内网wifi

这个wifi做的还是很不错的,用户连上后需要给管理员发邮件申请上网,然后会收到验证码用于登陆上网