12月21日百度飞桨AI快车道-ElasticCTR【北京】专场活动
这次大会和深度学习有关,百度的ai确实厉害。现场用到了k8s集群,docker,hdfs等相关技术,所以,就去了。
一、荣幸通过了审核


二、现场状况


三、现场还有小礼品
充电宝


四、办公室名字很有个性

五、现场PPT
5.1 预习资料

5.3 ElasticCTR飞桨推荐系统解决方案
Elastic开源软件栈

ElasticCTR架构

六、实验环境
6.1 ElasticCRT架构

6.1 服务器
现场会给每个人一个服务器账号,用xshell连上就可以开始,每10人共享一个k8s集群;每个k8s集群有10个namespace,通过namespace隔离每个人的用户环境;每个集群是20个节点,配置还是很高的。然后有一个EXTERNAL-IP,每个人会使用一个端口出去。
如果是自己搭建环境的话需要满足以下:

[user8@instance-w0oy7npe-3 elastic-ctr-cli]$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
192.168.50.13 Ready <none> 45h v1.13.10 192.168.50.13 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.14 Ready <none> 45h v1.13.10 192.168.50.14 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.17 Ready <none> 45h v1.13.10 192.168.50.17 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.2 Ready <none> 45h v1.13.10 192.168.50.2 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.49 Ready <none> 45h v1.13.10 192.168.50.49 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.50 Ready <none> 45h v1.13.10 192.168.50.50 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.51 Ready <none> 45h v1.13.10 192.168.50.51 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.52 Ready <none> 45h v1.13.10 192.168.50.52 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.53 Ready <none> 45h v1.13.10 192.168.50.53 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.54 Ready <none> 45h v1.13.10 192.168.50.54 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.55 Ready <none> 45h v1.13.10 192.168.50.55 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.56 Ready <none> 45h v1.13.10 192.168.50.56 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.57 Ready <none> 45h v1.13.10 192.168.50.57 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.58 Ready <none> 45h v1.13.10 192.168.50.58 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.59 Ready <none> 45h v1.13.10 192.168.50.59 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.60 Ready <none> 45h v1.13.10 192.168.50.60 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.63 Ready <none> 45h v1.13.10 192.168.50.63 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.65 Ready <none> 45h v1.13.10 192.168.50.65 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.66 Ready <none> 45h v1.13.10 192.168.50.66 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
192.168.50.72 Ready <none> 45h v1.13.10 192.168.50.72 <none> CentOS Linux 7 (Core) 3.10.0-1062.4.1.el7.x86_64 docker://18.9.2
6.2 操作步骤
整个操作还是比较简单的,但是理解起来还是比较困难的!
1、登陆BCC节点,
2、确认环境是否可用
- kubectl get nodes
3、代码克隆,现场的话就不用克隆了,现场环境代码与github稍有区别:
- git clone https://github.com/PaddlePaddle/ElasticCTR.git
4、参数配置
-
sh elastic-control.sh -r
[user8@instance-w0oy7npe-3 elastic-ctr-cli]$ sh elastic-control.sh -r CPU=4 MEM=4 CUBE=2 TRAINER=2 PSERVER=2 CUBE=2 DATA_PATH=/app SLOT_CONF=./slot.conf VERBOSE=0 HDFS_ADDRESS=hdfs://192.168.50.158:9000 HDFS_UGI=root,i START_DATE_HR=20191221/00 END_DATE_HR=20191221/09 SPARSE_DIM=1000001 DATASET_PATH=/cluster3/train_data cube.yaml written to ./cube.yaml transfer.yaml written to ./transfer.yaml File server yaml written to fileserver.yaml Main yaml written to fleet-ctr.yaml start file-server pod No resources found. deployment.apps/file-server created service/file-server created searching file-server external IP, wait a moment. ... curl --upload-file ./slot.conf 180.76.185.64:9000 File ./slot.conf uploaded to 180.76.185.64:9000/slot.conf
5、执行训练
-
[user8@instance-w0oy7npe-3 elastic-ctr-cli]$ sh elastic-control.sh -a Waiting for pod... pod/cube-0 created service/cube-0 created pod/cube-1 created service/cube-1 created pod/cube-transfer created deployment.apps/paddleserving created service/paddleserving created No resources found. job.batch.volcano.sh/fleet-ctr-demo created waiting for mlflow... mlflow ready!
6、监控状态
-
[user8@instance-w0oy7npe-3 elastic-ctr-cli]$ sh elastic-control.sh -l Trainer 0 Log: 2019-12-21 08:09:39,509 - __main__ - INFO - Slot:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26'] 2019-12-21 08:09:46,659 - __main__ - INFO - startup program done. 2019-12-21 08:09:51,757 - __main__ - INFO - Training: 20191221/00 2019-12-21 08:10:56,612 - __main__ - INFO - 0.6585546060804714 2019-12-21 08:11:58,872 - __main__ - INFO - 0.6765955112507775 2019-12-21 08:13:00,832 - __main__ - INFO - 0.6852686167342176 2019-12-21 08:14:02,717 - __main__ - INFO - 0.6908312775141056 2019-12-21 08:14:52,288 - __main__ - INFO - save inference program: ./saved_models/20191221/00_model/ 2019-12-21 08:15:06,033 - __main__ - INFO - push raw model to HDFS: 20191221/00 2019-12-21 08:15:23,697 - __main__ - INFO - push converted model to HDFS: //cluster3/train_data/user8/output/20191221/00 2019-12-21 08:15:23,731 - __main__ - INFO - 20191221/00 Training Done. 2019-12-21 08:15:28,601 - __main__ - INFO - Training: 20191221/01 2019-12-21 08:15:52,317 - __main__ - INFO - 0.6961387571218486 2019-12-21 08:16:54,322 - __main__ - INFO - 0.6999073056549932 2019-12-21 08:17:56,776 - __main__ - INFO - 0.7031652044967064 File Server Log: 2019-12-21 08:15:32,330 - __main__ - INFO - new model downloaded. b'{"id": "1576916114", "key": "1576916114", "input": "/output/ctr_cube/20191221/base"}\n' Cube Transfer Log: [all reload ok]inst:[{test_dict base 0 0 0 0 0 /cube 172.16.162.85 8027 172.16.162.85 8001 70 1576916162 1576916162 1576916169 1576916179 1576916180 0 0 1576916161} {test_dict base 0 0 0 0 1 /cube 172.16.69.62 8027 172.16.69.62 8001 70 1576916162 1576916162 1576916169 1576916179 1576916180 0 0 1576916161}] Padddle Serving Log:
7、预测服务
-
[user8@instance-w0oy7npe-3 elastic-ctr-cli]$ sh elastic-control.sh -c ... Try ELASTIC CTR: 1. cd client 2. (python) python bin/elastic_ctr.py 180.76.157.253 8010 conf/slot.conf data/ctr_prediction/data.txt 3. (C++ native) bin/elastic_ctr_demo --test_file data/ctr_prediction/data.txt
-
[user8@instance-w0oy7npe-3 elastic-ctr-cli]$ cd client/ [user8@instance-w0oy7npe-3 client]$ python bin/elastic_ctr.py 180.76.157.253 8010 conf/slot.conf data/ctr_prediction/data.txt auc = 0.692497772238
8、web端访问

七、百度的内网wifi
这个wifi做的还是很不错的,用户连上后需要给管理员发邮件申请上网,然后会收到验证码用于登陆上网
