kubeflow 私有化部署和一些开发的坑位

kubeflow 安装步骤

根据kubeflow/mainfests上面的流程安装1.4.1版本

最好安装顺序步骤来一个个组件装,注意kustomize版本和Kubernetes版本

私有化部署

尽量获取kubeflow里面的镜像地址

1
2
3
4
5
images_file=images.txt


./kustomize/kustomize_3.2.0_darwin_amd64 build manifests/example | grep image: | sed -e 's/[ ]*image:[ ]*//' -e 's/"//g' | sort -u > $images_file

下载image push到私有仓库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
images_file=images.txt
push256=images256.txt
push_image=imagespush.txt
repo=xxxxx:8080/library
version=kf-manifests-1.4.1

cat $images_file | grep @sha256 | sed -e 's/[ ]*@sha256:.*//' -e 's/"//g' > $push256
cat $images_file | grep -v @sha256 > $push_image

for pull_image in $(cat $images_file)
do
  echo "开始拉取$pull_image..."
  docker pull $pull_image
done

for image in $(cat $push_image)
do
  echo "push $image..."
  docker tag $image $repo/$image
  docker push $repo/$image
done

for image in $(cat $push256)
do
  image_id=$(docker images|grep $image|awk '{print $3}'|sort -u)
  echo "push $image...$image_id"
  docker tag $image_id  $repo/$image:$version
  docker push $repo/$image:$version
done

1
istio/proxyv2:1.9.6

istio/proxy要另外push不能直接全部放到

私有化镜像

需要到的镜像在images.txtget_images.sh生成+额外缺的包

通过push_images.sh拉取和push到私有仓库

生成manifests.yaml并修改

1
./kustomize/kustomize_3.2.0_darwin_amd64 build manifests/example > manifests.yaml

使用修改好的manifests.yaml部署

1
cat manifests.yaml| kubectl apply -f -

遇到的问题

  • gcr.io/arrikto/kubeflow/oidc-authservice:28c59ef push到私有后拉不下来

img

需要修改仓库地址

  • 带sha256的镜像私有化后sha会改变拉不下来

    img

  • 层级太多不指定项目名pull不了,即使push到library项目里面

    img

解决

从images.txt到push tag转换由push_img.sh解决,manifest.yaml目前手动替换对应镜像名

images.txt push tag manifest.yaml
docker.io/istio/proxyv2:1.9.6 xxx:xx/library/docker.io/istio/proxyv2:1.9.6 library/docker.io/istio/proxyv2:1.9.6
gcr.io/knative-releases/knative.dev/eventing/cmd/broker/filter@sha256:0e25aa1613a3a1779b3f7b7f863e651e5f37520a7f6808ccad2164cc2b6a9b12 xxx:xx/library/gcr.io/knative-releases/knative.dev/eventing/cmd/broker/filter:kf-manifests-1.4.1 library/gcr.io/knative-releases/knative.dev/eventing/cmd/broker/filter:kf-manifests-1.4.1

kfserving

需要修改ConfigMap inferenceservice-config

image必须写完成路径xxx.xxx.xx.203:8080/library/pytorch/torchserve-kfs不然pull不了

安装遇到的问题

service “istio-galley” not found

1
2
3
4
5
6
    2020-05-22T14:11:35.511996Z	error	installer	failed to create "EnvoyFilter/istio-system/metadata-exchange-1.4": Internal error occurred: failed calling webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: service "istio-galley" not found
    2020-05-22T14:11:35.586916Z	error	installer	failed to create "EnvoyFilter/istio-system/metadata-exchange-1.5": Internal error occurred: failed calling webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: service "istio-galley" not found
    2020-05-22T14:11:35.626408Z	error	installer	failed to create "EnvoyFilter/istio-system/metadata-exchange-1.6": Internal error occurred: failed calling webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: service "istio-galley" not found
    2020-05-22T14:11:35.664555Z	error	installer	failed to create "EnvoyFilter/istio-system/stats-filter-1.4": Internal error occurred: failed calling webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: service "istio-galley" not found
    2020-05-22T14:11:35.712217Z	error	installer	failed to create "EnvoyFilter/istio-system/stats-filter-1.5": Internal error occurred: failed calling webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: service "istio-galley" not found
    2020-05-22T14:11:35.750036Z	error	installer	failed to create "EnvoyFilter/istio-system/stats-filter-1.6": Internal error occurred: failed calling webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: service "istio-galley" not found

有可能是卸载不干净

1
kubectl delete validatingwebhookconfigurations istio-galley

再apply

http设置

img

已经在mainfests.yml改了

  • centraldashboard
  • jupyter-web-app-deployment
  • katib-ui
  • kfserving-models-web-app
  • ml-pipeline-ui
  • tensorboards-web-app-deployment
  • volumes-web-app-deployment

在这几个名字的Deployment里面添加env参数

1
2
        - name: APP_SECURE_COOKIES
          value: "false"

https参考

  • https://github.com/kubeflow/kubeflow/issues/5803
  • https://github.com/kubeflow/manifests/pull/1819/files

安装到k8s1.19遇到TokenRequest的问题

版本:rke安装k8s-1.19.6、kubeflow-1.4

istio组件安装时报错

💡 MountVolume.SetUp failed for volume “istio-token” : failed to fetch token: the API server does not have TokenRequest endpoints enabled

执行下面脚本发现是集群没有TokenRequest

1
kubectl get --raw /api/v1 | jq '.resources[] | select(.name | index("serviceaccounts/token"))'

解决方式

设置变量jwtPolicy

values.global.jwtPolicy=first-party-jwt

参考:https://stackoverflow.com/questions/64641078/how-to-install-kubeflow-on-existing-on-prem-kubernetes-cluster

1.6版本修改不了docker.io/istio/proxyv2:1.14.1地址

只能pull,然后手动改一下tag

1
2
3
4
5
6
7
docker tag library/docker.io/istio/proxyv2:1.14.1 docker.io/istio/proxyv2:1.14.1

docker tag library/gcr.io/ml-pipeline/frontend:2.0.0-alpha.5 gcr.io/ml-pipeline/frontend:2.0.0-alpha.5

docker tag library/gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.5 gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.5


1.6部署InferenceService私有仓库拉不了镜像

参考issue

1
2
3
4
5
# 查看InferenceService详情
kubectl describe InferenceService/iris-test
 
 Message:               Revision "iris-test-predictor-default-00001" failed with message: Unable to fetch image "harbor.xxx.cn/library/tensorflow/serving:2.6.2": failed to resolve image to digest: Get "https://harbor.xxx.cn/v2/": x509: certificate is not valid for any names, but wanted to match harbor.xxx.cn.

CR ClusterServingRuntime不能使用域名缺省,不然会报错

1
2
Message:               Revision "iris-test-predictor-default-00001" failed with message: Unable to fetch image "library/tensorflow/serving:2.6.2": failed to resolve image to digest: Get "https://index.docker.io/v2/": context deadline exceeded.
    Reason:                RevisionFailed

要这样修改

1
2
3
4
5
6
7
8
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: kserve-tensorflow-serving
spec:
 ....
    image: harbor.xxx.cn/library/tensorflow/serving:2.6.2 # 需要完整地址
    name: kserve-container

x509的问题参考

1
kubectl -n knative-serving edit configmap config-deployment

跳过校验

1
2
3
4
5
6
7
8
9
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-deployment
  namespace: knative-serving

data:
  # List of repositories for which tag to digest resolving should be skipped
  registriesSkippingTagResolving: harbor.xxx.cn # 改这里

修改完后重新部署InferenceService

multi-user 问题

  • https://v1-4-branch.kubeflow.org/docs/components/multi-tenancy/getting-started/

新增用户

add_profile.yaml

1
2
3
4
5
6
7
8
9
apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
  name: kubeflow-admin-example-com # 用户namespace
spec:
  owner:
    kind: User
    name: admin@example.com # 用户名
kubectl apply -f add_profile.yaml

新增dex的config map

1
2
3
4
5
6
7
8
9
10
    staticPasswords:
    - email: user@example.com
      hash: $2y$12$4K/VkmDd1q1Orb3xAt82zu8gk7Ad6ReFR4LCP9UeYE90NLiN9Df72
      # https://github.com/dexidp/dex/pull/1601/commits
      # FIXME: Use hashFromEnv instead
      username: user
      userID: "15841185641784"
    - email: admin@example.com
      hash: $2y$12$uTQwnB7afyNhob.6ETtbdOSTtwvFgqdo9/6yyDi3RZZuk6OqWgsR6
      username: admin

生成hash需要pip install passlib bcrypt

1
python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

输入密码回车就可以得到需要的hash

最后重启下dex对应的pod

开发相关

自定义Jupyter镜像

自定义Jupyter镜像 因为每个人需要的包版本不一样

pipeline sdk

暂时不要用v2版本的,kubeflow会报错

1
 main.go:50] Failed to execute component: unable to get pipeline with PipelineName "pipeline/v2add" PipelineRunID "7e2bdeeb-aa6f-4109-a508-63a1be22267c": Failed GetContextByTypeAndName(type="system.Pipeline", name="pipeline/v2add")

sdk sample

kfserving

需要使用py3.9不然安装ray[serve]==1.5.0依赖会报错

fairing

1
pip install kubeflow-fairing

根本安装不了,依赖冲突, 还有不停安装同个包不同版本的问题。一句话根本没法装

1
2
3
4
5
6
7
8
9
$ pip-compile -r test_req.in
Could not find a version that matches kubernetes==10.0.1,>=10.0.1,>=12.0.0 (from kubeflow-fairing==1.0.2->-r test_req.in (line 1))
Tried: 1.0.0, 1.0.0, 1.0.1, 1.0.1, 1.0.2, 1.0.2, 2.0.0, 2.0.0, 3.0.0, 3.0.0, 4.0.0, 4.0.0, 5.0.0, 5.0.0, 6.0.0, 6.0.0, 6.1.0, 6.1.0, 7.0.0, 7.0.0, 7.0.1, 7.0.1, 8.0.0, 8.0.0, 8.0.1, 8.0.1, 8.0.2, 8.0.2, 9.0.0, 9.0.0, 9.0.1, 9.0.1, 10.0.0, 10.0.0, 10.0.1, 10.0.1, 10.1.0, 10.1.0, 11.0.0, 11.0.0, 12.0.0, 12.0.0, 12.0.1, 12.0.1, 17.17.0, 17.17.0, 18.20.0, 18.20.0, 19.15.0, 19.15.0
Skipped pre-versions: 0.0.0a2, 0.0.0a2, 0.0.0a5, 0.0.0a5, 1.0.0a2, 1.0.0a2, 1.0.0a3, 1.0.0a4, 1.0.0a4, 1.0.0a5, 1.0.0a5, 1.0.0b1, 1.0.0b1, 1.0.0b2, 1.0.0b2, 1.0.0b3, 1.0.0b3, 2.0.0a1, 2.0.0a1, 2.0.0b1, 2.0.0b1, 3.0.0a1, 3.0.0a1, 3.0.0b1, 3.0.0b1, 4.0.0a1, 4.0.0a1, 4.0.0b1, 4.0.0b1, 5.0.0b1, 5.0.0b1, 6.0.0b1, 6.0.0b1, 7.0.0a1, 7.0.0a1, 7.0.0b1, 7.0.0b1, 8.0.0a1, 8.0.0a1, 8.0.0b1, 8.0.0b1, 9.0.0a1, 9.0.0a1, 9.0.0b1, 9.0.0b1, 10.0.0a1, 10.0.0a1, 11.0.0a1, 11.0.0a1, 11.0.0b1, 11.0.0b1, 11.0.0b2, 11.0.0b2, 12.0.0a1, 12.0.0a1, 12.0.0b1, 12.0.0b1, 17.14.0a1, 17.14.0a1, 17.17.0b1, 17.17.0b1, 18.17.0a1, 18.17.0a1, 18.20.0b1, 18.20.0b1, 19.15.0a1, 19.15.0a1, 19.15.0b1, 19.15.0b1, 20.11.0a1, 20.11.0a1, 20.12.0b1, 20.12.0b1
There are incompatible versions in the resolved dependencies:
  kubernetes>=10.0.1 (from kubeflow-tfjob==0.1.3->kubeflow-fairing==1.0.2->-r test_req.in (line 1))
  kubernetes>=12.0.0 (from kfserving==0.6.1->kubeflow-fairing==1.0.2->-r test_req.in (line 1))
  kubernetes>=10.0.1 (from kubeflow-pytorchjob==0.1.3->kubeflow-fairing==1.0.2->-r test_req.in (line 1))
  kubernetes==10.0.1 (from kubeflow-fairing==1.0.2->-r test_req.in (line 1))

kubeflow-fairing有多个依赖包,kfserving比较新。其他包好老(有维护但是不release),fairing本身也没有兼容最新的kfserving。

主要是k8s client从11.xx开始从swagger_types换成openapi_types,而其他包也没有跟上

使用自己维护fairing的adapt-latest-kfserving-and-training-oprator分支打包whl包

1
2
3
4
5
6
7
#!/bin/bash

rm -rf build
rm -rf dist
rm -rf kubeflow_fairing.egg-info

python setup.py bdist_wheel

可以在项目的dist目录找到对应的whl文件

img

然后使用pip install xxx.whl就可以安装

模型状态判断

  • 正常

    1
    2
    3
    4
    5
    6
    7
    8
    "conditions": [
               ......
                {
                    "lastTransitionTime": "2022-01-27T01:49:37Z",
                    "status": "True",
                    "type": "Ready"
                }
      ]
    

    判断 ready true 官方通过inferenceServiceReadiness判断

  • 异常

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    "conditions": [
              
      
                {
                    "lastTransitionTime": "2022-01-27T01:54:52Z",
                    "message": "Revision \"test-sklearn-predictor-default-00001\" failed with message: 0/3 nodes are available: 1 Insufficient memory, 3 Insufficient cpu..",
                    "reason": "RevisionFailed",
                    "severity": "Info",
                    "status": "False",
                    "type": "PredictorRouteReady"
                },
                {
                    "lastTransitionTime": "2022-01-27T01:54:52Z",
                    "message": "Configuration \"test-sklearn-predictor-default\" does not have any ready Revision.",
                    "reason": "RevisionMissing",
                    "status": "False",
                    "type": "Ready"
                }
            ]
    

    遍历conditions if reason==”RevisionFailed” 就是异常,然后拿它message

  • 部署中

    其他情况都是部署中

要查看kserving里面赋值conditions的代码,主要看InferenceServiceStatus.PropagateStatus方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
func (ss *InferenceServiceStatus) PropagateStatus(component ComponentType, serviceStatus *knservingv1.ServiceStatus) {
	....
	// propagate overall service condition
	serviceCondition := serviceStatus.GetCondition(knservingv1.ServiceConditionReady)
	if serviceCondition != nil && serviceCondition.Status == v1.ConditionTrue {
		if serviceStatus.Address != nil {
			statusSpec.Address = serviceStatus.Address
		}
		if serviceStatus.URL != nil {
			statusSpec.URL = serviceStatus.URL
		}
	}
	// propagate ready condition for each component
	readyCondition := conditionsMap[component]
	ss.SetCondition(readyCondition, serviceCondition)
	// propagate route condition for each component
	routeCondition := serviceStatus.GetCondition("ConfigurationsReady")
	routeConditionType := routeConditionsMap[component]
	ss.SetCondition(routeConditionType, routeCondition)
	// propagate configuration condition for each component
	configurationCondition := serviceStatus.GetCondition("RoutesReady")
	configurationConditionType := configurationConditionsMap[component]
	// propagate traffic status for each component
	statusSpec.Traffic = serviceStatus.Traffic
	ss.SetCondition(configurationConditionType, configurationCondition)

	ss.Components[component] = statusSpec
}

一直追踪函数流程,简化一下(其实有点想不明白,最后两个值为什么这样赋值,变量名好奇怪,感觉好像反了

根据component会不一样 serviceStatus 来自于knative client
PredictorReady Ready
PredictorRouteReady ConfigurationsReady
PredictorConfigurationReady RoutesReady

比较顶级api 先调用knative创建上面的组件(components)knative,然后在创建ingress相关的setting IngressReady

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
reconcilers := []components.Component{
  components.NewPredictor(r.Client, r.Scheme, isvcConfig),
}
if isvc.Spec.Transformer != nil {
  reconcilers = append(reconcilers, components.NewTransformer(r.Client, r.Scheme, isvcConfig))
}
if isvc.Spec.Explainer != nil {
  reconcilers = append(reconcilers, components.NewExplainer(r.Client, r.Scheme, isvcConfig))
}
for _, reconciler := range reconcilers {
  if err := reconciler.Reconcile(isvc); err != nil {
    r.Log.Error(err, "Failed to reconcile", "reconciler", reflect.ValueOf(reconciler), "Name", isvc.Name)
    r.Recorder.Eventf(isvc, v1.EventTypeWarning, "InternalError", err.Error())
    return reconcile.Result{}, errors.Wrapf(err, "fails to reconcile component")
  }
}
//Reconcile ingress
ingressConfig, err := v1beta1api.NewIngressConfig(r.Client)
if err != nil {
  return reconcile.Result{}, errors.Wrapf(err, "fails to create IngressConfig")
}
reconciler := ingress.NewIngressReconciler(r.Client, r.Scheme, ingressConfig)
r.Log.Info("Reconciling ingress for inference service", "isvc", isvc.Name)
if err := reconciler.Reconcile(isvc); err != nil {
  return reconcile.Result{}, errors.Wrapf(err, "fails to reconcile ingress")
}

// Reconcile modelConfig
configMapReconciler := modelconfig.NewModelConfigReconciler(r.Client, r.Scheme)
if err := configMapReconciler.Reconcile(isvc); err != nil {
  return reconcile.Result{}, err
}

if err = r.updateStatus(isvc); err != nil {
  r.Recorder.Eventf(isvc, v1.EventTypeWarning, "InternalError", err.Error())
  return reconcile.Result{}, err
}