kubeflow 安装步骤

根据kubeflow/mainfests上面的流程安装1.4.1版本

最好安装顺序步骤来一个个组件装，注意kustomize版本和Kubernetes版本

私有化部署

尽量获取kubeflow里面的镜像地址

images_file=images.txt

./kustomize/kustomize_3.2.0_darwin_amd64 build manifests/example | grep image: | sed -e 's/[ ]*image:[ ]*//' -e 's/"//g' | sort -u > $images_file

下载image push到私有仓库

images_file=images.txt
push256=images256.txt
push_image=imagespush.txt
repo=xxxxx:8080/library
version=kf-manifests-1.4.1

cat $images_file | grep @sha256 | sed -e 's/[ ]*@sha256:.*//' -e 's/"//g' > $push256
cat $images_file | grep -v @sha256 > $push_image

for pull_image in $(cat $images_file)
do
  echo "开始拉取$pull_image..."
  docker pull $pull_image
done

for image in $(cat $push_image)
do
  echo "push $image..."
  docker tag $image $repo/$image
  docker push $repo/$image
done

for image in $(cat $push256)
do
  image_id=$(docker images|grep $image|awk '{print $3}'|sort -u)
  echo "push $image...$image_id"
  docker tag $image_id  $repo/$image:$version
  docker push $repo/$image:$version
done

1	`istio/proxyv2:1.9.6`

istio/proxy要另外push不能直接全部放到

私有化镜像

需要到的镜像在images.txt 由get_images.sh生成+额外缺的包

通过push_images.sh拉取和push到私有仓库

生成manifests.yaml并修改

./kustomize/kustomize_3.2.0_darwin_amd64 build manifests/example > manifests.yaml

使用修改好的manifests.yaml部署

1	`cat manifests.yaml\| kubectl apply -f -`

遇到的问题

gcr.io/arrikto/kubeflow/oidc-authservice:28c59ef push到私有后拉不下来

需要修改仓库地址

带sha256的镜像私有化后sha会改变拉不下来
层级太多不指定项目名pull不了，即使push到library项目里面

解决

从images.txt到push tag转换由push_img.sh解决，manifest.yaml目前手动替换对应镜像名

images.txt	push tag	manifest.yaml
docker.io/istio/proxyv2:1.9.6	xxx:xx/library/docker.io/istio/proxyv2:1.9.6	library/docker.io/istio/proxyv2:1.9.6
gcr.io/knative-releases/knative.dev/eventing/cmd/broker/filter@sha256:0e25aa1613a3a1779b3f7b7f863e651e5f37520a7f6808ccad2164cc2b6a9b12	xxx:xx/library/gcr.io/knative-releases/knative.dev/eventing/cmd/broker/filter:kf-manifests-1.4.1	library/gcr.io/knative-releases/knative.dev/eventing/cmd/broker/filter:kf-manifests-1.4.1

kfserving

需要修改ConfigMap inferenceservice-config

image必须写完成路径xxx.xxx.xx.203:8080/library/pytorch/torchserve-kfs不然pull不了

安装遇到的问题

service “istio-galley” not found

    2020-05-22T14:11:35.511996Z	error	installer	failed to create "EnvoyFilter/istio-system/metadata-exchange-1.4": Internal error occurred: failed calling webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: service "istio-galley" not found
    2020-05-22T14:11:35.586916Z	error	installer	failed to create "EnvoyFilter/istio-system/metadata-exchange-1.5": Internal error occurred: failed calling webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: service "istio-galley" not found
    2020-05-22T14:11:35.626408Z	error	installer	failed to create "EnvoyFilter/istio-system/metadata-exchange-1.6": Internal error occurred: failed calling webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: service "istio-galley" not found
    2020-05-22T14:11:35.664555Z	error	installer	failed to create "EnvoyFilter/istio-system/stats-filter-1.4": Internal error occurred: failed calling webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: service "istio-galley" not found
    2020-05-22T14:11:35.712217Z	error	installer	failed to create "EnvoyFilter/istio-system/stats-filter-1.5": Internal error occurred: failed calling webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: service "istio-galley" not found
    2020-05-22T14:11:35.750036Z	error	installer	failed to create "EnvoyFilter/istio-system/stats-filter-1.6": Internal error occurred: failed calling webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: service "istio-galley" not found

有可能是卸载不干净

1	`kubectl delete validatingwebhookconfigurations istio-galley`

再apply

http设置

已经在mainfests.yml改了

centraldashboard
jupyter-web-app-deployment
katib-ui
kfserving-models-web-app
ml-pipeline-ui
tensorboards-web-app-deployment
volumes-web-app-deployment

在这几个名字的Deployment里面添加env参数

        - name: APP_SECURE_COOKIES
          value: "false"

https参考

https://github.com/kubeflow/kubeflow/issues/5803
https://github.com/kubeflow/manifests/pull/1819/files

安装到k8s1.19遇到TokenRequest的问题

版本：rke安装k8s-1.19.6、kubeflow-1.4

istio组件安装时报错

💡 MountVolume.SetUp failed for volume “istio-token” : failed to fetch token: the API server does not have TokenRequest endpoints enabled

执行下面脚本发现是集群没有TokenRequest

kubectl get --raw /api/v1 | jq '.resources[] | select(.name | index("serviceaccounts/token"))'

解决方式

设置变量jwtPolicy

values.global.jwtPolicy=first-party-jwt

参考：https://stackoverflow.com/questions/64641078/how-to-install-kubeflow-on-existing-on-prem-kubernetes-cluster

1.6版本修改不了docker.io/istio/proxyv2:1.14.1地址

只能pull，然后手动改一下tag

docker tag library/docker.io/istio/proxyv2:1.14.1 docker.io/istio/proxyv2:1.14.1

docker tag library/gcr.io/ml-pipeline/frontend:2.0.0-alpha.5 gcr.io/ml-pipeline/frontend:2.0.0-alpha.5

docker tag library/gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.5 gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.5

1.6部署InferenceService私有仓库拉不了镜像

参考issue

# 查看InferenceService详情
kubectl describe InferenceService/iris-test
 
 Message:               Revision "iris-test-predictor-default-00001" failed with message: Unable to fetch image "harbor.xxx.cn/library/tensorflow/serving:2.6.2": failed to resolve image to digest: Get "https://harbor.xxx.cn/v2/": x509: certificate is not valid for any names, but wanted to match harbor.xxx.cn.

CR ClusterServingRuntime不能使用域名缺省，不然会报错

Message:               Revision "iris-test-predictor-default-00001" failed with message: Unable to fetch image "library/tensorflow/serving:2.6.2": failed to resolve image to digest: Get "https://index.docker.io/v2/": context deadline exceeded.
    Reason:                RevisionFailed

要这样修改

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: kserve-tensorflow-serving
spec:
 ....
    image: harbor.xxx.cn/library/tensorflow/serving:2.6.2 # 需要完整地址
    name: kserve-container

x509的问题参考

1	`kubectl -n knative-serving edit configmap config-deployment`

跳过校验

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-deployment
  namespace: knative-serving

data:
  # List of repositories for which tag to digest resolving should be skipped
  registriesSkippingTagResolving: harbor.xxx.cn # 改这里

修改完后重新部署InferenceService

multi-user 问题

https://v1-4-branch.kubeflow.org/docs/components/multi-tenancy/getting-started/

新增用户

add_profile.yaml

apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
  name: kubeflow-admin-example-com # 用户namespace
spec:
  owner:
    kind: User
    name: admin@example.com # 用户名
kubectl apply -f add_profile.yaml

新增dex的config map

    staticPasswords:
    - email: user@example.com
      hash: $2y$12$4K/VkmDd1q1Orb3xAt82zu8gk7Ad6ReFR4LCP9UeYE90NLiN9Df72
      # https://github.com/dexidp/dex/pull/1601/commits
      # FIXME: Use hashFromEnv instead
      username: user
      userID: "15841185641784"
    - email: admin@example.com
      hash: $2y$12$uTQwnB7afyNhob.6ETtbdOSTtwvFgqdo9/6yyDi3RZZuk6OqWgsR6
      username: admin

生成hash需要pip install passlib bcrypt

python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

输入密码回车就可以得到需要的hash

最后重启下dex对应的pod

开发相关

自定义Jupyter镜像

自定义Jupyter镜像因为每个人需要的包版本不一样

pipeline sdk

暂时不要用v2版本的，kubeflow会报错

 main.go:50] Failed to execute component: unable to get pipeline with PipelineName "pipeline/v2add" PipelineRunID "7e2bdeeb-aa6f-4109-a508-63a1be22267c": Failed GetContextByTypeAndName(type="system.Pipeline", name="pipeline/v2add")

sdk sample

kfserving

需要使用py3.9不然安装ray[serve]==1.5.0依赖会报错

fairing

1	`pip install kubeflow-fairing`

根本安装不了，依赖冲突, 还有不停安装同个包不同版本的问题。一句话根本没法装

$ pip-compile -r test_req.in
Could not find a version that matches kubernetes==10.0.1,>=10.0.1,>=12.0.0 (from kubeflow-fairing==1.0.2->-r test_req.in (line 1))
Tried: 1.0.0, 1.0.0, 1.0.1, 1.0.1, 1.0.2, 1.0.2, 2.0.0, 2.0.0, 3.0.0, 3.0.0, 4.0.0, 4.0.0, 5.0.0, 5.0.0, 6.0.0, 6.0.0, 6.1.0, 6.1.0, 7.0.0, 7.0.0, 7.0.1, 7.0.1, 8.0.0, 8.0.0, 8.0.1, 8.0.1, 8.0.2, 8.0.2, 9.0.0, 9.0.0, 9.0.1, 9.0.1, 10.0.0, 10.0.0, 10.0.1, 10.0.1, 10.1.0, 10.1.0, 11.0.0, 11.0.0, 12.0.0, 12.0.0, 12.0.1, 12.0.1, 17.17.0, 17.17.0, 18.20.0, 18.20.0, 19.15.0, 19.15.0
Skipped pre-versions: 0.0.0a2, 0.0.0a2, 0.0.0a5, 0.0.0a5, 1.0.0a2, 1.0.0a2, 1.0.0a3, 1.0.0a4, 1.0.0a4, 1.0.0a5, 1.0.0a5, 1.0.0b1, 1.0.0b1, 1.0.0b2, 1.0.0b2, 1.0.0b3, 1.0.0b3, 2.0.0a1, 2.0.0a1, 2.0.0b1, 2.0.0b1, 3.0.0a1, 3.0.0a1, 3.0.0b1, 3.0.0b1, 4.0.0a1, 4.0.0a1, 4.0.0b1, 4.0.0b1, 5.0.0b1, 5.0.0b1, 6.0.0b1, 6.0.0b1, 7.0.0a1, 7.0.0a1, 7.0.0b1, 7.0.0b1, 8.0.0a1, 8.0.0a1, 8.0.0b1, 8.0.0b1, 9.0.0a1, 9.0.0a1, 9.0.0b1, 9.0.0b1, 10.0.0a1, 10.0.0a1, 11.0.0a1, 11.0.0a1, 11.0.0b1, 11.0.0b1, 11.0.0b2, 11.0.0b2, 12.0.0a1, 12.0.0a1, 12.0.0b1, 12.0.0b1, 17.14.0a1, 17.14.0a1, 17.17.0b1, 17.17.0b1, 18.17.0a1, 18.17.0a1, 18.20.0b1, 18.20.0b1, 19.15.0a1, 19.15.0a1, 19.15.0b1, 19.15.0b1, 20.11.0a1, 20.11.0a1, 20.12.0b1, 20.12.0b1
There are incompatible versions in the resolved dependencies:
  kubernetes>=10.0.1 (from kubeflow-tfjob==0.1.3->kubeflow-fairing==1.0.2->-r test_req.in (line 1))
  kubernetes>=12.0.0 (from kfserving==0.6.1->kubeflow-fairing==1.0.2->-r test_req.in (line 1))
  kubernetes>=10.0.1 (from kubeflow-pytorchjob==0.1.3->kubeflow-fairing==1.0.2->-r test_req.in (line 1))
  kubernetes==10.0.1 (from kubeflow-fairing==1.0.2->-r test_req.in (line 1))

kubeflow-fairing有多个依赖包，kfserving比较新。其他包好老（有维护但是不release），fairing本身也没有兼容最新的kfserving。

主要是k8s client从11.xx开始从swagger_types换成openapi_types，而其他包也没有跟上

https://github.com/kubeflow/training-operator/pull/1143 ( Kubeflow Training Operator )
AttributeError: ‘V1TFJob’ object has no attribute ‘openapi_types’

使用自己维护fairing的adapt-latest-kfserving-and-training-oprator分支打包whl包

#!/bin/bash

rm -rf build
rm -rf dist
rm -rf kubeflow_fairing.egg-info

python setup.py bdist_wheel

可以在项目的dist目录找到对应的whl文件

然后使用pip install xxx.whl就可以安装

模型状态判断

正常

"conditions": [
           ......
            {
                "lastTransitionTime": "2022-01-27T01:49:37Z",
                "status": "True",
                "type": "Ready"
            }
  ]

判断 ready true 官方通过inferenceServiceReadiness判断

异常

"conditions": [
          
  
            {
                "lastTransitionTime": "2022-01-27T01:54:52Z",
                "message": "Revision \"test-sklearn-predictor-default-00001\" failed with message: 0/3 nodes are available: 1 Insufficient memory, 3 Insufficient cpu..",
                "reason": "RevisionFailed",
                "severity": "Info",
                "status": "False",
                "type": "PredictorRouteReady"
            },
            {
                "lastTransitionTime": "2022-01-27T01:54:52Z",
                "message": "Configuration \"test-sklearn-predictor-default\" does not have any ready Revision.",
                "reason": "RevisionMissing",
                "status": "False",
                "type": "Ready"
            }
        ]

遍历conditions if reason==”RevisionFailed” 就是异常，然后拿它message

部署中

其他情况都是部署中

要查看kserving里面赋值conditions的代码，主要看InferenceServiceStatus.PropagateStatus方法

func (ss *InferenceServiceStatus) PropagateStatus(component ComponentType, serviceStatus *knservingv1.ServiceStatus) {
	....
	// propagate overall service condition
	serviceCondition := serviceStatus.GetCondition(knservingv1.ServiceConditionReady)
	if serviceCondition != nil && serviceCondition.Status == v1.ConditionTrue {
		if serviceStatus.Address != nil {
			statusSpec.Address = serviceStatus.Address
		}
		if serviceStatus.URL != nil {
			statusSpec.URL = serviceStatus.URL
		}
	}
	// propagate ready condition for each component
	readyCondition := conditionsMap[component]
	ss.SetCondition(readyCondition, serviceCondition)
	// propagate route condition for each component
	routeCondition := serviceStatus.GetCondition("ConfigurationsReady")
	routeConditionType := routeConditionsMap[component]
	ss.SetCondition(routeConditionType, routeCondition)
	// propagate configuration condition for each component
	configurationCondition := serviceStatus.GetCondition("RoutesReady")
	configurationConditionType := configurationConditionsMap[component]
	// propagate traffic status for each component
	statusSpec.Traffic = serviceStatus.Traffic
	ss.SetCondition(configurationConditionType, configurationCondition)

	ss.Components[component] = statusSpec
}

一直追踪函数流程，简化一下（其实有点想不明白，最后两个值为什么这样赋值，变量名好奇怪，感觉好像反了）

根据component会不一样	serviceStatus 来自于knative client
PredictorReady	Ready
PredictorRouteReady	ConfigurationsReady
PredictorConfigurationReady	RoutesReady

比较顶级api 先调用knative创建上面的组件（components）knative，然后在创建ingress相关的setting IngressReady

reconcilers := []components.Component{
  components.NewPredictor(r.Client, r.Scheme, isvcConfig),
}
if isvc.Spec.Transformer != nil {
  reconcilers = append(reconcilers, components.NewTransformer(r.Client, r.Scheme, isvcConfig))
}
if isvc.Spec.Explainer != nil {
  reconcilers = append(reconcilers, components.NewExplainer(r.Client, r.Scheme, isvcConfig))
}
for _, reconciler := range reconcilers {
  if err := reconciler.Reconcile(isvc); err != nil {
    r.Log.Error(err, "Failed to reconcile", "reconciler", reflect.ValueOf(reconciler), "Name", isvc.Name)
    r.Recorder.Eventf(isvc, v1.EventTypeWarning, "InternalError", err.Error())
    return reconcile.Result{}, errors.Wrapf(err, "fails to reconcile component")
  }
}
//Reconcile ingress
ingressConfig, err := v1beta1api.NewIngressConfig(r.Client)
if err != nil {
  return reconcile.Result{}, errors.Wrapf(err, "fails to create IngressConfig")
}
reconciler := ingress.NewIngressReconciler(r.Client, r.Scheme, ingressConfig)
r.Log.Info("Reconciling ingress for inference service", "isvc", isvc.Name)
if err := reconciler.Reconcile(isvc); err != nil {
  return reconcile.Result{}, errors.Wrapf(err, "fails to reconcile ingress")
}

// Reconcile modelConfig
configMapReconciler := modelconfig.NewModelConfigReconciler(r.Client, r.Scheme)
if err := configMapReconciler.Reconcile(isvc); err != nil {
  return reconcile.Result{}, err
}

if err = r.updateStatus(isvc); err != nil {
  r.Recorder.Eventf(isvc, v1.EventTypeWarning, "InternalError", err.Error())
  return reconcile.Result{}, err
}