kubernetes: k8s 是谷歌在2014年开源的容器化集群管理系统
Pod Pod 是可以在 Kubernetes 中创建和管理的、最小的可部署的计算单元
网络控制器
ApiServer: Uniform interface for access by all services
ControllerManager: Maintaining Copy Expectations
kubeadm
: the command to bootstrap the cluster.
kubelet
: the component that runs on all of the machines in your cluster and does things like starting pods and containers.
kubectl
: the command line util to talk to your cluster.
master和nodes 均需要执行一下步骤进行安装
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo[kubernetes]name=Kubernetesbaseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64enabled=1gpgcheck=1repo_gpgcheck=1gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpgexclude=kubelet kubeadm kubectlEOF# Set SELinux in permissive mode (effectively disabling it)sudo setenforce 0sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config# turn off swapswapoff -a # 临时vim /etc/fstab # 永久sudo yum install -y kubelet-1.20.5 kubeadm-1.20.5 kubectl-1.20.5 --disableexcludes=kubernetessudo systemctl enable --now kubelet
uninstall command
sudo yum remove -y kubelet kubeadm kubectl
假设 master 所在机器为 10.176.66.58
# 提前拉去镜像kubeadm config images pull --image-repository registry.aliyuncs.com/google_containerskubeadm config images pull --image-repository 10.176.66.20/google_containers# ping 不通 service 和 pod 的dnssudo kubeadm init \--apiserver-advertise-address=10.114.130.3 \--image-repository registry.aliyuncs.com/google_containers \--kubernetes-version v1.20.5 \--service-cidr=10.96.0.0/12 \--pod-network-cidr=10.244.0.0/16# 使用私服sudo kubeadm init \--apiserver-advertise-address=10.114.130.3 \--image-repository 10.176.66.20/google_containers \--kubernetes-version v1.20.5 \--service-cidr=10.96.0.0/12 \--pod-network-cidr=10.244.0.0/16# 在 nodes 上直接上面生成的命令,加入集群,如下kubeadm join 10.176.66.58:6443 --token 7opg66.gcmdavb2vxiliytp \ --discovery-token-ca-cert-hash sha256:ecb8d4930ac8489c1196560612afa1736dddf7be25244a50e64c82dca9bb2644# 使用 kubectl 工具mkdir -p $HOME/.kubesudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/configsudo chown $(id -u):$(id -g) $HOME/.kube/config# 查看 node 节点加入情况kubectl get nodes# 查看 pod 情况kubectl get pods -o wide# 安装 pod 网络插件 CNIkubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
master 重新生成加入命令:
kubeadm token create --print-join-command
kubeadm token list
master/node 退出集群
kubeadm reset
待更新
# 查看所有 pod 和 servicekubectl get pod,svc# 开启检测, 有变化打印kubectl get pod -w# 查看 所有组件kubectl get pods --all-namespaces -o widekubectl get componentstatuses# 查看某个服务情况kubectl describe pods -n kube-system coredns-7f89b7bc75-hsjdl# 查看某个 pod 信息kubectl describe pod <pod-name># 查看 pod 下,某个容器日志kubectl logs <pod-name> -c <container-name># 删除某个 podkubectl delete pod <pod-name># 删除所有 deploymentkubectl delete deployment --all# 删除所有 podkubectl delete pod --all# 根据配置文件创建对象kubectl create -f nginx.yaml# 根据配置文件删除对象 (猜测是根据 meta 标识的唯一对象进行删除)kubectl delete -f nginx.yaml# 更新对象配置kubectl replace -f nginx.yaml# 进入容器kubectl exec -it nacos-2 -- /bin/bash
# 更新kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1 --record# 查看回滚历史kubectl rollout history deployment.v1.apps/nginx-deployment# 回滚到上个版本kubectl rollout undo deployment.v1.apps/nginx-deployment# 回滚到指定版本kubectl rollout undo deployment.v1.apps/nginx-deployment --to-revision=2# 缩放kubectl scale deployment.v1.apps/nginx-deployment --replicas=10# 自动伸缩kubectl autoscale deployment.v1.apps/nginx-deployment --min=10 --max=15 --cpu-percent=80# 查看各 pod/container cpu和memory 的占用量kubectl top podkubectl top pod test-huishi-server-6f875487d7-9rzpdkubectl top pod | grep lemes-service-common# 查看节点的内存和cpu占用情况kubectl top nodessudo systemctl restart docker
sudo yum install -y nfs-utils rpcbindsudo mkdir -p /data/nfssudo sh -c "sudo echo '/data/nfs *(rw,sync,no_root_squash)' >> /etc/exports"sudo systemctl enable --now nfssudo systemctl enable --now rpcbind
sudo cat > /etc/yum.repos.d/ceph.repo << EOF[ceph-norch]name=ceph-norchbaseurl=https://mirrors.aliyun.com/ceph/rpm-nautilus/el7/noarch/enabled=1gpgcheck=0[ceph-x86_64]name=ceph-x86_64baseurl=https://mirrors.aliyun.com/ceph/rpm-nautilus/el7/x86_64/enabled=1gpgcheck=0EOFsudo yum install ceph-common
kubectl exec -it redis-cluster-0 -- redis-cli --cluster create --cluster-replicas 1 $(kubectl get pods -l app=redis-cluster -o jsonpath='{range.items[*]}{.status.podIP}:6379 {end}')kubectl exec -it redis-cluster-0 -- redis-cli cluster nodes
- --http-port=81- --https-port=8443
# 安装 cephsudo yum -y install ceph-common# 创建poolceph osd pool create kubernetes 16 16# 初始化poolrbd pool init kubernetes# 创建块文件rbd create -p kubernetes --image-feature layering rbd.img --size 10Gmkdir -p /data/ceph/sdbmkdir -p /data/ceph/sdc# 查看 lv pathsudo vgscansudo vgdisplay -v datavgceph-deploy osd create --data /dev/datavg/lv_data whulpdpms01ceph-deploy osd create --data /dev/datavg/lv_data whulpdpms02ceph-deploy osd create --data /dev/datavg/lv_data whulpdpms03
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: lemes-gateway-ig
namespace: default
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /$2
nginx.ingress.kubernetes.io/proxy-connect-timeout: “600”
nginx.ingress.kubernetes.io/proxy-send-timeout: “600”
nginx.ingress.kubernetes.io/proxy-read-timeout: “600”
nginx.ingress.kubernetes.io/proxy-body-size: “600m”
nginx.ingress.kubernetes.io/configuration-snippet: |
more_set_headers “Host $host”;
more_set_headers “X-Forwarded-Proto $scheme”;
more_set_headers “X-Forwarded-For $proxy_add_x_forwarded_for”;
more_set_headers “X-Real-IP $remote_addr”;
spec:
rules:
- http:
paths:
- path: /lemes-api(/|$)(.*)
pathType: Prefix
backend:
service:
name: lemes-gateway-svc
port:
number: 80
原因与解决方案: https://github.com/kubernetes/ingress-nginx/issues/78371. 编辑 ingress-nginx```bashkubectl edit configmap -n ingress-nginx ingress-nginx-controller
data: allow-snippet-annotations: "false"
升级 集群名-> System -> 资源 -> 配置映射 中的 ingress-nginx-controller
添加如下键值对
compute-full-forwarded-for: trueforwarded-for-header: X-Forwarded-Foruse-forwarded-headers:true
实时生效
docker stop/kill/rm -f 都不好使
# 找到进程$ ps axo stat,ppid,pid,comm | grep -w defunctZl 19653 19679 java <defunct># 找到父进程$ ps -f 19679UID PID PPID C STIME TTY STAT TIME CMDroot 19653 19635 0 11:02 ? Ss 0:00 [docker-startup.]$ sudo kill -9 19635$ sudo systemctl restart docker
kubectl get pods | grep Terminating | awk '{print $1}' | xargs -I {} kubectl delete pod {} --force --grace-period=0kubectl delete pod nginx-ingress-controller-lbftg --force --grace-period=0
sudo vi /etc/sysctl.conf# 添加fs.file-max=9000000fs.inotify.max_user_instances = 1000000fs.inotify.max_user_watches = 1000000sudo sysctl -psudo systemctl restart docker
# 查看挂载卷cat /proc/mounts |grep "docker"# 显示/dev/mapper/centos-root /var/lib/docker/overlay xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0 overlay /var/lib/docker/overlay/xxxxxxxxxx# 取消挂载umount /var/lib/docker/overlay/xxxxxxxxxxx# 批量取消挂载sudo umount `cat /proc/mounts |grep "docker"|awk '{print $2}'`
[lemes@slt6dhqgxev ~]$ docker psCannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?# journalctl -u docker.serviceaccept4: too many open files
问题解析
文件数创建上限
问题解决
sudo vi /etc/sysctl.conf# 添加fs.file-max = 9000000fs.inotify.max_user_instances = 1000000fs.inotify.max_user_watches = 1000000# 生效sudo sysctl -p# 重启 docker 服务sudo systemctl restart docker# 启动所有关闭的容器docker start $(docker ps -a | awk '{ print $1}' | tail -n +2)
问题讨论:volume hostpath with subpath
问题原因:
映射到了 kubectl 容器内docker exec -it $(docker ps -aq --filter "name=kubelet") /bin/sh
修改了宿主机的 dns 后,需要重启 docker 才能全体生效
sudo systemctl restart docker
kubectl delete -A ValidatingWebhookConfiguration foobar-ingress-nginx-admission
网络框架一直安装不上, 根据 docker logs -f kubelet
日志查看, 网络插件安装时报 diskpress
被放逐
问题: /data 磁盘空间不足
解决方案: 释放磁盘空间解决
现象: pod被驱逐, 报错如题
问题: 使用的临时容量超过了节点限制(在此节点上)
2023-08-11 21:22 再次出现这个问题
经查询资料, 可能是由于没有限制容器日志造成的
南方厂工厂的 sudo vi /etc/docker/daemon.json
配置确实如下
{ "registry-mirrors": [], "insecure-registries": [ "10.176.66.20:5000", "10.188.132.44:5000", "10.188.132.123:5000", "10.176.2.207:5000" ], "data-root":"/data/docker/system", "debug": true, "experimental": false,}
改为如下, 限制每个容器只能保留10m的日志
{ "registry-mirrors": [], "insecure-registries": [ "10.188.132.44:5000", "10.188.132.123:5000", "10.176.2.207:5000" ], "data-root":"/data/docker/system", "debug": true, "experimental": false, "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "1", "labels": "production_status", "env": "os,customer" }}
2023-08-25 16:41 再次出现问题
发现 tingyun 使用的 emptyDir
中, 一直在写入日志,导致占用临时空间
# 查询log日志总数sudo find /data/docker/system/containers/ -name "*-json.log" | xargs sudo ls -l | awk '{print $5}' | awk '{sum+=$1}END{print sum}'sudo find /data/docker/system/containers/ -name "*-json.log" | xargs sudo rm -rf
2023-08-31 smt-wh 和 smt-tjsc 都出现了这个问题 The node was low on resource: ephemeral-storage. Container lemes-service-wh-report was using 1936Ki, which exceeds its request of 0.
超出了0,就不是有限制, 而是当前磁盘已经达到了 85%,造成了 pod 驱逐
Error from server (InternalError): error when creating “management-state/tmp/yaml-397511040”: Internal error occurred: failed calling webhook “validate.nginx.ingress.kubernetes.io”: Post “https://ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1/ingresses?timeout=10s“: x509: certificate is not valid for any names, but wanted to match ingress-nginx-controller-admission.ingress-nginx.svc
有个 issue
问题很像 PLEG is not healthy K8 1.20.4/Ubuntu 20.04
根据 minchieh-fay
老哥的回答,是 runc
的 runc-1.0.0-rc93
这个版本有问题
可以通过 docker version
来查看 runc
的版本, 确实是 runc-1.0.0-rc93
, 按照如下方式进行离线升级
runc
的 github release 找到升级的版本, 我选的是 1.1.4
, 选择 runc.amd64
进行下载mv runc.amd64 runc && chmod +x runc
进行重命名和赋予执行权限runc
文件, mv /usr/bin/runc /usr/bin/runc.bak
docker
服务, systemctl stop docker
runc
文件到 /usr/bin/
目录下, mv runc /usr/bin/runc
docker
服务, systemctl start docker
docker version
查看 runc
版本, 确认升级成功