Introduction to Seccomp: BPF linux syscall filter

来源:互联网 发布:六十知天命 编辑:程序博客网 时间:2024/06/06 05:32
  1. Seccomp Introduction
  2. Seccomp Security Profiles for Docker
    2.1 Docker default seccomp profile
    2.2 Use custom Seccomp profile
    2.3 Docker Run without the default seccomp profile
  3. Seccomp Security Profiles for Kubernetes
    3.1 Kubernetes default seccomp profile
    3.2 Kubernetes Use runtime default profile
    3.3 Kubernetes Use custom Seccomp profile
  4. Seccomp

1. Seccomp Introduction

Seccomp filtering provides a means for a process to specify a filter for
incoming system calls. This filter is defined by Berkeley Packet
Filter(BPF) rules
Seccomp通过为进程指定一个过滤器的途径来过滤Linux系统调用,该过滤器使用BPF来定义过滤的规则

Secure computing mode (Seccomp) is a Linux kernel feature. You can use it to restrict the actions available within the container. This feature is available only if Docker has been built with seccomp and the kernel is configured with CONFIG_SECCOMP enabled. To check if your kernel supports seccomp:
Seccomp是Linux Kernel的特性,可以使用它来过滤容器内可用的系统调用,要使用该特性必须满足以下条件:

Linux Kernel 3.5 or higher && CONFIG_SECCOMP=y

root@kube-master:~#  cat /boot/config-`uname -r` | grep CONFIG_SECCOMPCONFIG_SECCOMP_FILTER=yCONFIG_SECCOMP=y

2. Seccomp Security Profiles for Docker

Docker 使用该特性必须满足以下条件:
1. Linux Kernel 3.5 or higher && CONFIG_SECCOMP=y
2. Seccomp profiles require seccomp 2.2.1 or higher
3. Version of Docker 1.10 or higher

2.1 Docker default seccomp profile

The default seccomp profile provides a sane default for running containers with seccomp and disables around 52 system calls out of 300+. It is moderately protective while providing wide application compatibility. The default Docker profile (found here) has a JSON layout.

Significant syscalls blocked by the default profile

Docker’s default seccomp profile is a whitelist which specifies the calls that are allowed. The table below lists the significant (but not all) syscalls that are effectively blocked because they are not on the whitelist. The table includes the reason each syscall is blocked rather than white-listed

Syscall Description acct Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_PACCT. add_key Prevent containers from using the kernel keyring, which is not namespaced. adjtimex Similar to clock_settime and settimeofday, time/date is not namespaced. Also gated by CAP_SYS_TIME bpf Deny loading potentially persistent bpf programs into kernel, already gated by CAP_SYS_ADMIN. clock_adjtime Time/date is not namespaced. Also gated by CAP_SYS_TIME. clock_settime Time/date is not namespaced. Also gated by CAP_SYS_TIME. clone Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_USERNS. create_module Deny manipulation and functions on kernel modules. Obsolete. Also gated by CAP_SYS_MODULE delete_module Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE. finit_module Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE. get_kernel_syms Deny retrieval of exported kernel and module symbols. Obsolete. get_mempolicy Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE. init_module Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE. ioperm Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO. iopl Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO. kcmp Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE. kexec_file_load Sister syscall of kexec_load that does the same thing, slightly different arguments. Also gated by CAP_SYS_BOOT. kexec_load Deny loading a new kernel for later execution. Also gated by CAP_SYS_BOOT. keyctl Prevent containers from using the kernel keyring, which is not namespaced. lookup_dcookie Tracing/profiling syscall, which could leak a lot of information on the host. Also gated by CAP_SYS_ADMIN. mbind Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE. mount Deny mounting, already gated by CAP_SYS_ADMIN. move_pages Syscall that modifies kernel memory and NUMA settings. name_to_handle_at Sister syscall to open_by_handle_at. Already gated by CAP_SYS_NICE. nfsservctl Deny interaction with the kernel nfs daemon. Obsolete since Linux 3.1. open_by_handle_at Cause of an old container breakout. Also gated by CAP_DAC_READ_SEARCH. perf_event_open Tracing/profiling syscall, which could leak a lot of information on the host. personality Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns. pivot_root Deny pivot_root, should be privileged operation. process_vm_readv Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE. process_vm_writev Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE. ptrace Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping CAP_PTRACE. query_module Deny manipulation and functions on kernel modules. Obsolete. quotactl Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_ADMIN. reboot Don’t let containers reboot the host. Also gated by CAP_SYS_BOOT. request_key Prevent containers from using the kernel keyring, which is not namespaced. set_mempolicy Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE. setns Deny associating a thread with a namespace. Also gated by CAP_SYS_ADMIN. settimeofday Time/date is not namespaced. Also gated by CAP_SYS_TIME. stime Time/date is not namespaced. Also gated by CAP_SYS_TIME. swapon Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN. swapoff Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN. sysfs Obsolete syscall. _sysctl Obsolete, replaced by /proc/sys. umount Should be a privileged operation. Also gated by CAP_SYS_ADMIN. umount2 Should be a privileged operation. Also gated by CAP_SYS_ADMIN. unshare Deny cloning new namespaces for processes. Also gated by CAP_SYS_ADMIN, with the exception of unshare –user. uselib Older syscall related to shared libraries, unused for a long time. userfaultfd Userspace page fault handling, largely needed for process migration. ustat Obsolete syscall. vm86 In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN. vm86old In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN.

2.2 Use custom Seccomp Profiles
When you run a container, it uses the default profile unless you override it with the security-opt option. For example, the following explicitly specifies the default policy:

$ docker run --rm -it --security-opt seccomp=/etc/docker/seccomp/profile.json hello-world

2.3 Docker Run without the default seccomp profile

You can pass unconfined(无限制,无约束) to run a container without the default seccomp profile.

$ docker run --rm -it --security-opt seccomp=unconfined debian:jessie \    unshare --map-root-user --user sh -c whoami

3. Seccomp Security Profiles for Kubernetes

Seccomp (secure computing mode) is used to restrict the set of system calls applications can make, allowing cluster administrators greater control over the security of workloads running in Kubernetes cluster

Kubernetes 使用该特性必须满足以下条件:
1. Linux Kernel 3.5 or higher && CONFIG_SECCOMP=y
2. Seccomp profiles require seccomp 2.2.1 or higher
3. Version of Docker 1.10 or higher
4. Version of Kubernetes 1.3.0-beta.2 or higher

3.1 Kubernetes default seccomp profile

Containers are run with unconfined seccomp settings by default
在默认情况下Kubernetes使用unconfined,既对创建出来所有容器中的系统调用不做限制, 所以存在安全隐患!

Here’s an example of a pod that uses the unconfined profile:

apiVersion: v1kind: Podmetadata:  name: trustworthy-pod  annotations:    seccomp.security.alpha.kubernetes.io/pod: unconfinedspec:  containers:    - name: trustworthy-container      image: sotrustworthy:latest

3.2 Kubernetes Use runtime default profile

To bind a specific profile to a Pod, you can use the following alpha annotations:

Specify a Seccomp profile for all containers of the Pod:

seccomp.security.alpha.kubernetes.io/pod

Specify a Seccomp profile for an individual container:

container.seccomp.security.alpha.kubernetes.io/${container_name}
Value Description runtime/default the default profile for the container runtime. unconfined unconfined profile, disable Seccomp sandboxing. localhost/<profile-name> the profile installed to the node’s local seccomp profile root

Example :

Here’s an example of a pod that uses a profile called runtime/default using the container-level annotation:

apiVersion: v1kind: Podmetadata:  name: explorer  annotations:    container.seccomp.security.alpha.kubernetes.io/explorer: runtime/defaultspec:  containers:    - name: explorer      image: gcr.io/google_containers/explorer:1.0      args: ["-port=8080"]      ports:        - containerPort: 8080          protocol: TCP      volumeMounts:        - mountPath: "/mount/test-volume"          name: test-volume  volumes:    - name: test-volume      emptyDir: {}

3.3 Kubernetes Use custom Seccomp profile

使用自定义Seccomp profile的步骤:

3.3.1 在每个kubelet工作节点上指定seccomp profile root路径

--seccomp-profile-root string   Directory path for seccomp profiles. (default "/var/lib/kubelet/seccomp")

3.3.2 在seccomp profile root路径中创建符合BPF规则的profile

N/A  参考2.1中docker默认的BPF rule

3.3.3 在创建容器的时候指定自定义的profile

seccomp.security.alpha.kubernetes.io/pod:localhost/`<profile-name>`

3.3.4 示例

To bind a specific profile to a Pod, you can use the following alpha annotations:

Specify a Seccomp profile for all containers of the Pod:

seccomp.security.alpha.kubernetes.io/pod

Specify a Seccomp profile for an individual container:

container.seccomp.security.alpha.kubernetes.io/${container_name}
Value Description runtime/default the default profile for the container runtime. unconfined unconfined profile, disable Seccomp sandboxing. localhost/<profile-name> the profile installed to the node’s local seccomp profile root

If you want to use use custom profiles (prefixed with localhost/), you have to copy these to all worker nodes in your cluster. The default folder for profiles is /var/lib/kubelet/seccomp.

Example 1 :

Here’s an example of a pod that uses a profile called example-explorer-profile using the container-level annotation:

Seccomp Profile /var/lib/kubelet/seccomp/example-explorer-profile 
apiVersion: v1kind: Podmetadata:  name: explorer  annotations:    container.seccomp.security.alpha.kubernetes.io/explorer: localhost/example-explorer-profilespec:  containers:    - name: explorer      image: gcr.io/google_containers/explorer:1.0      args: ["-port=8080"]      ports:        - containerPort: 8080          protocol: TCP      volumeMounts:        - mountPath: "/mount/test-volume"          name: test-volume  volumes:    - name: test-volume      emptyDir: {}

Example: How to prevent chmod syscall

In this example we spin up two Pods. Both try to change the permissions on a file. While the Pod chmod-unconfined runs with the default profile of Docker and exits successfully, the same command in Pod chmod-prevented fails, as it is not allowed by its Seccomp profile.

Seccomp Profile /var/lib/kubelet/seccomp/prevent-chmod{  "defaultAction": "SCMP_ACT_ALLOW",  "syscalls": [    {      "name": "chmod",      "action": "SCMP_ACT_ERRNO"    }  ]} 
apiVersion: v1kind: Podmetadata:  name: chmod-unconfinedspec:  containers:  - name: chmod    image: busybox    command:      - "chmod"    args:      - "666"      - /etc/hostname  restartPolicy: Never---apiVersion: v1kind: Podmetadata:  name: chmod-prevented  annotations:    seccomp.security.alpha.kubernetes.io/pod: localhost/prevent-chmodspec:  containers:  - name: chmod    image: busybox    command:      - "chmod"    args:      - "666"      - /etc/hostname  restartPolicy: Never
$ kubectl create -f seccomp-pods.yamlpod "chmod-unconfined" createdpod "chmod-prevented" created$ kubectl get pods -aNAME               READY     STATUS      RESTARTS   AGEchmod-prevented    0/1       Error       0          8schmod-unconfined   0/1       Completed   0          8s

Reference:

https://github.com/kubernetes/kubernetes/blob/release-1.4/docs/design/seccomp.md
https://docs.docker.com/engine/security/seccomp/
https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
https://github.com/torvalds/linux/tree/master/samples/seccomp
https://blog.jetstack.io/blog/kubernetes-1-3-hidden-gems/
https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#seccomp
http://www.selinuxplus.com/?p=370

0 0
原创粉丝点击
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 丝棉被淋了雨怎么办呢? 厚棉花被有霉味怎么办 可乐倒在棉絮上怎么办 酸奶倒在被子上怎么办 发现自家房屋墙壁发霉怎么办 布艺拖鞋发霉了怎么办 棉拖鞋洗了还臭怎么办 棉拖鞋洗了发黄怎么办 小狗5天不吃东西怎么办 手上猴子总是不停的起怎么办 真空压缩袋破了怎么办 兔子突然腿瘸了怎么办 被宠物兔咬出血怎么办 小兔子不吃东西精神不好怎么办 小狗脐带咬开了怎么办 刚生的狗就死了怎么办? 兔子生了不喂奶怎么办 兔子出生4天摸了怎么办 兔子吃了纸箱子怎么办 被小兔子抓一下怎么办 兔子吃了蟑螂药怎么办 兔子洗了澡要死怎么办 养的花蔫了怎么办 养的小鸡很大了怎么办 养的小兔子死了怎么办 小狗出现爬跨行为怎么办 养的小狗总做吞咬人的行为怎么办 螃蟹和柿子一起吃了怎么办 指甲上月牙没了怎么办 手指上月牙没了怎么办 指甲上没半月牙怎么办 电动车车牌被偷了怎么办 警察拖车拖坏了怎么办 6岁儿童牙疼怎么办 小白单车不退押金怎么办 光盘放笔记本电脑里读不出来怎么办 cd光盘读不出来怎么办 最近脸干的不行怎么办 脸感觉干的不行怎么办 新生儿睡觉黑白颠倒了怎么办 婴儿吐水和奶花怎么办