Why is requiring that all capabilities be dropped in a Kubernetes PodSecurityPolicy redundant with non-root + disallow privilege escalation?

Asked 15/11, 2018 at 21:23 Answered 27/9, 2019 at 20:54

Solved docker kubernetes containers linux-capabilities

The second example policy from the PodSecurityPolicy documentation consists of the following PodSecurityPolicy snippet

...
spec:
  privileged: false
  # Required to prevent escalations to root.
  allowPrivilegeEscalation: false
  # This is redundant with non-root + disallow privilege escalation,
  # but we can provide it for defense in depth.
  requiredDropCapabilities:
    - ALL
...

Why is dropping all capabilities redundant for non-root + disallow privilege escalation? You can have a container process without privilege escalation that is non-root but has effective capabilities right?

It seems like this is not possible with Docker:

$ docker run --cap-add SYS_ADMIN --user 1000 ubuntu grep Cap /proc/self/status
CapInh: 00000000a82425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a82425fb
CapAmb: 0000000000000000

All effective capabilities have been dropped even when trying to explicitly add them. But other container runtimes could implement it, so is this comment just Docker specific?

Puentes answered 15/11, 2018 at 21:23 Comment(0)

Why is dropping all capabilities redundant for non-root + disallow privilege escalation?

Because you need privilege escalation to be able to use 'new' capabilities, an effectively allowPrivilegeEscalation: false is disabling setuid in the execve system call that prevents the use of any new capabilities.
Also as shown in the docs: "Once the bit is set, it is inherited across fork, clone, and execve and cannot be unset". More info here.

This in combination with privileged: false renders requiredDropCapabilities: [ALL] redundant.

The equivalent Docker options here are:

--user=whatever => privileged: false
--security-opt=no-new-privileges => allowPrivilegeEscalation: false
--cap-drop=all => requiredDropCapabilities: [ALL]

It seems like this is not possible with Docker

That's what looks like Docker is doing, the moment you specify a non-privileged user all of the effective capabilities are dropped (CapEff: 0000000000000000), even if you specify --cap-add SYS_ADMIN

This combined with the --security-opt=no-new-privileges as an option renders --cap-drop=all redundant.

Note that it seems like the default capability mask for docker includes SYS_ADMIN

$ docker run --rm ubuntu grep Cap /proc/self/status
CapInh: 00000000a80425fb
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
$ capsh --decode=00000000a82425fb
0x00000000a82425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_sys_admin,cap_mknod,cap_audit_write,cap_setfcap

Which would make sense why the 00000000a82425fb is the same without specifying any --cap-add option.

But other container runtimes could implement it, so is this comment just Docker specific?

I suppose, so you could have a case where privileged: false and allowPrivilegeEscalation: false not effectively disabling capabilities and that could be dropped with requiredDropCapabilities: (Although, I don't see why another runtime would want to change the Docker behavior).

Reubenreuchlin answered 16/11, 2018 at 1:27 Comment(7)

why do you need privilege escalation to use capabilities? Capabilities can be inherited from forking without either the parent or child needing to escalate privileges through an exec - if you do need privilege escalation to use capabilities this is no longer Docker specific, it applies to Linux more generally – Puentes 16/11, 2018 at 7:38

Yes, it is a Linux thing. Quoted from the link above: "Once the bit is set, it is inherited across fork, clone, and execve and cannot be unset". So whatever is forked even if it's inheriting capabilities, it wouldn't be able to effectively use them. – Reubenreuchlin 16/11, 2018 at 15:32

That doesn't mean it can't use capabilities, it means the process can't gain more capabilities through an exec - so if a container starts with the capabilities it needs, then setting the no_new_privs bit doesn't prevent it from retaining and using those capabilities, even over forks/clones etc. – Puentes 16/11, 2018 at 15:58

Right, I meant new capabilities, so that combined with privileged: false renders requiredDropCapabilities: ALL redundant. I suppose that's what docker is doing when you specify --user 1000 – Reubenreuchlin 16/11, 2018 at 17:3

Changed the answer a bit. Hopefully, it clarifies. Let me know any other comments/questions. – Reubenreuchlin 16/11, 2018 at 18:26

Thanks - I think the main issue here is what privileged means - here privilege literally just means being able to set privileged: true in a Pod spec and so you can have capabilities and still be non-privileged. I think this only works because Docker drops all effective capabilities of non-root users so combining that with disallowing privilege escalation prevents them from ever gaining any capabilities, but not all runtimes would necessary do that in which case requiring that all capabilities be dropped when the container starts would actually be different - do you agree? – Puentes 17/11, 2018 at 9:59

Yes. I agree. The only thing is that I can't see why another runtime would want to change this behavior but it's possible. Where effective capabilities would not be dropped with priviledged: false + allowPrivilegeEscalation: false and keep some capabilities that would be able to be dropped by requiredDropCapabilities: <something> – Reubenreuchlin 17/11, 2018 at 19:3

There are a multiple (good) sub questions inside your question.
I want to focus on the main question:

Why is dropping all capabilities redundant for non-root + disallow privilege escalation?

To make it simpler I think we can focus on the disallow privilege escalation part and simply ask:

What happens behind the scenes when we set the allowPrivilegeEscalation: false in a PodSecurityPolicy?

From the K8S docs you can see that "This bool directly controls whether the no_new_privs flag gets set on the container process".

So what happens if this flag is being set?

Quoting from the kernel docs: "When this flag is set, execve promises not to grant the privilege to do anything that could not have been done without the execve call.
For example, the setuid and setgid bits will no longer change the uid or gid; file capabilities will not add to the permitted set".

In other words, setting up allowPrivilegeEscalation: false will cause all capabilities to be dropped.

This is why adding this part consider to be redundant:

 requiredDropCapabilities:
    - ALL

I hope this simplify things a bit.

I think the answers for the other questions are very clear in the accepted answer, and I have nothing to add to them.

Notice: If you're running a kernel >= 4.10, then you can see the value of a thread's no_new_privs attribute in /proc/[pid]/status file - under the capabilities attributes:

.
.
CapInh: 00000000a82425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a82425fb
CapAmb: 0000000000000000
NoNewPrivs: 0 <-----
.
.

Roa answered 27/9, 2019 at 20:54 Comment(0)

Recommended topics

Hot tags