Reliable Linux process identification, pt 2/3: Ideas

Multicore world – does it complicate things?

Current chips, or at least the main computing units (CPUs), are made up of many cores. Some of the reasons are beyond the scope of this article. Mainly there are factors that affect the total throughput of a processor, and the thermal factors… (physical heat generation). Lowering the voltage makes thermal power problems dissipate, but, at the same time signals tend to attenuate and the very concept of having the processor precisely relay information becomes at risk.

With multicore processor you can keep some parts of the processor actually physically resting (idle / shut down) and run your necessary work load on only part of the processor.

Kernel is just… a process!

Now to blow your mind, I’ll reveal a major surprise: kernel itself is a process! I don’t want to unnecessarily make your brain wobbly, but there’s a interesting academic distinctions to be done in defining the very qualia of process.

Kernel’s role is to be a reliable “master control program”. The kernel is trusted to shepherd other processes. The hardware has to have protection rings, different levels of mode for code to be run on the CPU. Otherwise the kernel wouldn’t actually be able to protect the computing environment.

Let’s go back. Imagine a very simple computer which doesn’t have the capability of programmability. This kind of computer runs a single fixed program. Once it’s loaded, it cannot be changed. Although at first this sounds arcane, there’s lots of fixed-program embedded systems. They’re often in consumer devices, like music players, perhaps DVD/Blueray or other media players – and in a ton of other application areas more concerned with industrial control etc.

Back to the traditional Operating System like Linux or Windows.

Ok, a process is nothing but a executable (binary) file being loaded in RAM and considered as code.

The earliest experiments I did were .COM files under MS-DOS. COM files were special, simple executables, limited by a maximum size of 64 kilobytes (65536 bytes). COM programs could not dictate any specific conditions for the loader, whereas the more sophisticated .EXE files had a header area – which was used to direct the operating system about how to handle the execution.

Regardless of the steps taken, a executable image gets loaded, and is placed into such a RAM block that it is considered to be executable code. The kernel stores information about the process in its own tables.

Then starts the execution. As we’re talking about multitasking operating systems, a process gets its slice of CPU time only periodically. Often, thus, a process is not actively “on the run” 100% of the time.

If you think about computers, they’re not alive. We do however use language that implies processes are live beings. We ‘kill’ a process in Linux.

In Linux, the file format of executables is called ELF (information on ELF).

What about those .sh and .py files and what not? They are not executables per se. They are scripts, which are loaded through an interpreter; whereas ELF files actually go into the executive “factory line” of the CPU. An instruction pointer (IP) register is pointed to the initial command of a ELF file. The ELF file may contain other “stuff” too.

Back on track: identifying processes through string-based, PID-based, or combinations thereof

A process could be identified by

  • the process name in a process listing
  • a fingerprint

PID is not a good method to identify process

Notably, the PID (a number) cannot be used to reliably identify a process. PIDs are just temporary serial numbers given by the Linux kernel to processes, as they are started. PIDs can be reused, after a process has died and thus released its PID.

What about fingerprinting processes?

Fingerprints are basically an algorithm run, with an array of bytes (content) from the process memory as the algorithm’s input.

Example: calculate hashes from a array of bytes. A hash is an algorithm well suited for fingerprinting:

  • relatively fast to compute
  • one-way
  • you can represent larger parts in short, fixed size hash values

What could we do for a process? Let’s calculate the first 512 bytes of the process image, and store <id,sum> pairs to identify processes. Each process gets a unique sum. Theoretically, for two processes to mis-identify, the chances are slim: only one in 2^512. Very astronomically small chance of error. What about reality? Depends on where (from the process) you sample the 512 bytes.

Simply the process name as identity?

Process name as identifier is pretty trivial. In Linux, the process name is the same as the “bare” executable name. Question arises, naturally: is the name of a process reliable?

Let’s try the tools in Part 3! Stay tuned.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: