Create timestamp with fractional seconds
Asked Answered
D

2

6

awk can generate a timestamp with strftime function, e.g.

$ awk 'BEGIN {print strftime("%Y/%m/%d %H:%M:%S")}'
2019/03/26 08:50:42

But I need a timestamp with fractional seconds, ideally down to nanoseconds. gnu date can do this with the %N element:

$ date "+%Y/%m/%d %H:%M:%S.%N"
2019/03/26 08:52:32.753019800

But it is relatively inefficient to invoke date from within awk compared to calling strftime, and I need high performance as I'm processing many large files with awk and need to generate many timestamps while processing the files. Is there a way that awk can efficiently generate a timestamp that includes fractional seconds (ideally nanoseconds, but milliseconds would be acceptable)?

Adding an example of what I am trying to perform:

awk -v logFile="$logFile" -v outputFile="$outputFile" '
BEGIN {
   print "[" strftime("%Y%m%d %H%M%S") "] Starting to process " FILENAME "." >> logFile
}
{
    data[$1] += $2
}
END {
    print "[" strftime("%Y%m%d %H%M%S") "] Processed " NR " records." >> logFile
    for (id in data) {
        print id ": " data[id] >> outputFile
    }
}
' oneOfManyLargeFiles
Deirdra answered 26/3, 2019 at 13:58 Comment(10)
In fact, deciseconds, centiseconds, or milliseconds would help, as nanoseconds may be a little extreme. But I'm printing the timestamps and a summary of the data that has been processed thus far, and sub-second granularity would be very beneficial in monitoring the process.Deirdra
If you are processing a large file, do you really think it is relevant to know if you processed it for 123.124secons or just 124 seconds? This just does not make any sense. Also, you might be interested in the unix command time. This seems to do exactly what you are interested in.Siemens
Also, any answer to this question is meaning less. You want an as accurate timing as possible to get an idea how fast your processing went. By calling date in awk, you slow things down by doing two system calls. By opening /proc/uptime allone, you already waste 2 ms. Really concider if your task really need subsecond accuracy, and if it does, should time not be good enough.Siemens
Well, not all files are large, and many are processed in less than a second. Also, the specific code is a little contrived to provide a simple, concrete example. So yes, it is relevant to know down to fractional second what point point awk is at when it writes an event to the log. What I'm seeing from the answers and comments is that awk cannot do this with internal functions. Internal functions, e.g. strftime, can only get a granularity of seconds.Deirdra
@RustyLemur are you trying to find the bottleneck in your code to optimize?Siemens
It's not so much for optimization as for reporting and general understanding of the time requirements for my process.Deirdra
@EdMorton, this is why I was hoping awk could use strftime to get sub-second granularity, since it should be much faster than invoking an external tool, e.g. date. Even though calling strftime alone would cause some delay, it is assumed to be significantly faster than calling date.Deirdra
@RustyLemur, isn't it good enough to print it upto one second? If subsequent steps are all done within the same second, you know the timing is irrelevant. If the time for a particular step takes more then a single second, you know that the fractional part is irrelevant. Or are you trying to add the deltaT's of various steps into an array to see how much time a particular step took for the complete file?Siemens
This awk component is part of a large application, and it's beneficial to know how much delay is being produced from input received to output produced. Once everything is finalized, the timestamp reporting will probably be relaxed to seconds granularity, but during benchmarks and performance testing, it's good to know if it takes 15 milliseconds versus 1 second to process a file. For this particular application, I can allow the external shell to record the timestamps before invoking awk and after it exits, but then I can't have awk print specific timestamps during the processing.Deirdra
@EdMorton, thanks for the benchmark. It looks my assumption that calling the external command would be slower may be incorrect.Deirdra
S
6

If you are really in need of subsecond timing, then any call to an external command such as date or reading an external system file such as /proc/uptime or /proc/rct defeats the purpose of the subsecond accuracy. Both cases require to many resources to retrieve the requested information (i.e. the time)

Since the OP already makes use of GNU awk, you could make use of a dynamic extension. Dynamic extensions are a way of adding new functionality to awk by implementing new functions written in C or C++ and dynamically loading them with gawk. How to write these functions is extensively written down in the GNU awk manual.

Luckily, GNU awk 4.2.1 comes with a set of default dynamic libraries which can be loaded at will. One of these libraries is a time library with two simple functions:

the_time = gettimeofday() Return the time in seconds that has elapsed since 1970-01-01 UTC as a floating-point value. If the time is unavailable on this platform, return -1 and set ERRNO. The returned time should have sub-second precision, but the actual precision may vary based on the platform. If the standard C gettimeofday() system call is available on this platform, then it simply returns the value. Otherwise, if on MS-Windows, it tries to use GetSystemTimeAsFileTime().

result = sleep(seconds) Attempt to sleep for seconds seconds. If seconds is negative, or the attempt to sleep fails, return -1 and set ERRNO. Otherwise, return zero after sleeping for the indicated amount of time. Note that seconds may be a floating-point (nonintegral) value. Implementation details: depending on platform availability, this function tries to use nanosleep() or select() to implement the delay.

source: GNU awk manual

It is now possible to call this function in a rather straightforward way:

awk '@load "time"; BEGIN{printf "%.6f", gettimeofday()}'
1553637193.575861

In order to demonstrate that this method is faster then the more classic implementations, I timed all 3 implementations using gettimeofday():

awk '@load "time"
     function get_uptime(   a) {
        if((getline line < "/proc/uptime") > 0)
        split(line,a," ")
        close("/proc/uptime")
        return a[1]
     }
     function curtime(    cmd, line, time) {
        cmd = "date \047+%Y/%m/%d %H:%M:%S.%N\047"
        if ( (cmd | getline line) > 0 ) {
           time = line
        }
        else {
           print "Error: " cmd " failed" | "cat>&2"
        }
        close(cmd)
        return time
      }
      BEGIN{
        t1=gettimeofday(); curtime(); t2=gettimeofday();
        print "curtime()",t2-t1
        t1=gettimeofday(); get_uptime(); t2=gettimeofday();
        print "get_uptime()",t2-t1
        t1=gettimeofday(); gettimeofday(); t2=gettimeofday();
        print "gettimeofday()",t2-t1
      }'

which outputs:

curtime() 0.00519109
get_uptime() 7.98702e-05
gettimeofday() 9.53674e-07

While it is evident that curtime() is the slowest as it loads an external binary, it is rather startling to see that awk is blazingly fast in processing an extra external /proc/ file.

Siemens answered 26/3, 2019 at 22:9 Comment(2)
Some of the function calls in the benchmark program is incorrect: getimeofday should be gettimeofday.Matos
Also some earlier versions of GNU awk have the loadable time library. At least 4.1.4 seems to have it. The feature seems to have been added in 2012: git.savannah.gnu.org/cgit/gawk.git/commit/…Matos
A
2

If you are on Linux, you could use /proc/uptime:

$ cat /proc/uptime 
123970.49 354146.84

to get some centiseconds (the first value is the uptime) and compute the time difference between the beginning and whenever something happens:

$ while true ; do echo ping ; sleep 0.989 ; done |        # yes | awk got confusing
awk '
function get_uptime(   a, line) {
    if((getline line < "/proc/uptime") > 0)
        split(line,a," ")
    close("/proc/uptime")
    return a[1]
}
BEGIN {
    basetime=get_uptime()                    
}
{
    if(!wut)                                 # define here the cause
        print get_uptime()-basetime          # calculate time difference
}'

Output:

0
0.99
1.98
2.97
3.97
Amin answered 26/3, 2019 at 15:56 Comment(4)
Why not put line as a parameter for get_uptime(), too?Matos
@Matos That was just a POC but I don't see a reason not to, except that the getline and close would need to be in 2 places and function get_uptime() would kind of lose its meaning.Amin
No, but it would make line a local variable like a, if that matters.Matos
@Matos Oh yeah, that's right. Fixin' without testing. :D Thanks.Amin

© 2022 - 2024 — McMap. All rights reserved.