Conting events between sequential stages in a process using R
Asked Answered
G

1

0

I've been trying to resolve an exercise from a textbook where I am faced with the challenge of count different events between sequential stages of an industrial process.

Information related to the process: A test Subject is made to go through a 3 stage process with stages A, B and C respectively the first one being A, second B and finally C; a test Subject may abandon the process at stages A or B and then start again from point A, each time the process takes place a dataset is created with the IDENTIFICATION of the test subject, the TIMESTAMP in which the stage took place and a unique VISIT_CODE During any stage, a test subject may trigger an "ALERT" and this will be recorded with the TIMESTAMP, ALERT_CODE and test subject IDENTIFICATION.

What is to be calculated: I have to create a code in R to count how many ALERTS where generated by the test subjects between stages A and B, between stages B and C and finally how many ALERTS where generated after C. Please note that a test subject may at some point abandon the process to later on start again from point A.

The textbook gives a HINT: "Take a closer look at the stage a test subject is currently to then determine if the ALERT is generated from stage A and before a stage B and also if the test subject triggers an ALERT after stage B and before stage C, but keep in mind that if a test subject abandons at stage A and triggers an ALERT if the TIMESTAMP of that ALERT is smaller than their next attempt at stage A then it should be credited as an ALERT after a stage A"

as another hint, the textbook reveals that the ALERTS after Stage C where only 1 and that it was trigger by test subject W-6 with ALTER_CODE AYUJ-3915716168. The datasets are:

Stage Process

TableA<-tribble(~STAGE, ~TEST_SUBJECT,~TIMESTAMP,~VISIT_CODE,
"A",    "XYU-1",    "10",   "BKO",
"A",    "XYU-1",    "15",   "JUJD",
"B",    "XYU-1",    "20",   "DUDH",
"A",    "FF-09",    "25",   "KSIWJD",
"B",    "FF-09",    "30",   "AJAKAM",
"C",    "FF-09",    "35",   "ZISKS",
"A",    "UU-89",    "40",   "NNXJD",
"B",    "UU-89",    "45",   "DDUWO",
"A",    "I-44", "50",   "JIWIW",
"A",    "W-6",  "55",   "SHDN",
"B",    "W-6",  "60",   "IWOLS",
"C",    "W-6",  "65",   "JDDD",
"A",    "U-90", "70",   "DJDKSMS",
"B",    "U-90", "75",   "NDJSM",
"A",    "T-87", "80",   "DNDJDK")

Alerts dataset

TableB<-tribble(~TEST_SUBJECT,~TIMESTAMP,~ALERT_CODE,
"XYU-1",    "11",   "AYUJ-151571406",
"XYU-1",    "12",   "AYUJ-487008829",
"XYU-1",    "28",   "AYUJ-211990388",
"FF-09",    "32",   "AYUJ-4177221842",
"W-6",  "56",   "AYUJ-1300211351",
"W-6",  "63",   "AYUJ-3014305494",
"I-44", "67",   "AYUJ-4454800551",
"U-90", "73",   "AYUJ-1079921935",
"U-90", "76",   "AYUJ-3348911727",
"U-90", "79",   "AYUJ-2381219626",
"T-87", "82",   "AYUJ-4778326278",
"W-6",  "89",   "AYUJ-3915716168")

SOLUTION:

The textbook states that the proper solution for this problem is:

Alerts between Stages A & B including alerts from test subjects that abandoned the process in the attempt nth at stage A Alerts between Stages B & C including alerts from test subjects that abandoned the process in the attempt nth at stage B Alerts after stage C
AYUJ-151571406 AYUJ-211990388 AYUJ-3915716168
AYUJ-487008829 AYUJ-3014305494
AYUJ-1300211351 AYUJ-3348911727
AYUJ-1079921935 AYUJ-4177221842
AYUJ-4778326278 AYUJ-2381219626
AYUJ-4454800551

What have I done? : I have tried grouping and "joining" the data by TEST_SUBJECT and stage and timestamp to make the counts but I and finding a really hard to set the conditions to make the code count events between stages because I am finding difficult to corelate which events took plave before and after each timespan by test subject and stage.

Thank you so much to all of you talented and cool people for your help or recomandations

Gladwin answered 10/7, 2021 at 19:12 Comment(2)
your sample data from the alert-table seems to be missing a column?Franckot
Hello! Thank you for pointing that out I have fixed it! Thank you so muchGladwin
F
1

Here is a data.table apprioach, resulting in a list of alerts after a-b-c..

library(data.table)
# Make tables data.table format
setDT(TableA)
setDT(TableB)
# set TiMESTAP to numeric
TableA[, TIMESTAMP := as.numeric(TIMESTAMP)]
TableB[, TIMESTAMP := as.numeric(TIMESTAMP)]
# Create data.table with Stage intervals by test subject
DT.interval <- TableA[, .(start = min(TIMESTAMP)), by = .(TEST_SUBJECT, STAGE)]
# Perform rolling join
TableB[, Stage := DT.interval[TableB, 
                              STAGE, 
                              on = .(TEST_SUBJECT, start = TIMESTAMP), 
                              roll = Inf]][]
# Split alerts by stage
split(TableB[,3:4], by = "Stage")
# $A
#         ALERT_CODE Stage
# 1:  AYUJ-151571406     A
# 2:  AYUJ-487008829     A
# 3: AYUJ-1300211351     A
# 4: AYUJ-4454800551     A
# 5: AYUJ-1079921935     A
# 6: AYUJ-4778326278     A
# 
# $B
#         ALERT_CODE Stage
# 1:  AYUJ-211990388     B
# 2: AYUJ-4177221842     B
# 3: AYUJ-3014305494     B
# 4: AYUJ-3348911727     B
# 5: AYUJ-2381219626     B
# 
# $C
#         ALERT_CODE Stage
# 1: AYUJ-3915716168     C
Franckot answered 12/7, 2021 at 8:37 Comment(4)
I have no idea as to how this code solves the problem but all credit goes to you sir! 10/10 I guess data.table is the future and I have to use it since I was not able to come up with a solution using TidyverseGladwin
What parts of the code do you have questions about? Key is the rolling join, where for every Alert in TableB, there is looked for the most recent Timestamp with the same test_subject in the interval-table.Franckot
thats amazing thank you so F... much thats one of the coolest functions I have heard tthis yar it seems as though data.table is a lot more powerful than tidyverse, I am an R student I will be so happy if you could recommend the packages that you beleieve are best for data wrangglingGladwin
Useful packages I seem to use a lot are: data.table, tidyverse-packages, sf, igraph, fuzzyjoin, leaflet, openxlsx. But what I'll use just depends on the job at hand. data.table versus tidyverse; most of the time you can tackle most problems with both packages. In gereral, data.table has a bit of a learning curve on start, and is (often) faster when working om large datasets. Tidyverse results in very readable code, even for non-R programmers.Franckot

© 2022 - 2024 — McMap. All rights reserved.