How can duplicates be removed from a file using COBOL?
Asked Answered
K

4

31

The input file have records as: 8712351,8712353,8712353,8712354,8712356,8712352,8712355 8712352,8712355

Using COBOL, I need to remove duplicates from the above file and write to an output file. I wrote simple logic to read records and write to an output file.

Where do I need to put the logic of removing duplicates (say, 8712353, 8712352) from the above file?

Here is the program logic:

   IDENTIFICATION DIVISION.
   PROGRAM-ID.RemoveDup.
   ENVIRONMENT DIVISION.
   INPUT-OUTPUT SECTION.
   FILE-CONTROL.
   SELECT INPUTFILEDUP ASSIGN TO 'C:\Cobol\INPUTFILEDUP.txt'
           ORGANIZATION IS LINE SEQUENTIAL.
   SELECT OUTFILEDUP ASSIGN TO 'C:\Cobol\OUTFILEDUP.txt'
               ORGANIZATION IS LINE SEQUENTIAL.

   DATA DIVISION.

   FILE SECTION.
   FD INPUTFILEDUP.
   01 INPUTFILEDUPREC.
       88 EOFINPUTFILEDUP    VALUE HIGH-VALUES.
       02 INPUTFILEID        PIC 9(07).

   FD  OUTFILEDUP.
   01 OUTFILEDUPREC         PIC 9(07).

   WORKING-STORAGE SECTION.
   77 WS-VARIABLE            PIC 9(09).
   77 REC-NOT-MATCH          PIC 9(01).
   77 CUR-VARIABLE           PIC 9(09).

   PROCEDURE DIVISION.
   BEGIN.
   OPEN INPUT  INPUTFILEDUP
   OPEN OUTPUT OUTFILEDUP

   READ INPUTFILEDUP
       AT END SET EOFINPUTFILEDUP  TO TRUE
   END-READ
   PERFORM UNTIL (EOFINPUTFILEDUP)
                WRITE OUTFILEDUPREC  FROM  INPUTFILEID
               READ  INPUTFILEDUP
                     AT END SET EOFINPUTFILEDUP TO TRUE
                           PERFORM UNTIL (EOFINPUTFILEDUP)
  END-READ
  END-PERFORM
                   CLOSE   INPUTFILEDUP
                   CLOSE  OUTFILEDUP
  STOP RUN.

I sorted the tnput file in ascending order as:

8712351,8712353,8712353,8712354,8712356,8712352,8712355,8712352,8712355

And it worked, and below is the modified code:

But suppose if my file is not in either ascending or descending order the where I need to write the sort logic before removing duplicates. How can update the below code for this? As I tried, but I was not successful in doing this if the input file structure is like:

8712351,8712353,8712353,8712354,8712356,8712352,8712355,8712352,8712355

   IDENTIFICATION DIVISION.
   PROGRAM-ID.RemoveDup2.
   ENVIRONMENT DIVISION.
   INPUT-OUTPUT SECTION.
   FILE-CONTROL.
   SELECT INPUTFILEDUP ASSIGN TO 'C:\Cobol\INPUTFILEDUP.txt'
           ORGANIZATION IS LINE SEQUENTIAL.
   SELECT OUTFILEDUP ASSIGN TO 'C:\Cobol\OUTFILEDUP.txt'
               ORGANIZATION IS LINE SEQUENTIAL.

   DATA DIVISION.

   FILE SECTION.
   FD INPUTFILEDUP.
   01 INPUTFILEDUPREC.
       88 EOFINPUTFILEDUP    VALUE HIGH-VALUES.
       02 INPUTFILEID        PIC 9(07).

   FD  OUTFILEDUP.
   01 OUTFILEDUPREC         PIC 9(07).

   WORKING-STORAGE SECTION.
   77 WS-VARIABLE            PIC 9(09) VALUE ZERO.
   77 REC-NOT-MATCH          PIC 9(01).
   77 CUR-VARIABLE           PIC 9(7) VALUE ZERO.

   PROCEDURE DIVISION.
   BEGIN.
   OPEN INPUT  INPUTFILEDUP
   OPEN OUTPUT OUTFILEDUP

   READ INPUTFILEDUP
       AT END SET EOFINPUTFILEDUP  TO TRUE
   END-READ
   PERFORM UNTIL (EOFINPUTFILEDUP)
        IF INPUTFILEID NOT EQUAL TO  WS-VARIABLE
              MOVE  INPUTFILEID TO WS-VARIABLE
              WRITE OUTFILEDUPREC  FROM  INPUTFILEID
              READ  INPUTFILEDUP
                  AT END SET  EOFINPUTFILEDUP TO TRUE
              PERFORM UNTIL (EOFINPUTFILEDUP)
        ELSE
              DISPLAY "dUPLICATE FOUND"   INPUTFILEID

   READ INPUTFILEDUP
     AT END SET EOFINPUTFILEDUP  TO TRUE

   END-READ

       END-PERFORM

   CLOSE   INPUTFILEDUP
   CLOSE  OUTFILEDUP
   STOP RUN.
Kentiga answered 18/11, 2009 at 18:55 Comment(1)
WOW new favorite tag! :) Question about the data from which you are removing duplicates: are the numbers such as 8712351 all going to occur within a relatively compact range, such as 8700000-8800000? Or is it possible for numbers to vary from 1-N over an enormous range?Disfavor
K
6

Finally it worked.

Here is the code:

   IDENTIFICATION DIVISION.
   PROGRAM-ID.RemoveDup2.
   ENVIRONMENT DIVISION.
   INPUT-OUTPUT SECTION.
   FILE-CONTROL.
   SELECT INPUTFILEDUP ASSIGN TO 'C:\Cobol\INPUTFILEDUP.txt'
           ORGANIZATION IS LINE SEQUENTIAL.
   SELECT OUTFILEDUP ASSIGN TO 'C:\Cobol\OUTFILEDUP.txt'
               ORGANIZATION IS LINE SEQUENTIAL.
   SELECT WorkFile ASSIGN TO "WORK.TMP".

   DATA DIVISION.

   FILE SECTION.
   FD INPUTFILEDUP.
   01 INPUTFILEDUPREC.
       88 EOFINPUTFILEDUP    VALUE HIGH-VALUES.
       02 INPUTFILEID        PIC 9(07).

   FD  OUTFILEDUP.
   01 OUTFILEDUPREC         PIC 9(07).

   SD WorkFile.
   01 WORKREC.
      02 WINPUTFILEID       PIC 9(07).

   WORKING-STORAGE SECTION.
   77 WS-VARIABLE            PIC 9(09) VALUE ZERO.
   77 REC-NOT-MATCH          PIC 9(01).
   77 CUR-VARIABLE           PIC 9(7) VALUE ZERO.

   PROCEDURE DIVISION.
   BEGIN.
       SORT WorkFile ON ASCENDING KEY WINPUTFILEID
       USING INPUTFILEDUP GIVING INPUTFILEDUP

   OPEN INPUT  INPUTFILEDUP
   OPEN OUTPUT OUTFILEDUP

       READ INPUTFILEDUP
               AT END SET EOFINPUTFILEDUP  TO TRUE
   END-READ
       PERFORM UNTIL (EOFINPUTFILEDUP)
           IF INPUTFILEID NOT EQUAL TO  WS-VARIABLE
                   MOVE  INPUTFILEID TO WS-VARIABLE
                   WRITE OUTFILEDUPREC  FROM  INPUTFILEID
                   READ  INPUTFILEDUP
                       AT END SET  EOFINPUTFILEDUP TO TRUE
       PERFORM UNTIL (EOFINPUTFILEDUP)
           ELSE
                   DISPLAY "DUPLICATE FOUND    "   INPUTFILEID

   READ INPUTFILEDUP
               AT END SET EOFINPUTFILEDUP  TO TRUE
   END-READ
   END-PERFORM

   CLOSE   INPUTFILEDUP
   CLOSE  OUTFILEDUP

   STOP RUN.
Kentiga answered 18/11, 2009 at 20:44 Comment(0)
M
2

When Organization is Sequential, the record deleted is the last record read. The Delete statement is valid only when the last operation against the file is a successful Read statement. If not, the Delete returns a File Status value of 43. Because a Delete cannot return File Status values beginning with a 2 when the file is Open with Sequential Access, coding Invalid Key on such a Delete is not allowed.

When Dynamic or Random access is selected for the file, the Delete statment, like the Rewrite, becomes a little less restrictive. The record being deleted need not have bene previously read. Simply fill in the primary Key information in the record description for the fle and issue the Delete statement. If the record does not exist, a File Status of 23 is returned and an Invalid Key condition exists.

From page 274 of

Sams Teach Yourself COBOL in 24 Hours

page 274 (which I have just dusted down from off my bookshelf). So in your case you'll presumably set up your records to be sorted by INPUTFILEID, make a record as you go through of occurences of a given INPUTFILEID past its first occurence, and Delete accordingly (after you have written it to your output file).

Monto answered 18/11, 2009 at 19:23 Comment(0)
G
1

If you will sort the file with an external sort prior to reading it in the cobol program you can remove the duplicates with the SORT keyword EQUALS. If you sort the file prior to the cobol program and do not drop duplicates then a simple IF statement and a save field will allow you to delete the dups.

Set up a INPUTFILEID-save field. Right after the read.... IF inputfileid equal inputfileid-save read again if not write... after the write move inputfileid to inputfileid-save. You will have to break up the current perform to do this.

If you do not fully understand what I am saying and will help you change the code just let me know

Giusto answered 18/11, 2009 at 19:32 Comment(0)
S
1

sort is standard for these OS close jobs to follow the DRY principle. Gears -t for separator and -u for uniques. It's C.

Scoutmaster answered 19/11, 2009 at 14:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.