Is there a simplistic way to extract numbers from a string following certain rules?
Asked Answered
L

5

16

I need to pull numbers from a string and put them into a list, there are some rules to this however such as identifying if the extracted number is a Integer or Float.

The task sounds simple enough but I am finding myself more and more confused as time goes by and could really do with some guidance.


Take the following test string as an example:

There are test values: P7 45.826.53.91.7, .5, 66.. 4 and 5.40.3.

The rules to follow when parsing the string are as follows:

  • numbers cannot be preceeded by a letter.

  • If it finds a number and is not followed by a decimal point then the number is as an Integer.

  • If it finds a number and is followed by a decimal point then the number is a float, eg 5.

  • ~ If more numbers follow the decimal point then the number is still a float, eg 5.40

  • ~ A further found decimal point should then break up the number, eg 5.40.3 becomes (5.40 Float) and (3 Float)

  • In the event of a letter for example following a decimal point, eg 3.H then still add 3. as a Float to the list (even if technically it is not valid)

Example 1

To make this a little more clearer, taking the test string quoted above the desired output should be as follows:

enter image description here

From the image above, light blue colour illustrates Float numbers, pale red illustrates single Integers (but note also how Floats joined together are split into seperate Floats).

  • 45.826 (Float)
  • 53.91 (Float)
  • 7 (Integer)
  • 5 (Integer)
  • 66 . (Float)
  • 4 (Integer)
  • 5.40 (Float)
  • 3 . (Float)

Note there are deliberate spaces between 66 . and 3 . above due to the way the numbers were formatted.

Example 2:

Anoth3r Te5.t string .4 abc 8.1Q 123.45.67.8.9

enter image description here

  • 4 (Integer)
  • 8.1 (Float)
  • 123.45 (Float)
  • 67.8 (Float)
  • 9 (Integer)

To give a better idea, I created a new project whilst testing which looks like this:

enter image description here


Now onto the actual task. I thought maybe I could read each character from the string and identify what are valid numbers as per the rules above, and then pull them into a list.

To my ability, this was the best I could manage:

enter image description here

The code is as follows:

unit Unit1;

{$mode objfpc}{$H+}

interface

uses
  Classes, SysUtils, FileUtil, Forms, Controls, Graphics, Dialogs, StdCtrls;

type
  TForm1 = class(TForm)
    btnParseString: TButton;
    edtTestString: TEdit;
    Label1: TLabel;
    Label2: TLabel;
    Label3: TLabel;
    lstDesiredOutput: TListBox;
    lstActualOutput: TListBox;
    procedure btnParseStringClick(Sender: TObject);
  private
    FDone: Boolean;
    FIdx: Integer;
    procedure ParseString(const Str: string; var OutValue, OutKind: string);
  public
    { public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.lfm}

{ TForm1 }

procedure TForm1.ParseString(const Str: string; var OutValue, OutKind: string);
var
  CH1, CH2: Char;
begin
  Inc(FIdx);
  CH1 := Str[FIdx];

  case CH1 of
    '0'..'9': // Found a number
    begin
      CH2 := Str[FIdx - 1];
      if not (CH2 in ['A'..'Z']) then
      begin
        OutKind := 'Integer';

        // Try to determine float...

        //while (CH1 in ['0'..'9', '.']) do
        //begin
        //  case Str[FIdx] of
        //    '.':
        //    begin
        //      CH2 := Str[FIdx + 1];
        //      if not (CH2 in ['0'..'9']) then
        //      begin
        //        OutKind := 'Float';
        //        //Inc(FIdx);
        //      end;
        //    end;
        //  end;
        //end;
      end;
      OutValue := Str[FIdx];
    end;
  end;

  FDone := FIdx = Length(Str);
end;

procedure TForm1.btnParseStringClick(Sender: TObject);
var
  S, SKind: string;
begin
  lstActualOutput.Items.Clear;
  FDone := False;
  FIdx := 0;

  repeat
    ParseString(edtTestString.Text, S, SKind);
    if (S <> '') and (SKind <> '') then
    begin
      lstActualOutput.Items.Add(S + ' (' + SKind + ')');
    end;
  until
    FDone = True;
end;

end.

It clearly doesn't give the desired output (failed code has been commented out) and my approach is likely wrong but I feel I only need to make a few changes here and there for a working solution.

At this point I have found myself rather confused and quite lost despite thinking the answer is quite close, the task is becoming increasingly infuriating and I would really appreciate some help.


EDIT 1

Here I got a little closer as there is no longer duplicate numbers but the result is still clearly wrong.

enter image description here

unit Unit1;

{$mode objfpc}{$H+}

interface

uses
  Classes, SysUtils, FileUtil, Forms, Controls, Graphics, Dialogs, StdCtrls;

type
  TForm1 = class(TForm)
    btnParseString: TButton;
    edtTestString: TEdit;
    Label1: TLabel;
    Label2: TLabel;
    Label3: TLabel;
    lstDesiredOutput: TListBox;
    lstActualOutput: TListBox;
    procedure btnParseStringClick(Sender: TObject);
  private
    FDone: Boolean;
    FIdx: Integer;
    procedure ParseString(const Str: string; var OutValue, OutKind: string);
  public
    { public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.lfm}

{ TForm1 }

// Prepare to pull hair out!
procedure TForm1.ParseString(const Str: string; var OutValue, OutKind: string);
var
  CH1, CH2: Char;
begin
  Inc(FIdx);
  CH1 := Str[FIdx];

  case CH1 of
    '0'..'9': // Found the start of a new number
    begin
      CH1 := Str[FIdx];

      // make sure previous character is not a letter
      CH2 := Str[FIdx - 1];
      if not (CH2 in ['A'..'Z']) then
      begin
        OutKind := 'Integer';

        // Try to determine float...
        //while (CH1 in ['0'..'9', '.']) do
        //begin
        //  OutKind := 'Float';
        //  case Str[FIdx] of
        //    '.':
        //    begin
        //      CH2 := Str[FIdx + 1];
        //      if not (CH2 in ['0'..'9']) then
        //      begin
        //        OutKind := 'Float';
        //        Break;
        //      end;
        //    end;
        //  end;
        //  Inc(FIdx);
        //  CH1 := Str[FIdx];
        //end;
      end;
      OutValue := Str[FIdx];
    end;
  end;

  OutValue := Str[FIdx];
  FDone := Str[FIdx] = #0;
end;

procedure TForm1.btnParseStringClick(Sender: TObject);
var
  S, SKind: string;
begin
  lstActualOutput.Items.Clear;
  FDone := False;
  FIdx := 0;

  repeat
    ParseString(edtTestString.Text, S, SKind);
    if (S <> '') and (SKind <> '') then
    begin
      lstActualOutput.Items.Add(S + ' (' + SKind + ')');
    end;
  until
    FDone = True;
end;

end.

My question is how can I extract numbers from a string, add them to a list and determine if the number is integer or float?

The left pale green listbox (desired output) shows what the results should be, the right pale blue listbox (actual output) shows what we actually got.

Please advise Thanks.

Note I re-added the Delphi tag as I do use XE7 so please don't remove it, although this particular problem is in Lazarus my eventual solution should work for both XE7 and Lazarus.

Lobito answered 31/10, 2016 at 13:56 Comment(26)
Take a look at the System.Masks.MatchesMask function. I didn't try, but this could maybe help you.Abbieabbot
@DavidHeffernan That is not a fair assumption considering the time it took for me to write what I thought was a valid question (you really don't know what the question is?), and also showed my progress and effort to the best of my ability. If I wanted someone to do it all for me I would not have put so much effort thus far into this, so please don't just assume I want a copy and paste answer, I just need some guidance to help me on my way, you can only evolve as a programmer from learning and not copy and pasting so please don't assume I expect someone to do the work for me.Lobito
So what is your question. Be very specific.Sargasso
@DavidHeffernan I need to pass a string and then parse it looking for numbers (numbers can be Integer or Float). Once numbers have been found determine what type the number is, eg is it an Integer or Float. A Float cannot have more than one decimal point, for example 123.45.6 should be split into two results, the first is a Float (123.45) and the second is an Integer (6). If the example however is, 123.45.6.7 then the split would be Float (123.45) and Float (6.7).Lobito
@DavidHeffernan Sorry, to be honest I have ran myself into a state of confusion from spending so much time on this task (you wont believe how long), I thought the question was clear enough but maybe it is not. If you understand it now maybe you can help improve my post because I am not sure how I can express my question in such a simple way. This is why I added images showing the desired output. The actual output list was to help me while I tried different things, so my task is to parse the test string and populate the actual output list to match the desired output list on the left.Lobito
The problem I have with the question is that it feels too vague to me. It's like you want us to debug your entire algo and fix it.Sargasso
@DavidHeffernan I understand, I just don't know how I can make my question more specific, and given the nature of it I can also see why it might come across as if I am expecting someone to fix it all for me (which I honestly am not). It's just I am not too good with this sort of string handling, to get as far as I have was quite a challenge for me and I feel I have almost reached my limit (I start doubting myself and my ability), I just need some guidance and a push in the right direction more than anything :(Lobito
I do sympathise, and it's clear you are putting lots of effort in.Sargasso
@DavidHeffernan Thanks, I just really wish I could absorb and understand these tasks much more better and clearly because the longer they go unsolved the harder it becomes. This is my proposal, hopefully no one posts a solution and in the mean time I am going to take some time out away from the problem to gather my thoughts and ease the stress and confusion and then hopefully come back and attack it and try and make more progress, finite state machine looks way too advanced so I will attempt to keep going with my current way. Wish me luck :)Lobito
This is pure curosity, but wherever are you getting such chaotic input data?Locklear
On what planet does 45.826.53.91.7 parse out to 45.826, 53.91 and 7? How do you determine it's not 45, 826.53, and 91.7, or 45, 826, 53.91 and 7? Where are you getting this random-noise filled data?Snorkel
To everyone, this is just an exercise I set for myself to see how to manipulate such data, if it proves too difficult I will just abandon the whole thing. I know for me it is proving quite difficult and confusing but I sometimes like setting myself what may seem like quirky tasks just to see what I can or cannot do. Having failed so far I would have been interested to know how it should be done, but I will keep trying and come back tomorrow with hopefully some more progress.Lobito
@KenWhite because the number is broken down, the result of 45.826.53.91.7 would be 45.826 and 53.91 (a float cannot have more than one decimal). So you should be visualising the numbers like so: |45.826|53.91|7| with the first two broken down numbers been identified as floats and the remaining 7 a single Integer. This is because that particular number is continuous without spaces or letters, just numbers and decimal points.Lobito
What would the sequence look like if 45 was to be indeed an integer?Costanza
@SertacAkyuz the 45 would only be a in Integer if there was no decimal afterwards, it should be identified as a float because the decimal point do follow. Please see the edited question where I have added a image to hopefully show more clearer.Lobito
But if there would be two integers after each other, how would they be separated? By a space, or ...?Juanjuana
@TomBrunberg Yes, It would only be an integer if there is a space between making it a separate number, eg take this number 12.34 if there is a space after the 2 (12 .34) then you have two integers (12 and 34). The decimal points keep it continuous.Lobito
@Lobito Kudos to you for trying to set yourself a challenge to help your learning. But there is perhaps a more important lesson you need to learn first. Most programmers fall into the trap of making things overcomplicated and difficult for themselves. You have made this mistake. By allowing the the symbol . to perform double-duty as both decimal-point and item separator you turn what could have been an interesting parsing exercise into an unrealistic problem that you're unlikely to learn anything useful from. Most important lesson: Don't overcomplicate things.Lessee
PS: If you want to learn by solving programming challenges: search for sites like hackerrank and codinggame. These sites offer a variety of challenges that are more reasonable; and often require application of more real-world algorithms. They might not support Delphi, but there's a good chance they'll support Pascal. And even if you're forced to use a different language - algorithmic principles are the same regardless of language.Lessee
@CraigYoung thanks for the tips and advice, really appreciate it.Lobito
I tried some more today but kept failing but a few things came to mind but not sure if its on the right track or not. What I thought to do was something like building a string from inside the ParseLine procedure so instead of the list been populated one char at a time like it is right now, instead build up a string and if the string contains more then 1 decimal point or space / letter etc then break and start again looking for the next number. These are just fuzzy ideas I am getting at the moment though.Lobito
Your only output kinds are integer and float - and you don't set any value on entry. So how can you ever get anything but integer (as float is commented out)? So your problem is (or should be ) obvious. You always output something (the next bit of the string) and it is always described integer (once any integer is found) - which is exactly what your output shows! Please learn to use the debugger and you will see where you are going wrong.Amadeo
@Amadeo the commented out code was just failed attempts which I will revise and try to correct. As for the debugging I know how to use it but I don't currently have Delphi installed and lets just say the Lazarus debugger is not as intuitive, sure you can break and step etc but trying to use the watch list etc is not as good as Delphi but I won't use that as any kind of excuse, my main problem is trying to unconfuse myself and have a clear plan of what I am trying to do. In fact I doubt there would be many circumstances where this would ever be needed but since I started it I just want it finishedLobito
OK. I have never used Lazarus. But at a minimum you need to initialise OutKind with something like 'None' in the first line. That will go a long way towards solving your issues in terms of confusing yourself. Also remove the first OutKind = float - that would just make ever number a float, which is not what you want. Finally you need an else statement if the if statement concerning '..' otherwise something like 1..3 will come out as integer!Amadeo
But that still won't follow your rules. 1.2.3 won't break. To handle that sort of situation you really do need to investigate FSMs as suggested by MBoAmadeo
@CraigYoung cool websites you linked earlier by the way ;)Lobito
S
14

Your rules are rather complex, so you can try to build finite state machine (FSM, DFA -Deterministic finite automaton).

Every char causes transition between states.

For example, when you are in state "integer started" and meet space char, you yield integer value and FSM goes into state " anything wanted".

If you are in state "integer started" and meet '.', FSM goes into state "float or integer list started" and so on.

Stormy answered 31/10, 2016 at 14:23 Comment(6)
A state machine is the way to go.Lira
Wow, looks like I underestimated the task by quite a bit if this is the kind of thing involved. I thought I could simply iterate each char in the string and pick out valid numbers :)Lobito
Yes you can, but depending on the state you must interpret the characters differently. Just as MBo described.Lira
In your code OutKind already (at least partly) represents your states, so you are already on the way to an FSM without realising it. FSM formalises the idea and makes the code clearer and more robust than yours. You may need more intermediate states and you tend to code for each state separately, to reduce risk of errors and isolate them when they do occur. But it is not that much further on that where you are. So don't despair.Amadeo
I don't recall the name, but I have used some math parsing/formula engine which I integrated into my own script. There are plenty things out there already that can do the job much better than your trial and error.Nablus
Having now fully read the page linked this does seem much likely the way to go, the examples given do seem extremely technical however but the overview and image samples make it a bit easier to understand.Lobito
A
6

The answer is quite close, but there are several basic errors. To give you some hints (without writing your code for you): Within the while loop you MUST ALWAYS increment (the increment should not be where it is otherwise you get an infinite loop) and you MUST check that you have not reached the end of the string (otherwise you get an exception) and finally your while loop should not be dependant on CH1, because that never changes (again resulting in an infinite loop). But my best advice here is trace through you code with the debugger - that is what it is there for. Then your mistakes would become obvious.

Amadeo answered 31/10, 2016 at 14:24 Comment(0)
A
3

There are so many basic errors in your code I decided to correct your homework, as it were. This is still not a good way to do it, but at least the basic errors are removed. Take care to read the comments!

procedure TForm1.ParseString(const Str: string; var OutValue,
  OutKind: string);
//var
//  CH1, CH2: Char;      <<<<<<<<<<<<<<<< Don't need these
begin
  (*************************************************
   *                                               *
   * This only corrects the 'silly' errors. It is  *
   * NOT being passed off as GOOD code!            *
   *                                               *
   *************************************************)

  Inc(FIdx);
  // CH1 := Str[FIdx]; <<<<<<<<<<<<<<<<<< Not needed but OK to use. I removed them because they seemed to cause confusion...
  OutKind := 'None';
  OutValue := '';

  try
  case Str[FIdx] of
    '0'..'9': // Found the start of a new number
    begin
      // CH1 := Str[FIdx]; <<<<<<<<<<<<<<<<<<<< Not needed

      // make sure previous character is not a letter
      // >>>>>>>>>>> make sure we are not at beginning of file
      if FIdx > 1 then
      begin
        //CH2 := Str[FIdx - 1];
        if (Str[FIdx - 1] in ['A'..'Z', 'a'..'z']) then // <<<<< don't forget lower case!
        begin
          exit; // <<<<<<<<<<<<<<
        end;
      end;
      // else we have a digit and it is not preceeded by a number, so must be at least integer
      OutKind := 'Integer';

      // <<<<<<<<<<<<<<<<<<<<< WHAT WE HAVE SO FAR >>>>>>>>>>>>>>
      OutValue := Str[FIdx];
      // <<<<<<<<<<<<< Carry on...
      inc( FIdx );
      // Try to determine float...

      while (Fidx <= Length( Str )) and  (Str[ FIdx ] in ['0'..'9', '.']) do // <<<<< not not CH1!
      begin
        OutValue := Outvalue + Str[FIdx]; //<<<<<<<<<<<<<<<<<<<<<< Note you were storing just 1 char. EVER!
        //>>>>>>>>>>>>>>>>>>>>>>>>>  OutKind := 'Float';  ***** NO! *****
        case Str[FIdx] of
          '.':
          begin
            OutKind := 'Float';
            // now just copy any remaining integers - that is all rules ask for
            inc( FIdx );
            while (Fidx <= Length( Str )) and  (Str[ FIdx ] in ['0'..'9']) do // <<<<< note '.' excluded here!
            begin
              OutValue := Outvalue + Str[FIdx];
              inc( FIdx );
            end;
            exit;
          end;
            // >>>>>>>>>>>>>>>>>>> all the rest in unnecessary
            //CH2 := Str[FIdx + 1];
            //      if not (CH2 in ['0'..'9']) then
            //      begin
            //        OutKind := 'Float';
            //        Break;
            //      end;
            //    end;
            //  end;
            //  Inc(FIdx);
            //  CH1 := Str[FIdx];
            //end;

        end;
        inc( fIdx );
      end;

    end;
  end;

  // OutValue := Str[FIdx]; <<<<<<<<<<<<<<<<<<<<< NO! Only ever gives 1 char!
  // FDone := Str[FIdx] = #0; <<<<<<<<<<<<<<<<<<< NO! #0 does NOT terminate Delphi strings

  finally   // <<<<<<<<<<<<<<< Try.. finally clause added to make sure FDone is always evaluated.
            // <<<<<<<<<< Note there are better ways!
    if FIdx > Length( Str ) then
    begin
      FDone := TRUE;
    end;
  end;
end;
Amadeo answered 1/11, 2016 at 16:38 Comment(1)
No need to be condescending, it wasn't homework. Read the comments more closely.Incandesce
J
3

You have got answers and comments that suggest using a state machine, and I support that fully. From the code you show in Edit1, I see that you still did not implement a state machine. From the comments I guess you don't know how to do that, so to push you in that direction here's one approach:

Define the states you need to work with:

type
  TReadState = (ReadingIdle, ReadingText, ReadingInt, ReadingFloat);
  // ReadingIdle, initial state or if no other state applies
  // ReadingText, needed to deal with strings that includes digits (P7..)
  // ReadingInt, state that collects the characters that form an integer
  // ReadingFloat, state that collects characters that form a float

Then define the skeleton of your statemachine. To keep it as easy as possible I chose to use a straight forward procedural approach, with one main procedure and four subprocedures, one for each state.

procedure ParseString(const s: string; strings: TStrings);
var
  ix: integer;
  ch: Char;
  len: integer;
  str,           // to collect characters which form a value
  res: string;   // holds a final value if not empty
  State: TReadState;

  // subprocedures, one for each state
  procedure DoReadingIdle(ch: char; var str, res: string);
  procedure DoReadingText(ch: char; var str, res: string);
  procedure DoReadingInt(ch: char; var str, res: string);
  procedure DoReadingFloat(ch: char; var str, res: string);

begin
  State := ReadingIdle;
  len := Length(s);
  res := '';
  str := '';
  ix := 1;
  repeat
    ch := s[ix];
    case State of
      ReadingIdle:  DoReadingIdle(ch, str, res);
      ReadingText:  DoReadingText(ch, str, res);
      ReadingInt:   DoReadingInt(ch, str, res);
      ReadingFloat: DoReadingFloat(ch, str, res);
    end;
    if res <> '' then
    begin
      strings.Add(res);
      res := '';
    end;
    inc(ix);
  until ix > len;
  // if State is either ReadingInt or ReadingFloat, the input string
  // ended with a digit as final character of an integer, resp. float,
  // and we have a pending value to add to the list
  case State of
    ReadingInt: strings.Add(str + ' (integer)');
    ReadingFloat: strings.Add(str + ' (float)');
  end;
end;

That is the skeleton. The main logic is in the four state procedures.

  procedure DoReadingIdle(ch: char; var str, res: string);
  begin
    case ch of
      '0'..'9': begin
        str := ch;
        State := ReadingInt;
      end;
      ' ','.': begin
        str := '';
        // no state change
      end
      else begin
        str := ch;
        State := ReadingText;
      end;
    end;
  end;

  procedure DoReadingText(ch: char; var str, res: string);
  begin
    case ch of
      ' ','.': begin  // terminates ReadingText state
        str := '';
        State := ReadingIdle;
      end
      else begin
        str := str + ch;
        // no state change
      end;
    end;
  end;

  procedure DoReadingInt(ch: char; var str, res: string);
  begin
    case ch of
      '0'..'9': begin
        str := str + ch;
      end;
      '.': begin  // ok, seems we are reading a float
        str := str + ch;
        State := ReadingFloat;  // change state
      end;
      ' ',',': begin // end of int reading, set res
        res := str + ' (integer)';
        str := '';
        State := ReadingIdle;
      end;
    end;
  end;

  procedure DoReadingFloat(ch: char; var str, res: string);
  begin
    case ch of
      '0'..'9': begin
        str := str + ch;
      end;
      ' ','.',',': begin  // end of float reading, set res
        res := str + ' (float)';
        str := '';
        State := ReadingIdle;
      end;
    end;
  end;

The state procedures should be self explaining. But just ask if something is unclear.

Both your test strings result in the values listed as you specified. One of your rules was a little bit ambiguous and my interpretation might be wrong.

numbers cannot be preceeded by a letter

The example you provided is "P7", and in your code you only checked the immediate previous character. But what if it would read "P71"? I interpreted it that "1" should be omitted just as the "7", even though the previous character of "1" is "7". This is the main reason for ReadingText state, which ends only on a space or period.

Juanjuana answered 2/11, 2016 at 9:59 Comment(1)
So many answers and comments it's going to take me a while to let it all sink in. As for your assumption based on "P71" then yes both numbers would be ignored as the string did not start with a number.Lobito
D
1

Here's a solution using regex. I implemented it in Delphi (tested in 10.1, but should also work with XE8), I'm sure you can adopt it for lazarus, just not sure which regex libraries work over there. The regex pattern uses alternation to match numbers as integers or floats following your rules:

Integer:

(\b\d+(?![.\d]))
  • started by a word boundary (so no letter, number or underscore before - if underscores are an issue you could use (?<![[:alnum:]]) instead)
  • then match one or more digits
  • that are neither followed by digit nor dot

Float:

(\b\d+(?:\.\d+)?)
  • started by a word boundary (so no letter, number or underscore before - if underscores are an issue you could use (?<![[:alnum:]]) instead)
  • then match one or more digits
  • optionally match dot followed by further digits

A simple console application looks like

program Test;

{$APPTYPE CONSOLE}

uses
  System.SysUtils, RegularExpressions;

procedure ParseString(const Input: string);
var
  Match: TMatch;
begin
  WriteLn('---start---');
  Match := TRegex.Match(Input, '(\b\d+(?![.\d]))|(\b\d+(?:\.\d+)?)');
  while Match.Success do
  begin
    if Match.Groups[1].Value <> '' then
      writeln(Match.Groups[1].Value + '(Integer)')
    else
      writeln(Match.Groups[2].Value + '(Float)');
    Match := Match.NextMatch;
  end;
  WriteLn('---end---');
end;

begin
  ParseString('There are test values: P7 45.826.53.91.7, .5, 66.. 4 and 5.40.3.');
  ParseString('Anoth3r Te5.t string .4 abc 8.1Q 123.45.67.8.9');
  ReadLn;
end.
Damper answered 1/11, 2016 at 10:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.