In C#, what is the best way to parse this WIKI markup?
Asked Answered
G

7

13

I need to take data that I am reading in from a WIKI markup page and store it as a table structure. I am trying to figure out how to properly parse the below markup syntax into some table data structure in C#

Here is an example table:

 || Owner || Action || Status || Comments ||
 | Bill | Fix the lobby | In Progress | This is easy |
 | Joe | Fix the bathroom | In Progress | Plumbing \\
 \\
  Electric \\
 \\
 Painting \\
 \\
 \\ | 
 | Scott | Fix the roof | Complete | This is expensive |

and here is how it comes in directly:

|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive| 

So as you can see:

  • The column headers have "||" as the separator
  • A row columns have a separator or "|"
  • A row might span multiple lines (as in the second data row example above) so i would have to keep reading until I hit the same number of "|" (cols) that I have in the header row.

I tried reading in line by line and then concatenating lines that had "\" in between then but that seemed a bit hacky.

I also tried to simply read in as a full string and then just parse by "||" first and then keep reading until I hit the same number of "|" and then go to the next row. This seemed to work but it feel like there might be a more elegant way using regular expressions or something similar.

Can anyone suggest the correct way to parse this data?

Gentile answered 9/4, 2015 at 3:39 Comment(8)
Are the \ in the source, or there to highlight an extended row in the example?Winch
no . . if i read it in to a raw string, this whole thing shows up as a single line. I broke it out above to make it easier to see what table i want to seeGentile
what effort have you made?Wharfage
I have updated the question with what i have tried. I have it working using my second approach but i feel like there is probably a much simpler way to achieve thisGentile
@Gentile You mentioned that the raw data comes in a single line, but you formatted it for the question. Please add the raw input to your question.Dunt
@Jon Tirjan - I have updated the question to include the raw inputGentile
@Gentile - Thanks. I added an answer with a working solution.Dunt
Assuming it's possible, have you tried looking at the source code of the wiki engine that already handles this syntax?Conurbation
H
7

I have largely replaced the previous answer, due to the fact that the format of the input after your edit is substantially different from the one posted before. This leads to a somewhat different solution.

Because there are no longer any line breaks after a row, the only way to determine for sure where a row ends, is to require that each row has the same number of columns as the table header. That is at least if you don't want to rely on some potentially fragile white space convention present in the one and only provided example string (i.e. that the row separator is the only | not preceded by a space). Your question at least does not provide this as the specification for a row delimiter.

The below "parser" provides at least the error handling validity checks that can be derived from your format specification and example string and also allows for tables that have no rows. The comments explain what it is doing in basic steps.

public class TableParser
{
    const StringSplitOptions SplitOpts = StringSplitOptions.None;
    const string RowColSep = "|";
    static readonly string[] HeaderColSplit = { "||" };
    static readonly string[] RowColSplit = { RowColSep };
    static readonly string[] MLColSplit = { @"\\" };

    public class TableRow
    {
        public List<string[]> Cells;
    }

    public class Table
    {
        public string[] Header;
        public TableRow[] Rows;
    }

    public static Table Parse(string text)
    {
        // Isolate the header columns and rows remainder.
        var headerSplit = text.Split(HeaderColSplit, SplitOpts);
        Ensure(headerSplit.Length > 1, "At least 1 header column is required in the input");

        // Need to check whether there are any rows.
        var hasRows = headerSplit.Last().IndexOf(RowColSep) >= 0;
        var header = headerSplit.Skip(1)
            .Take(headerSplit.Length - (hasRows ? 2 : 1))
            .Select(c => c.Trim())
            .ToArray();

        if (!hasRows) // If no rows for this table, we are done.
            return new Table() { Header = header, Rows = new TableRow[0] };

        // Get all row columns from the remainder.
        var rowsCols = headerSplit.Last().Split(RowColSplit, SplitOpts);

        // Require same amount of columns for a row as the header.
        Ensure((rowsCols.Length % (header.Length + 1)) == 1, 
            "The number of row colums does not match the number of header columns");
        var rows = new TableRow[(rowsCols.Length - 1) / (header.Length + 1)];

        // Fill rows by sequentially taking # header column cells 
        for (int ri = 0, start = 1; ri < rows.Length; ri++, start += header.Length + 1)
        {
            rows[ri] = new TableRow() { 
                Cells = rowsCols.Skip(start).Take(header.Length)
                    .Select(c => c.Split(MLColSplit, SplitOpts).Select(p => p.Trim()).ToArray())
                    .ToList()
            };
        };

        return new Table { Header = header, Rows = rows };
    }

    private static void Ensure(bool check, string errorMsg)
    {
        if (!check)
            throw new InvalidDataException(errorMsg);
    }
}

When used like this:

public static void Main(params string[] args)
{
        var wikiLine = @"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
        var table = TableParser.Parse(wikiLine);

        Console.WriteLine(string.Join(", ", table.Header));
        foreach (var r in table.Rows)
            Console.WriteLine(string.Join(", ", r.Cells.Select(c => string.Join(Environment.NewLine + "\t# ", c))));
}

It will produce the below output:

output

Where "\t# " represents a newline caused by the presence of \\ in the input.

Husein answered 13/4, 2015 at 5:24 Comment(1)
See my updated question as I have included the raw text without any formatting as I tried this solution and it didn't seem to give me the right column or row data.Gentile
D
4

Here's a solution which populates a DataTable. It does require a litte bit of data massaging (Trim), but the main parsing is Splits and Linq.

var str = @"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";

var headerStop = str.LastIndexOf("||");
var headers = str.Substring(0, headerStop).Split(new string[1] { "||" }, StringSplitOptions.None).Skip(1).ToList();
var records = str.Substring(headerStop + 4).TrimEnd(new char[2] { ' ', '|' }).Split(new string[1] { "| |" }, StringSplitOptions.None).ToList();

var tbl = new DataTable();
headers.ForEach(h => tbl.Columns.Add(h.Trim()));
records.ForEach(r =>  tbl.Rows.Add(r.Split('|')));
Dunt answered 14/4, 2015 at 12:55 Comment(0)
A
2

This makes some assumptions but seems to work for your sample data. I'm sure if I worked at I could combine the expressions and clean it up but you'll get the idea. It will also allow for rows that do not have the same number of cells as the header which I think is something confluence can do.

List<List<string>> table = new List<List<string>>();


var match = Regex.Match(raw, @"(?:(?:\|\|([^|]*))*\n)?");
if (match.Success)
{
    var headersWithExtra = match.Groups[1].Captures.Cast<Capture>().Select(c=>c.Value);
    List<String> headerRow = headersWithExtra.Take(headersWithExtra.Count()-1).ToList();
    if (headerRow.Count > 0)
    {
        table.Add(headerRow);
    }
}

match = Regex.Match(raw + "\r\n", @"[^\n]*\n" + @"(?:\|([^|]*))*");
var cellsWithExtra = match.Groups[1].Captures.Cast<Capture>().Select(c=>c.Value);

List<string> row = new List<string>();
foreach (string cell in cellsWithExtra)
{
    if (cell.Trim(' ', '\t') == "\r\n")
    {
        if (!table.Contains(row) && row.Count > 0)
        {
            table.Add(row);
        }
        row = new List<string>();
    }
    else
    {

        row.Add(cell);
    }
}
Arnold answered 11/4, 2015 at 23:44 Comment(0)
H
2

This ended up very similar to Jon Tirjan's answer, although it cuts the LINQ to a single statement (the code to replace that last one was horrifically ugly) and is a bit more extensible. For example, it will replace the Confluence line breaks \\ with a string of your choosing, you can choose to trim or not trim whitespace from around elements, etc.

private void ParseWikiTable(string input, string newLineReplacement = " ")
{
    string separatorHeader = "||";
    string separatorRow = "| |";
    string separatorElement = "|";

    input = Regex.Replace(input, @"[ \\]{2,}", newLineReplacement);

    string inputHeader = input.Substring(0, input.LastIndexOf(separatorHeader));
    string inputContent = input.Substring(input.LastIndexOf(separatorHeader) + separatorHeader.Length);

    string[] headerArray = SimpleSplit(inputHeader, separatorHeader);
    string[][] rowArray = SimpleSplit(inputContent, separatorRow).Select(r => SimpleSplit(r, separatorElement)).ToArray();

    // do something with output data
    TestPrint(headerArray);
    foreach (var r in rowArray) { TestPrint(r); }
}

private string[] SimpleSplit(string input, string separator, bool trimWhitespace = true)
{
    input = input.Trim();
    if (input.StartsWith(separator)) { input = input.Substring(separator.Length); }
    if (input.EndsWith(separator)) { input = input.Substring(0, input.Length - separator.Length); }

    string[] segments = input.Split(new string[] { separator }, StringSplitOptions.None);
    if (trimWhitespace)
    {
        for (int i = 0; i < segments.Length; i++)
        {
            segments[i] = segments[i].Trim();
        }
    }

    return segments;
}

private void TestPrint(string[] lst)
{
    string joined = "[" + String.Join("::", lst) + "]";
    Console.WriteLine(joined);
}

Console output from your direct input string:

[Owner::Action::Status::Comments]

[Bill::fix the lobby::In Progress::This is eary]

[Joe::fix the bathroom::In progress::plumbing Electric Painting]

[Scott::fix the roof::Complete::this is expensive]

Huckaback answered 16/4, 2015 at 9:45 Comment(0)
C
2

A generic regex solution that populate a datatable and is a little flexible with the syntax.

           var text = @"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";

        // Get Headers
        var regHeaders = new Regex(@"\|\|\s*(\w[^\|]+)", RegexOptions.Compiled);
        var headers = regHeaders.Matches(text);

        //Get Rows, based on number of headers columns
        var regLinhas = new Regex(String.Format(@"(?:\|\s*(\w[^\|]+)){{{0}}}", headers.Count));
        var rows = regLinhas.Matches(text);

        var tbl = new DataTable();

        foreach (Match header in headers)
        {
            tbl.Columns.Add(header.Groups[1].Value);
        }

        foreach (Match row in rows)
        {
            tbl.Rows.Add(row.Groups[1].Captures.OfType<Capture>().Select(col => col.Value).ToArray());
        }
Carce answered 17/4, 2015 at 17:46 Comment(0)
L
2

Here's a solution involving regular expressions. It takes a single string as input and returns a List of headers and a List> of rows/columns. It also trims white space, which may or may not be the desired behavior, so be aware of that. It even prints things nicely :)

enter image description here

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

namespace parseWiki
{
    class Program
    {
        static void Main(string[] args)
        {
            string content = @"|| Owner || Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
            content = content.Replace(@"\\", "");
            string headerContent = content.Substring(0, content.LastIndexOf("||") + 2);
            string cellContent = content.Substring(content.LastIndexOf("||") + 2);
            MatchCollection headerMatches = new Regex(@"\|\|([^|]*)(?=\|\|)", RegexOptions.Singleline).Matches(headerContent);
            MatchCollection cellMatches = new Regex(@"\|([^|]*)(?=\|)", RegexOptions.Singleline).Matches(cellContent);

            List<string> headers = new List<string>();
            foreach (Match match in headerMatches)
            {
                if (match.Groups.Count > 1)
                {
                    headers.Add(match.Groups[1].Value.Trim());
                }
            }

            List<List<string>> body = new List<List<string>>();
            List<string> newRow = new List<string>();
            foreach (Match match in cellMatches)
            {
                if (newRow.Count > 0 && newRow.Count % headers.Count == 0)
                {
                    body.Add(newRow);
                    newRow = new List<string>();
                }
                else
                {
                    newRow.Add(match.Groups[1].Value.Trim());
                }
            }
            body.Add(newRow);

            print(headers, body);
        }

        static void print(List<string> headers, List<List<string>> body)
        {
            var CELL_SIZE = 20;

            for (int i = 0; i < headers.Count; i++)
            {
                Console.Write(headers[i].Truncate(CELL_SIZE).PadRight(CELL_SIZE) + "  ");
            }
            Console.WriteLine("\n" + "".PadRight( (CELL_SIZE + 2) * headers.Count, '-'));

            for (int r = 0; r < body.Count; r++)
            {
                List<string> row = body[r];
                for (int c = 0; c < row.Count; c++)
                {
                    Console.Write(row[c].Truncate(CELL_SIZE).PadRight(CELL_SIZE) + "  ");
                }
                Console.WriteLine("");
            }

            Console.WriteLine("\n\n\n");
            Console.ReadKey(false);
        }
    }

    public static class StringExt
    {
        public static string Truncate(this string value, int maxLength)
        {
            if (string.IsNullOrEmpty(value) || value.Length <= maxLength) return value;
            return value.Substring(0, maxLength - 3) + "...";

        }
    }
}
Leaper answered 17/4, 2015 at 21:2 Comment(0)
L
0

Read the input string one character at a time and use a state-machine to decide what should be done with each input character. This approach probably needs more code, but it will be easier to maintain and to extend than regular expressions.

Levant answered 15/4, 2015 at 6:10 Comment(2)
I guess that my answer needs some code examples, but I don't understand why that reference link was added to the answer. It does not point to any example code or a more detailed explanation.Levant
Quite right. That was an unrelated link and should never have been added. I rolled back the change. In the future feel free to roll back edits such as these. Just click on the "edited" link and you will see the post history - the history allows you to roll things back.Transom

© 2022 - 2024 — McMap. All rights reserved.