Hadoop Cascading : CascadeException "no loops allowed in cascade" when cogroup pipes twice
Asked Answered
S

2

2

I'm trying to write a Casacading(v1.2) casade (http://docs.cascading.org/cascading/1.2/userguide/htmlsingle/#N20844) consisting of two flows:

1) The first flow outputs urls to a db table, (in which they are automatically assigned id's via an auto-incrementing id value). This flow also outputs pairs of urls into a SequenceFile with field names "urlTo", "urlFrom".

2) The second flow reads from both these sources and tries to do a CoGroup on "urlTo" (from the SequenceFile) and "url" (from the db source) to get the db record "id" for each "urlTo".

It then does a CoGroup on "urlFrom" and "url" to get the db record "id" for each "urlFrom".

The two flows work individually - if I call flow.complete() on the first before running the second flow. But if I put the two flows in a cascade object I get the error

cascading.cascade.CascadeException: no loops allowed in cascade, flow: urlLink*url*url, source: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='urls', columnNames=null, columnDefs=null, primaryKeys=null}}, sink: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='url_link', columnNames=[urlLinkFrom, urlLinkTo], columnDefs=[bigint(20), bigint(20)], primaryKeys=[urlLinkFrom, urlLinkTo]}}

on trying to configure the cascade.

I can see it's coming from the addEdgeFor function of the CascadeConnector but I'm not clear on how to resolve this problem.

I've never used Cascade / CascadeConnector before. Is there something I'm missing?

Serviceable answered 16/7, 2013 at 14:30 Comment(0)
S
1

"A Tap is not given an explicit name by design. This is so a given Tap instance can be re-used in different {@link Flow}s that may expect a source or sink by a different logical name, but are the same physical resource."

"In general, two instances of the same Tap class must have differing Identifiers (and different #equals)."

It turns out that JDBCTaps generate their identifier from the connection url alone (and do not include the table name). So as I was reading from one table and writing to a different table in the same database it seemed like I was reading from and writing to the same Tap and causing a loop.

As a work-around, I'm going to subclass the JDBCTap and override the getIdentifier() method to include the table name.

Serviceable answered 17/7, 2013 at 15:33 Comment(0)
K
2

It seems like your some paths for source and sinks are the same.

A Cascade uses the concept of Direct Graphs to build the Cascade itself so if you have a flow source and a sink source pointing to the same location that in essence creates a loop and is disallowed in the concept of Directed Graphs since

it does not go from:

  • Source Location A to Sink Location B

but instead goes from:

  • Source Location A to Sink Location A.
Koran answered 16/7, 2013 at 21:34 Comment(2)
Thanks for your response! I'm not sure I quite understand, are you suggesting one of my flows is pointing to the same sink as it's using for source, or that the sink of my first flow is the source of my second flow and that's where the issue lies? If I'm not supposed to use the sink of the first flow as the source of the second, how do I create the cascade where the second flow of data relies on the output of the first? I'm looking at this example github.com/cwensel/cascading.samples/blob/master/wordcount/src/… but I still don't see where I'm going wrong.Serviceable
yes, I am suggesting that within flow named "urlLinkurlurl" the source and sink have the same TAP filepath or cascading thinks they do. It looks like the naming is where the confusion lies. This is conjecture but I think the names url and url "url<TAP-sink>*url<TAP-source>" are confusing the cascading flow into thinking that you are calling the same source and sink Taps within the same flow.Koran
S
1

"A Tap is not given an explicit name by design. This is so a given Tap instance can be re-used in different {@link Flow}s that may expect a source or sink by a different logical name, but are the same physical resource."

"In general, two instances of the same Tap class must have differing Identifiers (and different #equals)."

It turns out that JDBCTaps generate their identifier from the connection url alone (and do not include the table name). So as I was reading from one table and writing to a different table in the same database it seemed like I was reading from and writing to the same Tap and causing a loop.

As a work-around, I'm going to subclass the JDBCTap and override the getIdentifier() method to include the table name.

Serviceable answered 17/7, 2013 at 15:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.