How can you write to multiple outputs dependent on the key using Scalding(/cascading) in a single Map Reduce Job. I could of course use .filter
for all the possible keys, but that is a horrible hack, which will fire up many jobs.
Write to multiple outputs by key Scalding Hadoop, one MapReduce Job
There is TemplatedTsv in Scalding (from version 0.9.0rc16 and up), exactly same as Cascading TemplateTsv.
Tsv(args("input"), ('COUNTRY, 'GDP))
.read
.write(TemplatedTsv(args("output"), "%s", 'COUNTRY))
// it will create a directory for each country under "output" path in Hadoop mode.
This looks even more flexible than what I requested!! Thanks. Could you say from which Scalding version this is? Is it 0.10.0 and above? or 0.9.0? –
Blues
From codebase, it looks to be available from 0.9.0rc16 version. –
Stith
@Stith is there any way to drop the fields used in the template string? In your example, basically I want the resulting files to have only the 'GDP' field in the resulting output. –
Umbles
@Umbles That is really good question; but, I do not know if it is possible with current TemplatedTsv implementation. However, you can make another, your own, MyTemplatedTsv like here github.com/twitter/scalding/blob/0.11.0/scalding-core/src/main/… and add "override val fields = Fields.ALL" and specify the fields to be written when calling that tap. Could you please reply here if you test that? –
Stith
Use MultipleOutputFormat and extrapolate from these other SO questions to write a custom output class using the output format: Create Scalding Source like TextLine that combines multiple files into single mappers, Compress Output Scalding / Cascading TsvCompressed
This suggestion on the Cascading User group suggests to use Cascading TemplateTap. Not sure how to connect this to Scalding though.
That certainly looks promising, care to provide Scalding code for people's copy and paste needs? :) –
Blues
© 2022 - 2024 — McMap. All rights reserved.