Joe Conley Tagged zeppelin Random thoughts on technology, books, golf, and everything else that interests me Notebook Driven Development <p>Fellow Spark developers, hearken to me! How fast is your Spark development cycle? Slow? Really slow? You could use this <a href="">super awesome template</a> to enable running your Spark jobs in IntelliJ, but sometimes you’re constrained by the size/locality of the data you’re working with, and you find that each re-run takes time (which is precious and finite and all that so yes, this stuff matters).</p> <p>The craftier of you might turn to that most estimable of tools, the REPL (Read-Evaluate-Process-Loop) for quick command-line iteration. And that’s a good start. I use the Scala REPL on a daily basis, mostly to verify proper date/time formats and regex testing. Using the REPL with Spark, you don’t have the overhead of starting up/shutting down the SparkContext and you can quickly test out things with immediate feedback (cool). And you can enter the REPL from SBT using the <code class="highlighter-rouge">console</code> command, giving you access to the classes/utilities you’ve built in that project and the project dependencies (very cool).</p> <h2 id="a-better-way">A Better Way</h2> <p>So yes, the REPL is nice and all but you can go even FURTHER, FASTER with notebooks like <a href="">Apache Zeppelin</a>. Zeppelin (like <a href="">Jupyter</a>) allows you to write snippets of runnable code in notebooks and execute them from the browser. What separates Zeppelin from Jupyter is how well it works out of the box with Spark. Spark is the default interpreter for Zeppelin and provides the spark and sql contexts for you implicitly. You also get great visualizations of SQL queries for free.</p> <table class="image"> <caption align="bottom">Simple SQL query using Zeppelin's bank example</caption> <tr><td><img src="/assets/zeppelin-sql.png" alt="Simple SQL query using Zeppelin's bank example" /></td></tr> </table> <p><br /></p> <table class="image"> <caption align="bottom">Simple SQL query with bar graph and form input</caption> <tr><td><img src="/assets/zeppelin-bar.png" alt="Simple SQL query with bar graph" /></td></tr> </table> <p><br /></p> <p>With Zeppelin, if you’re trying to query some dataset and want to understand its total size, the cardinality of a column, or simple descriptive statistics, you can do that immediately from the notebook itself with simple SQL queries. This sounds trivial but it ABSOLUTELY saves you time and effort by giving you a tight feedback loop when asking questions of data and not having to reload it every single time (when you use <code class="highlighter-rouge">cache</code>). In addition, you get documentation for free with Markdown, data visualization support with Angular, a growing ecosystem of modules in the Big Data ecosystem, and simple support for collaboration and sharing among your team.</p> <p>I also think Zeppelin helps you write more scalable Spark code. Writing code in paragraphs reinforces the idea of making methods as small and concise as possible. Once these chunks of code are worked out, building out your codebase is more or less a matter of composing these chunks into logical classes or methods.</p> <p>Zeppelin does have it’s drawbacks. Switching between your actual code and the notebook can be challenging, so you need dedicated contexts of exploration (Zeppelin) vs. crafting a solution (codebase) and stick to them. Also, dependency management is too manual. I would love for Zeppelin to know everything my Spark job knows through some Vulcan mindmeld or something (did I use that term correctly? I’m not a Trekkie. I’m a whatever-you-call-Tolkien-book-lover-two-generations-removed. Ringer? Inkling? Istari?).</p> <h2 id="big-idea-section">Big Idea Section</h2> <p>Ultimately, I think Zeppelin is a great tool if you’re a Spark developer trying to build scalable systems in a reasonable amount of time. I think notebooks are <a href="">“what’s next”</a>. I think speed of development can be a big bottleneck to the software engineering process, especially when working with large volumes of data. I also think, most importantly, that any company of reasonable size needs a certain level of useful, live documentation to understand just what the hell they’re doing.</p> <p>Because knowledge is power right? Isn’t all of this “coding”, “documentation”, and “testing” just different ways to represent knowledge? Ultimately <a href="">knowledge is just a tool</a>, a means to achieve some goal. It’s incumbent on us as engineers to use the best tools we can to accomplish our goals. I think Zeppelin is one such tool. I also think we could take this idea further and eventually get to the point where all of the code we write is just simple chunks, easily composable with minimal overhead (why do we spend so much time on packaging and deployment?). Or maybe we’re wasting our time and we should let <a href="">AI do our dirty work</a> for us? Who knows, but for now, I guess we keep on…</p> <iframe width="560" height="315" src="" frameborder="0" allowfullscreen=""></iframe> <p><br /></p> Tue, 28 Nov 2017 00:00:00 +0000 Chariot Day 2017 Talk <p>This past weekend I had the pleasure of attending and speaking at Chariot Day 2017, an internal conference put on by my fellow Charioteers. This is the third I’ve attended, and I was amazed by the impressive breadth and depth of the talks. I feel very lucky to be working with people much smarter than me!</p> <p>I gave a talk on how to use Apache Spark and Apache Zeppelin to provide quick SQL-based visualizations for your personal finances. Here are the slides if you’re interested. I’ll be delving more into notebooks and developer productivity in a later post. I’m working on exposing the Zeppelin notebooks that were part of the demo as well, if you’re interested in running the notebooks against your own personal finance dataset let me know!</p> <iframe src=";loop=false&amp;delayms=3000" frameborder="0" width="600" height="366" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe> Mon, 09 Oct 2017 00:00:00 +0000