Joe Conley Tagged sbt Random thoughts on technology, books, golf, and everything else that interests me Notebook Driven Development <p>Fellow Spark developers, hearken to me! How fast is your Spark development cycle? Slow? Really slow? You could use this <a href="">super awesome template</a> to enable running your Spark jobs in IntelliJ, but sometimes you’re constrained by the size/locality of the data you’re working with, and you find that each re-run takes time (which is precious and finite and all that so yes, this stuff matters).</p> <p>The craftier of you might turn to that most estimable of tools, the REPL (Read-Evaluate-Process-Loop) for quick command-line iteration. And that’s a good start. I use the Scala REPL on a daily basis, mostly to verify proper date/time formats and regex testing. Using the REPL with Spark, you don’t have the overhead of starting up/shutting down the SparkContext and you can quickly test out things with immediate feedback (cool). And you can enter the REPL from SBT using the <code class="highlighter-rouge">console</code> command, giving you access to the classes/utilities you’ve built in that project and the project dependencies (very cool).</p> <h2 id="a-better-way">A Better Way</h2> <p>So yes, the REPL is nice and all but you can go even FURTHER, FASTER with notebooks like <a href="">Apache Zeppelin</a>. Zeppelin (like <a href="">Jupyter</a>) allows you to write snippets of runnable code in notebooks and execute them from the browser. What separates Zeppelin from Jupyter is how well it works out of the box with Spark. Spark is the default interpreter for Zeppelin and provides the spark and sql contexts for you implicitly. You also get great visualizations of SQL queries for free.</p> <table class="image"> <caption align="bottom">Simple SQL query using Zeppelin's bank example</caption> <tr><td><img src="/assets/zeppelin-sql.png" alt="Simple SQL query using Zeppelin's bank example" /></td></tr> </table> <p><br /></p> <table class="image"> <caption align="bottom">Simple SQL query with bar graph and form input</caption> <tr><td><img src="/assets/zeppelin-bar.png" alt="Simple SQL query with bar graph" /></td></tr> </table> <p><br /></p> <p>With Zeppelin, if you’re trying to query some dataset and want to understand its total size, the cardinality of a column, or simple descriptive statistics, you can do that immediately from the notebook itself with simple SQL queries. This sounds trivial but it ABSOLUTELY saves you time and effort by giving you a tight feedback loop when asking questions of data and not having to reload it every single time (when you use <code class="highlighter-rouge">cache</code>). In addition, you get documentation for free with Markdown, data visualization support with Angular, a growing ecosystem of modules in the Big Data ecosystem, and simple support for collaboration and sharing among your team.</p> <p>I also think Zeppelin helps you write more scalable Spark code. Writing code in paragraphs reinforces the idea of making methods as small and concise as possible. Once these chunks of code are worked out, building out your codebase is more or less a matter of composing these chunks into logical classes or methods.</p> <p>Zeppelin does have it’s drawbacks. Switching between your actual code and the notebook can be challenging, so you need dedicated contexts of exploration (Zeppelin) vs. crafting a solution (codebase) and stick to them. Also, dependency management is too manual. I would love for Zeppelin to know everything my Spark job knows through some Vulcan mindmeld or something (did I use that term correctly? I’m not a Trekkie. I’m a whatever-you-call-Tolkien-book-lover-two-generations-removed. Ringer? Inkling? Istari?).</p> <h2 id="big-idea-section">Big Idea Section</h2> <p>Ultimately, I think Zeppelin is a great tool if you’re a Spark developer trying to build scalable systems in a reasonable amount of time. I think notebooks are <a href="">“what’s next”</a>. I think speed of development can be a big bottleneck to the software engineering process, especially when working with large volumes of data. I also think, most importantly, that any company of reasonable size needs a certain level of useful, live documentation to understand just what the hell they’re doing.</p> <p>Because knowledge is power right? Isn’t all of this “coding”, “documentation”, and “testing” just different ways to represent knowledge? Ultimately <a href="">knowledge is just a tool</a>, a means to achieve some goal. It’s incumbent on us as engineers to use the best tools we can to accomplish our goals. I think Zeppelin is one such tool. I also think we could take this idea further and eventually get to the point where all of the code we write is just simple chunks, easily composable with minimal overhead (why do we spend so much time on packaging and deployment?). Or maybe we’re wasting our time and we should let <a href="">AI do our dirty work</a> for us? Who knows, but for now, I guess we keep on…</p> <iframe width="560" height="315" src="" frameborder="0" allowfullscreen=""></iframe> <p><br /></p> Tue, 28 Nov 2017 00:00:00 +0000 High-Leverage Development with Giter8 Templates <p><img src="" /> <img src="" /></p> <p><a href="">Edmond Lau</a> talks a lot about leverage in his book <a href="">The Effective Engineer</a>, a term he borrowed from Andy Grove’s <a href=";from_search=true">High Output Management</a>. Both are excellent reads, especially for programmers looking to maximize the impact they have on their teams. The term <em>leverage</em> gets to the heart of this. It describes activities that create a disproportionate amount of value. This feels like a much more elegant description than “10x/rockstar/ninja developer” or whatever cliche that stokes the egos of the programmer-gods. It places the focus on <em>output</em>, where it belongs!</p> <p>Some examples of high-leverage activities Lau mentions include:</p> <ul> <li>improving the onboarding processes for new hires via tutorials, documentation, and notebooks (i.e. labs)</li> <li>creating tight feedback loops to quickly validate ideas (e.g. use a REPL or a notebook!)</li> <li>writing tools to make you and other developers more efficient</li> </ul> <p>In this spirit, I’ve created a <a href="">Giter8</a> template to show how to <a href="">create an SBT-based Spark project</a> with the following accouterments:</p> <ul> <li>utilities for logging and writing dataframes in common formats</li> <li>configuration via <a href="">Typesafe Config</a></li> <li>building the fat jar via <a href="">sbt-assembly</a></li> <li>release support via <a href="">sbt-release</a></li> <li>support for running your Spark job in Intellij</li> </ul> <p>This has saved me a significant amount of time in starting new Spark jobs or testing out quick proof-of-concepts. Simply call <code class="highlighter-rouge">sbt new josephpconley/spark-seed.g8</code> and you’re all set! Enjoy!</p> Thu, 12 Oct 2017 00:00:00 +0000 Roll Your Own Notification Service <p>Have you ever wished you could receive customized updates whenever your favorite websites update their content? Most sites offer the means to get notified when a new blog post hits the wire or new products are added to their catalog (RSS, social media, e-mail, etc.). But what if the site doesn’t use any of these services? Or what if you only want specific updates (i.e. blog posts from author X, new products containing the name Y)? Then you’re left with only one course of action: build your own notification service!</p> <p>Armed with the mighty powers of HTML scraping, the Scala programming language, and a recurring scheduling mechanism (in this case Play’s Akka scheduler), you have all the tools you need to setup your custom notification.</p> <h2 id="my-new-ebook-notification-service">My New EBook Notification Service</h2> <p>Let’s create a notification service which let’s us know when new ebooks are available at my local digital library, <a href="">Delaware County Library System</a>. At the time of this writing, no such notification service exists. As I’d prefer not to miss any notifications, I’d like to setup an RSS feed. Specifically, we’ll write a process which periodically checks the digital library site for new ebooks and updates an RSS feed accordingly.</p> <h3 id="scalasbt">Scala/SBT</h3> <p>We’ll start out by creating a basic Scala application using SBT (you can checkout a skeleton project <a href="">here</a>). Let’s add the <a href="">HTMLUnit</a> and <a href="!/overview">Scala IO</a> libraries to our project. We’ll use HTMLUnit to parse the HTML code of the library’s website, and we’ll use Scala IO to write our XML to file. Your build file should now look like this (assuming you named your project “ebook”):</p> <script src=""></script> <h3 id="scala-xml">Scala XML</h3> <p>Let’s start by building an abstraction for an RSS feed (you can read about the basics of RSS <a href="">here</a>). We’ll start with an Item case class which holds the basic properties of an RSS item and a method to generate xml. Similarly, we define the basic properties of a Feed using a trait. We’ll make this abstract in the anticipation of re-using this abstraction for other feeds.</p> <script src=""></script> <h3 id="screen-scraping-with-htmlunit">Screen Scraping with HTMLUnit</h3> <p>Let’s build a NewEBookFeed which implements Feed. When we implement the items method, we’ll use HTMLUnit to parse the HTML code from <a href="">Delaware County Library System</a> to find out the newest items. This requires digging around the source HTML a bit to understand the structure and find useful patterns. Basic knowledge of <a href="">XPath</a> is required to leverage those patterns. After inspecting the source code and following the appropriate links, we can view the New Ebook page source and parse out the new titles, authors, and image URLs.</p> <script src=""></script> <p>That’s it! You can find my complete code as part of my <a href="">scrape library</a>, specifically the com.josephpconley.books and com.josephpconley.rss packages. We can test the code by running the following:</p> <script src=""></script> <h2 id="deploy-using-play">Deploy using Play</h2> <p>Now that we have a way to generate an up-to-date RSS feed, we need a way to update our feed periodically and make it publically available to an RSS Reader like <a href="">feedly</a> (my personal favorite). We could handle this a few different ways (i.e. schedule a CRON job to push a file to our Dropbox folder), however I’d like to demonstrate how to handle both the scheduling and file writing/serving using the <a href="">Play Framework</a>.</p> <p>Start a new Play Scala project, and either package our ebook project as a jar and copy to the lib folder, or just copy and paste the source code into the new Play project (I’ve done the former).</p> <h3 id="akka-scheduler">Akka Scheduler</h3> <p>To hook into Play’s Akka scheduler, we create a Global object in the app folder and override the onStart method, which allows us to run code once the application starts. The Akka system scheduler allows you to schedule a recurring process for a given Duration. In our case, since the site doesn’t update that frequently and we want to be respectful by not overloading the site with requests, we’ll set the duration to 12 hours.</p> <script src=""></script> <p>From there, it’s simply a matter of building out a controller with some routes to host the updated file (a straightforward exercise I’d leave to the reader). I personally included this code and hosted the RSS feeds in <a href="">my own Play app</a> running on Heroku.</p> <h2 id="drawbacks">Drawbacks</h2> <p>One drawback you might have noticed from this specific example is the possibility of the target site’s source code changing. We relied on very specific HTML tags, text and class attributes to query the information we needed, and should the site be re-written significantly, it’s possible that we would have to re-write our scraping code to accommodate.</p> <h2 id="conclusion">Conclusion</h2> <p>Managing the daily flow of information can be a challenge. With a little bit of coding, however, we can gain finer control over the information we consume, helping us be more productive in our everyday life.</p> Mon, 27 Jan 2014 00:00:00 +0000