Joe Conley Tagged scala Random thoughts on technology, books, golf, and everything else that interests me http://www.josephpconley.com/name/scala Notebook Driven Development <p>Fellow Spark developers, hearken to me! How fast is your Spark development cycle? Slow? Really slow? You could use this <a href="http://www.josephpconley.com/2017/10/12/spark-template.html">super awesome template</a> to enable running your Spark jobs in IntelliJ, but sometimes you’re constrained by the size/locality of the data you’re working with, and you find that each re-run takes time (which is precious and finite and all that so yes, this stuff matters).</p> <p>The craftier of you might turn to that most estimable of tools, the REPL (Read-Evaluate-Process-Loop) for quick command-line iteration. And that’s a good start. I use the Scala REPL on a daily basis, mostly to verify proper date/time formats and regex testing. Using the REPL with Spark, you don’t have the overhead of starting up/shutting down the SparkContext and you can quickly test out things with immediate feedback (cool). And you can enter the REPL from SBT using the <code class="highlighter-rouge">console</code> command, giving you access to the classes/utilities you’ve built in that project and the project dependencies (very cool).</p> <h2 id="a-better-way">A Better Way</h2> <p>So yes, the REPL is nice and all but you can go even FURTHER, FASTER with notebooks like <a href="https://zeppelin.apache.org/">Apache Zeppelin</a>. Zeppelin (like <a href="http://jupyter.org/">Jupyter</a>) allows you to write snippets of runnable code in notebooks and execute them from the browser. What separates Zeppelin from Jupyter is how well it works out of the box with Spark. Spark is the default interpreter for Zeppelin and provides the spark and sql contexts for you implicitly. You also get great visualizations of SQL queries for free.</p> <table class="image"> <caption align="bottom">Simple SQL query using Zeppelin's bank example</caption> <tr><td><img src="/assets/zeppelin-sql.png" alt="Simple SQL query using Zeppelin's bank example" /></td></tr> </table> <p><br /></p> <table class="image"> <caption align="bottom">Simple SQL query with bar graph and form input</caption> <tr><td><img src="/assets/zeppelin-bar.png" alt="Simple SQL query with bar graph" /></td></tr> </table> <p><br /></p> <p>With Zeppelin, if you’re trying to query some dataset and want to understand its total size, the cardinality of a column, or simple descriptive statistics, you can do that immediately from the notebook itself with simple SQL queries. This sounds trivial but it ABSOLUTELY saves you time and effort by giving you a tight feedback loop when asking questions of data and not having to reload it every single time (when you use <code class="highlighter-rouge">cache</code>). In addition, you get documentation for free with Markdown, data visualization support with Angular, a growing ecosystem of modules in the Big Data ecosystem, and simple support for collaboration and sharing among your team.</p> <p>I also think Zeppelin helps you write more scalable Spark code. Writing code in paragraphs reinforces the idea of making methods as small and concise as possible. Once these chunks of code are worked out, building out your codebase is more or less a matter of composing these chunks into logical classes or methods.</p> <p>Zeppelin does have it’s drawbacks. Switching between your actual code and the notebook can be challenging, so you need dedicated contexts of exploration (Zeppelin) vs. crafting a solution (codebase) and stick to them. Also, dependency management is too manual. I would love for Zeppelin to know everything my Spark job knows through some Vulcan mindmeld or something (did I use that term correctly? I’m not a Trekkie. I’m a whatever-you-call-Tolkien-book-lover-two-generations-removed. Ringer? Inkling? Istari?).</p> <h2 id="big-idea-section">Big Idea Section</h2> <p>Ultimately, I think Zeppelin is a great tool if you’re a Spark developer trying to build scalable systems in a reasonable amount of time. I think notebooks are <a href="https://www.youtube.com/watch?v=oHGK96-WixU">“what’s next”</a>. I think speed of development can be a big bottleneck to the software engineering process, especially when working with large volumes of data. I also think, most importantly, that any company of reasonable size needs a certain level of useful, live documentation to understand just what the hell they’re doing.</p> <p>Because knowledge is power right? Isn’t all of this “coding”, “documentation”, and “testing” just different ways to represent knowledge? Ultimately <a href="http://www.lifeissues.net/writers/gro/gro_056heidegger.html">knowledge is just a tool</a>, a means to achieve some goal. It’s incumbent on us as engineers to use the best tools we can to accomplish our goals. I think Zeppelin is one such tool. I also think we could take this idea further and eventually get to the point where all of the code we write is just simple chunks, easily composable with minimal overhead (why do we spend so much time on packaging and deployment?). Or maybe we’re wasting our time and we should let <a href="https://www.oreilly.com/ideas/artificial-intelligence-in-the-software-engineering-workflow">AI do our dirty work</a> for us? Who knows, but for now, I guess we keep on…</p> <iframe width="560" height="315" src="https://www.youtube.com/embed/_h9MxNn8P7w?rel=0" frameborder="0" allowfullscreen=""></iframe> <p><br /></p> Tue, 28 Nov 2017 00:00:00 +0000 http://www.josephpconley.com/2017/11/28/notebook-drvien-development.html http://www.josephpconley.com/2017/11/28/notebook-drvien-development.html High-Leverage Development with Giter8 Templates <p><img src="https://images.gr-assets.com/books/1427583285l/25238425.jpg" /> <img src="https://images.gr-assets.com/books/1347800461l/324750.jpg" /></p> <p><a href="https://twitter.com/edmondlau">Edmond Lau</a> talks a lot about leverage in his book <a href="https://www.goodreads.com/book/show/25238425-the-effective-engineer?from_search=true">The Effective Engineer</a>, a term he borrowed from Andy Grove’s <a href="https://www.goodreads.com/book/show/324750.High_Output_Management?ac=1&amp;from_search=true">High Output Management</a>. Both are excellent reads, especially for programmers looking to maximize the impact they have on their teams. The term <em>leverage</em> gets to the heart of this. It describes activities that create a disproportionate amount of value. This feels like a much more elegant description than “10x/rockstar/ninja developer” or whatever cliche that stokes the egos of the programmer-gods. It places the focus on <em>output</em>, where it belongs!</p> <p>Some examples of high-leverage activities Lau mentions include:</p> <ul> <li>improving the onboarding processes for new hires via tutorials, documentation, and notebooks (i.e. labs)</li> <li>creating tight feedback loops to quickly validate ideas (e.g. use a REPL or a notebook!)</li> <li>writing tools to make you and other developers more efficient</li> </ul> <p>In this spirit, I’ve created a <a href="http://www.foundweekends.org/giter8/">Giter8</a> template to show how to <a href="https://github.com/josephpconley/spark-seed.g8">create an SBT-based Spark project</a> with the following accouterments:</p> <ul> <li>utilities for logging and writing dataframes in common formats</li> <li>configuration via <a href="https://github.com/typesafehub/config">Typesafe Config</a></li> <li>building the fat jar via <a href="https://github.com/sbt/sbt-assembly">sbt-assembly</a></li> <li>release support via <a href="https://github.com/sbt/sbt-release">sbt-release</a></li> <li>support for running your Spark job in Intellij</li> </ul> <p>This has saved me a significant amount of time in starting new Spark jobs or testing out quick proof-of-concepts. Simply call <code class="highlighter-rouge">sbt new josephpconley/spark-seed.g8</code> and you’re all set! Enjoy!</p> Thu, 12 Oct 2017 00:00:00 +0000 http://www.josephpconley.com/2017/10/12/spark-template.html http://www.josephpconley.com/2017/10/12/spark-template.html Real World Spark Lessons <p>I’ve enjoyed learning the ins and outs of <a href="https://spark.apache.org/">Spark</a> at my current client. I’ve got a nice base SBT project going where I use Scala to write the Spark job, <a href="https://github.com/typesafehub/config">Typesafe Config</a> to handle configuration, <a href="https://github.com/sbt/sbt-assembly">sbt-assembly</a> to build out my artifacts, and <a href="https://github.com/sbt/sbt-release">sbt-release</a> to cut releases. Using this as my foundation, I recently built a Spark job that runs every morning to collect the previous day’s data from a few different datasources, join some reference data, perform a few aggregations and write all of the results to Cassandra. All in roughly three minutes (not too shabby).</p> <p>Here’s some initial lessons learned:</p> <ul> <li>Be mindful of when to use <code class="highlighter-rouge">cache()</code>. It sets a checkpoint for your DAG so you don’t need to re-compute the same instructions. I ended up using this before performing my multiple aggregations.</li> <li><a href="https://avro.apache.org/">Apache Avro</a> is really really good at data serialization. Should be the default choice for large-scale data writing into HDFS.</li> <li>When using <code class="highlighter-rouge">pivot(column, range)</code>, it REALLY helps if you can enumerate the entire range of the pivot column values. My job time was cut in half as a result of passing all possible values. More here on <a href="https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html">the Databricks blog</a></li> <li>Cassandra does upserting by default, so I didn’t even need to worry about primary key constraints if data needs to be re-run (idempotency is badass).</li> </ul> <p>Recently, I was asked to update my job to run every 15 minutes to grab the latest 15 minutes of data (people always want more of a good thing). So I somewhat mindlessly updated my cronjob and didn’t re-tune any of the configuration parameters (spoiler alert: bad idea). Everything looked good locally and on our test cluster, but when it came time for production, WHAM! My job was now taking 5-7 minutes when running on a fraction of the data for the daily runs. Panic time!</p> <p><img src="/assets/fry-panic.jpg" alt="Philip J. Fry Panicking" /><br /></p> <p>After wading through my own logs and some cryptic YARN stacktraces, it dawned on me to check my configuration properties. One thing in particular jumped out at me:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>spark.sql.shuffle.partitions = 2048 </code></pre></div></div> <p>I had been advised to set this value when running my job in production. And it worked well for my daily job (cutting down on processing time by 30s). However, now that I was working with data in a 15-minute time window, this was WAY too many partitions. The additional runtime resulted from the overhead of using so many partitions for so little data (my own theory, correct me if I’m wrong). So I disabled this property (defaulting to 200) and my job started running in ~2 minutes, much better!</p> <p><img src="/assets/futurama-happy.jpg" alt="Futurama gang happy" /><br /></p> <p>(UPDATE: after some experimentation on the cluster, I set the number of partitions to 64)</p> <p>More lessons learned:</p> <ul> <li>ALWAYS test your Spark job on a production-like cluster as soon as you make any changes. Running your job locally vs. running your job on a YARN/Mesos cluster is about as similar as running them on Earth vs. Mars, give or take.</li> <li>You REALLY should know the memory/cpu stats of your cluster to help inform your configuration choices. You should also be mindful of what other jobs run on the cluster and when.</li> <li>Develop at least a basic ability to <a href="https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html">read and understand the Spark UI</a>.<br /> It’s got a lot of useful info, and with event logging you can see the improvements of your incremental changes in real-time.</li> </ul> <p>Let me give another shout-out to Typesafe Config again for making my life easier. I have three different ways (env variables, properties file, command line args) to pass configuration to my Spark job and I was able to quickly tune parameters using all of these options. Interfaces are just as important to developers as they are to end users!</p> <p>All in all this was a fun learning experience. I try to keep up on different blogs about Spark but you really don’t get a good feel for it until you’re actually working on a problem with production-scale infrastructure and data. I think this is a good lesson for any knowledge work. You need to <a href="https://www.farnamstreetblog.com/2013/04/the-work-required-to-have-an-opinion/">do the work</a> to acquire knowledge. This involves not just reading but challenging assumptions, proving out ideas, and <a href="http://www.nytimes.com/1997/07/27/sports/hogan-constant-focus-on-perfection.html?src=pm">digging knowledge out of the dirt</a>. Active engagement using quick feedback loops will lead to much deeper and usable knowledge, and that’ll make you, as Mick would say, <a href="https://www.youtube.com/watch?v=o0CXUv-xxtY">“a very dangerous person!”</a></p> <p>Party on!</p> <p><img src="https://media.giphy.com/media/vMnuZGHJfFSTe/giphy.gif" alt="Wayne Zang" /><br /></p> Wed, 31 May 2017 00:00:00 +0000 http://www.josephpconley.com/2017/05/31/real-world-spark-lessons.html http://www.josephpconley.com/2017/05/31/real-world-spark-lessons.html Scala By The Schuylkill Recap <p>This past Tuesday I had the pleasure of attending the <a href="http://scala.comcast.com/">Scala by the Schuylkill conference</a> at Comcast headquarters in downtown Philadelphia. Initially begun as an internal Scala conference, the organizers opened the conference this year to external folks interested in Scala. I learned a lot from this event, gaining perspective on trends in the Scala community and sparking curiosity in several interesting applications of the Scala language.</p> <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Our <a href="https://twitter.com/hashtag/ScalaByTheSchuylkill?src=hash">#ScalaByTheSchuylkill</a> organizers with keynote speaker <a href="https://twitter.com/sreekotay">@sreekotay</a>! <a href="https://twitter.com/hashtag/onbreak?src=hash">#onbreak</a> <a href="https://twitter.com/hashtag/scala?src=hash">#scala</a> <a href="https://t.co/yyJoTfkljm">pic.twitter.com/yyJoTfkljm</a></p>&mdash; Comcast Careers (@comcastcareers) <a href="https://twitter.com/comcastcareers/status/823903924394610694">January 24, 2017</a></blockquote> <script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <p><br /></p> <p>The keynote speeches were the highlight of the conference for me. Comcast’s CTO, <a href="https://twitter.com/sreekotay">Sree Kotay</a>, gave an engaging talk on the culture of innovation at Comcast and how they’ve evolved into a “technology first” company (as quoted recently by their CEO Brian Roberts). He also explained their rationale for using Scala for certain projects, noting the interoperability with Java, modularity, and its ability to draw top talent as key factors of adoption. He even showed off his geek credentials by detailing his love/hate relationship with a certain Scala web service library. It’s clear that Sree is an engineer at heart and it was refreshing to see that the CTO of a multi-billion dollar company still enjoys tinkering with code.</p> <p><a href="https://twitter.com/mpilquist">Michael Pilquist</a> gave the other keynote, doing a masterful job in explaining the <a href="https://speakerdeck.com/mpilquist/realistic-functional-programming">value of functional programming</a>. He boiled down the essence of FP as managing the complexity of both state and control flow via composability and small expressions in isolation. He also demystified category theory, an area of mathematics I’ve always found interesting but never really saw the practical use for until now. He stressed that category theory in programming is used to achieve precision by finding the appropriate level of abstraction for a given problem to focus on the essential. Michael put these ideas in an accessible and interesting context, and I also appreciated his book recommendation, <a href="https://www.goodreads.com/book/show/23360039-how-to-bake-pi"><em>How to Bake Pi</em></a> by Eugenia Chang, which I’m currently devouring.</p> <p>A great variety of talks followed, touching on interesting topics like GIS, machine learning, microservices, and streaming with a focus on tools like Akka and Spark. About half of the speakers were from Comcast, and it was interesting to see the problems they’ve had to solve and why they chose Scala to solve them (hint: they work with data, a LOT of it). I came away with at least a dozen different TODOs to research new libraries or techniques. I also enjoyed meeting new people and catching up with some past colleagues. As an introvert, I don’t focus much on networking and relationship building, but a tech conference focused on a specific technology like Scala creates an environment that’s very conducive to meeting new people and learning about their work.</p> <p>I’m happy to see an important tech company like Comcast invest so much time and energy into both the Scala ecosystem and the local Scala community here in Philadelphia. It’s clear that, regardless of what you may have heard, Scala is here to stay!</p> <p>Special thanks to Chariot for sponsoring my attendance!</p> Fri, 27 Jan 2017 00:00:00 +0000 http://www.josephpconley.com/2017/01/27/scala-by-the-schuylkill.html http://www.josephpconley.com/2017/01/27/scala-by-the-schuylkill.html Help! My Monads are Nesting! <p>Do you build reactive applications using Scala? Then chances are you’ve had to deal with a <code class="highlighter-rouge">Future[Monad[T]]</code>, where <code class="highlighter-rouge">Monad</code> could be <code class="highlighter-rouge">Option</code>, <code class="highlighter-rouge">Either</code>, or something <a href="http://www.josephpconley.com/2016/07/18/an-ode-to-or.html">more wonderful</a> like <code class="highlighter-rouge">Or</code>. While these monads do nest as expected, the syntax and code flow can start to get pretty messy (motivating example below).</p> <p>Enter <a href="https://github.com/chariotsolutions/scala-commons#futureor">FutureOr</a>! This utility makes it super-simple to sequence several <code class="highlighter-rouge">Future[Or[T]]</code> calls. It’s also fairly easy to implement, so you could easily clone this and customize for your favorite nested monad combination.</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>//three service calls which return Future[Or[T]] and depend on the previous call trait IntService{ def callA: Future[Int Or One[ErrorMessage]] def callB(int a): Future[Int Or One[ErrorMessage]] def callC(int b): Future[Int Or One[ErrorMessage]] } val service: IntService = ... //without FutureOr, really ugly! I wouldn't wish this on my worst enemy! val result: Future[Int Or One[ErrorMessage]] = service.callA.flatMap{ a =&gt; a.flatMap{ case Good(goodA) =&gt; service.callB(goodA).flatMap { b =&gt; b.flatMap { case Good(goodB) =&gt; service.callC(goodB) case Bad(e) =&gt; Future.successful(Bad(e)) } } case Bad(e) =&gt; Future.successful(Bad(e)) } } //with FutureOr, so much better! val result: Future[Int Or One[ErrorMessage]] = (for { a &lt;- FutureOr(service.callA) b &lt;- FutureOr(service.callB(a)) c &lt;- FutureOr(service.callC(b) } yield c).future </code></pre></div></div> Thu, 17 Nov 2016 00:00:00 +0000 http://www.josephpconley.com/2016/11/17/future-or.html http://www.josephpconley.com/2016/11/17/future-or.html An Ode to Or <blockquote> <p>How do I love <a href="http://www.scalactic.org/user_guide/OrAndEvery">Or</a>? Let me <a href="http://doc.scalatest.org/2.2.6/#org.scalactic.Accumulation">Accumulate</a> the ways.<br /> I love thee to the depth and breadth and height<br /> My IDE can reach, when feeling uncompiled<br /> For the Ends of Concise Code and ideal Control Flow.<br /> I love thee to the level of every day’s<br /> Most quiet need, a quiet workspace and thorough documentation.<br /> I love thee freely, as my functions strive for <a href="http://doc.scalatest.org/2.2.6/#org.scalactic.Good">Good</a>.<br /> I love thee purely, as they capture errors with <a href="http://doc.scalatest.org/2.2.6/#org.scalactic.Bad">Bad</a>.<br /> I love thee with the passion put to use<br /> In my old griefs with <a href="http://danielwestheide.com/blog/2013/01/02/the-neophytes-guide-to-scala-part-7-the-either-type.html">Either</a>, and with my childhood’s faith.<br /> I love thee with a love I seemed to lose<br /> With my lost saints of Java! – I love thee with the breath,<br /> Smiles, tears, of all my past programs! – and, if <a href="https://twitter.com/bvenners">Venners</a> choose,<br /> I shall but love thee better after sys.exit(0).</p> </blockquote> <p>A modern interpretation of <a href="http://imaginativeliteratureforeccentrics.blogspot.com/2008/03/sonnet-43.html">Elizabeth Barrett Browning’s Sonnet 43</a> inspired by the estimable <a href="http://www.scalactic.org/user_guide/OrAndEvery">Scalactic</a> library.</p> Mon, 18 Jul 2016 00:00:00 +0000 http://www.josephpconley.com/2016/07/18/an-ode-to-or.html http://www.josephpconley.com/2016/07/18/an-ode-to-or.html JSONPath Library for Play <p>I’ve been working on a platform that transforms, composes, and serves data. As part of this effort, I’ve developed a <a href="https://github.com/josephpconley/play-jsonpath">library for Play</a> that performs a JSONPath query on a Play JsValue. You can learn about JSONPath by reading <a href="http://goessner.net/articles/JsonPath/">Stefan Goessner’s blog post</a> on the subject.</p> <p>I use <a href="https://github.com/gatling/jsonpath">Gatling’s jsonpath library</a> to parse the JSONPath expression. I then fold over the tokens, performing a pattern match on each to construct the apporpriate JsValue. This parser supports all queries except for queries that rely on expressions of the underlying language like <code class="highlighter-rouge">$..book[(@.length-1)]</code>. However, there’s usually a ready workaround as you can execute the same query using <code class="highlighter-rouge">$..book[-1:]</code>.</p> <h2 id="example">Example</h2> <p>Here’s a scala worksheet which traces the examples on Stefan’s post:</p> <script src="https://gist.github.com/josephpconley/10647739.js"></script> <h2 id="deviation-from-jsonpath">Deviation from JSONPath</h2> <p>One conscious choice I made as far as deviating from JSONPath is to always flatten the results of a recursive query. Using the bookstore example, typically a query of <code class="highlighter-rouge">$..book</code> will return an array with one element, the array of books. If there was another book array somewhere in the document, then <code class="highlighter-rouge">$..book</code> will return an array with two elements, both arrays of books. However, if you were to query <code class="highlighter-rouge">$..book[2]</code> for our example, you would get the second book in the first array, which assumes that the <code class="highlighter-rouge">$..book</code> result has been flattened. In order to make recursion easier and the code simpler, I always flatten the result of recursive queries regardless of the context.</p> <p>If you have any questions, comments, or suggestions please let me know. I hope to be introducing an early iteration of my data platform shortly so stay tuned!</p> Tue, 15 Apr 2014 00:00:00 +0000 http://www.josephpconley.com/2014/04/15/jsonpath-for-play.html http://www.josephpconley.com/2014/04/15/jsonpath-for-play.html Building better APIs with Play! <p>This is the technical companion to my Point.io post, <a href="http://point.io/article/building-better-apis-play">Building better APIs with Play!</a>. Herein lies coding examples galore!</p> <h2 id="restful-architecture---routing">RESTful Architecture - Routing</h2> <p>The routes file of a Play app allows you to define the HTTP verb, the route path/pattern, and the corresponding method from the controller. In addition to denoting basic path variables, you can also use regular expressions to match on specific routes (i.e. xml or html formats for example). What’s great about this approach is this file is compiled along with the source code, so any mistakes like an incorrect controller method or invalid HTTP verb will be caught and thrown at compile time.</p> <script src="https://gist.github.com/josephpconley/9337208.js"></script> <h2 id="action-composition">Action composition</h2> <p>We define two types of custom actions: atomic and composed. Atomic actions can be used stand-alone or as building blocks to be composed with other actions. We use the following pattern for building an atomic action.</p> <script src="https://gist.github.com/josephpconley/9345681.js"></script> <p>The object allows us to use the action by itself, and the case class allows us to compose this action with other actions.</p> <script src="https://gist.github.com/josephpconley/9345730.js"></script> <p>A composed action is strictly syntactic sugar, making it more convenient to combine behaviors and keeping the controller code more concise. We define composed actions using just an object.</p> <script src="https://gist.github.com/josephpconley/9345746.js"></script> <h2 id="filters">Filters</h2> <p><a href="http://www.playframework.com/documentation/2.2.2/ScalaHttpFilters">Filters</a> are handy for cross-cutting concerns. We’ve had one use case where it was necessary to modify every JSON response with links to metadata. We achieved this using a filter and the <a href="http://www.playframework.com/documentation/2.2.2/Enumeratees">Play Enumeratee library</a></p> <script src="https://gist.github.com/josephpconley/9345957.js"></script> <h2 id="json---global-messaging">JSON - Global messaging</h2> <p>Building an effective API requires being responsive to users in a comprehensive manner. All concievable events should be handled appropriately, such as incorrect requests from the client or internal server errors. Creating a Global object allows you to generically craft responses to handle these situations. We define methods to handle events like internal errors, route not found, or a bad request.</p> <script src="https://gist.github.com/josephpconley/9345819.js"></script> <h2 id="conclusion">Conclusion</h2> <p>Play is well-equipped to handle the nuances of API development and maintenance. We’re pleased with the stability and performance we’ve seen thus far and are looking forward to continuing down this path of <a href="http://www.reactivemanifesto.org/">reactive goodness</a>.</p> Tue, 04 Mar 2014 00:00:00 +0000 http://www.josephpconley.com/2014/03/04/building-better-apis-with-play.html http://www.josephpconley.com/2014/03/04/building-better-apis-with-play.html (Triz)Swagging out at the Philly Codefest <p>This past weekend I teamed up with some buddies from <a href="http://point.io">Point.io</a> (<a href="https://twitter.com/twrivera">Angel</a>, <a href="https://twitter.com/jxshin75">Jon</a>, and <a href="https://twitter.com/dyang_pointio">Dylan</a>) to participate in my first hackathon, <a href="http://phillycodefest.com/">Philly Codefest</a>. We spent the weekend bringing Angel’s dream to life: a platform called TrizSwagger to analyze and track the use of “swag” (i.e. T-shirts, office supplies, and other marketing mathoms). We leveraged social media and geolocation to give companies real-time visibility to their marketing campaigns. Feel free to check out the app <a href="http://www.trizswagger.com/">here</a>.</p> <table class="image"> <caption align="bottom">Angel demoing Point.io's apiDoc service</caption> <tr><td><img src="/assets/angel.jpg" alt="Angel promoting Point.io" /></td></tr> </table> <p> <br /></p> <h2 id="lessons-learned">Lessons Learned</h2> <p>Good programming is always concerned with simplicity and efficiency, whether it’s using efficient data structures and algorithms, conciseness in your codebase, or even naming variables properly. Building an app in 24 hours, however, throws the need for simplicity and efficiency into sharp relief. Here are a few takeways from my experience.</p> <h3 id="coast-to-coast-json">Coast-to-Coast JSON</h3> <p>I’ve always been a big fan of <a href="http://en.wikipedia.org/wiki/Domain-driven_design">Domain-driven Design</a>. Writing POJOs in Java and case classes in Scala can provide a clear crystallization of the main actors of your application. However, models may not always be necessary, especially if your app is backed by a service/database which gives you JSON (TrizSwagger is backed in <a href="http://www.mongodb.com/">MongoDB</a> and served by <a href="http://flask.pocoo.org/">Flask</a>). The extra layer of complexity in marshalling/unmarshalling between JSON and your model can hinder performance and readability of your code, especially if you’re using heavy ORM frameworks like Hibernate. During a hackathon, if you’re rapidly making changes to the model you’ll surely get slowed down. For a more thorough treatment of JSON Coast-to-Coast, check out <a href="http://mandubian.com/2013/01/13/JSON-Coast-to-Coast/">the Mandubian Blog</a></p> <h3 id="knockoutjs-mapping-plugin">Knockout.js Mapping plugin</h3> <p>In order to implement coast-to-coast design effectively, it’s important to have a front-end framework that manages JSON well. One such framework I’m fond of is Knockout.js, more specifically their <a href="http://knockoutjs.com/documentation/plugins-mapping.html">mapping plugin</a>. This plugin automatically maps a JSON message into a Javascript observable object. You can then code the front-end directly against object properties without having to pre-define a viewmodel. You can also customize how objects are mapped by either modifying or enhancing the created object. This strategy proved quite helpful during the hackathon as any changes to our back-end API literally only had to be changed in one place (the front-end).</p> <p>One caveat is the creation of a new object using this plugin. Since the plugin requires a JSON object to build out the observable, I wrote a basic method in Play to take an expected JSON object and “empty” it, setting default values that would be used in the new object form.</p> <script src="https://gist.github.com/josephpconley/9207995.js"></script> <h3 id="understanding-your-tools">Understanding your tools</h3> <p>TrizSwagger integrates with both Twitter and Facebook. Understanding and setting up those integrations, however, occupied a lot of our time. We also ran into issues with a server on OpenShift, slowing us down further. Ultimately I think simpler is better, and every choice in technology needs to be well thought-out and well-suited to its use case.</p> <h2 id="conclusion">Conclusion</h2> <p>Overall it was an awesome experience. Even though we didn’t win (or place, or show for that matter), we learned a lot and we still took the time to mentor other teams who were using the <a href="http://point.io/pointio-platform">Point.io API</a>. Great job Team TrizSwagger!</p> <table class="image"> <caption align="bottom">Jon, Dylan, and Joe doing some last-minute coding</caption> <tr><td><img src="/assets/triz.jpg" alt="Jon, Dylan, and Joe doing some last-minute coding" /></td></tr> </table> Tue, 25 Feb 2014 00:00:00 +0000 http://www.josephpconley.com/2014/02/25/phillycodefest-trizswagger-lessons-learned.html http://www.josephpconley.com/2014/02/25/phillycodefest-trizswagger-lessons-learned.html Accenture Match Play Statistics <p>February madness is here! The <a href="http://www.worldgolfchampionships.com/accenture-match-play-championship.html">Accenture Match Play Championship</a> starts today, and although the usual big names of Tiger, Phil, and Adam Scott are absent this year, there still promises to be some drama. Can Matt Kuchar become the first player not named Tiger to go back-to-back? Will Jimmy Walker improve upon his obscene winning percentage this year? Will past Ryder Cup emotions fuel players to victory (looking at you Mr. Poulter)? We’ll find out.</p> <p>The other source of drama is due to the vagaries of the match play format. Underdogs regularly upset favorites, and only one person has won the event as a number one seed (<a href="http://en.wikipedia.org/wiki/WGC-Accenture_Match_Play_Championship">Tiger</a>). This makes completing a bracket the ultimate exercise in futility. While I’m sure <a href="http://www.cbssports.com/collegebasketball/eye-on-college-basketball/24416823/warren-buffett-dan-gilbert-offering-1-billion-for-perfect-tourney-bracket">Warren Buffett’s billion dollar NCAA wager</a> is very safe at odds of 1 in 9.2 quintillion, I’d imagine a similar wager on this tournament would be even safer (though statistically it’s the same odds).</p> <p>Anyways, I’ll be looking at some random statistics as the tournament progresses. One such stat is average holes played per match. If you fill out your bracket on <a href="http://fantasy.pgatour.com/">pgatour.com</a>, you’re asked to put in a tiebreaker of total holes played by the champion of the tournament. A champion will have played six matches in total. For each match, a winner can play less than 18 holes, the full 18, or more than 18 if still tied (here’s a good primer on <a href="http://golf.about.com/od/beginners/a/matchplayscore.htm">match play scoring</a> for the uninitiated). I took a rough guess that on average, a match ends after the 16th hole. But I wanted to be sure (I’m in a money league for my bracket, can’t hold anything back). So naturally, I turned to programming.</p> <p>Here’s a very simply <a href="https://github.com/josephpconley/scala/blob/master/scrape/src/main/scala/com/josephpconley/golf/MatchPlay.scala">program to gather data on past matches</a>. This parses out the holes played for roughly 2,000 matches in this event from 2005 to present. The result was that the average holes played per match is 16.64116095. For the tiebreaker, you’d multiply this by 6 to get (roughly) 100.</p> <p>That’s it for now, I’ll try to dig deeper and come up with more interesting stats. If you have any suggestions feel free to comment!</p> Wed, 19 Feb 2014 00:00:00 +0000 http://www.josephpconley.com/2014/02/19/accenture-match-play-stats.html http://www.josephpconley.com/2014/02/19/accenture-match-play-stats.html Hacking NPR's Sunday Puzzle <p>I’m a big fan of puzzles. I’ll often start my day attempting the Philadelphia Inquirer’s jumble and crossword, with varying degress of success. One puzzle I never miss is <a href="http://www.npr.org/series/4473090/sunday-puzzle">NPR’s Weekend Edition Puzzle</a> featuring New York Times puzzle editor Will Shortz. At the end of each segment, he poses a question to the audience, and occasionally these questions can be solved with the help of programming. To that end, I’ve built an app to help non-programmers solve these puzzles. I’ve also added common puzzle utilities like an Anagram checker and a Scrabble solution generator.</p> <p>You can find a running version of the puzzle solver <a href="http://app.josephpconley.com/puzzles">here</a>. The Scala library of the puzzle utilities can be found <a href="https://github.com/josephpconley/scala/tree/master/puzzles">here</a>. This project also has an npr package which shows examples of programs written to solve past NPR puzzles.</p> <h2 id="puzzle-solver">Puzzle Solver</h2> <p>This single-page app searches through a specified list of words searching for one of three things: anagrams, potential Scrabble solutions, or most powerfully, a regular expression. We’ll use this last mode to solve a recent NPR puzzle.</p> <h3 id="puzzle-modes">Puzzle Modes</h3> <h4 id="anagrams">Anagrams</h4> <p>This mode will search for all potential anagrams of the input word. Helpful for solving the jumble commonly found in your newspaper. For example, here are today’s four jumble clues from the Inquirer:</p> <div class="well well-lg"> GREEV WORNC KNITSY KRUTYE </div> <p>Setting the app controls to Mode = Anagram and Word List Source = Scrabble (a good list source for most purposes), we set Input once for each jumble and after hitting Submit, we get one proper anagram for each jumble.</p> <h4 id="scrabble">Scrabble</h4> <p>This mode will search for all possible valid Scrabble words based on the letters (i.e. your Scrabble hand) provided. You can also specify how many wild cards (i.e. “blanks”) are in your hand. This is useful not only to help find solutions but to verify solutions as well (faster than leafing through a Scrabble dictionary).</p> <h4 id="regular-expressions">Regular expressions</h4> <p>This is the most powerful mode. This will return all words matching a valid Java regular expression. You can find a good tutorial about Java regular expressions <a href="http://www.vogella.com/tutorials/JavaRegularExpressions/article.html">here</a>.</p> <h3 id="word-lists">Word Lists</h3> <p>I’ve gathered two common word lists, a list of valid Scrabble words mentioned <a href="http://pzxc.com/embed-flash-scrabble-dictionary-text-file">here</a> and the <a href="http://www.freebsd.org/cgi/cvsweb.cgi/src/share/dict/web2?rev=1.12;content-type=text%2Fplain">UNIX word list</a>. I’ve also added a space to add a custom list of words to search.</p> <h2 id="technology">Technology</h2> <p>I’ve built this app using <a href="http://www.playframework.com/documentation/2.2.x/ScalaHome">Play Scala</a> as the backend. After importing my puzzles library, here’s the relevant controller code:</p> <script src="https://gist.github.com/josephpconley/8862621.js"></script> <p>I probably could have handled the JSON a bit safer by using a Reads[T] object to handle the parsing, but as this app was fairly simple I used the unsafe conversion JsPath.as. Please don’t think less of me!</p> <p>I’ve also employed <a href="http://knockoutjs.com">Knockout.js</a> to manage the front-end functionality. Knockout.js is a lightweight MVVM framework which manages DOM updates automatically and succinctly, ensuring that your front-end code is not a monolith of jQuery calls. Here’s the code for the front-end:</p> <script src="https://gist.github.com/josephpconley/8862697.js"></script> <p>And that’s it! A good future exercise would be to stream the solutions reactively, especially when dealing with a long word list. This would be done using Play’s <a href="http://www.playframework.com/documentation/2.1.x/Enumeratees">Enumeratee library</a>. If anyone’s interested in that I can post a follow-up detailing that solution.</p> <h2 id="puzzle-solution---double-s-non-programmers">Puzzle Solution - Double S (Non-Programmers)</h2> <p>Now our tool is ready to help us solve a recent puzzle. Here’s the question, reprinted from <a href="http://www.npr.org/2014/01/26/266210037/take-synonyms-for-a-spin-or-pirouette">NPR’s website</a>:</p> <div class="well well-lg">What word, containing two consecutive S's, becomes its own synonym if you drop those S's?</div> <p>Using Regex mode and our Scrabble word list, we can define the regular expression .<em>ss.</em> to return all words with a double S:</p> <p><img src="/assets/regex.bmp" alt="Regex Step One" /></p> <p>From there, we use Excel to store this list in the first column. We can then copy this list in the second column and do a Find/Replace to remove instances of “SS”. We could inspect each row for synonyms but as we have 1976 results, that’s a lot of words to inspect. To further narrow down our choices, I used Excel to transpose the second column into one row, copy that into a text file and do another Find/Replace by highlighting the space between each word and replacing with a vertical bar. This gives us another regular expression that should look like this:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>abbe|aby|abyes|...|zestle|ziple|zonele </code></pre></div></div> <p>Using that regular expression, we can use our Puzzle Solver to determine which words in the list are valid words and which aren’t (we only care about words that are valid). This second search takes significantly longer, but we’ll eventually get a list back of 156 valid words. Plugging this list back into Excel, we can use a VLOOKUP to identify the pairs of words which are valid. After sorting to see valid word pairs first, we can inspect to see which pairs are synonyms. Luckily, the answer appears fairly early in the list (spreadsheet can be downloaded <a href="https://github.com/josephpconley/scala/raw/master/puzzles/src/main/resources/SS.xlsx">here</a>):</p> <p><img src="/assets/SS.bmp" alt="Regex Solution" /></p> <h2 id="puzzle-solution---double-s-programmers">Puzzle Solution - Double S (Programmers)</h2> <p>The previous non-coding solution might have seemed a bit convoluted. A much simpler method would be to use my Scala library directly to find the solutions, which can be done with as little as four lines of code:</p> <script src="https://gist.github.com/josephpconley/8915428.js"></script> <h2 id="conclusion">Conclusion</h2> <p>If you enjoy NPR’s Sunday Puzzle, I’d highly recommend <a href="http://puzzles.blainesville.com/">Blaine’s Puzzle Blog</a> as an excellent companion resource. This blog community offers tantalizing, interesting hints for the solution of the puzzle and often digress into other challenging puzzles as well.</p> <p>Happy puzzling!</p> Mon, 10 Feb 2014 00:00:00 +0000 http://www.josephpconley.com/2014/02/10/hacking-npr-sunday-puzzle.html http://www.josephpconley.com/2014/02/10/hacking-npr-sunday-puzzle.html Scala 101 <p>I recently gave an introductory talk about Scala for my unintiated Point.io hackers. Here’s the slides for future reference. Enjoy!</p> <div class="row"> <iframe src="//www.slideshare.net/slideshow/embed_code/43254707" width="800" height="600" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe> </div> Wed, 29 Jan 2014 00:00:00 +0000 http://www.josephpconley.com/2014/01/29/scala-101.html http://www.josephpconley.com/2014/01/29/scala-101.html Roll Your Own Notification Service <p>Have you ever wished you could receive customized updates whenever your favorite websites update their content? Most sites offer the means to get notified when a new blog post hits the wire or new products are added to their catalog (RSS, social media, e-mail, etc.). But what if the site doesn’t use any of these services? Or what if you only want specific updates (i.e. blog posts from author X, new products containing the name Y)? Then you’re left with only one course of action: build your own notification service!</p> <p>Armed with the mighty powers of HTML scraping, the Scala programming language, and a recurring scheduling mechanism (in this case Play’s Akka scheduler), you have all the tools you need to setup your custom notification.</p> <h2 id="my-new-ebook-notification-service">My New EBook Notification Service</h2> <p>Let’s create a notification service which let’s us know when new ebooks are available at my local digital library, <a href="http://digitallibrary.delcolibraries.org/">Delaware County Library System</a>. At the time of this writing, no such notification service exists. As I’d prefer not to miss any notifications, I’d like to setup an RSS feed. Specifically, we’ll write a process which periodically checks the digital library site for new ebooks and updates an RSS feed accordingly.</p> <h3 id="scalasbt">Scala/SBT</h3> <p>We’ll start out by creating a basic Scala application using SBT (you can checkout a skeleton project <a href="https://github.com/josephpconley/scala/tree/master/hello-world-sbt">here</a>). Let’s add the <a href="http://htmlunit.sourceforge.net/">HTMLUnit</a> and <a href="http://jesseeichar.github.io/scala-io-doc/0.2.0/index.html#!/overview">Scala IO</a> libraries to our project. We’ll use HTMLUnit to parse the HTML code of the library’s website, and we’ll use Scala IO to write our XML to file. Your build file should now look like this (assuming you named your project “ebook”):</p> <script src="https://gist.github.com/josephpconley/8584992.js"></script> <h3 id="scala-xml">Scala XML</h3> <p>Let’s start by building an abstraction for an RSS feed (you can read about the basics of RSS <a href="http://www.w3schools.com/rss/rss_reference.asp">here</a>). We’ll start with an Item case class which holds the basic properties of an RSS item and a method to generate xml. Similarly, we define the basic properties of a Feed using a trait. We’ll make this abstract in the anticipation of re-using this abstraction for other feeds.</p> <script src="https://gist.github.com/josephpconley/8590722.js"></script> <h3 id="screen-scraping-with-htmlunit">Screen Scraping with HTMLUnit</h3> <p>Let’s build a NewEBookFeed which implements Feed. When we implement the items method, we’ll use HTMLUnit to parse the HTML code from <a href="http://digitallibrary.delcolibraries.org/">Delaware County Library System</a> to find out the newest items. This requires digging around the source HTML a bit to understand the structure and find useful patterns. Basic knowledge of <a href="http://www.w3schools.com/xpath/">XPath</a> is required to leverage those patterns. After inspecting the source code and following the appropriate links, we can view the New Ebook page source and parse out the new titles, authors, and image URLs.</p> <script src="https://gist.github.com/josephpconley/8590984.js"></script> <p>That’s it! You can find my complete code as part of my <a href="https://github.com/josephpconley/scala/tree/master/scrape">scrape library</a>, specifically the com.josephpconley.books and com.josephpconley.rss packages. We can test the code by running the following:</p> <script src="https://gist.github.com/josephpconley/8591058.js"></script> <h2 id="deploy-using-play">Deploy using Play</h2> <p>Now that we have a way to generate an up-to-date RSS feed, we need a way to update our feed periodically and make it publically available to an RSS Reader like <a href="http://feedly.com">feedly</a> (my personal favorite). We could handle this a few different ways (i.e. schedule a CRON job to push a file to our Dropbox folder), however I’d like to demonstrate how to handle both the scheduling and file writing/serving using the <a href="http://www.playframework.com/">Play Framework</a>.</p> <p>Start a new Play Scala project, and either package our ebook project as a jar and copy to the lib folder, or just copy and paste the source code into the new Play project (I’ve done the former).</p> <h3 id="akka-scheduler">Akka Scheduler</h3> <p>To hook into Play’s Akka scheduler, we create a Global object in the app folder and override the onStart method, which allows us to run code once the application starts. The Akka system scheduler allows you to schedule a recurring process for a given Duration. In our case, since the site doesn’t update that frequently and we want to be respectful by not overloading the site with requests, we’ll set the duration to 12 hours.</p> <script src="https://gist.github.com/josephpconley/8605053.js"></script> <p>From there, it’s simply a matter of building out a controller with some routes to host the updated file (a straightforward exercise I’d leave to the reader). I personally included this code and hosted the RSS feeds in <a href="http://app.josephpconley.com/rss">my own Play app</a> running on Heroku.</p> <h2 id="drawbacks">Drawbacks</h2> <p>One drawback you might have noticed from this specific example is the possibility of the target site’s source code changing. We relied on very specific HTML tags, text and class attributes to query the information we needed, and should the site be re-written significantly, it’s possible that we would have to re-write our scraping code to accommodate.</p> <h2 id="conclusion">Conclusion</h2> <p>Managing the daily flow of information can be a challenge. With a little bit of coding, however, we can gain finer control over the information we consume, helping us be more productive in our everyday life.</p> Mon, 27 Jan 2014 00:00:00 +0000 http://www.josephpconley.com/2014/01/27/roll-your-own-notification-service.html http://www.josephpconley.com/2014/01/27/roll-your-own-notification-service.html