<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>reorchestrate</title><link>https://reorchestrate.com/</link><description>Recent content on reorchestrate</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 05 Feb 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://reorchestrate.com/index.xml" rel="self" type="application/rss+xml"/><item><title>Your binary is no longer safe: Conversion</title><link>https://reorchestrate.com/posts/your-binary-is-no-longer-safe-conversion/</link><pubDate>Thu, 05 Feb 2026 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/your-binary-is-no-longer-safe-conversion/</guid><description>&lt;p&gt;This post is the continuation of &lt;a href="https://reorchestrate.com/posts/your-binary-is-no-longer-safe-decompilation/" &gt;Your binary is no longer safe: Decompilation&lt;/a&gt; about the brute-force reverse engineering of binary (compiled) programs using &lt;a href="https://en.wikipedia.org/wiki/Large_language_model" target="_blank" rel="noopener" &gt;Large Language Models&lt;/a&gt; (LLMs) to automate this two-part problem: decompilation and conversion to a modern programming language.&lt;/p&gt;
&lt;p&gt;This post covers the second part of the problem: &lt;strong&gt;conversion&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="claude-enters-the-game-"&gt;Claude enters the game &amp;hellip;&lt;/h2&gt;
&lt;p&gt;Here is a 1:1 translation produced by &lt;code&gt;claude-opus-4.5&lt;/code&gt;. You can see that my implementation does have a different signature to deal with the &lt;code&gt;rust&lt;/code&gt; borrow-checker - but you can safely ignore this for this discussion.&lt;/p&gt;</description></item><item><title>Your binary is no longer safe: Decompilation</title><link>https://reorchestrate.com/posts/your-binary-is-no-longer-safe-decompilation/</link><pubDate>Wed, 04 Feb 2026 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/your-binary-is-no-longer-safe-decompilation/</guid><description>&lt;p&gt;This post is about the brute-force reverse engineering of binary (compiled) programs using &lt;a href="https://en.wikipedia.org/wiki/Large_language_model" target="_blank" rel="noopener" &gt;Large Language Models&lt;/a&gt; (LLMs) to automate this two-part problem: decompilation and conversion to a modern programming language.&lt;/p&gt;
&lt;p&gt;This post covers the first part of the problem: &lt;strong&gt;decompilation&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="callout info"&gt;
 
 &lt;div class="callout-title"&gt;Update 24 February 2026&lt;/div&gt;
 
 &lt;div class="callout-content"&gt;
 &lt;p&gt;Based on feedback I have split this article into two posts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The first (this post) describes the process of &lt;a href="https://en.wikipedia.org/wiki/Decompiler" target="_blank" rel="noopener" &gt;decompilation&lt;/a&gt; which can be skipped if you are already familiar with this topic.&lt;/li&gt;
&lt;li&gt;The &lt;a href="https://reorchestrate.com/posts/your-binary-is-no-longer-safe-conversion/" &gt;second post&lt;/a&gt; describes the code-conversion and &lt;a href="https://en.wikipedia.org/wiki/Differential_testing" target="_blank" rel="noopener" &gt;differential testing&lt;/a&gt; approach used to verify the conversion is equivalent to the original binary.&lt;/li&gt;
&lt;/ul&gt;
 &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In this post an old &lt;a href="https://en.wikipedia.org/wiki/Multi-user_dungeon" target="_blank" rel="noopener" &gt;Multi-user Dungeon&lt;/a&gt; (MUD) game binary has been targeted (see the reasoning below) but the approach applies equally well to other tasks, such as modernizing binaries or converting legacy &lt;code&gt;COBOL&lt;/code&gt; to a modern language.&lt;/p&gt;</description></item><item><title>SQLite Transactions</title><link>https://reorchestrate.com/posts/sqlite-transactions/</link><pubDate>Tue, 16 Jul 2024 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/sqlite-transactions/</guid><description>&lt;h2 id="what-is-sqlite"&gt;What is SQLite?&lt;/h2&gt;
&lt;p&gt;In the past few years &lt;a href="https://www.sqlite.org/" target="_blank" rel="noopener" &gt;SQLite&lt;/a&gt; (not SQL-light) has had a surge of popularity as people have come to realise its power as an in-process, highly reliable SQL database engine as a backend for server processes rather than its traditional role of client or edge applications. This change in stance for SQLite has happened despite the authors almost &lt;a href="https://www.sqlite.org/whentouse.html#checklist_for_choosing_the_right_database_engine" target="_blank" rel="noopener" &gt;actively discouraging&lt;/a&gt; its use for this purpose.&lt;/p&gt;
&lt;p&gt;I am interested in SQLite for some key reasons:&lt;/p&gt;</description></item><item><title>Custom JWT Claims with Ory Kratos</title><link>https://reorchestrate.com/posts/custom-jwt-claims-with-ory-kratos/</link><pubDate>Tue, 02 Jul 2024 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/custom-jwt-claims-with-ory-kratos/</guid><description>&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/JSON_Web_Token" target="_blank" rel="noopener" &gt;JSON Web Tokens&lt;/a&gt; (&lt;code&gt;JWT&lt;/code&gt;) are data structures that allow the holder of the token to assert &lt;code&gt;claims&lt;/code&gt; that are able to be cryptographically verified. An example would be that a claim could hold a user &lt;code&gt;claim&lt;/code&gt; &lt;code&gt;role&lt;/code&gt; like &lt;code&gt;administrator&lt;/code&gt; that if proven to be unaltered (using a shared public key) then could be used to provide access to certain &lt;code&gt;administrator&lt;/code&gt; functions.&lt;/p&gt;
&lt;p&gt;Tools like &lt;a href="https://postgrest.org/en/v12/" target="_blank" rel="noopener" &gt;PostgREST&lt;/a&gt;, &lt;a href="https://hasura.io/" target="_blank" rel="noopener" &gt;Hasura&lt;/a&gt; and &lt;a href="https://supabase.com/" target="_blank" rel="noopener" &gt;Supabase&lt;/a&gt; (a managed PostgREST) rely on &lt;code&gt;JWT&lt;/code&gt;s to carry a user&amp;rsquo;s &lt;code&gt;claims&lt;/code&gt; where the &lt;code&gt;role&lt;/code&gt; claim is used to &lt;a href="https://www.postgresql.org/docs/current/sql-set-role.html" target="_blank" rel="noopener" &gt;set a Postgres role&lt;/a&gt; to enforce role-based access control. Tools like &lt;a href="https://auth0.com/" target="_blank" rel="noopener" &gt;Auth0&lt;/a&gt; and &lt;a href="https://github.com/supabase/auth" target="_blank" rel="noopener" &gt;Supabase Auth&lt;/a&gt; are tools that are commonly used to generate the &lt;code&gt;JWT&lt;/code&gt; and can be configured to create tokens that contain the correct &lt;code&gt;claims&lt;/code&gt; format for the target system.&lt;/p&gt;</description></item><item><title>Plugins for Rust</title><link>https://reorchestrate.com/posts/plugins-for-rust/</link><pubDate>Fri, 27 Jan 2023 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/plugins-for-rust/</guid><description>&lt;p&gt;Plugins are a useful way to allow advanced users to add new functions your software without having to modify the main program itself. With &lt;a href="https://en.wikipedia.org/wiki/Interpreter_%28computing%29" target="_blank" rel="noopener" &gt;interpreted languages&lt;/a&gt; like JavaScript or Python this can be relatively easy as the runtime itself is able to execute arbitrary instructions without them having to be compiled first. Rust is a compiled lanaguage so does not have a method of executing abritrary instructions prior to compilation so adding functions generally requires rewriting them in Rust and redeploying the binary.&lt;/p&gt;</description></item><item><title>Debezium does not impact source database performance</title><link>https://reorchestrate.com/posts/debezium-performance-impact/</link><pubDate>Mon, 01 Mar 2021 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/debezium-performance-impact/</guid><description>&lt;p&gt;&lt;a href="https://debezium.io/" target="_blank" rel="noopener" &gt;Debezium&lt;/a&gt; is a Database Change-Data-Capture (aka CDC) tool that is able to decode open source and proprietary database logs, normalize them to a standard payload format and push them into a series of &lt;a href="https://kafka.apache.org/" target="_blank" rel="noopener" &gt;Kafka&lt;/a&gt; topics. It implements the Confluent &lt;a href="https://docs./platform/current/connect/index.html" target="_blank" rel="noopener" &gt;Kafka Connect&lt;/a&gt; interface so is built to be highly-available, has development supported by &lt;a href="https://redhat.com" target="_blank" rel="noopener" &gt;RedHat&lt;/a&gt; and commercial support from &lt;a href="https://confluent.io" target="_blank" rel="noopener" &gt;Confluent&lt;/a&gt;. If you are even moderately interested in data and the reason for Change-Data-Capture is not obvious head immediately to &lt;a href="https://martin.kleppmann.com/2015/11/05/database-inside-out-at-oredev.html" target="_blank" rel="noopener" &gt;watch this&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>DeltaLake: A clever solution to a big (data) problem</title><link>https://reorchestrate.com/posts/deltalake-a-clever-solution-to-a-big-data-problem/</link><pubDate>Fri, 09 Aug 2019 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/deltalake-a-clever-solution-to-a-big-data-problem/</guid><description>&lt;h2 id="why-blob-storage-is-risky"&gt;Why Blob Storage Is Risky&lt;/h2&gt;
&lt;p&gt;Many people use &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener" &gt;Amazon S3&lt;/a&gt; or equivalent services to replace their relational database data warehouses as blob storage coupled with technologies like the compressed-columnar &lt;a href="https://parquet.apache.org/" target="_blank" rel="noopener" &gt;parquet&lt;/a&gt; file format offers a reasonably performant, massively scalable and cheap alternative.&lt;/p&gt;
&lt;p&gt;To understand why blob storage can be risky it is important to understand how Apache Spark executes writes to non-transactional storage systems like S3 or other blob storage. To do this we are going use a contrived example where every day a user extracts their bank account transactions to a CSV file named &lt;code&gt;yyyy-MM-dd_transactions.csv&lt;/code&gt;:&lt;/p&gt;</description></item><item><title>Code doesn't scale for ETL</title><link>https://reorchestrate.com/posts/code-doesnt-scale-for-etl/</link><pubDate>Fri, 19 Jul 2019 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/code-doesnt-scale-for-etl/</guid><description>&lt;h2 id="history-of-the-problem"&gt;History of the problem&lt;/h2&gt;
&lt;p&gt;Since Hadoop was released in 2007 users have been struggling to use it to deploy reliable and scalable Extract-Transform-Load (ETL) data pipelines. This was exacerbated in the early days by the ecosystem still being in flux - it felt like every day there was another major Apache Foundation project being announced adding to the Hadoop ecosystem. Now with the convergence of the adoption of cloud infrastructure, much more powerful hardware/faster networking and maturity of open-source solutions like Apache Spark it should be easier than ever to build ETL pipelines - but organisations are still grappling with how to rapidly deliver reliable and scalable ETL pipelines.&lt;/p&gt;</description></item><item><title>Using Apache Spark Neural Networks to Recognise Digits</title><link>https://reorchestrate.com/posts/using-apache-spark-neural-networks-to-recognise-digits/</link><pubDate>Sat, 12 Mar 2016 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/using-apache-spark-neural-networks-to-recognise-digits/</guid><description>&lt;p&gt;One of the famous machine learning challenges is the performing handwritten character recognition (classification) over the &lt;a href="http://yann.lecun.com/exdb/mnist/" target="_blank" rel="noopener" &gt;MNIST database of handwritten digits&lt;/a&gt;. The MNIST dataset has a training set of 60,000 and a test set of 10,000 28x28 pixel images of handwritten digits and an integer value between 0 and 9 containing their true value.&lt;/p&gt;
&lt;p&gt;The current best scores are &lt;a href="https://en.wikipedia.org/wiki/MNIST_database" target="_blank" rel="noopener" &gt;available on Wikipedia&lt;/a&gt; with the best score for &lt;a href="https://en.wikipedia.org/wiki/Artificial_neural_network" target="_blank" rel="noopener" &gt;Neural Networks&lt;/a&gt; currently at 0.35% error.&lt;/p&gt;</description></item><item><title>AffineTransform Transformer for Apache Spark ML</title><link>https://reorchestrate.com/posts/affinetransform-transformer-for-apache-spark-ml/</link><pubDate>Sun, 06 Mar 2016 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/affinetransform-transformer-for-apache-spark-ml/</guid><description>&lt;p&gt;Whilst playing with the MNIST dataset I found I needed a way of rotating images and so I decided to build an &lt;a href="https://en.wikipedia.org/wiki/Affine_transformation" target="_blank" rel="noopener" &gt;Affine Transform&lt;/a&gt; Transformer for Apache Spark ML. I have implemented the basic Affine Transformation operations: &lt;code&gt;rotate&lt;/code&gt;, &lt;code&gt;scaleX&lt;/code&gt;, &lt;code&gt;scaleY&lt;/code&gt;, &lt;code&gt;shearX&lt;/code&gt;, &lt;code&gt;shearY&lt;/code&gt;, &lt;code&gt;translateX&lt;/code&gt;, &lt;code&gt;translateY&lt;/code&gt;. Any pixel which exceeds the image dimensions will be discarded.I am sure the code could be improved but this is a good starting point.&lt;/p&gt;
&lt;p&gt;To use this transformer I assume your data is &lt;code&gt;Dense&lt;/code&gt; or &lt;code&gt;Sparse&lt;/code&gt; &lt;code&gt;Vector&lt;/code&gt; form where each pixel value is indexed. To perform the operations you need to provide the dimensions of the image with &lt;code&gt;Width&lt;/code&gt; and &lt;code&gt;Height&lt;/code&gt;, define the &lt;code&gt;Operation&lt;/code&gt; and the &lt;code&gt;Factor&lt;/code&gt;.&lt;/p&gt;</description></item><item><title>A Date Hierarchy for Neo4j</title><link>https://reorchestrate.com/posts/date-hierarchy-for-neo4j/</link><pubDate>Sat, 27 Feb 2016 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/date-hierarchy-for-neo4j/</guid><description>&lt;p&gt;I wrote this a while ago based on this &lt;a href="http://www.markhneedham.com/blog/2014/04/19/neo4j-cypher-creating-a-time-tree-down-to-the-day/" target="_blank" rel="noopener" &gt;excellent post&lt;/a&gt; and added a few more attributes. Given that &lt;a href="http://neo4j.com/" target="_blank" rel="noopener" &gt;Neo4j&lt;/a&gt; doesn&amp;rsquo;t have a datatype to deal with dates it might come in handy for you too.&lt;/p&gt;
&lt;p&gt;It will generate a calendar between the years specified at the top of the script (&lt;code&gt;1970&lt;/code&gt; to &lt;code&gt;2050&lt;/code&gt;) and create &lt;code&gt;Day&lt;/code&gt; vertexes with attributes of &lt;code&gt;year&lt;/code&gt;, &lt;code&gt;month&lt;/code&gt;, &lt;code&gt;day&lt;/code&gt;, &lt;code&gt;dayName&lt;/code&gt; (day of week) and &lt;code&gt;workDay&lt;/code&gt; (binary). It will then generate create &lt;code&gt;NEXT&lt;/code&gt; directed edges between each &lt;code&gt;Day&lt;/code&gt;, &lt;code&gt;Month&lt;/code&gt; and &lt;code&gt;Year&lt;/code&gt; object and create the &lt;code&gt;HAS_MONTH&lt;/code&gt; and &lt;code&gt;HAS_DAY&lt;/code&gt; edges to join &lt;code&gt;Year&lt;/code&gt; to &lt;code&gt;Month&lt;/code&gt; to &lt;code&gt;Day&lt;/code&gt; so you can traverse the hierarchy quickly.&lt;/p&gt;</description></item><item><title>A better Binarizer for Apache Spark ML</title><link>https://reorchestrate.com/posts/a-better-binarizer-for-apache-spark-ml/</link><pubDate>Sat, 16 Jan 2016 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/a-better-binarizer-for-apache-spark-ml/</guid><description>&lt;p&gt;&lt;strong&gt;Update: This code has been approved and should appear in Apache Spark 2.0.0.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;Binarizer&lt;/code&gt; transformer (&lt;a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.Binarizer" target="_blank" rel="noopener" &gt;API&lt;/a&gt;) is part of the core Apache Spark ML package. Its job is simple: compare a series of numbers against a threshold value and if the value is greater than the threshold then output &lt;code&gt;1.0&lt;/code&gt; and if less than (or equal to) the threshold then output &lt;code&gt;0.0&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For example, if we apply the &lt;code&gt;Binarizer&lt;/code&gt; transformer to the Iris standard machine learning dataset we would get these results (the threshold value is &lt;code&gt;4.9&lt;/code&gt;):&lt;/p&gt;</description></item><item><title>Porter Stemming in Apache Spark ML</title><link>https://reorchestrate.com/posts/porter-stemming-in-apache-spark-ml/</link><pubDate>Sun, 13 Dec 2015 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/porter-stemming-in-apache-spark-ml/</guid><description>&lt;p&gt;As I have been playing with Apache Spark ML and needed a &lt;a href="https://en.wikipedia.org/wiki/Stemming" target="_blank" rel="noopener" &gt;stemming algorithm&lt;/a&gt; I decided to have a go and write a custom transformer myself.&lt;/p&gt;
&lt;p&gt;As of Spark 1.5.2 Stemming has not been introduced (&lt;a href="https://issues.apache.org/jira/browse/SPARK-9578" target="_blank" rel="noopener" &gt;should be in 1.7.0&lt;/a&gt;) but I have taken the &lt;a href="http://tartarus.org/martin/PorterStemmer/" target="_blank" rel="noopener" &gt;Porter Stemmer&lt;/a&gt; Algorithm implemented in Scala by the &lt;a href="https://github.com/scalanlp/chalk" target="_blank" rel="noopener" &gt;ScalaNLP&lt;/a&gt; project and wrapped it as a Spark Transformer. Unfortunately, you are going to have to build Spark from source to use it.&lt;/p&gt;</description></item><item><title>Natural Language Processing with Apache Spark ML and Amazon Reviews (Part 2)</title><link>https://reorchestrate.com/posts/natural-language-processing-with-apache-spark-ml-and-amazon-reviews-part-2/</link><pubDate>Sun, 06 Dec 2015 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/natural-language-processing-with-apache-spark-ml-and-amazon-reviews-part-2/</guid><description>&lt;p&gt;Continues from &lt;a href="https://reorchestrate.com/posts/natural-language-processing-with-apache-spark-ml-and-amazon-reviews-part-1/" &gt;Part 1&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="4-execution"&gt;4 Execution&lt;/h4&gt;
&lt;h5 id="41-the-pipeline"&gt;4.1 The Pipeline&lt;/h5&gt;
&lt;p&gt;Now we have all the components of the pipeline ready all that is needed is to load them into the Spark ML &lt;code&gt;Pipeline()&lt;/code&gt;. A pipeline helps with the sequencing of stages so that we can automate the pipeline in the image at the top of this post.&lt;/p&gt;
&lt;p&gt;When &lt;code&gt;Pipeline()&lt;/code&gt; is &lt;code&gt;fit()&lt;/code&gt; to the &lt;code&gt;training&lt;/code&gt; set it will call the &lt;code&gt;fit()&lt;/code&gt; method of the two estimator stages (&lt;code&gt;StringIndexer()&lt;/code&gt; and &lt;code&gt;NaiveBayes()&lt;/code&gt; which will both produce models &lt;code&gt;StringIndexerModel()&lt;/code&gt; and &lt;code&gt;NaiveBayesModel()&lt;/code&gt; respectively) and &lt;code&gt;transform()&lt;/code&gt; on all the rest of the stages. When the &lt;code&gt;Pipeline()&lt;/code&gt; is called to &lt;code&gt;transform()&lt;/code&gt; the &lt;code&gt;test&lt;/code&gt; set it will call &lt;code&gt;transform()&lt;/code&gt; on all the stages (including the models).&lt;/p&gt;</description></item><item><title>Natural Language Processing with Apache Spark ML and Amazon Reviews (Part 1)</title><link>https://reorchestrate.com/posts/natural-language-processing-with-apache-spark-ml-and-amazon-reviews-part-1/</link><pubDate>Sat, 05 Dec 2015 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/natural-language-processing-with-apache-spark-ml-and-amazon-reviews-part-1/</guid><description>&lt;p&gt;The most exciting feature of Apache Spark is it&amp;rsquo;s &amp;lsquo;generality&amp;rsquo; meaning the ability to rapidly take some text data, transform it to a graph structure and perform some network analysis with &lt;a href="https://spark.apache.org/graphx/" target="_blank" rel="noopener" &gt;GraphX&lt;/a&gt; take that dataset and apply some machine learning algorithms with &lt;a href="https://spark.apache.org/mllib/" target="_blank" rel="noopener" &gt;SparkML&lt;/a&gt; and store it in memory and query it using &lt;a href="https://spark.apache.org/sql/" target="_blank" rel="noopener" &gt;SparkSQL&lt;/a&gt; all within a single program of very little code.&lt;/p&gt;
&lt;p&gt;In this post I wanted to write about the Spark ML framework and how easy and effective it is to do scalable machine learning by creating a pipeline which perform Natural Language Processing to assess whether a user&amp;rsquo;s plain text review aligns with their rating (from 1 to 5).&lt;/p&gt;</description></item><item><title>Performance Tuning Spark WikiPedia PageRank</title><link>https://reorchestrate.com/posts/performance-tuning-spark-pagerank/</link><pubDate>Sat, 21 Nov 2015 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/performance-tuning-spark-pagerank/</guid><description>&lt;p&gt;In my previous post I wrote some code to demonstrate how to go from the raw database extracts provided monthly by WikiPedia through to loading into Apache Spark GraphX and running PageRank.&lt;/p&gt;
&lt;p&gt;In this post I will discuss my efforts to make that process more efficient which may be relevant to some of you trying to do proof-of-concept activities on less than ideal hardware. My test box (a standalone Intel Core i3-3217U (17W) with 16GB RAM and 60GB SSD storage) cannot complete the full graph build due to insufficient resources. At most I can process around 10-20% of the dataset before hitting these resource constraints.&lt;/p&gt;</description></item><item><title>Computing WikiPedia's internal PageRank with Apache Spark</title><link>https://reorchestrate.com/posts/computing-wikipedias-internal-pagerank-with-spark/</link><pubDate>Sat, 14 Nov 2015 00:00:00 +0000</pubDate><guid>https://reorchestrate.com/posts/computing-wikipedias-internal-pagerank-with-spark/</guid><description>&lt;p&gt;Recently I have spent a lot of time reading and learning about graphs and graph analytics which naturally drew me to &lt;a href="https://spark.apache.org/graphx/" target="_blank" rel="noopener" &gt;Apache Spark GraphX&lt;/a&gt; having previously played with &lt;a href="http://neo4j.com/" target="_blank" rel="noopener" &gt;Neo4J&lt;/a&gt;. The benefits of GraphX are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;fully open source&lt;/li&gt;
&lt;li&gt;scalable using the Apache Spark model&lt;/li&gt;
&lt;li&gt;written in Scala which I have been meaning to learn&lt;/li&gt;
&lt;li&gt;already has basic graph algorithms such as &lt;a href="https://en.wikipedia.org/wiki/PageRank" target="_blank" rel="noopener" &gt;PageRank&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There is a great resource for learning the basics of Apache Spark provided by Berkeley&amp;rsquo;s &lt;a href="http://ampcamp.berkeley.edu/" target="_blank" rel="noopener" &gt;AMP Camp&lt;/a&gt; where you can follow very well documented tutorials including one on calculating PageRank on very small subset of Wikipedia data which has been preprocessed by them to simply the tutorials. This post extends their tutorial to loading a full set of Wikipedia&amp;rsquo;s most recent backup to the point of executing the PageRank.&lt;/p&gt;</description></item></channel></rss>