reorchestrate

Your binary is no longer safe: Conversion

Thu, 05 Feb 2026 00:00:00 +0000

This post is the continuation of Your binary is no longer safe: Decompilation about the brute-force reverse engineering of binary (compiled) programs using Large Language Models (LLMs) to automate this two-part problem: decompilation and conversion to a modern programming language.

This post covers the second part of the problem: conversion.

Claude enters the game …

Here is a 1:1 translation produced by claude-opus-4.5. You can see that my implementation does have a different signature to deal with the rust borrow-checker - but you can safely ignore this for this discussion.

Your binary is no longer safe: Decompilation

Wed, 04 Feb 2026 00:00:00 +0000

This post is about the brute-force reverse engineering of binary (compiled) programs using Large Language Models (LLMs) to automate this two-part problem: decompilation and conversion to a modern programming language.

This post covers the first part of the problem: decompilation.

Update 24 February 2026

Based on feedback I have split this article into two posts:

The first (this post) describes the process of decompilation which can be skipped if you are already familiar with this topic.
The second post describes the code-conversion and differential testing approach used to verify the conversion is equivalent to the original binary.

In this post an old Multi-user Dungeon (MUD) game binary has been targeted (see the reasoning below) but the approach applies equally well to other tasks, such as modernizing binaries or converting legacy COBOL to a modern language.

SQLite Transactions

Tue, 16 Jul 2024 00:00:00 +0000

What is SQLite?

In the past few years SQLite (not SQL-light) has had a surge of popularity as people have come to realise its power as an in-process, highly reliable SQL database engine as a backend for server processes rather than its traditional role of client or edge applications. This change in stance for SQLite has happened despite the authors almost actively discouraging its use for this purpose.

I am interested in SQLite for some key reasons:

Custom JWT Claims with Ory Kratos

Tue, 02 Jul 2024 00:00:00 +0000

JSON Web Tokens (JWT) are data structures that allow the holder of the token to assert claims that are able to be cryptographically verified. An example would be that a claim could hold a user claim role like administrator that if proven to be unaltered (using a shared public key) then could be used to provide access to certain administrator functions.

Tools like PostgREST, Hasura and Supabase (a managed PostgREST) rely on JWTs to carry a user’s claims where the role claim is used to set a Postgres role to enforce role-based access control. Tools like Auth0 and Supabase Auth are tools that are commonly used to generate the JWT and can be configured to create tokens that contain the correct claims format for the target system.

Plugins for Rust

Fri, 27 Jan 2023 00:00:00 +0000

Plugins are a useful way to allow advanced users to add new functions your software without having to modify the main program itself. With interpreted languages like JavaScript or Python this can be relatively easy as the runtime itself is able to execute arbitrary instructions without them having to be compiled first. Rust is a compiled lanaguage so does not have a method of executing abritrary instructions prior to compilation so adding functions generally requires rewriting them in Rust and redeploying the binary.

Debezium does not impact source database performance

Mon, 01 Mar 2021 00:00:00 +0000

Debezium is a Database Change-Data-Capture (aka CDC) tool that is able to decode open source and proprietary database logs, normalize them to a standard payload format and push them into a series of Kafka topics. It implements the Confluent Kafka Connect interface so is built to be highly-available, has development supported by RedHat and commercial support from Confluent. If you are even moderately interested in data and the reason for Change-Data-Capture is not obvious head immediately to watch this.

DeltaLake: A clever solution to a big (data) problem

Fri, 09 Aug 2019 00:00:00 +0000

Why Blob Storage Is Risky

Many people use Amazon S3 or equivalent services to replace their relational database data warehouses as blob storage coupled with technologies like the compressed-columnar parquet file format offers a reasonably performant, massively scalable and cheap alternative.

To understand why blob storage can be risky it is important to understand how Apache Spark executes writes to non-transactional storage systems like S3 or other blob storage. To do this we are going use a contrived example where every day a user extracts their bank account transactions to a CSV file named yyyy-MM-dd_transactions.csv:

Code doesn't scale for ETL

Fri, 19 Jul 2019 00:00:00 +0000

History of the problem

Since Hadoop was released in 2007 users have been struggling to use it to deploy reliable and scalable Extract-Transform-Load (ETL) data pipelines. This was exacerbated in the early days by the ecosystem still being in flux - it felt like every day there was another major Apache Foundation project being announced adding to the Hadoop ecosystem. Now with the convergence of the adoption of cloud infrastructure, much more powerful hardware/faster networking and maturity of open-source solutions like Apache Spark it should be easier than ever to build ETL pipelines - but organisations are still grappling with how to rapidly deliver reliable and scalable ETL pipelines.

Using Apache Spark Neural Networks to Recognise Digits

Sat, 12 Mar 2016 00:00:00 +0000

One of the famous machine learning challenges is the performing handwritten character recognition (classification) over the MNIST database of handwritten digits. The MNIST dataset has a training set of 60,000 and a test set of 10,000 28x28 pixel images of handwritten digits and an integer value between 0 and 9 containing their true value.

The current best scores are available on Wikipedia with the best score for Neural Networks currently at 0.35% error.

AffineTransform Transformer for Apache Spark ML

Sun, 06 Mar 2016 00:00:00 +0000

Whilst playing with the MNIST dataset I found I needed a way of rotating images and so I decided to build an Affine Transform Transformer for Apache Spark ML. I have implemented the basic Affine Transformation operations: rotate, scaleX, scaleY, shearX, shearY, translateX, translateY. Any pixel which exceeds the image dimensions will be discarded.I am sure the code could be improved but this is a good starting point.

To use this transformer I assume your data is Dense or Sparse Vector form where each pixel value is indexed. To perform the operations you need to provide the dimensions of the image with Width and Height, define the Operation and the Factor.

A Date Hierarchy for Neo4j

Sat, 27 Feb 2016 00:00:00 +0000

I wrote this a while ago based on this excellent post and added a few more attributes. Given that Neo4j doesn’t have a datatype to deal with dates it might come in handy for you too.

It will generate a calendar between the years specified at the top of the script (1970 to 2050) and create Day vertexes with attributes of year, month, day, dayName (day of week) and workDay (binary). It will then generate create NEXT directed edges between each Day, Month and Year object and create the HAS_MONTH and HAS_DAY edges to join Year to Month to Day so you can traverse the hierarchy quickly.

A better Binarizer for Apache Spark ML

Sat, 16 Jan 2016 00:00:00 +0000

Update: This code has been approved and should appear in Apache Spark 2.0.0.

The Binarizer transformer (API) is part of the core Apache Spark ML package. Its job is simple: compare a series of numbers against a threshold value and if the value is greater than the threshold then output 1.0 and if less than (or equal to) the threshold then output 0.0.

For example, if we apply the Binarizer transformer to the Iris standard machine learning dataset we would get these results (the threshold value is 4.9):

Porter Stemming in Apache Spark ML

Sun, 13 Dec 2015 00:00:00 +0000

As I have been playing with Apache Spark ML and needed a stemming algorithm I decided to have a go and write a custom transformer myself.

As of Spark 1.5.2 Stemming has not been introduced (should be in 1.7.0) but I have taken the Porter Stemmer Algorithm implemented in Scala by the ScalaNLP project and wrapped it as a Spark Transformer. Unfortunately, you are going to have to build Spark from source to use it.

Natural Language Processing with Apache Spark ML and Amazon Reviews (Part 2)

Sun, 06 Dec 2015 00:00:00 +0000

Continues from Part 1.

4 Execution

4.1 The Pipeline

Now we have all the components of the pipeline ready all that is needed is to load them into the Spark ML Pipeline(). A pipeline helps with the sequencing of stages so that we can automate the pipeline in the image at the top of this post.

When Pipeline() is fit() to the training set it will call the fit() method of the two estimator stages (StringIndexer() and NaiveBayes() which will both produce models StringIndexerModel() and NaiveBayesModel() respectively) and transform() on all the rest of the stages. When the Pipeline() is called to transform() the test set it will call transform() on all the stages (including the models).

Natural Language Processing with Apache Spark ML and Amazon Reviews (Part 1)

Sat, 05 Dec 2015 00:00:00 +0000

The most exciting feature of Apache Spark is it’s ‘generality’ meaning the ability to rapidly take some text data, transform it to a graph structure and perform some network analysis with GraphX take that dataset and apply some machine learning algorithms with SparkML and store it in memory and query it using SparkSQL all within a single program of very little code.

In this post I wanted to write about the Spark ML framework and how easy and effective it is to do scalable machine learning by creating a pipeline which perform Natural Language Processing to assess whether a user’s plain text review aligns with their rating (from 1 to 5).

Performance Tuning Spark WikiPedia PageRank

Sat, 21 Nov 2015 00:00:00 +0000

In my previous post I wrote some code to demonstrate how to go from the raw database extracts provided monthly by WikiPedia through to loading into Apache Spark GraphX and running PageRank.

In this post I will discuss my efforts to make that process more efficient which may be relevant to some of you trying to do proof-of-concept activities on less than ideal hardware. My test box (a standalone Intel Core i3-3217U (17W) with 16GB RAM and 60GB SSD storage) cannot complete the full graph build due to insufficient resources. At most I can process around 10-20% of the dataset before hitting these resource constraints.

Computing WikiPedia's internal PageRank with Apache Spark

Sat, 14 Nov 2015 00:00:00 +0000

Recently I have spent a lot of time reading and learning about graphs and graph analytics which naturally drew me to Apache Spark GraphX having previously played with Neo4J. The benefits of GraphX are:

fully open source
scalable using the Apache Spark model
written in Scala which I have been meaning to learn
already has basic graph algorithms such as PageRank

There is a great resource for learning the basics of Apache Spark provided by Berkeley’s AMP Camp where you can follow very well documented tutorials including one on calculating PageRank on very small subset of Wikipedia data which has been preprocessed by them to simply the tutorials. This post extends their tutorial to loading a full set of Wikipedia’s most recent backup to the point of executing the PageRank.