Solidity Compiler Version Identification on Smart Contract Bytecode

Lakshmi Prasanna Katyayani Devasani, Wright State University

Abstract

Identifying the version of the Solidity compiler used to create an Ethereum contract is a challenging task, especially when the contract bytecode is obfuscated and lacks explicit metadata. Ethereum bytecode is highly complex, as it is generated by the Solidity compiler, which translates high-level programming constructs into low-level, stack-based code. Additionally, the Solidity compiler undergoes frequent updates and modifications, resulting in continuous evolution of bytecode patterns. To address this challenge, we propose using deep learning models to analyze Ethereum bytecodes and infer the compiler version that produced them. A large number of Ethereum contracts and the corresponding compiler versions is used to train these models. The dataset includes contracts compiled with various versions of the Solidity compiler. We preprocess the dataset to extract opcode sequences from the bytecode, which serve as inputs for the deep learning models. We use the advanced sequence learning methods such as bidirectional long short-term memory (Bi-LSTM), convolutional neural network (CNN), CNN+Bi-LSTM, Transformer, and Sentence BERT (SBERT) to capture the semantics of the opcode sequences. We analyze each model’s performance using metrics such as accuracy, precision, recall, and F1-score. Our results demonstrate that our developed models excel at identifying the Solidity compiler version used in smart contracts with high accuracy. We also compare our methods with non-sequence learning models, showing that our models outperform them in most cases. This highlights the advantages of our proposed approaches for identifying Solidity compiler versions from Ethereum bytecodes.