🎙 Google’s Allen Day on Using ML in the Cryptocurrency Space
Was this email forwarded to you? Sign up here It’s so inspiring to learn from practitioners and thinkers. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you like it. No subscription is needed. 👤 Quick bio /
Allen Day (AD): I work in Google Cloud’s developer relations team. Our mission is to build a best-in-class experience for cloud devs. Within this team, I advocate for Google Cloud's web3 and data & analytics products. These spans the range of engineering data pipelines, from ingest through analytics and machine learning. I spend most of my time with the data processing and transformation products. Regarding how I got into machine learning, it wasn’t through deliberate intention but rather the result of my lifelong interest to explore and build at the intersection of computer code and DNA-based biocode. This started with self-study and learning to program a computer at six years old, and led me to pursue a graduate degree in bioinformatics, during which I learned how to use distributed systems to implement machine learning algorithms to do research in human genetics. 🛠 ML Work
AD: I got interested in cryptocurrencies in 2013 but didn't get around to learning about the blockchain data structures until Ethereum's ICO boom in 2017. I noticed there were some structural parallels — the blockchain transaction graph looks like the graph of genetic interactions inside a cell. So I decided to apply some simple analyses to find e.g. central nodes and write a blog post about how I did it. It ended up being more data engineering work than I expected to get to a few charts in a Jupyter notebook. I decided that nobody should need to do that work again, so I open-sourced the ETL and put the processed data into a free-to-access BigQuery dataset. Then I wrote my blog post. It was very well received by the blockchain community. Many analysts and engineers reached out to me, and a community formed around the open data. It became clear that we needed to address two key challenges to meet the community's needs: (1) a robust DevOps architecture (kubernetes, docker) to keep up to date with a blockchain network's consensus state, and (2) an extensible architecture for ETLing complex streaming data (pub/sub, dataflow, airflow) so that we could work with other blockchains such as Ethereum. I teamed up with a talented data engineer, Evgeny Medvedev, and we built the Blockchain ETL community and open-source software project. Today at GCP we maintain ~20 of these datasets in BigQuery. There's a Kaggle community analyzing them, and Evgeny went on to build a blockchain analytics company, Nansen, based on our work.
AD: If we consider all of the data on all of the public blockchains, there are indeed some small areas that are effectively invisible. For the majority of the data, though, we can see the transactions. Some blockchains are account-based so we can directly see system actors. Other blockchains are transaction-based and we need to use clustering methods to build synthetic identities. In all cases, we can reduce the ledger activity to a working set of system actors. From here, it's common to create continuous features via dispersion modeling to estimate contamination from a ransomware payment address. It's also common to use public label data to create categorical features — for example, using a random forest to find look-alikes to known labeled actors (miners, traders) based on their activity aggregated over time.
AD: Yes, definitely! Graph database investment and popularity in graph analytics workloads continue to grow. Their data access capabilities are on the cusp of being generally usable and there is an opportunity to apply graph databases to blockchain data structures. Why do we care about graph data structures at all? A graph is the ultimate generalized data structure. It captures and can represent the blockchain data with high fidelity, and it has the capability to encode rich relationships between nodes (temporal, semantic, social, spatial, functional). We've already demonstrated that there's useful inductive bias for non-graph-based methods. It seems reasonable to expect that graph-aware models like GNNs will outperform the more basic methods. I also think it's the right time to be thinking about this. As I described earlier, most of the activity on-chain is open for all to see. But we should expect these data to become more obfuscated and opaque over time. After all, one of the fundamental technologies upon which blockchains are based is cryptography. So more hiding capabilities will be introduced, and the awareness of on-chain actors that they're living in a dark forest will also increase. This becomes an adversarial ML problem. So we'll need the more powerful capabilities that are unlocked with GNNs, like identifying anomalous transactions, and conversely which transactions don't exist (...yet) that should. Classifying nodes with GNN embeddings and applying graph kernels to characterize neighborhoods will also prove useful.
AD: The theme of your question seems to be about building ML microservices that use a blockchain backplane. We're already seeing this today with blockchain oracles: middleware solutions that address the software oracle problem. I pioneered the concept of hybrid blockchain/cloud applications with Chainlink, and the essential problem we solved was how to run intense workloads by decoupling the on-chain compute for logging the transaction from the resources needed to deliver the result. As a concrete example, this design pattern allows spinning up a docker container to train a model or perform inference using a GPU and get results delivered on-chain. Blockchains employ checksums everywhere, so a nice feature that you get for free by doing this is responsible AI — the input dataset can be transparent and verified, and the model training/inference processes are deterministic and reproducible. Regarding federated learning, I haven't seen an implementation of coordinating with a blockchain, but it seems possible. We can reuse the same Oracle-based worker pattern described above, and converge with a MapReduce orchestrator. The techniques used to survive in the dark forest, like zero-knowledge proofs, may also be helpful here for managing privacy as blockchain-integrated ML models are brought to market.
AD: With regard to ML and NFTs, we're seeing NFTs that grant the owner access to ML-linked products and experiences — acting like a license key or a config file. ML is already being used in off-chain trading systems, and I expect we'll also see the on-chain equivalent of this, where Oracle-linked ML models are integral to the automated protocols that power decentralized finance and games. It's a great time to get involved at the intersection of ML and crypto, and it's been an honor to share with your audience some current market opportunities and areas of open inquiry. I'm excited to see more ML practitioners get involved and see what they'll create. 💥 Miscellaneous – a set of rapid-fire questions
AD: Elements of Statistical Learning (free PDF) by Trevor Hastie, Robert Tibshirani, Jerome Friedman; Introduction to Information Retrieval (free PDF) by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze; Foundations of Statistical Natural Language Processing by Chris Manning and Hinrich Schütze.
AD: I don’t think they’re equal, no. If P=NP there are of course major ramifications for cryptography and the entire stack of blockchain applications built on top of that. But it’s a tiny disruption in relation to all of our assumed limitations that get broken. Perhaps this question is so captivating because of how close it is to the human condition. We want both unlimited reach (P=NP) while operating from a place of total safety (P!=NP). But the math sublimely indicates we can’t have it both ways; this is both beautiful and terrifying. You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Key phrases
Older messages
Sign in to TheSequence
Tuesday, June 21, 2022
. Here's a link to sign in to TheSequence. This link can only be used once and expires after 24 hours. Sign in now © 2022 Jesus Rodriguez, Ksenia Semenova 75 Miracle Mile, Suite 7688, Coral Gables,
💠 Edge#201: Understanding Graph Convolutional Neural Networks
Tuesday, June 21, 2022
In this issue: we explain Graph Convolutional Neural Networks; we overview the original GCN Paper; we explore PyTorch Geometric, one of the most complete GNN frameworks available today. Enjoy the
📌 Event: Explore the future of scalable AI & more at Ray Summit: August 23-24 in San Francisco!
Monday, June 20, 2022
Must-attend
🔵⚪️ The Alexa Factor
Sunday, June 19, 2022
Weekly news digest curated by the industry insiders
📌 Event: Discover What It Takes to Scale Innovation & Data Science
Friday, June 17, 2022
Get inspired on June 22
You Might Also Like
📧 Introduction to Distributed Tracing With OpenTelemetry in .NET
Saturday, April 20, 2024
Introduction to Distributed Tracing With OpenTelemetry in .NET Read on: my website / Read time: 5 minutes BROUGHT TO YOU BY Shesha: The .NET Open-Source Low-Code Framework Introducing Shesha, a
a16z’s Infrastructure team gets a new general partner
Friday, April 19, 2024
Post News is shutting down and Wall Street isn't feeling a Salesforce-Informatica pairing View this email online in your browser By Christine Hall Friday, April 19, 2024 Image Credits: Andreessen
New Roundtable! Additive for Mass Production Applications
Friday, April 19, 2024
The Outlook for the Future View this email in your browser engineering.com Roundtable - Additive for Mass Production Applications: The Outlook for the Future 6 Considerations for Choosing the Right
📷 What to Know About Macro Photography — Why You Should Buy a Budget Motherboard
Friday, April 19, 2024
Also: How to Automatically Highlight Values in Excel, and More! How-To Geek Logo April 19, 2024 📩 Get expert reviews, the hottest deals, how-to's, breaking news, and more delivered directly to your
Is the wind going out of the AI sails?
Friday, April 19, 2024
Rippling vacuums up venture capital and Ramp bags more millions View this email online in your browser By Haje Jan Kamps Friday, April 19, 2024 Image Credits: Getty Images / Carol Yepes Welcome to
Llama 3 is out - Weekly News Roundup - Issue #463
Friday, April 19, 2024
Plus: brand-new, all-electric Atlas; AI Index Report 2024; Microsoft pitched GenAI tools to US military; Humane AI Pin reviews are in; debunking Devin; and more! ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Daily Coding Problem: Problem #1417 [Easy]
Friday, April 19, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Wayfair. You are given a 2 x N board, and instructed to completely cover the board with
Charted | How Hard Is It to Get Into an Ivy League School? 🎓
Friday, April 19, 2024
We detail the admission rates and average annual cost for Ivy League schools, as well as the median SAT scores required to be accepted. View Online | Subscribe Presented by: Discover the motivations
Dark Matter & Tortured Poets
Friday, April 19, 2024
New music releases aren't what they used to be -- for good and bad. Dark Matter & Tortured Poets By MG Siegler • 19 Apr 2024 View in browser View in browser New music releases in 2024 are a
Impact of AI on Product Management
Friday, April 19, 2024
Impact of AI on Product Management The rise of the AI Product Manager. Product managers have always championed customer's needs. However, with AI, the job requires new technical and ethical