🎙 Google’s Allen Day on Using ML in the Cryptocurrency Space
Was this email forwarded to you? Sign up here It’s so inspiring to learn from practitioners and thinkers. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you like it. No subscription is needed. 👤 Quick bio /
Allen Day (AD): I work in Google Cloud’s developer relations team. Our mission is to build a best-in-class experience for cloud devs. Within this team, I advocate for Google Cloud's web3 and data & analytics products. These spans the range of engineering data pipelines, from ingest through analytics and machine learning. I spend most of my time with the data processing and transformation products. Regarding how I got into machine learning, it wasn’t through deliberate intention but rather the result of my lifelong interest to explore and build at the intersection of computer code and DNA-based biocode. This started with self-study and learning to program a computer at six years old, and led me to pursue a graduate degree in bioinformatics, during which I learned how to use distributed systems to implement machine learning algorithms to do research in human genetics. 🛠 ML Work
AD: I got interested in cryptocurrencies in 2013 but didn't get around to learning about the blockchain data structures until Ethereum's ICO boom in 2017. I noticed there were some structural parallels — the blockchain transaction graph looks like the graph of genetic interactions inside a cell. So I decided to apply some simple analyses to find e.g. central nodes and write a blog post about how I did it. It ended up being more data engineering work than I expected to get to a few charts in a Jupyter notebook. I decided that nobody should need to do that work again, so I open-sourced the ETL and put the processed data into a free-to-access BigQuery dataset. Then I wrote my blog post. It was very well received by the blockchain community. Many analysts and engineers reached out to me, and a community formed around the open data. It became clear that we needed to address two key challenges to meet the community's needs: (1) a robust DevOps architecture (kubernetes, docker) to keep up to date with a blockchain network's consensus state, and (2) an extensible architecture for ETLing complex streaming data (pub/sub, dataflow, airflow) so that we could work with other blockchains such as Ethereum. I teamed up with a talented data engineer, Evgeny Medvedev, and we built the Blockchain ETL community and open-source software project. Today at GCP we maintain ~20 of these datasets in BigQuery. There's a Kaggle community analyzing them, and Evgeny went on to build a blockchain analytics company, Nansen, based on our work.
AD: If we consider all of the data on all of the public blockchains, there are indeed some small areas that are effectively invisible. For the majority of the data, though, we can see the transactions. Some blockchains are account-based so we can directly see system actors. Other blockchains are transaction-based and we need to use clustering methods to build synthetic identities. In all cases, we can reduce the ledger activity to a working set of system actors. From here, it's common to create continuous features via dispersion modeling to estimate contamination from a ransomware payment address. It's also common to use public label data to create categorical features — for example, using a random forest to find look-alikes to known labeled actors (miners, traders) based on their activity aggregated over time.
AD: Yes, definitely! Graph database investment and popularity in graph analytics workloads continue to grow. Their data access capabilities are on the cusp of being generally usable and there is an opportunity to apply graph databases to blockchain data structures. Why do we care about graph data structures at all? A graph is the ultimate generalized data structure. It captures and can represent the blockchain data with high fidelity, and it has the capability to encode rich relationships between nodes (temporal, semantic, social, spatial, functional). We've already demonstrated that there's useful inductive bias for non-graph-based methods. It seems reasonable to expect that graph-aware models like GNNs will outperform the more basic methods. I also think it's the right time to be thinking about this. As I described earlier, most of the activity on-chain is open for all to see. But we should expect these data to become more obfuscated and opaque over time. After all, one of the fundamental technologies upon which blockchains are based is cryptography. So more hiding capabilities will be introduced, and the awareness of on-chain actors that they're living in a dark forest will also increase. This becomes an adversarial ML problem. So we'll need the more powerful capabilities that are unlocked with GNNs, like identifying anomalous transactions, and conversely which transactions don't exist (...yet) that should. Classifying nodes with GNN embeddings and applying graph kernels to characterize neighborhoods will also prove useful.
AD: The theme of your question seems to be about building ML microservices that use a blockchain backplane. We're already seeing this today with blockchain oracles: middleware solutions that address the software oracle problem. I pioneered the concept of hybrid blockchain/cloud applications with Chainlink, and the essential problem we solved was how to run intense workloads by decoupling the on-chain compute for logging the transaction from the resources needed to deliver the result. As a concrete example, this design pattern allows spinning up a docker container to train a model or perform inference using a GPU and get results delivered on-chain. Blockchains employ checksums everywhere, so a nice feature that you get for free by doing this is responsible AI — the input dataset can be transparent and verified, and the model training/inference processes are deterministic and reproducible. Regarding federated learning, I haven't seen an implementation of coordinating with a blockchain, but it seems possible. We can reuse the same Oracle-based worker pattern described above, and converge with a MapReduce orchestrator. The techniques used to survive in the dark forest, like zero-knowledge proofs, may also be helpful here for managing privacy as blockchain-integrated ML models are brought to market.
AD: With regard to ML and NFTs, we're seeing NFTs that grant the owner access to ML-linked products and experiences — acting like a license key or a config file. ML is already being used in off-chain trading systems, and I expect we'll also see the on-chain equivalent of this, where Oracle-linked ML models are integral to the automated protocols that power decentralized finance and games. It's a great time to get involved at the intersection of ML and crypto, and it's been an honor to share with your audience some current market opportunities and areas of open inquiry. I'm excited to see more ML practitioners get involved and see what they'll create. 💥 Miscellaneous – a set of rapid-fire questions
AD: Elements of Statistical Learning (free PDF) by Trevor Hastie, Robert Tibshirani, Jerome Friedman; Introduction to Information Retrieval (free PDF) by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze; Foundations of Statistical Natural Language Processing by Chris Manning and Hinrich Schütze.
AD: I don’t think they’re equal, no. If P=NP there are of course major ramifications for cryptography and the entire stack of blockchain applications built on top of that. But it’s a tiny disruption in relation to all of our assumed limitations that get broken. Perhaps this question is so captivating because of how close it is to the human condition. We want both unlimited reach (P=NP) while operating from a place of total safety (P!=NP). But the math sublimely indicates we can’t have it both ways; this is both beautiful and terrifying. You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Sign in to TheSequence
Tuesday, June 21, 2022
. Here's a link to sign in to TheSequence. This link can only be used once and expires after 24 hours. Sign in now © 2022 Jesus Rodriguez, Ksenia Semenova 75 Miracle Mile, Suite 7688, Coral Gables,
💠 Edge#201: Understanding Graph Convolutional Neural Networks
Tuesday, June 21, 2022
In this issue: we explain Graph Convolutional Neural Networks; we overview the original GCN Paper; we explore PyTorch Geometric, one of the most complete GNN frameworks available today. Enjoy the
📌 Event: Explore the future of scalable AI & more at Ray Summit: August 23-24 in San Francisco!
Monday, June 20, 2022
Must-attend
🔵⚪️ The Alexa Factor
Sunday, June 19, 2022
Weekly news digest curated by the industry insiders
📌 Event: Discover What It Takes to Scale Innovation & Data Science
Friday, June 17, 2022
Get inspired on June 22
You Might Also Like
Daily Coding Problem: Problem #1619 [Hard]
Monday, November 25, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. Given two non-empty binary trees s and t , check whether tree t has exactly the
Unpacking “Craft” in the Software Interface & The Five Pillars of Creative Flow
Monday, November 25, 2024
Systems Over Substance, Anytype's autumn updates, Ghost's progress with its ActivityPub integration, and a lot more in this week's issue of Creativerly. Creativerly Unpacking “Craft” in the
What Investors Want From AI Startups in 2025
Monday, November 25, 2024
Top Tech Content sent at Noon! How the world collects web data Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, November 25, 2024? The HackerNoon
GCP Newsletter #426
Monday, November 25, 2024
Welcome to issue #426 November 25th, 2024 News LLM Official Blog Vertex AI Announcing Mistral AI's Large-Instruct-2411 on Vertex AI - Google Cloud has announced the availability of Mistral AI's
⏳ 36 Hours Left: Help Get "The Art of Data" Across the Finish Line 🏁
Monday, November 25, 2024
Visual Capitalist plans to unveal its secrets behind data storytelling, but only if the book hits its minimum funding goal. View Online | Subscribe | Download Our App We Need Your Help Only 36 Hours
DeveloPassion's Newsletter #180 - Black Friday Week
Monday, November 25, 2024
Edition 180 of my newsletter, discussing Knowledge Management, Knowledge Work, Zen Productivity, Personal Organization, and more! Sébastien Dubois DeveloPassion's Newsletter DeveloPassion's
Meet HackerNoon's Latest Features: Boost Stories with Translations, Speech-to-Text & More
Monday, November 25, 2024
Hey, Hacker! HackerNoon's monthly product update is here! Get ready for a new version of the mobile app, more translation developments, a new AI Gallery, backend moves, and more! 🚀 This product
The ultimate holiday gadget gift
Monday, November 25, 2024
AI isn't hitting a wall; $70 off Apple Watch; 60+ Amazon deals -- ZDNET ZDNET Tech Today - US November 25, 2024 Meta Quest 3S Why the Meta Quest 3S is the ultimate 2024 holiday present This $299
Deduplication in Distributed Systems: Myths, Realities, and Practical Solutions
Monday, November 25, 2024
This week, we'll discuss the deduplication strategies. We'll see whether they're useful and consider scenarios where you may need them. We'll also do a reality check with the promises
How to know if your data has been exposed
Monday, November 25, 2024
How do you know if your personal data has been leaked? Imagine getting an instant notification if your SSN, credit card, or password has been exposed on the dark web — so you can take action