Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If this newsletter is helpful to your job, please become a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
And now…let's dive into some interesting links from this week.
Learning JAX as a PyTorch developer
Assuming you already know PyTorch, this is what I’ve found that you need to know to get up to speed with JAX. (This isn’t really an article to try to sell you on JAX. My assumption is that you already know you want to use it for its cool autoparallel / advanced autodiff / ludicrous speed / scientific ecosystem / etc. etc. And I completely acknowledge that these awesome features come at the cost of something that has a slightly higher learning curve than PyTorch.) Let’s get started. We’ll cover 9 bullet points in total…
Nvidia Envy: understanding the GPU gold rush
In 2023, thousands of companies and countries begged Nvidia to purchase more GPUs. Can the exponential demand endure?…Will the exponential growth of Nvidia demand continue to outpace supply? Answering the trillion dollar shortage question is challenging – with FUD and propaganda driving the GPU conversation, it is hard to see through the noise to develop an intuition for the supply and demand dynamics at play. A few factors amplified the GPU shortage, and understanding them should help us understand how the 2020s will unfold…
6 months as a Data Science freelance - some tips
I have been a freelance Data Scientist for 6 months now, and I have more job offers than I can manage. I get offers almost every day and I have to turn down offers almost every week. Some people have written me to get some tips on how to start and get some clients, so I wanted to share a few things I tried to find clients on Upwork, LinkedIn and online communities…
Hex is a collaborative workspace for data science and analytics. Now data teams can run their queries, notebooks, and interactive reports — all in one place.
Hex has Magical AI tools that can generate queries and code, create visualizations, and even kickstart a whole analysis, all from natural language prompts, allowing teams to accelerate work and focus on what matters.
Join hundreds of data teams like Notion, AllTrails, Loom, Brex, and Algolia using Hex every day to make their work more impactful. Sign up today at hex.tech/datascienceweekly to get a 30-day free trial of the Hex Team plan!
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Sample size and predictive performance of machine learning methods with survival data: A simulation study
This work develops a time-to-event simulation framework to evaluate performances of Cox regression compared, among others, to tuned random survival forests, gradient boosting, and neural networks at varying sample sizes. Simulations were based on replications of subjects from publicly available databases, where event times were simulated according to a Cox model with nonlinearities on continuous variables and time-varying effects and on the SEER registry data…
Machine learning for microbiologists
Machine learning is increasingly important in microbiology where it is used for tasks such as predicting antibiotic resistance and associating human microbiome features with complex host diseases. The applications in microbiology are quickly expanding and the machine learning tools frequently used in basic and clinical research range from classification and regression to clustering and dimensionality reduction. In this Review, we examine the main machine learning concepts, tasks and applications that are relevant for experimental and clinical microbiologists. We provide the minimal toolbox for a microbiologist to be able to understand, interpret and use machine learning in their experimental and translational activities…
Vlite – simple vector database written in less than 200 lines of code
A blazing fast, lightweight, and simple vector database written in less than 200 lines of code…VLite is a vector database built for agents, ChatGPT Plugins, and other AI apps that need a fast and simple database to store vectors…It uses Apple's Metal Performance Shaders via PyTorch to accelerate vector loading. It uses CPU threading to accelerate vector queries to reduce time spent copying vectors from the GPU(MPS) to the CPU…
What is the future for ML researchers and startups? [Reddit]
With the advent of LLMs, multimodality and “general purpose” AIs which seat on unimaginable amounts money, computing power and data. I’m graduating and want to start a PHD, but feel quite disheartened given the huge results obtained simply by “brute-forcing” and by the ever-growing hype in machine learning that could result in a bubble of data scientists, ML researchers and so on…
Summon a Demon and Bind it:
A Grounded Theory of LLM Red Teaming in the Wild
Engaging in the deliberate generation of abnormal outputs from large language models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We relate and connect this activity between its practitioners' motivations and goals; the strategies and techniques they deploy; and the crucial role the community plays. As a result, this paper presents a grounded theory of how and why people attack large language models: LLM red teaming in the wild…
The Moat for Enterprise AI is RAG + Fine Tuning
Enterprise ready generative AI must be:
Secure & private: Your AI application must ensure that your data is secure, private, and compliant, with proper access controls. Think: SecOps for AI.
Scalable: your AI application must be easy to deploy, use, and upgrade, as well as be cost-efficient. You wouldn’t purchase – or build – a data application if it took months to deploy, was tedious to use, and impossible to upgrade without introducing a million other issues. We shouldn’t treat AI applications any differently.
Trusted. Your AI application should be sufficiently reliable and consistent. I’d be hard-pressed to find a CTO who is willing to bet her career on buying or building a product that produces unreliable code or generates insights that are haphazard and misleading.
With these guardrails in mind, it’s time we start giving generative AI the diligence it deserves…
NumPy for Numpties
The series will cover topics in no particular order. It will be a pseudo-random walk across NumPy. Some articles will focus on specific NumPy functions. Others will be mini-projects centred around NumPy and, inevitably, some Matplotlib. This is what I mean by a "loose series"…The aim of this post is to introduce the NumPy for Numpties series. But I can't really say my goodbyes without any Python code, can I?…So here's a taste of NumPy. The first time I saw the following trick, I thought it was magic. Full disclosure: this "first time" was using MATLAB, another programming language I used in the past, but the Python NumPy version is very similar…
LLM Apps Are Mostly Data Pipelines
While the existing LLM app tools like LangChain and LlamaIndex are useful for building LLM apps, their data loading capabilities aren’t recommended outside of initial experimentation. As I built and tested my LLM app pipeline I was able to feel the pain of some of the aspects that are under developed and hacked together. If you’re planning to build a production ready data pipeline to fuel your LLM apps you should heavily consider using an EL tool purpose built for the job…
Maxar's Open Satellite Feed
Maxar operates a fleet of satellites that capture imagery of the Earth. Some of these satellites offered the best resolution commercially available when they were first launched into space. Last year, Maxar earned ~$1.6B largely from selling imagery these satellites produced…there is a freely available SpatioTemporal Asset Catalog (STAC) published by Maxar that details imagery URLs and metadata for 28 disaster events. Below is a map of these locations and event details…In this post, I'll download and examine Maxar's freely available satellite imagery…
Learning skillful medium-range global weather forecasting
We introduce “GraphCast,” a machine learning-based method trained directly from reanalysis data. It predicts hundreds of weather variables, over 10 days at 0.25° resolution globally, in under one minute. GraphCast significantly outperforms the most accurate operational deterministic systems on 90% of 1380 verification targets, and its forecasts support better severe event prediction, including tropical cyclones tracking, atmospheric rivers, and extreme temperatures. GraphCast is a key advance in accurate and efficient weather forecasting, and helps realize the promise of machine learning for modeling complex dynamical systems…
Gen-AI/LLM - Interview prep [Reddit]
I have an interview call later this week which the work is regarding implementing generative AI within the companies workflow. Using LLMs with finetuning/in-context learning using system logs etc kind of stuff. I have studied machine learning, worked for few years now as well. Have good understanding of those stuff but never tried fine tuning hands-on. I'm worked majority into computer-vision applications but think that I lagged a bit on the LLM side. Any suggestions, recommeded papers, courses, videos I could go through?…
In this 6-week course, led by Thomas Nield, author of O'Reilly's "Essential Math for Data Science" and instructor at the University of Southern California, the mission is straightforward: bridge your knowledge gap in mathematics and connect you to the fundamental concepts that drive Artificial Intelligence, Machine Learning, and Data Science in a meaningful way.
Don't settle for superficial learning. Join other like-minded students on this exciting journey to not just learn math but to love it. Transform your career, and become an indispensable asset in the ever-evolving world of engineering and data science.
Check it out here → https://clas.pt/data-science-weekly
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
We are BCG X.
BCG X is the tech build & design unit of BCG. Turbocharging BCG’s deep industry and functional expertise, BCG X brings together advanced tech knowledge and ambitious entrepreneurship to help organizations enable innovation at scale. With nearly 3,000 technologists, scientists, programmers, engineers, and human-centered designers located across 80+ cities, BCG X builds and designs platforms and software to address the world’s most important challenges and opportunities.
Our BCG X teams own the full analytics value-chain end to end: framing new business challenges, designing innovative algorithms, implementing, and deploying scalable solutions, and enabling colleagues and clients to fully embrace AI. Our product offerings span from fully custom-builds to industry specific leading edge AI software solutions.
Our Data Scientists and Senior Data Scientist are part of our rapidly growing team to apply data science methods and analytics to real-world business situations across industries to drive significant business impact. You'll have the chance to partner with clients in a variety of BCG regions and industries, and on key topics like climate change, enabling them to design, build, and deploy new and innovative solutions.
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
BCB 731: Critical readings in biomedical statistics and machine learning
BCB 731 (a.k.a Defense Against the Dark Arts) surveys recurring statistical errors and pitfalls sometimes used to exaggerate the weight of evidence for novel biological claims or inflate the estimated accuracy of proposed predictive biomedical models. This course focuses on misapplied analyses of data sources where a small number of biological samples are quantified into very high dimensional feature spaces, such as in genomics, proteomics, and biomedical imaging…
T81 558:Applications of Deep Neural Networks
This course will introduce the student to classic neural network structures, Convolution Neural Networks (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Neural Networks (GRU), General Adversarial Networks (GAN) and reinforcement learning. Application of these architectures to computer vision, time series, security, natural language processing (NLP), and data generation will be covered. High Performance Computing (HPC) aspects will demonstrate how deep learning can be leveraged both on graphical processing units (GPUs), as well as grids. Focus is primarily upon the application of deep learning to problems, with some introduction to mathematical foundations. Students will use the Python programming language to implement deep learning using PyTorch…
Data Science Basics in Python [YouTube Playlist]
While teaching data analytics, geostatistics and machine learning to many students and working professionals I see many struggle with basic Python concepts. So I decided to make a new series of Data Science Basics in Python. I plan to publish one live code demonstration per week on topics such as: 1. Working with Tabular Data with Pandas 2. Working with Gridded Data with NumPy 3. Basic Plotting with MatPlotLib 4. Basic Statistical Analytics with SciPy, etc…
* Based on unique clicks.
** Find last week's issue #520 here.
Looking to hire? Hit reply to this email and let us know.
Looking to get a job? Check out our “Get A Data Science Job” Course
A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself to ~60,000 subscribers by sponsoring this newsletter.
Thank you for joining us this week :)
All our best,
Hannah & Sebastian
P.S. Was today’s newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.