Descriptive and Inferential Statistics

Hello everyone! Today I’ll be writing a bit about descriptive and inferential statistics.

Data Scientists invest a lot in the pre-processing of data. This requires a good understanding of statistics. Statistics is a branch of mathematics that studies the collection, presentation, analysis, and interpretation of data. Statistical modeling lies at the heart of Data Science and Analysis!

Two main statistical methods are used in data analysis: descriptive statistics, and inferential statistics. Descriptive statistics is solely concerned with the properties of the observed data, and it does not rest on the assumption that the data may come from a larger population. In machine learning, the term ‘inference’ is sometimes used instead to mean ‘make a prediction, by evaluating an already trained model’.

Some of the main concepts that fall under descriptive statistics are as follows:

  1. Measures of central tendency (Mean, Median, Mode, Percentiles and Quantiles)
  2. Measures of variability (Range, Variance and Standard Deviation, Standard Error)
  3. Measures of association between two variables (Covariance, Correlation Coefficient)

Inferential statistics, on the other hand, is the process of using data analysis to infer properties of an underlying distribution of probability. Inferential statistical analysis infers properties of a population, e.g., by testing hypotheses and deriving estimates. The observed data set is assumed to be sampled from a larger population, and the goal is to draw conclusions about that population based on the sample. In other words, inferential statistics is used to make inferences about a population based on a sample of data.

Inferential statistics involves using probability theory and statistical inference techniques to estimate population parameters, test hypotheses, and make predictions about future outcomes. It involves analyzing data to determine whether any observed differences or relationships are statistically significant, meaning they are unlikely to have occurred by chance.

Some of the main concepts that fall under inferential statistics are as follows (all falling under the topic of Test Statistics and Statistical Significance):

  1. Z-Tests
  2. T-Tests
  3. Confidence Intervals
  4. Chi-Squared Tests
  5. F-Tests
  6. P-Values
  7. A|B Testing

Don’t forget to comment and let me know what you think!

My new book is live on Amazon!!

Good day everyone!

I’m excited to announce that I just had my book go live on Amazon, and it’s ready to order! You can find the info on the book, free sample contents and links for purchasing on the blog’s Bookstore.

I have been working on the book since the first of January this year, so it’s really nice to see it has come to the production stage. I’m now ready to continue my other projects.

Cheers

Announcement: my parallel, philosophy blog is up and running!!

As you know, this blog is on technical, math-heavy issues in data science. Since my professional philosophy training, particularly in related to the Philosophy of AI, Data, and IT Ethics called for a platform to share my thoughts, I thought I’d start running a parallel blog on Medium only for those purposes: https://medium.com/@philanddata.

Make sure to subscribe and follow me on Medium; I have a lot to share on the philosophical side of things 🙂

On the Cross-Section of the Most Relevant Literature on Large Language Models

I just came across this blog post by Sebastian Raschka from the Ahead of AI magazine. It nicely overviews some relevant and to-the-point literature on different aspects of LLMs, in a structured way, in order to get up to speed with the recent advances in this area. The main focus is on academic research papers. Many additional resources, aimed at experts or general audiences, are cited, as well.

I hope you enjoy reading this article just like I did! Here‘s the link to it.

Note: In the near future, I will be posting my own content on LLMs, so stay tuned on that too!

Anouncement: my new book will be out soon!

The Front Cover

Hello good people!

I’m very excited to announce that my new book will be out soon. The book is titled No Bullshit Math for Data Science, and is exactly what its title suggests! If you’re like me, tired of purchasing ~350-page books on data science or its formal foundations at the price of about $50-100, only to learn lots of coding but much less actual math, this book is going to be for you.

I won’t spoil it too much now, but the book covers all the core math topics required for a deep understanding of data science — linear algebra, calculus, probability theory and statistics — in great detail.

Moreover, it dives deep into the mathematics underneath some of the most famous machine-learning algorithms. Finally, even though the theme of the book is mainly mathematics, the ML algorithms are also accompanied by sample Python codes for production.

Stay tuned for the rollout announcement!!

Book Review: “A Guide for Making Black Box Models Explainable”

This book review is for Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, by Christoph Molnar, the second edition of which was self-published in 2022 and spans 319 pages. The book is about making machine learning models and their decisions interpretable, which is a significant issue because computers do not usually explain their predictions, hindering the adoption of machine learning.

The lack of interpretability of machine learning models has been a major obstacle in their adoption, despite their great potential for improving products, processes, and research. Computers are not transparent about how they arrive at their predictions, and this can create a lack of trust in their decisions. This is especially problematic in high-stakes applications such as medical diagnosis, where a wrong prediction could have serious consequences.

Fortunately, there has been a growing interest in making machine learning models interpretable, and this is the focus of the book Interpretable Machine Learning by Christoph Molnar. The author recognizes that many existing models are considered “black boxes” because their internal workings are not transparent, and he sets out to provide practical tools for interpreting them.

The book begins with exploring the concept of interpretability, followed by a discussion of simple, interpretable models such as decision trees, decision rules, and linear regression. The main focus of the book is on model-agnostic methods for interpreting black box models, such as feature importance and accumulated local effects, and explaining individual predictions using Shapley values and LIME. In addition, the book presents methods specific to deep neural networks.

Each interpretation method is explained in depth, with a critical evaluation of their strengths and weaknesses, and how their outputs can be interpreted. The book enables machine learning practitioners, data scientists, statisticians, and anyone interested in making machine learning models interpretable to select and apply the most appropriate interpretation method for their project.

The book appears as a collection or little encyclopedia of various methods and model performance metrics, with a focus on exhaustivity over selectivity. The level and amount of mathematics are appropriate, offering a real feeling of what the techniques do.

The author provides applications based on real or “prototype” data, with source code on the author’s GitHub repository and website. You can also purchase the published version on Amazon. However, looking through the reviews, the print version’s quality seems to have room for improvement, and a PDF version with clickable links would be preferable.

The book is highly recommended for machine learning practitioners, data scientists, statisticians, and anyone interested in making machine learning models interpretable. The book provides a comprehensive and critical evaluation of interpretation methods for machine learning models, making it a valuable resource for those seeking to improve their understanding of interpretability in machine learning and its practical applications.

Algebra and its Applications in Data Science

Algebra is a branch of mathematics that deals with the study of mathematical symbols and the rules for manipulating these symbols. It is a powerful tool in data science, where it is used to model and analyze complex systems. Data science involves the use of mathematical, statistical, and computational methods to extract meaningful insights from data. In this article, we will explore the applications of algebra in data science and its significance in this field.

Algebraic equations can be used to model real-world problems, such as predicting the future value of a stock or forecasting the weather. These models are based on mathematical formulas that describe the relationships between various variables.

One of the most common algebraic equations used in data science is the linear equation. Linear equations are used to model relationships between two variables, such as the relationship between height and weight or the relationship between the price of a product and the number of units sold.

In addition to linear equations, data scientists use a variety of other algebraic equations, such as quadratic equations and exponential equations, to model complex systems. Algebraic methods are used in data science to identify patterns in data. These patterns can be used to make predictions about future outcomes or to understand the behavior of a system.

Algebraic techniques are also used in data science to perform data analysis. For example, algebraic methods can be used to transform data into a more useful format or to identify outliers in a dataset. Algebraic concepts such as matrices, vectors, and tensors are used extensively in machine learning, a subfield of data science. Machine learning algorithms use these concepts to analyze and manipulate large datasets.

Linear algebra, a branch of algebra that deals with linear equations and their representations, is particularly important in machine learning. Linear algebra is used to represent data in a high-dimensional space, where it can be analyzed and manipulated by machine learning algorithms.

Algebraic methods are used in data science to perform statistical analysis. For example, linear regression, a statistical method used to model the relationship between two variables, is based on algebraic equations. Algebraic techniques are also used in data science to perform optimization. Optimization involves finding the optimal solution to a problem, such as the maximum or minimum value of a function. Algebraic methods, such as gradient descent, are used to find these optimal solutions. Algebraic concepts are also used in data science to perform text analysis, a technique used to extract information from textual data. Algebraic techniques, such as latent semantic analysis, are used to identify patterns in the text and to extract meaning from it.

That’s about it for now! Please let me know in the comments or through Contact page if you have any questions or suggestions.

Welcome!

Greetings and welcome to my blog, Math and Data!

My name is Amir, and I’m excited to share my knowledge and passion for data, its mathematics and philosophy! I have a degree and several years of experience teaching various topics in mathematics. I also have a PhD in philosophy, specializing in ontologies, data ethics and the philosophy of AI.

In this blog, I promote data science, its mathematics and philosophically relevant issues, such as data ethics, the question of machine consciousness and generally, the philosophy of AI. These will constitute the educational articles of the blog. I will also be reviewing various resources, such as books and articles, on these topics.

Whether you’re just starting out on your math and data science journey or you’re looking to take your skills to the next level, I hope this blog will be a helpful resource for you!

You can contact me through the Contact page or [email protected].

Cheers