Machine Learning in Python

Machine Learning in Python

  • 1. Machine Learning

    • 1.1. What is machine learning?

      • In 1959,Arthur Samuel thought of a new approach. He wondered if computers could infer a logic instead of giving them explicit instruction. In other words he wondered if machines could learn
      • What if we give just i/p data to computer and te end results of previously accomplished task. Could a computer figure out best set of instructions that could yield given o/p based on the data provided to it
        • Lets say we give computer the nos.in the left(i/p) and nos. in right(o/p)
        • If the computer is able to identify what mathematical operation to apply to get the o/p then we say machine is learning
      • Supervised learning
        • After we train a model, which is a model that has learned the right set of instructions for a given task,going forward, we simply give it the i/p data and it applies internal instructions to get o/p.
        • Useful in solving problems such as image recognition, text prediction and spam filter.
      • Unsupervised learning
        • In unsupervised learning we simply ask the machine to evaluate the i/p data and identify any hidden patterns or relationships that exist in the data.
        • Useful in movie recommendation system
      • Reinforcement learning
        • In reinforcement learning there are two primary entities
          • i.e. Agent and Environment
        • Agent figures out the best way to accomplish a task through a series of cycles in which the agent takes an actoin and receives immediate positive or negative feedback on the action from the environment
        • Useful in pc game engines , robotics and self driving cars
    • 1.2. What is not machine learning?

      • Some times its said that ML is simply glorified statistics, it is true that ML does borrow a lot of concepts stats, however ML also does borrow a lot of concepts form IT, Calculus and even biology
      • Objective of ML: Its used in trying to predict what's going to happen in future(What's next?)
      • Objective of Stats model: They are mostly concerned with the relationship between variables(What is?)
        • With Stats model we understand what happens to variable B as a result of a change in variable A.
      • Difference between ML and Data Mining
        • ML approaches are primarily focused on prediction, the make predictions based on the known properties of the data.
        • Data mining is focused on discovery of previously unknown property in data.
      • In field of business analytics
        • ML = Predictive analytics
        • Data Mining = Descriptive analytics
        • Optimization =Prescriptive analytics
      • We use Descriptive analytics to track and analyses existing data in order to identify new patterns.
      • We use predictive analytics to analyses past trends in order to predict the likelihood of future outcomes.
      • We use prescriptive analytics to recommend actions based on prior performance
    • 1.3. What is unsupervised learning?

      • Unsupervised learning is the process of building a descriptive model.
        • Descriptive models are used to summarize and group data in new and interactive ways
      • e.g. Customer Segmentation
        • Suppose you want to group your customers based on how similar they are to each other in order to better market your products to them.
        • Here we have two kinds of info:
          • 1) Historical info about the spending habits of our customers
          • 2) Demographic info about each customer (age, gender etc)
    • 1.4. What is supervised learning?

      • Supervised learning is the process of training a predictive model.
      • Predictive models are used to assign labels to unlabeled data based on patterns learned from previously labeled historical data
      • If we want to predict the outcome of a new event we can use predictive model that has been trained on similar events to predict the outcome
      • e.g. Loan outcomes
        • Suppose you want to predict the loan risk of individual customer based on the info they provide on their loan application
        • Develop a ML model that predicts whether a particular customer will or will not default on a loan
        • Info available to us:
          • 1) descriptive data about each loan (Loan amount, annual salary etc)
          • 2) The outcome of each previous loans
        • Independent Data+ Dependent Data = Training Data
    • 1.5. What is Reinforcement learning?

      • Reinforcement learning is the science of learning to make decision from interaction
        • It is similar to early childhood learning
      • Reinforcement learning attempts to accomplish two distinct learning objectives:
        • 1) Finding previously unknown solutions to a problem
          • e.g. Machine playing chess
        • 2) Finding online solutions to problems that arise due to unforeseen circumstances
          • e.g. Machine that is able to find an alternate route due to landslide in the current route
        • FeedBack:(State , Reward)
          • The state describes the impact of the agent's previous action their environment and the possible actions the agent can take. Each action is associated with a numeric reward that the agent receives as a result of taking a particular action.
        • Exploitation : Choosing the action that maximizes reward
        • Exploration : Choosing an action with no consideration of the reward
    • 1.6. What are the steps to Machine Learning?

      • 1) Data Collection: Identify and acqiure the data you need for ML
      • 2) Data Exploration: Understand your data by describing and visualizing it
      • 3) Data Preparation: Modify your data so it works for the type of ML you intend to do
      • 80% of time is usually spent in these above three stages
      • 4) Modelling: Apply a ML approach to your data
      • 5) Evaluation: Assess how well your ML approach worked
      • Iterate in above two steps to find the best model
  • 2.Collecting Data For ML

    • 2.1. Things to consider when collecting data

      • Identify and acquire data you need for ML
      • There are 5 key considerations we need to keep in mind
      • 1) Accuracy
      • 2) Relevance
      • 3) Quantity
      • 4) Variability
      • 5) Ethics
    • 2.2 How to import data in python

      • Pandas package: It provides several easy to use functions for creating / structuring and importing data
      • import pandas as pd Here pd is an Allies, it allows us to refer functions of the package by simply referring to pd.functionName()
      • Ways of representing data:
        • 1) Series: It is heterogenous, 1D array like data structure with labelled rows
          • bricks1=pd.Series(members)
        • 2) DataFrame: It is heterogenous, 2-D data structure with labelled rows and columns
          • We can think of DataFrame as a collection of several series
          • DataFrame is very similar to a spreadsheet or a relational database table
            • bricks2=pd.DataFrame(members)
        • 3) Import data from a csv file
          • brics3=pd.read_csv("brics.csv")
        • 4) Import data from an excel file
          • brics4=pd.read_excel("brics.xlsx")
          • For excel with multiple sheets:
          • brics5=pd.read_excel("brics.xlsx",sheetname="xyzabc")
  • 3. Understanding data for machine learning

    • 3.1. Describe your data

      • Data exploration: Understand your data by describing it and visualizing it
      • Data exploration enables us to answer questions such as:
        • How many rows and columns are there in data?
        • What type of data do we have?
        • Are there missing, inconsistent or duplicate values in the data?
      • In ML we use certain key terms to describe the structure and nature of our data
        • Instance: An instance is an individual independent example of the concept represented by the data set
        • Feature: Property or characteristics of an instance
        • Categorical feature: Attribute that holds data stored in disrete form
        • Continuous feature: Attribute that holds data stored in the form of an integer or real no.
        • Dimensionality: The no. of features in a dataset
        • Sparsity & Density: The degree to which data exists in a dataset
          • e.g. If 20% of the value in the dataset are missing or undefined, we say that data is 20% sparse. Density is the compliment of sparsity
    • 3.2. How to summerise data in Python

      • The pandas DataFrame provides several easy to use methods that helps us describe and summerize data
      • One of these methods is info()
      • import pandas as pd
      • washers=pd.read_csv("washers.csv")
      • washers.info() : We can get concise summary of its rows and columns
      • washers.head(): head() returns first few rows in the DataFrame
      • Simple aggregation:
        • describe() It returns a statistical summary for each of the columns in a DataFrame
          • NOTE: The descriptive stats returned by the describe() depends on the data type of a column
          • washers[['BrandName']].describe()
          • washers[['Volume']].describe()
        • value_Counts() It returns a series containing counts of unique values
          • The resulting object will be in descending order so that the first element is the most frequently occuring element
          • washers[['BrandName']].Value_Counts()
          • washers[['BrandName']].Value_Counts(normalize=True) :To get o/p in percentilereprestation
        • mean() It returns the average
          • washers[['Volume]].mean()
      • Group aggregation:
        • We get specific aggregations at the group level
          • For e.g. We can compute the average volume of washers by brand
        • washers.groupby('BrandName')[['Volume]].mean() : This result is sorted by BrandName
    • 3.3. Visualize your data

      • Depending on type of question we are trying to answer there are 4 major types of visualization we could use:
        • 1) Comaprison
          • Comparison visualization illustrates the difference between two or more items at a given point in time or over a period of time.
          • It answers questions such as:
            • Is a feature important?
            • Does the median value of a feature differ between subgroups?
            • Does a feature have outliers?
        • 2) Relationship
          • Rekationship visualization illustrates the corelation between two or more variables
          • It answers questions such as:
            • How do two features interact with each other?
            • Is a feature important?
            • Does feature have outliers?
        • 3) Distribution
          • Distribution visualization shows the statistical distribution of the value of a feature
          • It answers questions such as:
            • Does a feature have outliers?
            • How spread out are the values of a feature?
            • Are the values of a feature symmetric or not
        • 4) Composition
          • Composition visualization shows the component makeup of the data
          • It answers questions such as:
            • How much does a subgroup contributes to the total?
            • What is the relative or absolute change in the composition of a subgroup over time?
    • 3.4 How to visualize data in pyhton

      • One of the most popular visualization packages in python is matplotlib
        • matplotlib inline :To ensure that the plots we create will appear right after our code
      • 1) Relationship visualization:
        • These types of visualizations are used illustrate the co=relation between two or more continuous variables
        • vehicles.plot(kind='scatter', x='citymap', y='co2emission') scatter plots are one of the most commonly used relationship visualization, they show how one variable changes in response to a change in another
      • 2) Distribution visualization:
        • Distribution visualization illustrates the statistical distribution of the values of a feature
        • One of the most commonly used distribution visualization is histogram
          • With histogram we can figure out which values are most common for a feature
          • vehicles['co2emission'].plot(kind='hist')
      • 3) Comparison visualization:
        • Comparison visualizations are used to illustrate the difference between two or more items at a given point in time or over a period of time
        • Most commonly used comparison visualization is box-plot
          • Using box-plot we can compare the distribution of values for a continuous feature against the values of a categorical feature
      • 4) Composition visualization:
        • These type of visualization show the component makeup of data
        • Most commonly used composition visualization is stacked bar
          • Stacked bar shows how much a sub-group contributes to the whole
  • 4. Preparing data for ML

    • 4.1. Common data quality issues

      • Missing data
      • Outiers
    • 4.2. How to resolve missing data in python

      • mask=students['state'].isnull()
        mask
        //It will give the o/p where state==NULL will be true
        
      • dropna()
        • It is used to remove missing values
        • examples;
          • students.dropna()
          • students=students.dropna(subset=['state','zip'],how='all')
      • fillna()
        • It replaces the missing value with something else
        • students=students.fillna({'gender':'Female'})
      • .loc[ ]
        • To apply mask as a row filter
        • students.loc[mask,:]
    • 4.3. Normalizing your data

      • Normalization ensures that values share a common property
      • Normalization often involves scaling data to fall within a small or specified range
      • Normalization is often required ,it reduces complexity and improves interpretability
      • Ways of nomralization:
        • 1) Z-score normalization
        • 2) Min-Max normalization
        • 3) log transformations
    • 4.4. How to normalize data in python

      • Refer online documentations
    • 4.5. Sampling your data

      • This is the process of selecting a subset of the instances in a data in a proxy for the module
      • The original dataset is referred to as the population, while the subset is known as a sample
          • Sampling without replacement
          • Sampling with replacement
          • Stratified sampling
    • 4.6. How to sample data in python

      • We usually have to split the rows in our data training and test sets using one of several sampling approaches
        • use train_test_split() :By default train_test_split() allocates 25% of the original data to the test set
    • 4.7. Reducing the dimensionality of your data

      • The process of reducing the no. of features in a dataset prior to modelling
      • Reduces complexity and helps avoid the curse of dimensionality
        • Curse of dimensionality means that the error increases with the increase in no. of features
      • Two approaches to dimensionality reduction:
        • 1) Feature selection
        • 2) Feature abstraction
  • 5. Types of ML models

    • Depending on the nature of the dependent variable supervised ML can be divided into:
      • Classification:
        • Supervised ML problem where the dependent variable is categorical
          • e.g. Yes/No ,colour etc
      • Regression:
        • Supervised ML where the dependent variable is continuous
          • e.g. age, income, temperature etc.
    • Evaluation
      • In order to get an unbiased evaluation of performance of our model:
        • You must train the model (Training data) with a different dataset then the ones which is used to evaluate them (test data)