Big Dataset →Vectorization>Loops!
Why use loops? When you can use Vectorization in Python!
Loops are one of the most major concepts of programming we learned when we were beginning our journey. Loops come to us naturally, we learn loops in almost every programming language. So, we start implementing loops whenever we are dealing with repetitive operations. But, when there is a huge number of iterations, say millions or billions, using loops is not a great idea. You might be struck for hours, and later realize that it throws an error. This is where the use of Vectorization in python becomes vital.
Vectorization
It is the technique of implementing NumPy array operations on a dataset. In the background, it applies the operations to all the elements of an array or series in one go, as compared to a ‘for’ loop that manipulates one row at a time.
Via this blog, we will look at some of the use cases where we can easily replace python loops with Vectorization, making you more effective and efficient.
Use Case — Mathematical Operations
I as Data Scientist, working with Pandas Data Frame, use loops to create new derived column using mathematical operations. In the following example, we can see hoe easily the loops can be replaced with Vectorization for such case.
Let’s create a data frame first by using pandas framework having 5 millions rows and 4 columns filled with random values between 0 and 50.
import numpy as np
import pandas as pd
df = df.DataFrame(np.random.randint(0, 50, size=(5000000, 4)), columns=('a','b','c','d'))
df.shape
#-->(5000000, 5)
df.head()
Now we will create a new column ‘ratio’ to find the ratio of the column ‘a’ and ‘d’.
Using Loops
```python
import time
start = time.time()
# Iterating through DataFrame using iterrows
for idx, row in df.iterrows():
# creating a new column
df.at[idx,'ratio'] = 100 * (row["a"] / row["d"])
end = time.time()
print(end - start)
### 119 Seconds
```
Using Vectorization
start = time.time()
df["ratio"] = 100 * (df["a"] / df["d"])
end = time.time()
print(end - start)
### 0.14 seconds
We can clearly able to see that there is a significant improvement, the time taken via Vectorization operation is almost 1000x faster as compared to the loops in python.
Use Case — If-else Statement
We use a lot of operations based on conditional logic that requires us to use the ‘if-else’. We can easily replace these logics with Vectorization operations in python.
By using the same data frame we created earlier, let’s see an example.
Imagine we want to create a new column ‘e’ based on some conditions column ‘a’.
Using Loops
import time
start = time.time()
# Iterating through DataFrame using iterrows
for idx, row in df.iterrows():
if row.a == 0:
df.at[idx,'e'] = row.d
elif (row.a <= 25) & (row.a > 0):
df.at[idx,'e'] = (row.b)-(row.c)
else:
df.at[idx,'e'] = row.b + row.c
end = time.time()
print(end - start)
### Time taken: 177 seconds
Using Vectorization
start = time.time()
df['e'] = df['b'] + df['c']
df.loc[df['a'] <= 25, 'e'] = df['b'] -df['c']
df.loc[df['a']==0, 'e'] = df['d']end = time.time()
print(end - start)
## 0.28007707595825195 sec
Use Case — Using it on Machine Learning
Deep Learning requires us to solve multiple complex equations and that too for millions and billions of rows. Running loops in Python to solve these equations is very sloe and Vectorization is the optimal solution.
For example, to calculate the value of y for millions of rows in the following equation of multi-linear regression:
Thus, for the same, we can replace loops with vectorization. the values of m1, m2, m3… are determined by solving the above equation using millions of value corresponding to x1, x2, x3…
import numpy as np
# setting initial values of m
m = np.random.rand(1,5)
# input values for 5 million rows
x = np.random.rand(5000000,5)
Using Loops
import numpy as np
m = np.random.rand(1,5)
x = np.random.rand(5000000,5)
total = 0
tic = time.process_time()
for i in range(0,5000000):
total = 0
for j in range(0,5):
total = total + x[i][j]*m[0][j]
zer[i] = total
toc = time.process_time()
print ("Computation time = " + str((toc - tic)) + "seconds")
####Computation time = 28.228 seconds
Using Vectorization
tic = time.process_time()
#dot product
np.dot(x,m.T)
toc = time.process_time()
print ("Computation time = " + str((toc - tic)) + "seconds")
####Computation time = 0.107 seconds
You can clearly see, the result using vectorization is 166x faster than loops.
Conclusion
Vectorization in python is super fast and should be preferred over loops, whenever we are working with large datasets.