- Détails
- Écrit par Francois Pacull
- Affichages : 506

Temperatures taken from this website:

https://www.historique-meteo.net/france/rh-ne-alpes/lyon/

This dataset is updated monthly to be updated early september with august temeratures (to be updated in early september with august temeratures).

```
python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
```

Lire la suite : Is this summer warmer than usual here in Lyon (so far)

- Détails
- Écrit par Francois Pacull
- Affichages : 1162

I recently stumbled on this interesting post on RealPython (excellent website by the way!):

**Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects**

This post has different subjects related to Pandas:
- creating a `datetime`

column
- looping over Pandas data
- saving/loading HDF data stores
- ...

I focused on the *looping over Pandas data* part. They compare different approaches for looping over a dataframe and applying a basic (piecewise linear) function:
- a "crappy" loop with `.iloc`

to access the data
- `iterrows()`

- `apply()`

with a lambda function

But I was a little bit disapointed to see that they did not actually implement the following other approaches: - itertuples()`

While

`.itertuples()`

tends to be a bit faster, let’s stay in Pandas and use`.iterrows()`

in this example, because some readers might not have run across`nametuple`

. - Numpy vectorize - Numpy (just a loop over Numpy vectors) - Cython - Numba

So I just wanted to complete their post by adding the latter approaches to the performance comparison, using the same `.csv`

file. In order to compare all the different implementations on the same computer, I also copied and re-ran their code.

Note: my laptop CPU is an `Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz`

(with some DDDR4-2400 RAM).

- Détails
- Écrit par Francois Pacull
- Affichages : 1746

k-means is a kind of clustering algorithms, which belong to the family of unsupervised machine learning models. It aims at finding $k$ groups of similar data (clusters) in an unlabeled multidimensional dataset.

Let $(x*1, ..., x*n)$ be a set of $n$ observations with $x*i \in \mathbb{R}^{d}$, for $1 \leq i \leq n$. The aim of the k-means algorithms is to find a disjoint partition $S={S*1, ..., S*k }$ of the $n$ observations into $k \leq n$ clusters, minimizing $D$ the within-cluster distance to center:
$$ D(S) = \sum*{i=1}^k \sum*{x \in S*i} \| x - \mu*i \|^2 $$
where $\mu*i$ is the $i$-th cluster center (i.e. the arithmetic mean of the cluster observations): $\mu*i = \frac{1}{|S*i|} \sum*{x*j \in S*i} x*j$, for $1 \leq i \leq n$.

Unfortunately, finding the exact solution of this problem is very tough (NP-hard) and a local minimum is generally sought using a heuristic.

Here is a simple description of the algorithm taken from the book "Data Science from Scratch" by Joel Grus (O'Reilly):

- Start with a set of k-means, which are $k$ points in $d$-dimensional space.
- Assign each point to the mean to which it is closest.
- If no point’s assignment has changed, stop and keep the clusters.
- If some point’s assignment has changed, recompute the means and return to step 2.

This algorithm is an iterative refinement procedure. In his book "Python Data Science Handbook" (O'Reilly), Jake VanderPlas refers to this algorithm as kind of Expectation–Maximization (E–M). Since step 1 is the algorithm initialization and step 3 the stopping criteria, we can see that the algorithm consists in only two alternating steps:

step 2. is the *Expectation*:

"updating our expectation of which cluster each point belongs to".

step 4. is the *Maximization*:

"maximizing some fitness function that defines the location of the cluster centers".

This is described with more details in the following link.

An interesting geometrical interpretation is that step 2 corresponds to partitioning the observations according to the Voronoi diagram generated by the centers computed previously (either on step 1 or 4). This is also why the standard k-means algorithm is also called Lloyd's algorithm, which is a Voronoi iteration method for finding evenly spaced sets of points in subsets of Euclidean spaces.

Let us have a look at the Voronoi diagram generated by the $k$ means.

As in Jake VanderPlas' book, we generate some fake observation data using scikit-learn 2-dimensional blobs, in order to easily plot them.

```python
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets.samples*generator import make*blobs

k = 20
n = 1000
X, _ = make*blobs(n*samples=n, centers=k, cluster*std=0.70, random*state=0)
plt.scatter(X[:, 0], X[:, 1], s=5);
```

- Détails
- Écrit par Francois Pacull
- Affichages : 1022

In this post we are simply going to retrieve the restaurants from the city of Lyon-France from Open Street Map, and then plot them with Bokeh.

Downloading the restaurants name and coordinates is done using a fork of the great OSMnx library. The OSM-POI feature of this fork will probably soon be added to OSMnx from what I understand (issue).

First we create a fresh conda env, install jupyterlab, bokeh (the following lines show the Linux way to do it but a similar thing could be done with Windows):

```
$ conda create -n restaurants python=3.6
$ source activate restaurants
$ conda install jupyterlab
$ conda install -c bokeh bokeh
$ jupyter labextension install jupyterlab_bokeh
$ jupyter lab osm_restaurants.ipynb
```

The jupyterlab extension allows the rendering of JS Bokeh content.

Then we need to install the POI fork of OSMnx:

```
$ git clone Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.:HTenkanen/osmnx.git
$ cd osmnx/
osmnx $ git checkout 1-osm-poi-dev
osmnx $ pip install .
osmnx $ cd ..
```

And we are ready to run the notebook:

```
jupyter lab osm_restaurants.ipynb
```

In [1]:

```
import osmnx as ox
place = "Lyon, France"
restaurant_amenities = ['restaurant', 'cafe', 'fast_food']
restaurants = ox.pois_from_place(place=place,
amenities=restaurant_amenities)[['geometry',
'name',
'amenity',
'cuisine',
'element_type']]
```

We are looking for 3 kinds of amenity related to food: restaurants, cafés and fast-foods. The collected data is returned as a geodataframe, which is basically a Pandas dataframe associated with a geoserie of Shapely geometries. Along with the geometry, we are only keeping 4 columns:

- restaurant name,
- amenity type (restaurant, café or fast_food),
- cuisine type and
- element_type (OSM types: node, way relation).

In [2]:

```
restaurants.head()
```

Out[2]:

In [3]:

```
ax = restaurants.plot()
```

©
2001 - 2019
Polymorphe.org