Wow, thank you for taking the time to answer my question Piotr!
That makes a lot of sense why you transposed them, it makes the coding a LOT easier :)
Thank you for clarifying the bias part and thank you for the tensorflow link. You are right, the bias is 1D vector, that was silly of me to think otherwise.
So, broadcasting is just a fancy way of saying that we make copies of the same 1D bias vector so we can fit the size of W*A? (I hope I got right thinking) By doing this, the same 1D bias vector will go (be “broadcast”) on each example in the batch.
Thank you again for this amazing article and for your thorough reply!