The aim of our research is to develop an anthropomorphic flutist robot that on the one hand reproduces the human motor skills required for playing the flute, and on the other hand displays cognitive capabilities for interacting with other (human) musicians. In this paper, we detail the recent mechanical improvements on the Waseda Flutist Robot (WF-4RIV), enhancing the realistic production of the flute sound. In particular, improved lips, oral cavity and tonguing are introduced and their mechanisms described: The possibility to deform the lip shape in 3-DOF, allows us to accurately control the characteristics of the air-stream (width, thickness and angle). An improved tonguing mechanism (1-DOF) has been designed to reproduce double tonguing. Furthermore we present the implementation of a real-time interaction system with human partners. We developed, as a first approach, a vision processing algorithm to track the 3D-orientation and position of a musical instrument: Image data is recorded using two cameras attached to the head of the robot, and processed in real-time. The proposed algorithm is based on color histogram matching and particle filter techniques to follow the position of a musicianpsilas hands on an instrument. Data analysis enables us to determine the orientation and location of the instrument. We map these parameters to control musical performance parameters of the WF-4RIV, such as sound vibrato and sound volume. A set of experiments were proposed to verify the effectiveness of the proposed tracking system during interaction with a human player. We conclude, that the quality of the musical performance of the WF-4RIV and its capabilities to interact with musical partners, have been significantly improved by the implementation of the techniques, that are proposed in this paper