Grab all urls from saved html file with python and beautifulsoup

  1. Prepare html file: in Chrome open youtube.com, press ctrl+s and save the page under the name youtube.html
  2. Install pip dependencies:
    pip install bs4
  3. Create a grab-links.py file with following contents:
    from bs4 import BeautifulSoup
    import argparse
    
    def get_args():
        parser = argparse.ArgumentParser(
            description=__doc__,
            formatter_class=argparse.RawDescriptionHelpFormatter
        )
        parser.add_argument(
            'file',
            type=str,
            nargs='+',
            help='saved html file to get urls from'
        )
        return parser.parse_args()
    
    def main():
        args = get_args()
    
        print(args.file[0])
        f = open(args.file[0],"r") 
    
        soup = BeautifulSoup(f.read(), 'lxml')
        f.close()
    
        for l in soup.findAll('a'):
          print(l.get('href'))
    
    if __name__ == "__main__":
        main()
    
  4. Grab the links from saved file using new-created script:
    python3 grab-links.py youtube.html